[Trilinos-Users] [Pytrilinos-users] expected Trilinos on shared memory machine.

Wed Apr 9 12:16:43 MDT 2008

On Wed, Apr 9, 2008 at 12:30 PM, Heroux, Michael A <maherou at sandia.gov> wrote:
>
>  Daniel,
>
>  Depending the preconditioner you are using with AztecOO, you should see
> improvement in performance running in parallel.  Performance might be
> limited by the shared bandwidth on your machine.  What kind of platform are
> you using?

How do I find out what the shared bandwidth is? I have three proposed
platforms with 2, 8, and 64 nodes. I include the details from meminfo
and cpuinfo below. Do the numbers below include what we're looking
for? Thanks!

2 node machine:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 8
model name      : AMD Athlon(tm) MP 2400+
stepping        : 1
cpu MHz         : 2000.178
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow ts
bogomips        : 4003.93

and memory:

MemTotal:      2076444 kB
MemFree:       1177832 kB
Buffers:           916 kB
Cached:         491864 kB
SwapCached:      77132 kB
Active:         495772 kB
Inactive:       355104 kB
HighTotal:     1179072 kB
HighFree:       363188 kB
LowTotal:       897372 kB
LowFree:        814644 kB
SwapTotal:     4000144 kB
SwapFree:      3833308 kB
Dirty:               0 kB
Writeback:           0 kB
AnonPages:      356380 kB
Mapped:          57968 kB
Slab:            35288 kB
PageTables:       4708 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   5038364 kB
Committed_AS:  1097836 kB
VmallocTotal:   114680 kB
VmallocUsed:      3828 kB
VmallocChunk:   110804 kB

==============================================

8 node machine:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 8220
stepping        : 3
cpu MHz         : 2812.978
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm
cr8_legacy
bogomips        : 5630.17
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

and memory:

MemTotal:     32964928 kB
MemFree:      13414884 kB
Buffers:        204924 kB
Cached:       13604200 kB
SwapCached:      14712 kB
Active:       12276440 kB
Inactive:      6657932 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     32964928 kB
LowFree:      13414884 kB
SwapTotal:    65535992 kB
SwapFree:     65514348 kB
Dirty:             324 kB
Writeback:           0 kB
AnonPages:     5121792 kB
Mapped:          51372 kB
Slab:           541584 kB
PageTables:      19108 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  82018456 kB
Committed_AS:  6887204 kB
VmallocTotal: 34359738367 kB
VmallocUsed:     32412 kB
VmallocChunk: 34359705719 kB

==============================================================

64 node machine:

processor  : 0
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 2
revision   : 1
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 1500.000000
itc MHz    : 1500.000000
BogoMIPS   : 2244.60
siblings   : 1

and memory:
MemTotal:     516572480 kB
MemFree:      508559920 kB
Buffers:          5872 kB
Cached:        3928208 kB
SwapCached:          0 kB
Active:        4242128 kB
Inactive:      1685520 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     516572480 kB
LowFree:      508559920 kB
SwapTotal:    10490400 kB
SwapFree:     10490400 kB
Dirty:               0 kB
Writeback:          16 kB
Mapped:        1669488 kB
Slab:           969504 kB
CommitLimit:  268776640 kB
Committed_AS:  2103568 kB
PageTables:       5552 kB
VmallocTotal: 137372805568 kB
VmallocUsed:   1010816 kB
VmallocChunk: 137371792720 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:    262144 kB

>
>  Mike
>
>
>
>  On 4/9/08 10:51 AM, "Daniel Wheeler" <daniel.wheeler2 at gmail.com> wrote:
>
>
>
> On Tue, Apr 8, 2008 at 6:22 PM, Bill Spotz <wfspotz at sandia.gov> wrote:
>  > On Apr 7, 2008, at 12:52 PM, Daniel Wheeler wrote:
>  >
>  >
>  > > In our code, for a typical problem the majority of the compute time is
>  > > spent in the "AztecOO.AztecOO(A, LHS, RHS)"  function.
>  > >
>  >
>  >  This is just a constructor.  Doesn't most of your time get spent in
>  >
>  >     Solver.Iterate(self.iterations, self.tolerance)
>  >
>  >  where Solver is the result of the constructor?
>
>  Yes. Sorry Bill. I pasted in the wrong line. So, let me reiterate the
>  question. Given that the majority of the time is being spent in
>  "Solver.Iterate(self.iterations, self.tolerance)", would you expect
>  major speeds ups by compiling Trilinos in parallel and running on a
>  shared memory machine?
>
>  --
>  Daniel Wheeler
>
>  _______________________________________________
>  Trilinos-Users mailing list
>  Trilinos-Users at software.sandia.gov
>  http://software.sandia.gov/mailman/listinfo/trilinos-users
>
>
>

-- 
Daniel Wheeler