Page 1 of 1

Best way to test for optimal settings of KPAR/NPAR/LPLANE

Posted: Tue Jul 02, 2013 4:31 pm
by hachteja
Hey all,

Currently I'm running some LOPTICS calculations that need fairly high NBANDS, ENCUT, and K-point meshes, and I want to use the parallelization options to make my calculations more efficient. However, these jobs are fairly costly in terms of computer hours, and I don't want to burn up all of my hours trying to make my calculations more efficient.

I'm a relatively new user, and was hoping for some guidance on the right way to test for optimal parallelization settings in VASP.

Thank you.

Best way to test for optimal settings of KPAR/NPAR/LPLANE

Posted: Wed Jul 10, 2013 1:45 am
by WolverBean
hachteja,
I'll paste here an email I wrote to a coworker who was performing calculations on NERSC's Hopper system. Since that's the only system I have extensive experience with, I can't say whether what's true on Hopper is true on all parallel architectures, but this might at least give you some idea of what to think about:

VASP parallelizes by bands, and by plane waves. VASP 5.3.2 and higher also parallelize by kpoints. (And of course, if you're doing an NEB calculation, those are also parallel... but I won't discuss that here.)

Let's get parallelization by planewaves out of the way now: turns out that with Hopper's architecture, parallelizing by planewaves doesn't help you (it has a very slight slowing effect, in fact). So we won't bother with it. Since the default is NOT to do planewaves in parallel, no effort is required to keep it that way.

Parallelization by bands is by far the most important. Say I'm working with a Bi8Mo12O48 lattice, and I've chosen my PAW cores such that this gives me 400 total electrons. Then I require NBANDS=200 at minimum (2 electrons per band), and to get an accurate result I'll need somewhat more than that. To have enough bands to account for all the unoccupied Mo d and Bi p valence orbitals (i.e. the first two sets of states above the Fermi level), I calculate that NBANDS=236 would be appropriate for this system.

The flag that controls parallelization over bands is the NPAR flag. The default is NPAR=1; at this value, all 236 of my bands will be calculated simultaneously across all however-many-processors-I'm-using. This is super inefficient, and you want to use a bigger value. How big? Well, if we were using 236 processor cores, we could set NPAR=236 and do each band on its own core. (This is what you'd HAVE to do for a Hartree Fock calculation.) BUT, not everything in the calculation is done separately for each band. The things that are the same for each band will therefore be done 236 times, when they only really need to be done once. This is not only inefficient, but will probably also cause you to run out of memory and crash. Surely, there must be a happy medium?

The usual recommendation is to set NPAR = square root of (number of processor cores). So the happy medium would be to use 236 cores, and set NPAR=15 or 16, right?

Well, not quite. First, we have a problem: 236 isn't evenly divisible by 15 or 16. So, let's round up: my rule of thumb suggested using 236 bands, but that's just a rule of thumb. Let's up it to 240, and for now assume we'll use 240 processor cores as well (we'll revisit this latter decision in a minute). With 240 cores, we could set NPAR=15 and divide the band calculation up into 240/15 = 16 parallel groups, or use NPAR=16 and divide the band calculation into 240/16 = 15 parallel groups. In the limit of infinitely fast communication between processors, it wouldn't matter in the slightest which way we did it. Of course, we're not operating in that limit. So, does it matter?

Well, on Hopper, NPAR=15 and NPAR=16 do in fact give you similar performance, but you can actually get much better performance if you set NPAR=20. Why? Because Hopper is built up from 12-core processors, with ultra-fast communication and shared memory within each 12-core unit but slower communication from one 12-core processor to another. If we use 240 cores, that means we need 20 of these 12-core processors. If we then set NPAR=20, we'll end up with each processor operating independently, and no band calculations spilling over from one processor to another. Thus, we take maximum advantage of the super-fast communication within each processor unit, without requiring much of the slower communication between units. NPAR=40 would also accomplish this, dividing each processor into two 6-core units, but my testing suggests this isn't as fast. Again, this is likely because all the non-parallelizable stuff then ends up on each processor twice. So the real sweet spot is to have each processor act as its own independent 12-core unit, and have all the non-parallelizable stuff repeated exactly once per processor (or in this case, 20 times total).

Now's a good time to revisit the question, how many processors should we use? We've determined that the number should be a multiple of 12. We know we can't use more processors than we have bands to calculate (assuming 1x1x1 kpoints, but see below). In fact, if we want to be efficient, then our number of processors should be an integral divisor of our number of bands. So for 240 bands, we could use 60 processors and NPAR=5. Then our 240 bands would be broken into 5 groups of 48, and each 12-core processor would do 48 bands at a time. This is nicely efficient: each processor core does 4 bands, and all the non-parallelizable stuff only gets duplicated 5 times. Another option would be to use 72 processors and NPAR=6. Then our 240 bands would be broken into 6 groups of 40, and each 12-core processor would do 40 bands at a time. This is somewhat less efficient: within a processor, you have to distribute 40 bands across 12 cores. The part of the processor that decides what each core should be working on at a given time will have to do more juggling to balance this load than it would if it were evenly distributing 36 bands across 12 cores, so you'll lose a little. So even though 72 processors is more than 60, you won't see much if any speed-up here in going from 60 to 72, because the parallelization is less efficient. Of course, both are much better than using 78 cores with NPAR=6, since now no matter how you slice it your 13 groups of cores don't line up with a 12-core-per-processor architecture, nor do 13 groups divide up 240 bands particularly nicely.

So, moral of story: choose a number of processors that evenly divides your NBANDS, and then choose NPAR such that your bands get divided evenly into chunks that match a 12-core processor architecture. For 240 bands, possible combinations are
240 core; NPAR=20
120 core; NPAR=10
60 core; NPAR=5
48 core; NPAR=4

To choose among these options, we go back to the NPAR ~ sqrt(#cores) rule, and guess that 120 cores and NPAR=10 is probably most efficient.

Now, suppose I want to run the calculation at 3x2x2 kpoints. Turns out for Bi8Mo12O48 with P21c symmetry, that's 10 unique kpoints. Parallelization by kpoints was motivated by the discovery that, if you run a calculation like the 3x2x2 one here at 48 cores, then at 60, then at 120, then at 240, you find out you don't cut your time by 25%, then 60%, then 80%. In fact, you might not cut your time at all! The NPAR flag only helps so much, but when you've got lots of processors, VASP still isn't terribly efficient. (Again, this comes back to wanting NPAR ~ sqrt(#cores) but at the same time having NPAR divide things into 12-core chunks). But, what you can do is start at 48 cores with KPAR=1, then go to 96 cores with KPAR=2, then to 240 cores with KPAR=5, or even 480 cores with KPAR=10, and you will see nearly linear improvements in your efficiency. (Or, 60 cores at KPAR=1 to 120 cores at KPAR=2 to 300 cores at KPAR=5, etc...) The order is: divide kpoints over processors first; THEN implement parallelization by bands. And KPAR must evenly divide both the total number of processors and the total number of kpoints.

So with 240 bands, 10 kpoints, and 300 cores, I could set KPAR=5 to divide my 300 cores into 5 groups of 60, with each group doing 2 kpoints instead of all 300 cores doing all 10 kpoints. Then I set NPAR=5 to divide each group of 60 into 5 groups of 12 for band parallelization. The overall result would be a very efficient calculation: it would go reasonably close to as quickly as a 48-band, 2-kpoint calculation would be on 12 cores.

My empirical observation is that you want KPAR ~ sqrt(# of unique kpoints), so KPAR=10 for 10 kpoints is probably not better than KPAR=5.

So, to sum up:

1. determine NBANDS. It must be at least large enough to include all your electrons, and the smaller your band gap, the more bands above the Fermi level you'll want. Regardless, NBANDS will always be a multiple of 12 on Hopper.
2. determine how many cores you want to use (ignoring kpoints). This will be an even divisor of NBANDS. Set NPAR accordingly.
3. determine how many unique kpoints you have. Set KPAR to be, say, 4 (if you have 4, 8, 12, etc unique kpoints), and multiply your # of cores in step 2 by 4 as well.
4. Let 'er rip!

I hope that helps! If other VASP users have other experience with this, please chime in!

Best way to test for optimal settings of KPAR/NPAR/LPLANE

Posted: Thu Jul 18, 2013 5:03 pm
by hachteja
never mind, replied to wrong topic
<span class='smallblacktext'>[ Edited Thu Jul 18 2013, 05:56PM ]</span>