PMFAST : Distributed / Shared Memory 2-Level Particle Mesh N-Body Implementation
Copyright (C) 2004 Hugh Merz - merz@cita.utoronto.ca

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

------------------------------------------------------------------------------
Feb. 2nd 2005  (revised for fftw-only version of code)

PMFAST : a quick overview.

If you would like details as to how the code was designed you may
be interested in reading the paper located at the pmfast webpage.

PMFAST is a parallel 2-level grid implementation of the particle mesh 
algorithm.  The total size of the grid is determined from the following formula:

  LPS = ( LF - 2 * LFB ) * NN

Where LPS is the 'length of physical space', or the box size in fine mesh cells,
LF is the length of each fine grid section, LFB is the length of the fine grid 
buffer (for included kernels LFB=24), and NN is the number of MPI nodes you are 
planning on using.

In this version of the code (which utilizes the fftw-3.0.1 serial FFT library 
instead of the IPP library) the restriction on LF has been relaxed to allow for
a greater number of possible geometries.  

Currently the allowable values for LF are restricted by the decomposition of 
the coarse mesh, LC, which is defined by:

  LC = LPS / 4

In the first instance LPS must be evenly divisible by the grid ratio, currently
fixed at 4.   As well as this, the coarse mesh must be evenly divisible by the 
total number of processors used:

  mod( LC , NT * NN ) = 0

where NT is the number of threads (processors) per node.  This means that LF 
must satisfy:

  mod( { [ LF - 2 * LFB ] * NN } / 4 , NT * NN ) = 0

We currently perform simulations using 1 particle for every 8 fine mesh cells,
as such the total number of particles used in the simulation, NP, is given by:

  NP = ( LPS / 2 ) ** 3

We gladly appreciate feedback, bug reports and suggestions for the code.  Please
send any comments you may have to the maintainers email address located on the
pmfast website.  We will also do our best to help people set up and execute the
code on their computing platform.

---------------

* Requirements: 

  F90 compiler
  FFTW 2.1.5 
  optimized serial FFT routine (FFTW 3.0.1)

Notes on requirements:

  Our current production version compiles with the Intel Fortran Compiler, which
  is currently available from intel.com free of charge (non-commercial version).
  If one would like to employ the shared-memory parallelization of the code the
  compiler must support OpenMP.

  FFTW 2.1.5 is required since there is currently no MPI support in
  version 3 of the library. It should be installed to support single 
  precision and to include mpi transforms, but without threading.  For more 
  information or to download:  http://www.fftw.org

  PMFAST uses single precision for all of the major data structures, as 
  such all libraries or external code should be compiled to support this.

  The production version of the code (also available on the website) currently
  uses a serial fft routine from the Intel IPP library.  Since this library is not
  freely available and restricts the size of the fine mesh to a power of 2, we have
  released this version of the code which uses the serial fft from FFTW 3.0.1.  We 
  have found it to run at comparable speeds (less than a factor of 2 slower, even for
  non-power of 2 mesh sizes) and this should make the code much more accessable and 
  portable.  Please see http://www.fftw.org to obtain this library.  It should be 
  installed to support single precision and without threading support (as it is
  called by multiple threads simultaneously within PMFAST).   

---------------

* Instructions:
 
   1) Download and untar (tar -xvzf) code and initial conditions.

   2) Edit parameters in header files: 
        iopar.fh    [input/output file paths]
        cosmopar.fh [cosmological parameters] 
        simpar.fh   [simulation parameters]
      The parameters are described within the files.

      ** Optional program execution modes:

       * Pairwise Force Testing:
 
        If one would like to execute pairwise force testing, the following
        needs to be set:

        PAIRPATH in iopar.fh (output location for pair data)
        pairtest = .true. in simpar.fh
        NP = 2 in simpar.fh

        In this mode, 2 particles are placed on the grid and their positions
        and velocities are written to disk following the fine mesh velocity
        update (fine_pair.dat) and the coarse mesh velocity update
        (total_pair.dat).  Parameters (a,G,dt) are scaled to 1. Each line
        of the above files contains:

        format='12f20.10', x1,y1,z1, vx1,vy1,vz1, x2,y2,z2, vx2,vy2,vz2 

        One may also wish to edit the set_pair routine in pairs.f90 to
        modify the scheme used to set the pairs on the grid.

        A simple analysis program that reads in the above pair data files and
        calculates the fractional error in the forces is located at
        utils/pair_check/pair_check.f90

       * Generation of Density Projections:

        input/projections can be edited to include a list of redshift at
        which density projections are generated.  For each redshift indicated
        three projection files will be created on each node, corresponding 
        to the projection of the overdensity to the midplane of each slab in
        each orthogonal dimension.  These must be recombined following program
        execution.
        
        The projections are written to the location indicated in PROJPATH.  
        The amount of disk space required for each output is:
        [(4*LC)^2+2*(4*LC)*(4*LC/NN)]*4 bytes

        utils/combine_projections/combine_proj.f90 is a simple program
        that reads in the constituent projections (after placing the
        files within the same directory) and generates
        full box length projections.

        utils/combine_projections/topgm.f90 is a file converter that will
        convert the binary full box length projections to portable greymap
        (.pgm) files for viewing in a graphics program.

        If one leaves input/projections empty, no projections will be 
        generated.

       * Restarting from a checkpoint:

        In order to restart from a checkpoint, one needs to set INIT_VAL=9
        in simpar.fh, as well as selecting the redshift for the checkpoint
        that one would like to restart from (z_restart).  
        
        All neccessary information to restart the simulation is contained in 
        OUT1/###.#params.dat, where ###.# is set from (z_restart), as well
        as the particle list checkpoint files.
        These are stored at either OUT1/xvp#.dat or OUT2/xvp#.dat, whichever
        is indicated through the params file. The # in the checkpoint files
        corresponds to the MPI rank of the node.
      
   3) Edit input/checkpoints 
      This file lists the redshifts at which particle list checkpoints
      are written to disk.  In order for the simulation to run correctly, at
      least one redshift should exist in this file at which the simulation is
      to stop.
      The program will complete upon writing the final checkpoint or 
      attaining the maximum number of timesteps, whichever comes first.  
      Make sure there is sufficient disk space for the checkpoint files 
      at the paths indicated in iopar.fh

   4) Distribute initial condtions
      Each nodes initial conditions should be placed in the directory INICOND
      and should correspond with the following format:
           filename ='xvp#.init' [# = rank of MPI process for node]
           format='binary'
           contains:
           integer(4) nploc
           real(4) xvp(1:6,nploc)

      If one is using the initial conditions obtained from the website (or
      from any other serial generator for that matter) then they
      must be decomposed in the above format, a simple code to do so can
      be found in utils/decompose_ic/decompose.f90

      Note that although particle positions in the x and y dimensions span the
      entire global fine mesh (0:LPS], the z dimension (along which the 
      simulation is decomposed) spans (0:LPS/NN] for each node.  As such
      the z position of particles needs to be offset to node relative coordinates
      if one is decomposing a cubical distribution.
      
   5) Edit Makefile and compile pmfast 
      change the path to the fftw library, fortran compiler, and related flags.
      Make sure that the use of the -openmp flag corresponds to the proper
      syntax for the compiler that you are using. Depending on how your compiler
      handles fortran modules, modifications to file dependancies for qsorti.f90 
      may be required (it works with Intel v8).  Compile by running 'make'.

   6) Compile filefftw program.  
      This is a background program that performs
      mpi ffts using a different number of processors than are defined in the 
      main pmfast program.  It is located in the filefftw directory and can be
      compiled using the buildffftw.csh script after modifying the MakeIA32 /
      MakeIA64 make files to point to the proper f90 compiler, mpi
      implementation and fftw library.  Also be sure to change the fpath 
      variable in filefftw.f90 to point at the same directory as FFTWSWAP in
      iopar.fh.  The buildffftw.csh script accepts 3 arguements, the size of
      each fine grid section (LF), the number of nodes (NN), and the number
      of cpus / node (NCPUPN).  

   7) Start MPI with the total number of nodes + cpus that you would like to 
      use.  This should be NCPUPN * NN processes in total. 

   8) Start the filefftw program using all of the nodes (NCPUPN * NN).  
      Make sure that the filefftw processes are numbered contiguously and
      not striped across the nodes (this is the default behaviour in LAM 
      - ie - first number_of_cpu/node processes all exist on node 1, etc).

   9) Start the pmfast program, using 1 process per node (NN total).  In 
      LAM this can be achieved by using 'mpirun n0-7 -np 8 pmfast' with NN=8 
      for example.

   10)Output is placed in the directories indicated in iopar.fh.  Should one 
      desire to restart the program this can be done by editing simpar.fh to
      select the redshift at which the program should be restarted (given that
      a checkpoint was performed at that redshift).

---------

* Output:

   Output files include particle checkpoints, density projections and a 
   timestep record.  A new addition is the mass power-spectrum on the coarse 
   mesh, which is enabled in simpar.fh.  All of these can be found in the 
   directories specified in iopar.fh.  

   Particle initial conditions are read in from binary data
   files, which contain the number of particles (4 byte integer) followed by
   a sequential list of the particle positions and velocites (x,y,z,vx,vy,vz). 
   Units are in fine grid cells. 
   In fortran 90:  read (10) num_particles,xv(1:6,1:num_particles)

   Particle checkpoints are saved in the same fashion, as well as a parameter
   file labelled by redshift that includes parameters required to restart the
   run.  If rotate_cp=.true. in simpar.fh particle checkpoints will 
   alternate between two locations, and are overwritten every second checkpoint.  
   Make sure there is enough diskspace for all of your desired checkpoints if 
   you set this flag to be .false.
 
   A small program that reads in a checkpoint file and writes a thin slab of
   particle positions is located at utils/slice_proj/slice_proj.f90, along with
   a supermongo macro to read in and plot the slice.

   Density projections are explained in the optional program execution modes 
   section above.


------------------------------

* Example of compiling and executing the code.

  * in this example we will simply put all of the files in one place on disk
    and use 2 nodes, with a 128 fine grid mesh 2 cpus / node

  1) download initial condtions: (128 - 48) * 2 = 160^3 total mesh size

  2) download pmfast tar file

  3) unpack pmfast tar file : tar -xvzf pmfast.tar.gz

  4) edit iopar.fh:
       set all file paths to point to your desired location
  
  5) edit cosmopar.fh:
       set the cosmological parameters to correspond with those used to
       generate the initial conditions.

  6) edit simpar.fh:
       set:
       LF = 128
       NN = 2
       NCPUPN = 2
       NT = 2
       MAX_PARTICLES = 512000 / 2 * ( 1 + 2 * 24 / 128 ) * (1 + 50 / 100)
                     = 528000 
          (expecting a particle imbalance of up to 50%)
       MAX_BUFFER = 528000 * 24 / 128 * 6
                  = 594000
       MAX_TAG = 52800
       MAX_NTS = 3000
       TS_RATIO_MAX = 3 (we will calculate a maximum of 3 fine timesteps per sweep)
       DT_SWEEP_SCALE = 1.0
       INIT_VAL = 8

       And all of the other parameters we will leave as is.

  7) edit input/checkpoints and input/projections, enter redshift that you 
     would like checkpointing and density projections to be performed at.

  8) decompose intial conditions:
       edit utils/decompose_ic/decompose.f90:
         set:  nc=160
               nn=2
       compile: f90 utils/decompose_ic/decompose.f90 -o decompose.x
       run decompose.x in the same directory as xvp.init, it should
       produce xvp0.init and xvp1.init.  Place these files in the 
       location specified in iopar.fh, with xvp0.init on the first node 
       (rank 0) and xvp1.init on the second node (rank 1)   

  9) edit makefile.
       edit library paths, compilers and compiler flags to suit
       your particular installation environment.  Make by executing
       'make'

 10) compile filefftw
       cd filefftw
       run: buildffftw.csh 128 2 2
       you may need to edit the makefile (either MakeIA32 or MakeIA64) 
       in the same fashion as the pmfast Makefile.  Make sure you run
       this script in the filefftw directory to avoid clobbering the
       pmfast makefile.

 11) start MPI.  With LAM, one can run lamboot lamnodes, where lamnodes
     is a list of the nodes and how many cpus we are using.  
     ex:  lamnodes
     host1  cpu=2
     host2  cpu=2

 12) now run the filefftw program using the above two nodes:
     mpirun -v -np 4 filefftw/filefftw

 13) and launch pmfast with a process per node:
     mpirun -v n0-1 -np 2 pmfast

 14) output should appear in the directory specified in iopar.fh. 

 15) to generate a "thin-slice" of the particle distribution, edit and 
     compile utils/slice_proj/slice_proj.f90.  This will create a formatted
     output file containing all of the particle positions within the slice 
     boundries.  This file can be plotted using the included supermongo macro,
     /utils/slice_proj/slice.sm, or with another plotting utility.

  16) to create viewable portable greymap files of the density projections, one
      can compile and run the file utils/combine_proj/combine_proj.f90 to
      create full box projections, followed by /utils/combine_proj/topgm.f90 
      to convert the projections into the .pgm format.  combine_proj 
      requires MPI to be linked in, and will have to be edited such that:
          LPS=160
	  NN=2
      you will want to run the executable `mpirun -v n0-1 -np 2 combine_proj.x`
      from the location of the checkpoints.  Following this you can edit 
      topgm.f90 to select which checkpoint you want to convert, compile it,
      and view the .pgm in your graphics viewer of choice.