Scientific Computing Project
Spring 2002
In both astronomy and electrostatics one is often interested in examining the development of physical systems consisting of a large number of independent particles acting under the predominant forces. With application in areas such as galaxy dynamics and plasma physics, the issue is to solve the so-called N-body problem: given N particles, what are the total forces exerted on each of them by the others? Given the answer we can simulate the development of the system, via traditional integration techniques.
The naïve approach solves this problem in quadratic time. This project will examine how well this approach fares compared to more sophisticated techniques and how well using large-scale cluster computing hardware applies to solving the problem. On this type of hardware, traditional complexity theory might not apply at all well. Traditionally one is concerned with the number of single operations ones algorithm might execute, while in large scale computing the number of single CPU instructions issued, will probably not matter much and execution time will depend largely on the number of cache-misses and network data transmission collisions and be limited by the network transmission capabilities. An algorithm running in quadratic-time in the traditional RAM model is thus not guarantied to be slower than, say an O(NlogN) time algorithm, even for considerably large N.
This project concerns it self with real-time simulations. This is obviously not to be taken literally, since real-life galaxy collisions occur over millions of years. Rather the goal is to be able to change parameters, such as initial particle distributions and see the effects immediately in a simulation scaled in time, so the collision can be seen as a smooth animation, lasting a couple of minutes, while the simulation is running.
Let us first examine the naïve approach, so as to understand the details
of the problem. In both gravitational and electrostatic systems the
interparticle forces are of the form
where
denotes the force exerted on the i'th
particle by the j'th,
is the
position of the i'th particle relative to the j'th,
ci is the charge of the particle and G the
force constant. In gravitational systems the charge is actually exactly equal
to the inertial mass, so the equation for the acceleration of the particles
from Newton's second law of motion could be simplified. The N-body
problem can then be solved by computing for each particle the sum
which can be done trivially directly in time O(N²).
Again we can apply a law by Newton, namely the third and set
so as to reduce the number of times
is evaluated, but only by a linear number.
Another approach is to try to compute the potential function
,
such that the force exerted on a particle with charge ci at
position
is given by
where
is usually called the field. Since the
gradient is a linear operator, the field must be a linear combination of all
single contributions, so
which satisfies
given
.
This equation generalizes nicely to arbitrary charge distributions, so we are
no longer forced to consider point particles:
where ρ(r′) is the charge distribution and dτ′
is an infinitesimally small region of space around
.
Computationally this function is not easily represented. We cannot generally
hope to obtain a simple (semi-)analytical representation, so that given we can
compute in constant time. What we can do trivially, is to store the N
pairs of and cis, and
s when
asked for the value of
, simply evaluate
.
This will take time O(N), and since we need to know the
potential at the location of each of the particles, we are back in an O(N²)
solution, but with even more work, because we also need compute the gradient.
When given the result of
or
,
doing the integration step is a trivial task, thus we are led to believe that
solving the N-body problem will be the execution time bottleneck. So
let us take a look at
As mentioned, the simulation will be run in real-time off a cluster-type supercomputer. This creates an interesting situation. Normally scalability is key, and indeed this project will focus on it too, but one will soon realize an upper limit to the complexity scene, physical system, to be simulated. In general, we desire for the system to be able to handle 41% more particles given twice as many CPUs (in the case of an O(N²) algorithm). This however, is not possible; consider the following setup:
where the front-end does the actual graphical animation and the back-end does the simulation. Only one machine will do the animation (makes no sense to have several machines do part of or the entire scene) and since we are in a cluster configuration, this setup is the only one that makes sense. Now suppose we want a smooth animation that is, the scene should be updated say 25 times per second. Suppose also, that it is described by a set of vectors that could potentially all change from one scene to the next. With a frame rate of >25 frames⁄s and a data transmission link between the front- and back-end of <100Mb⁄s we have a scene complexity of <4Mb⁄frame. Suppose the vectors are described in single precision in 3-space, that is 96bits per vector and we get <41,500vectors⁄frame. In the present system each particle is described by at least one such vector, which gives us a maximum number of particles simulated of at most 41,500. And these are very idealized settings. The conclusion must be, that for N > 41,500, we cannot hope for CPU scalability, only for network bandwidth scalability. The sad thing is that real-life implementations of these types of cluster-computers have a tendency to downplay the importance of these issues, and so all current and up-and-coming (as of Q202 in Denmark) implementations of such systems are based on 100Mb⁄s Ethernets, even though 1Gb⁄s nets are becoming affordable.
Having argued, that the setup should consist of a single machine doing the graphical animation of the system being modeled, let us now turn our attention to the job of the back-end. So what will go on behind the scenes, so to speak? This is where the actual simulation will take place, where the N-body problem will be solved and the integration steps computed.
The computational model is one in which there are several, say P, processors working in parallel and where the change to data by one processor will not visible to the others. Furthermore it is possible to pass messages containing any type of information from one processor to another, however this will be considered costly.
Suppose now, we want to evaluate
.
The input would be an array of particle positions
and the output would be the total force
exerted on
each of them:
A first phase could be to distribute
particles
among the P processors and compute the interparticle forces among
those particles located on each individual processor:
We have now computed part of the accumulated force on each particle, using a
minimum (P messages of size
each) of
network traffic; each particle has a hold on the contribution to the total
force on it from
of the other particles. Actually,
we are surprisingly far from done; we have to look at all N2
pairs of particles, but all we have looked at
is
pairs. Having only one processor, we would obviously be done, but the more
processors we have at our disposal, the further from the goal we are. How do we
proceed from here still minimizing message passing? What should be the next
phase? We could pick pairs of processors and swap, say all the odd numbered
particles (both position and the accumulated force) between such two, using 2P
messages (each half the size of before) and compute the odd-even particle
interaction. How far did that get us? Well, actually again no more than
pairs were handled. Does this generalize well? I do not think so. It would seem
we would need at least P such phases, which would require us to pass
P2 messages with a total size of O(NP).
Having thought of nothing better, we could distribute all particles to
all processors and not be sending any more information than that. A
brutally simple approach would thus be to pass out all particle positions to
all processors, but instead of computing all the forces, each processor would
compute the force on each of particle in a pre-designated set of size
.
The processor would then do the integration step on the particles in its set
and before next iteration pass the updated positions to the rest of the
processors.
Analyzing what went wrong, when trying to compute the forces locally on each
processor, we realize that getting hold of the contributions from the
particles, not on the processor, was a difficult task. Looking at the potential
formulation of the problem, we could hope to simply compute the potential
function from the particles on each processor and then pass the function it
self to the other processors, which will then sum up the potentials (recall, it
was a linear equation) and evaluate it at the position of each of its
particles. In the naïve way, however, the potential function was
represented was pairs of cis and
s
, so we soon realize that this is too much information and that we will be
communicating at least as much as the direct force approach. Can this be
remedied? Luckily it can! It turns out there is a simple way of representing
the potential of a gravitational or electrostatic field, namely using so-called
multipole expansions. Recall the cosine relation
where θ is the angle between the vectors
and
′. This gives us
Inserting into
,
we get
where Pn denotes the n'th Legendre
polynomial. Note that this expansion is around the origin, and that one could
easily compute the expansion around an arbitrary point, by simply translating
the distribution function.
Returning to our case, where the charge distribution is composed of point
charges, we substitute ρ(r′) with the Dirac
delta function and get
Now, we can in a sence compute
for any n
we like for any subset of particles we like. These coefficients will then act
as our representation of the potential produced by that particular subset. We
will obviously need to truncate the series, but for some suitably large p
(probably no larger than say 15), the first t coefficients will
yield sufficient precision. So now we can represent the potential due to the
particles not located on the processor and using a constant amount of
information. This gives us a new strategy: We associate
particles with each of the P processors. Let each of them handle
their particles locally, using a direct pair wise approach. Compute the
multipole expansion and pass it to the other processors. Upon receiving the
multipole expansion, compute the gradient and evaluate it for each associated
particle, adding up the contributions from the expansions from each of the
other processors. We now only need to pass O(P) messages
with a total size of O(P). And not only that, except from
the pair wise computation done locally on each processor, we only do O(N)
work. Simulating N virtual processors on the P physical
ones, we would in theory have solved the N-body problem in time O(N).
Yes, this is too good to be true, however. Notice, that
will diverge, if ε > 1, that is if
.
Hence we must not evaluate the multipole expansion of the potential in points
that are closer to the origin than any of the contributing particles! So this
solution is no good, if we cannot control the positions of the particles. The
solution was first presented in 1983 by Barnes and Hut. Their algorithm is not
designed for parallelism, but uses the divide-and-conquer strategy, so it
parallelizes easily. Furthermore, in the original article, they only used the
zero'th therm in the multipole expansion; the so-called monopole term,
which basically is equivalent to the entire charge situated at the
center-of-charge. Using this rather brutal truncation they were nevertheless
able to achieve an accuracy of about 99%. The idea is to recursively partition
space into non-overlapping regions, and store the particles in those regions in
a tree structure:
Each external node would contain some constant number of particles. Such a tree
(called a quadtree in 2D, as in the above figure, and an
octtree in 3D) could be build bottom-up in time O(hN),
where h is the height of the tree. When this was done, the tree is
traversed bottom-up, computing the multipole expansion of the potential around
the center of the region, due to the particles in the region corresponding to
each node. If the node is external, the expansion is computed directly, if it
is internal the expansion of the children is translated to give eight new
expansions, but around the center of the current node, and the simply added to
give the expansion of the potential due to the particles in the sub-tree. The
tree is then traversed one final time top-down, accumulating the potential
contributions of the nodes not in the current sub-tree. In the example above,
when visiting R0, R1 is visited
with the potential contributions from R2, R3
and R4, which was stored in the nodes during the previous
pass. Note that we can now use the multipole expansion of the potential due to
particles in one region to compute the potential at the positions of particles,
because
(at least for a considerable part of
the region). Both of these passes take time O(N), because
only a constant number of operations is done per node, so the total time to
solve the N-body problem is O(hN). Actually
the original Barnes-Hut algorithm did a root-to-leaf traversal for each
particle, applying potential contributions for each node on the way down, so
the total time of the down-ward pass becomes the sum of the height if all the
leaves, which is again O(hN). One is thus tempted to
choose the regions, so that the tree becomes balanced and the time becomes O(NlogN),
by e.g. partitioning the region such that an equal number of particles ends up
in each child region. But that could result in too large a part of the regions,
where
; imagine particles distributed on a
string along the “bottom” of a region – that would result in
a very slim child region in which where the particles contributing to the
expansion are very distant from the center, and thus space, where
would extend well into the child region above it. It is a viable option in the
name of speed, but precision will have to be sacrificed.
In a parallel setting each top-level region could be handled by a single processor, but the computational and administrative (i.e. building the tree, which is not easily parallelized) overhead is considerable and recalling the limit on N, given the real-time requirement, it is not clear which approach will be fastest.
I have implemented the quadratic time algorithm as described in sections 2 and 3. Next I will elaborate on and present the details of the problems associated with the multipole approach and how I solved and implemented them.
The alert reader would have probably already spotted a flaw in the reasoning
about
.
Recall the objective was to get a function that in a compact manner (constant
space and constant evaluation time) represented the potential due to a
collection of particles. In
,
however, the parameter θ is the angle between the point at
which we are evaluating the potential and the i'th particle,
and should as such be updated for each source charge, so how we compute
for any n we like for any subset of particles, is not entirely
clear. Luckily(!) the addition theorem states exactly what we need:
The n'th Legendre polynomial evaluated at cosine to the angle
between two points with spherical coordinates (r,θ,φ)
and (r′,α,β) is given by
where
denotes the spherical harmonics (
denotes the associated
Legendre polynomials, but I'll stop the details here. Suffice it to say,
they are real-valued). Inserting we get
and the potential becomes
so
where
is completely independent of the point of evaluation, and thus can be stored as
a representation of the potential due to a collection of particles.
Preliminary testing with these equations highlighted major problems: computing the expansion of the nodes of a tree with 40,000 particles took over five seconds on a 1.6GHz CPU.
The complexity is primarily due to two things: we are now dealing with spherical coordinates and with complex numbers. Also a lot could be gained with precomputing and tabulating functions or part of them. Fortunately to some extent, they eliminate each other. The particles comes in Cartesian coodinates, so which look horrible, but inserting into the associated Legendre polynomials we discover something nice: we have eliminated two trigonometric functions. Now, can we do the same with the rest of the trigonometric functions, namely those in the complex exponential term? Unfortunately, no, because there is a pesky little m in there, so where we need the values for m = −n…n. What we can do, however, is tabulate these values and in doing so exploit the inherent symmetry of the trigonometric functions, and the fact, that modern CPUs allows for fast computations of the associated trigonometric function (computing cosx takes 88 cycles on and AMD Athlon, while computing sinx only takes 19 additional cycles (or vice versa), if it is done while the processor is still confident that the x has not changed its value). Here is a simple illustration:
// computing xm = eimφ, for m
= −n…n
φ ← atan(ry⁄rx)Re(x0) ← 1; Im(x0) ← 0for m = 1…n doRe(xm) ← cos(mφ);
Im(xm) ← sin(mφ)Re(x-m) ← Re(xm);
Im(x−m) ← -Im(xm)od
(Note: floating point multiplications is faster than additions) so we only
spend 107p+121 cycles computing trigonometric functions, each time
we either add a particle to an expansion or evaluate the potential (p
is the truncation order and 121 for the atan(x⁄y)).
That is pretty good, considering the heavy use of complex exponentials and
spherical coordinates. When storing the multipoles, we can use the fact that x−m
is actually the complex conjugate of xm. This is a
tremendous gain when we have to transmit the potentials across a computer
network.
Tabulation goes even further, in that we can compute a table
once. Note that if the compiler were to unroll the loops that compute the sums,
would be compile-time computable, so the use of tables would not benefit us.
gcc, however, does not unroll the loops, even with -O3 and -funroll-loop flags
set (p is known at compile-time as a class template parameter), so
with that particular compiler tabulation is preferred. Modern CPUs also employ
automatic cache prefetching (and some compilers even generate explicit prefetch
instructions) so the penalty associated with accessing the tables should be
minimal. The associated Legendre polynomials are defined as recursive
differential equations, but given the point of evaluation, they can be
evaluated recursively (
depends on
for some m′ < m and n′ < n).
Using dynamic programming all the polynomials we need, can be evaluated just
once at a given point. A table containing these values is thus build each time
a particle is added or the potential evaluated. Adding a particle with charge c
at position
, for instance amounts to computing
the associated Legendre polynomials at rx⁄r,
where
and adding
to
each of the
.
At the core of the algorithm is the region hierarchy. As stated, this hierarchy logically constitutes an octtree, where internal nodes have eight children, leaves have a constant number of particles and all nodes have an associated multipole expansion. The literature touts the fact that since each region is either a leaf in the tree, or it has exactly eight children, the tree can be laid out implicitly in an array where the j'th child of the node at index i is positioned at index 8i+j+1 and that this would decrease running time. As I see it, this only makes sense if there are either very few particles or they are uniformly distributed and the source of the potential is static. Otherwise there is no limit on the potential height of the tree (other than linear) which would result in a worst case size of the array being exponential. Furthermore it would be impossible to efficiently extract a particle from one cell and insert it into another.
I therefore decided to implement the tree using pointers and dynamic allocation of nodes. If large regions with few particles should exist, I simply use a leaf instead of an internal node. This also allows for the use of virtual functions and efficient restructuring of the tree. The interface of virtual functions defines the operations on the nodes of the tree. The important ones are listed here:
bool addParticle(const Particle& p, Internal*& pThis);
void computeExpansion();
void applyForce();
void applyForce(Particle* first, Particle* last) const;
bool move();
addParticle() adds a particle to the node. If this (the node the
function is called on) is an internal node, the child region in which the
particle belongs is located. If it exists it is recursively inserted, otherwise
a leaf is created and the particle is inserted into the new leaf. If this is a
leaf, the insertion can cause overflow (recall a leaf may only contain a
constant number of particles) in which case a new internal node is created and
the particles belonging to the old leaf is inserted into that node one by one.
The new node is returned through the pThis pointer, so the tree
can be maintained upwards. The function returns true in the latter case,
indicating that the old node should be deleted and false otherwise. addParticle()
also handles cases where the particle does not belong in the region represented
by this node. In that case the particle is recursively inserted into the
parent. If this is the root a new root is created, the particle inserted, the
old root attached and the new one returned through pThis. This way
it is possible to construct an entire tree using only addParticle()
calls.
computeExpansion() is the first pass of the algorithm as it was
described in section 4 It recurses to the leaves, where it
clears the associated multipole and adds to it the associated particles one by
one. On the way back, it computes the multipoles of the internal nodes by
translating the multipoles of their children and adding up the expansion
coefficients. The first applyForce() is the downward part of the
second pass, in which nothing is done in the internal nodes. The leaves applies
the interparticle force to the associated particles using the direct pair-wise
method and calls the second applyForce() on its parent with those
particles as arguments. The second applyForce() is intended to
mean “apply the force exerted by the particles in your region to these
particles.” The first step is to partition the incoming array of
particles into those that are too close to the origin of the expansion for the
expansion to converge and those that are not. The latter ones are handles by
evaluating the gradient of the associated multipole expansion at the point of
the particles. The first ones are then used as the arguments of a recursive
call the child regions. If this recursive call proceeds all the way to a leaf,
a direct pair-wise summation is used. This will not happen too often, though,
since only a constant number of leaves are too close to any given leaf.
move(), like applyForce(), does nothing in internal
nodes. In the leaves it performs an integration step on all the particles that
is it updates the velocity of the particles with the accumulated force and then
updates the position. This might mean that a particle physically leaves the
region associated with the leaf it is in. In that case move() simply
calls addParticle() on the parent node. This in turn can cause the
leaf to run completely dry of particles, in which case move() returns
true indicating, that the leaf can be safely deleted (which the parent will do
on the way back from the recursion, so something is actually sometimes done in
the internal nodes in this pass).
Turns out, that the algorithm, the mathematics and data structure maintenance was not the only difficult part of this project. Generating sensible physically realistic input is too. One thing is stellar distribution. Stars are not uniformly distributed in galaxies and heavy stars does not mingle with lighter ones. Even more challenging is stabilizing the galaxy, a problem I consider unsolved for now. The trick is to give each star the exact right velocity – not too much or the galaxy would disintegrate and not to little or the galaxy would collapse under its own weight.
The distribution I am using now is uniform on the two angular of the spherical coordinates and with the radial coordinate normally distributed. This creates a sphere of stars, with a high concentration towards the center. It is then scaled to form an elliptical shape, which I consider well… an approximation. My first attempt at setting initial velocities was to consider the total force on each particle and let it define a centripetal force pulling towards the z-axis. This in turn defines a velocity perpendicular to the z-axis that would keep the star in orbit around the axis, should the potential be spherically symmetric – again a crude approximation.
My next attempt will be aimed at keeping the total energy of the particles at zero.
The final algorithm now only consists of taking the stars as they are
distributed and call addParticle() with them on a singular node.
Then loop over calls to first computeExpansion() then applyForce()
and finally move() At this point the particles should be displayed
on screen, which is done by traversing the tree, collecting their positions and
sending them through the network to a remote host running an animation program.
An even faster (asymptotically) algorithm exists, called Fast Multipole Method.
It employs so-called local expansions along with multipole expansions. Local
expansions are used on the way down second pass (in the first applyForce()
call) to represent the contributions to the potential of the multipole
expansions of regions neighboring the region associated with the expansions.
This way the potential of all but the leaf regions can be collected in a single
function when traversing the path from the root to that leaf, eliminating the
need for the second applyForce() function to call recursively all
the way back to the root. Using this approach each internal node and each leaf
is only visited a constant number of times, thus yielding a truly linear
scaling algorithm. I implemented these local expansions, but working with them,
I realized that they would be too slow and cause too big an overhead, compared
to the h factor of the Barnes-Hut approach, when N < 40,000
(where typically h < 11).
Following up on the success of the clean interface of the virtual functions of
the nodes of the tree, I decided to continue that line into the parallel
version. In this version two new types of nodes are added, a Master Node and a
Remote Node, each implementing the interface described in the above section.
The program I have implemented uses in all nine computational nodes – one
master and eight slaves. The algorithm of the master node has the same
structure as the sequential algorithm. The only difference is that the
particular node, the four functions are called on, is a Master Node. The Master
Node implements the four functions in an RMI fashion, simply relaying the
function calls to the slaves, through messages sent over the network. The
slaves in turn would continuously receive these messages and dispatch calls to
a Remote Node, which basically wraps around a tree; when the calls are going
down the tree the Remote Node simply relays them to the root of a tree, when
they go up (e.g. when a particle has moved beyond the region associated with
the root, the root calls addParticle() on its parent, which is now
the Remote Node) messages is broadcast across the network and the appropriate
computational node will then pick it up (e.g. upon receiving an addParticle
message, a Remote Node checks to see if it can be inserted into the region
associated with the root of the tree, and if so calls addParticle()
on it, otherwise ignore it, because it belongs to someone else).
This approach seems nice and clean; however it only goes so far, if we want to
exploit the compact nature of our potential function representation. The real
culprit is the second applyForce() function. This function must
necessarily ascend all the way to the root, because no particle is distant
enough from the root region in that they are all contained in it, and on the
way must descend down neighbor region subtrees. When the call gets to the root
of the tree of a computational node it calls applyForce() on the
Remote Node, which is supposed to relay the call to the other slaves. As with
RMI, this involves serialization and transmission of the arguments. But if we
add it all up, all leaves will cause applyForce() calls on the
Remote Node, causing serialization and transmission of all particles in the
tree, in turn causing all particles to be transmitted in every integrationstep.
To remedy this, each the Remote Node will on completion of computeExpansion()
send the multipole associated with the top of the tree. The hope is that these
expansions are described with much fewer bytes.
The above figure shows where the use of the multipole expansion is disallowed.
As stated all particles will cause an applyForce() call to the
root, but only the ones inside the grey sphere cannot use the potential
associated with the white region and must call applyForce() on its
children. The particles in the grey zone, but outside the cyan zone can then
safely use those multipoles and so on. In my implementation each Remote Node
transmits the multipoles associated with the internal nodes of the top three
levels of its tree, which constitutes a message of constant size. Upon
receiving the multipoles from the other slaves, the computeExpansion()
function of Remote Nodes stores them for the applyForce() function
to use. When applyForce() is called on a Remote Node it first
checks the stored multipoles if it can use them. The particles in the blue zone
are too close to the white region, however and must be transmitted to the slave
responsible for that region, so it then can apply the force of its potential to
them.
Recall that the computational setup is as described in section 3; the system is being simulated on a remote (cluster-) computer and has to be displayed on the local host. Typically the host that the simulation is to be displayed on is not even the front end; a common scenario would be that the user logs in from a workstation to the front end, dispatches a job script, which executes the simulation algorithm on several nodes on the back end. Each one back end node cannot be expected to contain all the data of the simulation (here in particular, the positions of all particles), hence the data need to be gathered somewhere for display. The back end nodes are often protected by a firewall, so one cannot expect to be able to send data from them directly to the workstation, via traditional protocols, like X, indeed all connection-oriented protocols seem senseless.
The back and front end being connected with a Ethernet, that guaranties high speed collision-free packet delivery, and the workstation being connected to the front end via the Internet, I decided to implement a protocol, which would have the back end nodes send each of their contributions to the system to a relay application using UDP packets for optimum performance on the Ethernet. The relay application, running on the front end, would collect these packets to get all particles in the complete system in a single “frame”. It employs double-buffering, so packets arriving out-of-order could be handled gracefully (e.g. node 3s contribution to frame i+1 arrives before node 5s contribution to frame i, would simply cause node 3s contribution to be stored in another buffer). The relay application could then connect to the workstation via a reliable protocol (TCP) and stream the collected frames through the connection. At the workstation a display-server is running, reading the stream into another double buffer system, writing in one buffer, while displaying another on the screen, using standard OpenGL API calls.
This approach introduces certain flexibility in the system. If for instance you do not need to see the simulation real time, you could simply make a relay application, that in stead of sending the frames to a remote host, simply stored them on the local harddisk for later retrieval. You can even have radically different variants of algorithms running on the back end, as long as they send the particle positions in UDP packages to a specific host.
The programs are now implemented and ready to go. New cluster computers tend to be barely up to date with respect to individual CPU processing power. Considering that the algorithm uses eight computational nodes, it would be senseless to run it on a 4-year old cluster, based on CPUs, which are 4-8 times slower than the ones readily available in newer workstations; running a serial version on such a workstation would be more productive. Initial tests were hence run on the Horseshoe supercluster, instead of the more obvious choice of the older Beobab cluster.
To facilitate the initial benchmarks, a special amputated relay application was developed; one that did not make the connection to the workstation, but instead just collected the particles, as described, and discarded them, but before doing so timing the collected frames. That is a relay application, which simulated the actual relay application, but did not transmit the frames, only counting and timing them.
The results of these benchmarks were extremely pour. A standard test consists of simulation 100 integrationsteps (frames). The relay would output the current running time each time 25 full frames had been collected. The goal of this project would then be for this time to be around one or two seconds apart, but it was between 22 and 25 seconds apart, largely indiscriminately of how many particles the system consisted of!
Having tested the now rather complex system on Beobab, and verified that it was able to render the animation on my workstation, albeit very slowly, it was also a surprise to see, that the display protocol did not behave well on Horseshoe. Too many UDP packets got lost on the way from the back to the front end, probably due to high network traffic, resulting in buffer overruns and ultimately in lost packages. Out of these only between 50 and 75 would ever arrive complete at the relay. I thus rewrote the amputated relay application to use the reliable protocol of MPI. Now all frames arrived, but at about half the speed!
With time running out, no obvious place to optimize the communication code and certainly nowhere to gain the 50x speed increase needed, I had to abandon the effort.
Before we begin the benchmarking, there are some loose ends that need to be settled. In particular we need to decide things like when are particles too close to the origin of the expansion, where should we truncate the expansion, how many particles will we allow in each leaf. The two first choices have an impact on the quality of the simulation in that if we make too extreme choices, substantial numerical errors will occur. The latter however, we can choose freely. Obviously we should choose it so as to minimize execution time. Experimentally, I found that a maximum of 48 particles per cell yields the best performance. Concerning the first two, we would like to make choices, that gives us what we want namely reasonable precision, but does not hurt performance. Originally Barns and Hutt chose the particles that were too close based on a parameter θ, such that a particle was too close to a multipole associated with a region of side length 2d, if the distance to the center of the region/origin of the expansion was less than θd. I adopted the same criteria and constructed a test, which pitted the direct pair-wise computations against the multipoles. This graph shows the average relative error of the absolute value of the gradient of the multipoles, assuming the direct pair-wise is accurate, evaluated at the points of the source charges in a galactic distribution of 4,150 particles for different values of θ. The multipoles are truncated after seven terms.
A value of
suffices to guarantee that the
multipole expansion will not diverge, but increasing the value will make the
truncation error less significant. The error of such power series ()
is of the order of the last surviving term, which is
So let us examine the effect of the number of terms kept in the expansion. Here are the same error measurements with θ = 1.9 but for varying p:
Based on these results I decided that θ = 2.5 and p = 6, yielding a relative error of 3.68‰ in this test, would be satisfactory for my needs.
With these parameters, the serial version was run on a 1.60GHz Athlon XP. One galaxy was crated and simulated for 25 integrationsteps. The following 75 integrationsteps was timed and divided by 3, so as to yield an average-of-three time for rendering 25 frames. For realtime simulation this time should be around one second and no more than two. Here are the results for the algorithms:
I must admit, I was highly surprised by the computational power required by these algorithms. I thought, that given fast enough computers, I would be able to simulate real time two galaxies consisting each of many, many thousands, maybe even millions of stars, colliding. As can be seen in the last figure, this was far from reality; only around 500-1,000 particles can be simulated on a single computer.
Another big surprise, evident in Figure 9 was that among two such radically different algorithms, no clear winner can be announced. The quadratic time algorithm seems to hold on as long as it can, and is doing an impressive job at it. The difference between the two is marginal, until the pair-wise approach must finally give up at around 4,000 particles and losses out at 5,000. As computers get faster, however, it will become feasible to simulate even larger systems real-time and I thus expect the Barnes-Hut variant to pull ahead, so I suspect the efforts I put in implementing this algorithm will pay off in 5-7 years. Also cutting back on the required precision would help the Barnes-Hut algorithm, but not significantly, if we want sensible results.
It was sad, though maybe not really as surprising, to see, that a cluster computer was not able to speed up the computation. But why exactly was it doomed to fail? I suspect it was the inherent need for a high number of synchronization points; no one node can continue the integration, before all nodes know the potential and no node can begin computing the potential before all nodes have performed the integration step and the particles have settled down and all nodes have to agree which particles have crossed boundaries from one node to another. So all nodes have to be completely synchronized at least three times per integration step, which is at least 90 times per second, which was probably too much for a cluster-type supercomputer. No amount of code optimization can change this, however with (a lot) more tuning of the parallel version, frame rate might still rise. Had my goal been to actually simulate a collision between galaxies with a given (very high) number of particles, taking whatever time it needed, the computations between the synchronization points may have been significant and Horseshoe would have definitely have been able to help out.
The code for the algorithms and applications can be found in the DAIMI
filesystem in /users/kv/Programs/SciComp/.
Please e-mail me for details on how to acquire it, compile
it, and/or run it.