Ubuntu 64 Bit and GHCs Garbage Collector

Hi,

Update: I tested a kernel 2.6.31-12 (from kernel.org) and had the same (if not even worse) behaviour as with 2.6.31-19. So, I apologize. It has nothing to do with Ubuntu per se, but probably with the linux-kernel; that makes me wonder if I haven’t done something faulty: that behaviour should have attracted the attention of some other Haskell developers in the meantime, but I couldn’t find any information on it. So, take in mind that the following could be incorrect…

finding this strange behaviour had cost me a lot of hours (see also my post on the haskell-cafe mailing list), because honestly, who do you blame if your parallel programm behaves strangely:

  1. yourself
  2. the compiler
  3. the special patches of your OS vendor to your linux kernel

Since the probabilities of an error in 2. are quite low (at lot of people use GHC and its parallelization techniques) and — at least I thought that — those in 3. are low, too (some of those people from 1. should be using my OS with GHC for parallel development), I chose 1. Hence, I spent a lot of time trying to find synchronisation and performance problems in my code, which in fact, weren’t there!

The current setup of a standard Ubuntu 9.10 (64 bit) is a vanilla kernel (2.6.31-19-generic) , used with GHC 6.12.1. Look at this small and seemingly innocent program

-- File Pi.hs
-- you need the numbers package from hackage.
module Main where
import Data.Number.CReal
import System.Environment
import GHC.Conc

main = do
    digits <- (read . head) `fmap` getArgs :: IO Int
    showCReal (fromEnum digits) pi `pseq` return ()

that you can compile with

ghc -- make -O2 -threaded Pi.hs -o pi

Benchmarking with

for i in `seq 1 5`;do time pi 5000 +RTS -N1;done
for i in `seq 1 5`;do time pi 5000 +RTS -N2;done

I received the following exemplary runtimes for the vanilla ubuntu kernel.


And this is the threadscope profile of a particular bad run:


As you can see, nearly all time is spend in the Garbage Collector, doing…things, GCs do…

I then compiled my own kernel2.633 from kernel.org and received these results:

Conclusion: At least for me and my parallel programs (and even some innocent toy programs like Pi.hs) does the standard Ubuntu 64bit kernel harm parallel performance (especially that of the GC) a lot.

I find this highly surprising:

  • Ubuntu is a widely used distribution
  • 64 bit versions are also becoming quite common
  • in my perception Haskell developers use parallelism a lot

Question. Either the intersection of these properties is quite small or my measurements have some serious flaw. I’d be glad if someone could disprove my results (and give me the opportunity to learn a bit) or support my measurements.

Visualizing rising iteration depth in a voltage diffusion model

Using my PASTHA library (more follows soon) to parallelize stencil calculations I wrote a few wrapper scripts for gnuplot and gifsicle to visualize the voltage diffusion for a metal plane for rising iteration depths:

Idea for capabilities

After looking at the source code in Capabilities.c I think a rudimentary implementation to add dynamic capabilities can work like this:

  1. Wait, until it is safe to change the capabilities structure.
  2. Allocate memory for number_of_capabilities + 1
  3. Copy old capabilities structure to new allocated one
  4. Relink to new memory location

To be honest, I don’t think it’s so easy; but at least this is a starting point I can use. Currently, I have to things to think about and which have to be solved before I can hack around in the code:

  1. How can I call C code in the RTS from Haskell?
  2. When is it safe to modify the capabilities structure?

Happy for any suggestions,
Michael

GHC’s 6.12 new dynamic linking and file sizes

GHC 6.12 finally supports dynamic linking, leading too much smaller binaries: For the standard HelloWorld example

module Main where
 
main = putStrLn "Hello World"

the binaries went from 427.000 bytes (stripped) to 10.736 bytes! That’s great :-)

Where am I?

And what am I doing here? I currently read (more or less randomized) remarks about GHC’s source code; the codebase is quite huge (and probably the biggest code base that I’ve ever worked with):

Extension  Lines
.lhs       300482
.hs        473479
.c         184039
.h         448394

combined  1406394

Found using wc -l `find . -name \*.`

Thankfully I only need a tiny fraction of it to add the dynamic-capabilities-feature: according to the commentary on the runtime system it consist of only about 50000 lines, and again, I probably need to modify / extend just a fraction of it. The code of the runtime system has an extra page on the wiki; it describes

  • the purpose of the single files. Unfortunately these are currently only one liners and links to the particular files. Maybe I will add a much more extensive description for the Capabilities-files.
  • the coding conventions for this subsystem. They are quite extensive but (in my humble opinion) very good and describe not only syntactical conventions but also some general remarks, debugging and robustness tricks.

For seeing how the capabilites are initialized I think that the file RtsFlag.c is helpful:

  • RtsFlag.c defines parsing for -N<…>, hence I see where the capabilites are set and how they fit in the big puzzle of the RTS.
  • RtsFlag.c uses and sets RtsFlags.ParFlags.nNodes (line 958). The structure is included in includes/Rts.h which includes include/rts/Flags.h (line 189 defines the important parts).
  • RtsMessage.c would probably help me in the future for printing some debugging messages.

I’m curious if I’m on the right track. By the way, take all the informations with a grain of salt, since I’m currently in the beginning of understanding the code of the runtime system.

Minor update: After a look at the commentary I’m pretty sure I need to understand much of the scheduling subsystem to extend the capabilities-system. Thankfully the wiki seems to document it.

Started hacking on GHC capabilities

Today I’ve started the small personal hacking project of implementing dynamic capatiblities in GHC’s parallel runtime system; I’ll blog about my efforts, problems and solutions and would be grateful for any hints, comments or pointers for more documentation. I began by talking with Simon Marlow and discussing the idea. He said that it’s a least worthwhile to try implementing it.

Currently I

  • made an account (mlesniak) on the GHC’s developers wiki.
  • opened a ticket (No. #3729) as a task, describing my idea: “Currently the number of capabilities can only be set at the start of a program using -N. At least for benchmarking and playing with semi-implicit code (any other real-world scenarious?) if would be nice to change the number of capabilities at runtime.”

    Current questions:

    • There is no person assigned on this task, should I assign myself?
    • It’s the only project without a particular difficulty (Diffuclty: None); should I set one or is this set by one of the “important” maintainers?
  • downloaded the source for GHC HEAD, according to Getting Started
  • was able to compile it on my 2.4 GHz Dualcore with 4 GB memory using the Building Guide. It took about 40 minutes for a full compile (yikes!)
  • took a first look into ghc/rts/ Capability.c and Capability.h. This looks quite complicated and I probably have to understand a lot of the underlying system before I can even think about trying to modify this part of the RTS itself. Hopefully, the GHC Commentary will help.

Why do I do this? Firstly, I really like to solve problems and learn new things on the way. Secondly, I was always interested in the underlying parallel runtime system of GHC but without a concrete task motivating myself to understand this huge system was difficult. Thirdly, I hope that by learning part of the system I’m able to implement some of my (currently quite vague) ideas on dynamic adaptiblity of parallel runtime systems.

I will keep the interwebs updated, but do not expect too much soon, since I do this primary in my free time. Any hints, questions, remarks, (de-)motivating comments?

GHC 6.12 RC1 and Ubuntu 9.10 (Karmic Koala)

I’m excited about GHC 6.12 and hope it is released soon. Currently I have some problems with Ubuntu 9.10 and its Linux kernel 2.6.31-14. It seems that GHC 6.12 will solve (some) of my problems.

For the examplary unoptimized code

-- Compile with ghc-6.10.4 -O2 -threaded --make Example.hs -o example
module Main where
import GHC.Conc
import Control.Concurrent
import Control.Monad
import System.Environment
import Data.List
------------------------------------------------------------
main :: IO ()
main = do
    args <- getArgs
    let threads = numCapabilities    -- threads determined by -N...
        taskDur = 1000000            -- big enough magic number
        taskNum = (read . head) args -- Number of tasks is 1st parameter
 
    queue    <- newChan
    finished <- newChan
    writeList2Chan queue (replicate taskNum taskDur)
    replicateM_ threads (forkIO (thread queue finished))
    replicateM_ taskNum (readChan finished)
------------------------------------------------------------
thread :: Chan Int -&gt; Chan Int -&gt; IO ()
thread queue finished =
    forever $ do
        task <- readChan queue
        workFor task
        writeChan finished 1
  where workFor n = foldl' (\a b -> a + sqrt (2^b)) 1 [1..n] `pseq`
           return ()

we get the following runtimes on my Dualcore Notebook:

$ time e-6.12 +RTS -N1 -RTS 16

real	0m43.994s
user	0m43.630s
sys	0m0.040s
$ time e-6.12 +RTS -N2 -RTS 16

real	0m25.987s
user	0m48.790s
sys	0m0.230s
$ time e-6.10 +RTS -N1 -RTS 16

real	0m45.883s
user	0m44.500s
sys	0m0.400s
$ time e-6.10 +RTS -N2 -RTS 16

real	0m38.930s
user	0m46.040s
sys	0m0.490s

As you can see, e-6.12 (the version compiled with ghc-6.12-20091010) delivers a good speedup whereas ghc-6.10 does not use my two cores very much. This makes testing, benchmarking and developing my parallel Haskell code currently quite difficult.


Relaunch

Hello, I’m relaunching mlesniak.com. Most of the old content (a raytracer and a tetris game in Java, biolib-Stuff) is gone at the moment but will probably be re-added over time. If you want details to a blog entry that you’ve found via Google’s cache just write me a mail.