Category Archives: Data science
I regularly open huge images and tables (>1GB) in interactive Java-based (astronomy) software such as Aladin and TopCat. Because of the way memory allocation works in Java, the area where objects reside in memory (called the heap) needs to be reserved up front using the “
Xmx” switch. Hence I tend to run memory-intensive applications using:
If you don’t use this flag you will get an
OutOfMemoryError exception as soon as your application exceeds the default heap size, which is typically set at only a few hundred megabytes.
However, I frequently found myself faced with a horrible performance experience when using a large heap. Java applications would freeze my entire (64-bit) Linux system for anywhere between 2 and 60 seconds! This happened regardless of the JVM used (I tried Oracle Java 1.7, Sun Java 1.6, GCJ 1.5). I verified that my system had plenty of memory available and was not swapping, hence a lack of memory was not to blame. A profiler revealed that these freezes were instead caused by an insane number of interrupts which ate 100% of all CPU cores in so-called “system” cycles.
The cause of these system freezes is Java’s garbage collection mechanism; a built-in automated memory management system which reclaims memory occupied by objects that are no longer in use. Whilst this feature makes programming in Java a bit easier than, say, C++; it comes with the disadvantage that garbage collection in a large heap can introduce a considerable overhead. Some collection algorithms deal less effectively with large heaps than others, and unfortunately in my case, Java appeared to be using a collection strategy which paused the application during the whole duration of each garbage collection run, hence resulting in frequent freezing.
The trick to avoid these freezes is to tell Java to use a collection strategy which runs concurrently to the application, hence avoiding lengthy interruptions of the entire process. This can be achieved using the “
XX:+UseConcMarkSweepGC” flag, i.e.:
java -Xmx4000m -XX:+UseConcMarkSweepGC
There are in fact many more tuning parameters which can influence the behaviour of the garbage collection, but “UseConcMarkSweepGC” looks like the first obvious thing to try if you are experiencing annoying freezes in memory-intensive Java applications.
Two weeks ago, I posted an animation on YouTube showing where Comet PanSTARRS would be visible. The video attracted more than 15 000 hits, and although this is not a proper statistical analysis, I would like to draw attention to an interesting result in the demographic analytics provided by YouTube: 75% of the viewers were male.
Although the numbers are only based on the ~20% of viewers which were logged into a YouTube account while watching, statistics like these may reveal broad trends about the public interest in astronomy. If we were to assume that all people interested in astronomy are equally likely to have watched the animation, and if in addition we assume that these people are all equally likely to have a YouTube account regardless of their age/gender, then one might conclude that (middle-aged) men are twice more likely to seek for comet information than women. Interestingly, this is broadly consistent with the (unfortunate) trend of large male majorities in astronomy clubs and university departments.
There is no doubt that the above assumptions are wrong to some degree, and that the YouTube statistics are hence biased. It is not clear how severe the bias is however. I tried Googling for demographic statistics of YouTube users in general, but could not find consistent information. (Does anyone know a reliable source? Are 75% of YouTube users male anyway?!)
If the biases can be accounted for using a proper statistical analysis, then the analytics offered by science-themed YouTube videos would provide a way to measure the public interest as a function of age, gender and topic.
Home directories often turn into spooky graveyards of random files, temporary directories and images of lolcats. It takes courage to delete the mess, because there may be one or two important files hiding amongst the pr0n. As a result, many scientists have grown afraid to run “ls ~” in public, fearing that the output of said command will expose them as file-hoarding maniacs. (By the way, the fear of running “ls ~” should be called domusindexophobia in Latin.)
For years I have employed the popular strategy of sticking random files on the desktop, and whenever it becomes a mess, create a folder called “oldstuff” and move everything into it. The strength of this strategy is that it can be repeated indefinitely (oldstuff2, oldstuff3, oldstuff4…), the weakness is that you remain a file-hoarding, domusindexophobic maniac.
In the past few months I decided to adopted a more sensible strategy. Read the rest of this entry