Jul 222008

My boss walked up to me about naming the project I’m working on, then I was wondering “why the heck do we need to have codenames for projects” anyways, the answer to this question aside, I decided I’d google a bit about this and see what I fell on:


I never thought I’d be able to tag together any post with the main topics of my blog: C# coding and Rock Climbing, well I was wrong… watch me.


 Posted by at 2:17 pm
Jun 152008

It’s been a while I haven’t come back here but hey… life is busy. Among other things I got a new job… I’m now working at Illumina in Hayward, Ca, who is in the business of sequencing DNA. So far I’m pretty happy with the environment they provide, everything I’ve ever dreamed of for a long time, lots of smart people, lots of nice technology, and a true enterprise setting… At last I can see what it’s like ! They are the current leaders in the market, let’s hope this is going to last and single molecule sequencing technologies are not going to hit us pretty hard. I’ve already completed a couple projects in Perl and Bash, however I will spend most of the time doing C# development.

I went for a few trips: for Damien’s birthday party I went to do some class 4 white water rafting in Auburn, it was pretty fun and scary in the rapids, especially this “tunnel chute” which is totally impressive, most of the time was spent paddling however, but it was still fun since we kept going at each other and sink each others out of the boats. There were some totally climbable boulders in the valley we rafted down, but they’re probably very inaccessible, but I could totally see some quality problems. Other than that I was surprised to see that some people are still looking for gold along the American River.

Last week we went to Mount Diablo and did some climbs that we don’t usually do, so it was a bit of a change from always going to the lower tier and climb the 90 foot wall.
We were supposed to drive up to Donner lake this week end and do some sport climbs up there, but we didn’t really get through with it, since we didn’t realize it was that far and half a week end is kinda short for such a trip. next time we’ll get it right…

Climbing wise I haven’t done so bad lately, last week I managed to send my first V9 and I also sent a V8 during that same session.

As far a software projects go, I’ve been kindda lousy lately and haven’t been able to finish anything while at the same time starting more things. A facebook app called sentit (i realized later rockclimbing.com does the same thing in better), and also my iphone Etcho App, which is still at the same point. Well I’m back on macOSX this week end and have taken forever to fix up my hackintosh bugs, but hopefully now we’re on the right track. My next post is about fixing those bugs, I don’t want to have to search everywhere again…

update: well I got the keyboard working yesterday but now it’s not working so my next article is going to have to wait…

 Posted by at 5:49 pm  Tagged with:
Apr 052008

In my previous posts about running Blat searches on the Sun Grid Engine, I mentioned I would follow up to report what kind of money we’re looking into when running such searches for mapping large amounts of sequences. The results were pretty impressive and nothing like my first benchmarks suggested. At first I reported that it would cost approximately 730$ to map about 280Mb of dna sequence. This was sort of expensive, but not expensive enough to prevent us from running it. However I inferred this only on the results of one run where the best thing to do would have been to base myself on running 2 runs because of the overhead costs of starting a program (and loading all the needed resources). It turns out that the test program run time was mostly due to overhead costs. To my great surprise when I did the actual blat run on the 280Mb of sequences which I expected to be roughly $400-500 or cpu hours it turned out it took only 11 cpu hours !! In addition surprisingly the Repeat Masker step is the one that is now more costly: where as it used to cost roughly a third or a fourth of the compute time of blat on a standalone workstation, it is now costing about 3 times more in the grid engine setup.

Under the grid engine setup on Network.com, we can not only run blat on several hundreds of nodes, it also will run roughly 10 times faster on each node than it does on my workstation. It seems blat benefits a lot from being run in 64 bit mode, or the memory installed on the sun hardware is top of the line, since blat is mostly memory accesses.

Apr 052008

Microsoft / business objects have changed the way Crystal Reports based applications can be deployed. It used to be complicated enough, now things have changed and apparently the documentation is scarce on the subject. I was sitting in a cafe a couple days ago trying to upgrade the source of one of my apps to visual studio 2008, I realized the crystal reports merge modules were not found anymore and wasn’t able to find any info about this in the documentation… god I hate that, everything should be findable in the docs (either the search function or the content have something to do with it, but somehow msdn docs can be unsettling sometimes). I found the solution to the problem as soon as I got connected to the internet.

It used to be be that you could deploy a Crystal Reports based project using merge modules, the new way to deploy crystal reports under visual studio 2008 is to use a redistributable msi package located under “C:Program FilesMicrosoft SDKsWindowsv6.0ABootstrapperPackagesCrystalReports10_5CRRedist2008_x86.msi”. I haven’t figured out a visual studio UI supported way to deploy this package at the same time as my application, and it looks as if, for now, one needs to deploy crystal reports separately. I will post my solution to this problem when I find out how to do this.

Mar 092008

I created a little script that retrieves all the sequences for a particular trace archive query and stores everything into a file, that’s got to be useful to more than me, it uses the query_tracedb.pl file provided by NCBI.
here it goes (updated 04/01/2008 for mac os X compatibility):


#first count the number of sequences

seqCount=`$queryScript query count $QUERY`
fileCount=$(( $seqCount/40000 ))
echo found $seqCount sequences
echo I will create $(($fileCount + 1)) files named dataN.tgz
while [[ $i -lt $(($fileCount + 1)) ]]
echo downloading data$i.tgz;
(echo -n "retrieve_tgz fasta 0b"; $queryScript "query page_size 40000 page_number $i binary $QUERY")| $queryScript > data$i.tgz
# fasta can be changed to all / xml
# use "text" instead of "binary" if doing an xml query
i=$(( $i + 1 ))

rm $OUTFILE -f
for file in `ls data*.tgz`;
echo processing $file;
tar -zxf $file --to-stdout >> $OUTFILE;
echo $QUERY > $OUTFILE.query
seqsInFile=`cat $OUTFILE | grep ">" | wc | tr -s " " | cut -f2`
echo $seqsInFile sequences found in the final file
tar -czf $OUTFILE.tar.gz $OUTFILE*

Mar 082008

Some follow up with running blat on network.com. Unfortunately, using blat on smaller chunks of DNA, like chromosomes wasn’t the way to go: I quickly rearranged my qsub submissions so that individual chromosomes maybe searched instead of whole genomes at once. twoBitInfo, which retrieves the information related to the chromosomes in a 2bit file, was my friend for that, however I obtained the same “killed” error message for the twoBitInfo utility, hinting seriously at some compilation issues. I went back on track with my efforts to cross compile blat for amd64 on my RHEL4 box but that still gave me error message on the grid engine.

By the way, i disgress a bit but unless you didn’t realize, testing those binaries isn’t the most fun thing to do if you don’t have a solaris box setup, as each time you need to reupload the binaries and scripts and test them live, while at the same time wasting 1cpu/hour. Also somehow lately jobs are taking forever to start up, even though the job id increments only by one (hinting that there were no other jobs running during my wait time).

Seeing how my efforts at cross compilation failed miserably, I decided that my next move would be to try and compile blat natively… After running grid engine jobs solely for compiling blat (a job that takes 5 seconds on my machine, a bit of an overkill to use the cluster but thank god i’m not using all 5000 cpus) I managed to compile the blatSuite successully. The major bit to “porting” was the error message:

pscmGfx.c: In function `colinearPoly':
pscmGfx.c:390: warning: implicit declaration of function `isinf'
gmake[1]: *** [pscmGfx.o] Error 1
gmake: *** [topLibs] Error 2

it seems the isinf function is a source of headache for people porting to solaris, some suggestions (http://www.ruby-forum.com/topic/70926) are to change the compiler flags to gnu99, but that was to no avail. I resolved it by removing pscmGfx.o from the makefile, since it’s not used by blat. the mods I made to the common make file follow, some changes maybe useless but I did not take the time to test:

CFLAGS=-L/usr/sfw/lib/amd64 -L/usr/lib -lnsl -lsocket -lresolv
HG_INC=-I../inc -I../../inc -I../../../inc -I../../../../inc -I../../../../../inc -I/usr/include -I/usr/sfw/lib/gcc/i386-pc-solaris2.10/3.4.3/include/

more on benchmarking later…

update: Benchmarking estimate tell me that it will cost 730$ to repeat mask and map a whole library’s worth of end sequences (about 280 Megabases onto the human genome)

Mar 062008

The standard solaris and opteron binaries for blat don’t work on network.com’s grid engine. the processes don’t get a chance to start and are killed on the spot.

the error message is pretty cryptic too:

/var/tmp/spool/r130c23z1/job_scripts/295234: line 52: 7090 Killed "$BLAT" "$DBPATH" "$CHUNK" "$OUTFILE" $PARAMS -dots=`echo $(($CHUNKSIZE/10))`

I will have to recompile them from the sources with different architecture flags.

update: recompilation won’t be necessary, it seems blat is also being spontaneously killed locally on my server, a behavior I had forgotten about. It seems that blatting individual chromosomes instead of whole genomes will be the way to go.

Mar 042008

Repeat masking now works thanks to a set of perl and bash scripts I created; I can input any fasta formatted file I want, and specify some repeat masker parameters together with the size of the work unit in bytes; the fasta file gets chopped into small work units of size close to specified size and the jobs get distributed on the grid, and processed at lightning speed. I haven’t figured out costs yet, but I will post results back here when I scale up with the searches. The only current caveat is that each RepeatMasker process requires to have its own database. At run time, each process has to recreate a database, I had to hack the code in RepeatMasker a bit so that each process is able to do this; I calculated that the impact of this change should cost about $0.002/cpu/hour, which is fortunately very little. Another down side to this little caveat is that on a 1000 cpu job the database will end up chewing space, that is 3.5 Gb of space will be used for temp storage of the databases, that’s pretty important given the Sun Grid Engine only allows up to 10Gb of storage space…

Next step: getting blat to work; my first experience is that the scripts I developed in parallel with RepeatMasker’s scripts are running fine however the blat process seems to be killed out of the blue, I suspect it’s because blat is taking too much memory. I contacted tech support for this, since I don’t see a way out of that on my own.

 Posted by at 11:57 am
Feb 282008

I unearthed one of my long ago written program for KOMP to present at ucsf to an audience who is interested in knocking out micro-rnas. I had to present it today, it seems things went well, although Lu who was here as well seems to say I wasn’t bragging enough, and that I’m not trying to sell my stuff enough. I have to agree: I hate to oversell something, and that’s how I’m made I guess; but I admit too, sometimes it’s nice to hear from someone (usually a good salesman) that what you’re buying into is such a nice product, and you’re already in love with it before you get a chance to see all the unironed bugs and lack of functionality… At least you’ve been in love for some time…

Feb 272008

today I’m really wiped… I was all day long in front of the computer (just like any other day…) but I think I’ve accumulated some fatigue lately by staying up late and working crazy days, now I have a huge headache. The good news is this morning I found out I was approved to use the Sun Grid Engine (I am not a terrorist after all) so I’m potentially going to be able to run blat jobs on thousands of cpus and wait only a half hour (according to my calculations) vs months to map a whole library’s worth of bac end sequences, that is around 325,000 sequences and that for a mere $400-$1000 in cpu crunching cost. I’ll post when I get the actual numbers. The other good news is that I can use all my experience from last year’s frenzy with setting up the grid engine at bacpac, I already have a set of scripts compatible with the grid engine framwork that i can reuse to run my blat jobs… More news on that next week…