Sunday, May 30, 2010

My OpenTox Workshop contribution: Linking explicit and implicit knowledge

Please find below the views I have expressed at the OpenTox Workshop in the nice Potsdam area near Berlin:

More on the meeting later.

Tuesday, May 25, 2010

CDK 1.3.5

A lot of changes in this release: the SMSD code (see doi:10.1186/1758-2946-1-12), removed outdated code (force field, R-CDK bridge), SMILES @ and @@ chiralities reading, a new IChemObjectBuilder interfaces, and several new features in the MDL IO classes. The full list:
• Fix for getBestAlignmentForLabelXY eb7529b
• Test for getBestAlignmentForLabelXY 2cb5f4d
• Renamed the ligancy classes to use the term tetrahedral since they implicitly encode for tetrahedral chirality d80628e
• fixed PMD errors on Use instanceof against interfaces not implementations and few more Signed-off-by: Syed Asad Rahman e36a2a1
• Renamed the ligancy classes to use the term tetrahedral since they implicitly encode for tetrahedral chirality 3043c17
• Introducing PMD test for CDK specific issues: 406930b
• Removed unstable forcefield code 235b04f
• The createSMILES() methods now take IAtomContainer rather than IMolecule. Originally, IMolecule was the type since it was assumed that SMILES would only be generated for connected components - but the code already handles disconnected components. Since IAtomContainers are meant to support that, this change makes sense. Also a result is that we don't have to convert a IAtomContainer to IMolecule to generate SMILES 2399a18
• updated test cases Signed-off-by: Syed Asad Rahman 32cb936
• fixed makeAtomsMapOfBondsMap with test 5bd8d64
• Emergency fix: IAminoAcid instead of AminoAcid 5193626
• A few more interfaces instead of implementations 648a2f4
• Replaced use of implementations by interfaces 6317be2
• Use an interface instead of an implementation as type b82a2f6
• Use interfaces instead of implementation 50b9e74
• MDL reading and writing atom value line, including test cases c88bd3c
• Added another test to check that a query larger than the target does not match 088afce
• Test cases for MCS updated 15fa35a
• updated MCS solution count in the VF lib, depricated turbo MCS and provided methods for timeout c4d3bf9
• updated test for single atom ed04f25
• Replaced outdated URL with entry in WikiPedia (fixes #3002741) ad2bd3e
• Fixed outdated JavaDoc I forgot to remove (fixes #3002409) 82f40a1
• Included the qm module in the dist-all (fixes #3002622) e2c95ec
• Cleaned up unthrown exceptions 2501371
• Removed use of SMARTS parser in the test code so that a new dependency is not required e561364
• removed unwated checks 41561a2
• commented unused code f39566c
• test cases for the IQueryAtomContainer support in the SMSD with exceptions fixed 839827b
• test cases for the IQueryAtomContainer support in the SMSD b292f7c
• IQuery* support for the SMSD, will help in SMART based searches 75d86dd
• Turbomode MCS search added f3f809e
• removed unwanted test case Signed-off-by: Syed Asad Rahman ea4960c
• VF Sub search turbo mode 5b4f00a
• Updated the SMSD code for turbo mode substructure search Refactored the SMSD class itself f5ccf1b
• Added two test cases by Daniel from my blog: http://chem-bla-ics.blogspot.com/2010/05/cip-rules-2-parsing-and-from-smiles.html 9cbc242
• Added two more unit tests, related to ring closing 0f4649c
• SMILES @ and @@ chiralities are now fully read. 2dd575e
• Added unit tests with various chiral SMILES situations. 4c2e9b7
• Implemented stereo chemistry for atoms with four ligands 9abead5
• Patch to generalize the stereo chemistry handling. be1b70a
• Minor fix to build.xml to ensure that SMSD code gets included in the large jar file ee5e0a2
• Added JavaDoc testing to the QA task; removed module-uptodate which checked if the module was already compiled causing the target to be skipped a1e1f1b
• I made sure the MDLV2000Reader considers 0,0,0 coordinates in files with a single atom as 2d and 3d coordinates. The MDLReader does not handle the 0,0,0 case explicitly, so I just added a test for 2d. It might be better to have uniforma handling, but I will file a bug report for that. 42b64e1
• The RXNReaders/Writers now all handle Atom-Atom-Mappings. This was only done in the MDLRXNReader till now 6ff55ba
• Atom-Atom-Mapping is now read and written in MDL files. Note the reading until now was into ID field, which is not in line with description of ID field in ChemObject (Returns the identifier (ID) of this object). Also added tests for MDLWriter/Reader/2000Reader. 6b787c2
• Flexibility for ring start angles. d6d6ab0
• more mdl reader writer tests 353f938
• Added a test case for short line mol files. Patch by S.Kuhn, reworked by M.Rijnbeek. c4ea1d8
• MDL reading and writing UP_OR_DOWN e36983b
• Removal of references to removed R-CDK bridge fd1e57c
• Removed the R-based model package and associated jar files and test code 5733da3
• Removal of references to removed R-CDK bridge b1075d7
• Removed the R-based model package and associated jar files and test code e41372b
• Removal of references to removed R-CDK bridge d3f65d0
• Removed DocCheck from Eclipse' Build Path fcd7623
• Removed obsolete castings, or replaced by more general ones b34423d
• The big SMSD patch. 10edb71
• Fixed a ClassCastException in a unit test; I messed up (mea culpa) b5fa3dc
• Copied code from the DefaultChemObjectBuilder to handle the IBond constructor that takes an IAtom[] which I had forgotten to port to DebugChemObjectBuilder and NoNotificationChemObjectBuilder (fixes a few regressions) 25427b4
• Fixed NullPointerExceptions for LonePair's and SingleElectron's constructed with the no-argument constructors 5f34897
• Added missing cloning of single electrons 2d4c122
• Do not try to clone the atom if it does not exist 9672df0
• MDL reading and writing valency c20a0ea
• Line separator fix for RGroup writer ad706cc
• More extensive testing for removeHydrogens in atomcontainers with hetero atoms without Hs 3155d7d
• Be a bit more forthcoming with debug messages: report also the parameter types of the method 0f85f52
• This patch makes sure that the removeHydrogens method in AtomContainerManipulator sets hydrogenCoutn to 0 if no hydrogens were on a heavy atom. Till now, this was null, which was at least not good, in a way even wrong. 708cf8a
• converted uses of indexOf to startsWith/contains 7b9d84e
• Updated HIN reader to fix bug 2984581 f95c632
• Added unit test to see of arrays are properly cloned, and that array entries of the original are not overwritten 38d5f8d
• Unit test that the IAtom[] array is properly cloned, and overwriting entries in the clone does not overwrite entries on the original 3c1b07e
• Removed duplication of cloning. 216c160
• Apparently the super.clone() does not clone the pointer to the IAtomContainer[], causing a clone() followed by changing containers in the clone to overwrite the original IAtomContainer[]. Fixed by creating a new array. 4e5d6a1
• Moved test from the specific class to the abstract tests, as the behavior should be the same for NNMoleculeSet and DebugMoleculeSet too 068fb3b
• Fixed a typo in the test method name 8142ae5
• Got rid of some debug prints 6079160
• Fixed a regex bug which prevented the engine from properly loading descriptor class names 9bd0490
• Some code cleanup to make it more idiomatic Java 565d7c3
• New IChemObjectBuilder interface: 45f683f
• Two more tests for the issue: atom typing works fine; aromaticity detection fails: one ring is detected as aromatic (that with two nitrogesn), so that it does not consider the double ring, marking the other ring as non-aromatic 3be2367

Sunday, May 23, 2010

PMD code checking: checking for programming against CDK interfaces

I wrote recently about programming against interfaces, which led to some discussion. Rajarshi commented:
patch 2 seems to replace usage of Molecule with Molecule, rather than IMolecule
Now, this particularly problem was Eclipse being overly active cleaning up the imports, but we reached the point that it is actually interesting to have PMD warn about these issues. PMD is a cool piece of software used by the CDK project for a long time one. One particular cool thing is that you can write code warnings as XPath queries.

So, I promised Rajarshi to have a look at a custom rule to look at using IMolecule instead of Molecule. The code is actually pretty simple, using the aforementioned XPath feature:
<rule name="ReplaceMoleculeWithIMolecule"
language="java"
message="Use the IMolecule interfaces instead of the Molecule implementation"
class="net.sourceforge.pmd.rules.XPathRule">
<description>
Programming against the CDK interfaces allows users to pick their favorite
</description>
<priority>3</priority>
<properties>
<property name="xpath">
<value>
<![CDATA[
//Type/ReferenceType/ClassOrInterfaceType[@Image='Molecule']
]]>
</value>
</property>
</properties>
<example>
<![CDATA[
Molecule mol = builder.newInstance(IMolecule.class);
]]>
</example>
</rule>

Patch pending...

Friday, May 21, 2010

Cleaner CDK Code #6: set the CDKException's cause Exception

For about two weeks now, I am working on fixing a bug in the nonotify module: they classes were still sending around change events. I decided to finally do it the proper way: sending around change events is extra functionality, so the data module should extend the nonotify module and not the other way around, as in CDK 1.2 and CDK 1.3 up to now. There was a dirty hack in place, introducing a boolean indicating it should send events or not, but that turned out to be insufficient.

Anyway, while working on a major refactoring like this one, you run into many, many small code issues. Clearly. the one I mostly run into is that a lot of code is not written against the interfaces yet. However, I just ran into another bit of code that can be improved.

Set the cause Exception when throwing a CDKException
This is the particular bit of code I ran into:
try {
molecule = (IMolecule) container.clone();
} catch (CloneNotSupportedException e) {
throw new CDKException(
"Error occured during clone " + e
)
}

However, there are two small improvements that can be applied to this code. The first is that the the error message passed to the CDKException constructor can reuse the message of the cause Exception. The second improvement is that we can actually pass the cause exception is cause to the CDKException. Therefore, the CDKException can be constructed as (also fixing the typo :):
new CDKException(
"Error occurred during clone " + e.getMessage(),
e
);

Oh, and, of course, CDK code should use variables of at least three characters. Now, the original CDKException construction was on one line, in total having four points of improvement :)

CKAN 1.0: A major contribution to Open Knowledge

Peter informed about the CKAN 1.0 release. I browsed the 27 chemistry packages, and saw for example BODR show up:

The database of packages is community based, and you can simply log in with an OpenID. This allows me to updated some outdated and add missing information.

As mentioned, the list contains 27 packages, out of a total of almost 1000. So, not so much chemistry there yet, but that's something we are getting used to. Worse even, is that of these 27, only a small subset is Open. These are identified, I think, by the blue bar on the left of the packages in the above screenshot. Even then, many of these are labeled public domain, which we know is not very practicaly. This is why the CC0 waiver exists.

But it at least allows us to create overviews of the available downloadable data. The system is very easy to use. Well done!

Tuesday, May 18, 2010

Cleaner CDK Code #5: develop against interfaces

Rajarshi is at the EBI (or at least was yesterday), talking about his rcdk package (his excellent slides). One slide is about how to create a new atom; he mentions not to use new Atom() but the DefaultChemObjectBuilder instead (slide 73). I do not entire agree with the message given.

Develop against interfaces
The slide seems to favour the DefaultChemObjectBuilder, but there like are more suitable builders for a particular application. Currently, I am aware of two alternative builders: the DebugChemObjectBuilder, and the NoNotificationChemObjectBuilder.

The reason the CDK has a builder pattern is the following. It is cleaner to write against interfaces than against implementations, because it allows alternative implementations. I just listed all three provided by the CDK library itself, but other implementations may exists too; they might have a completely different data model, e.g. fully CMLDOM-based, or fully SQL-based. By programming against interfaces, changing the implementation becomes easy.

Now, by using new Atom() you choose a particular implementation (in this case, the one around the DefaultChemObjectBuilder. However, you like the user to pick the implementation. This is why the CDK library itself uses builders all over the place: it assumes a IChemObjectBuilder is predefined and that is used.

For example:
IChemObjectBuilder builder = new DefaultChemObjectBuilder();
IMolecule molecule = builder.newMolecule();
IAtom atom = builder.newAtom();
molecule.addAtom(atom);
If your method actually has an IChemObject as input, it can retrieve the builder from there:
public IMolecule addToMolecule(IAtom atom) {
IChemObjectBuilder builder = atom.getBuilder();
IMolecule molecule = builder.newMolecule();
}
This latter situation is what exists most in the CDK library, actually. In some cases, this is not the case, and sometimes you may need to pass an IChemObjectBuilder to a constructor. This is, for example, the case with the constructor of the SmilesParser.

Now, reconsider the first code example. By defining a builder only once, and reusing that builder in the rest of your code, you only have to change one line to use a different implementation. For example, the DebugChemObjectBuilder that sends debug messages for each set and get call to one of the data classes. I used this in the past, and solved several nasty bugs with this; just by seeing in what order data was set and read. And I only needed to change one single line of code for that.

Thursday, May 13, 2010

Three ACS-RDF sessions

It has been a while since I gave an update on the ACS-RDF session in Boston. Things are settling in, and it seems there will be three sessions, each with an approximate theme. Each session will have five talks, of which one will be extended, where the speaker will be invited to set the context of the theme. Full programs will follow, but find the current sessions wordles below:

CINF003 on Sunday morning: RDF and Computation

CINF0031 on Monday morning: RDF and Ontologies

CINF0032 on Monday afternoon: RDF and Chemical Applications

My favourite chemistry things

Yeah, it's my turn. Standing on the shoulders of ChemJobber, Azmanam, and ScienceBase, here's list of things I like about chemistry. To put things into perspective first, a bit, I note that ChemJobber and Azmanam focused on wet-lab chemistry, and David on fancy molecules. Now, I am a theoretical chemist, and was thinking on what to orient the things I like, and on how general to make them. This meme is not easy, you now. But here goes:

1. chemical graph theory
Chemical graph theory is one of the common theoretical models chemists work with to make sense of chemical properties. I like it because the graph theory is fairly straightforward, but chemistry adds enough color (literally!) to create a nice complexity that kept the cheminformatics field going strong for more than 50 years now :) For example, how to adapt the theory to deal with mutli-atom bonds :)

2. rare nuclei in the NMRShiftDB
The NMRShiftDB is an Open Data repository for annotated NMR spectra. The fun here is to add NMR spectra of rare nuclei. Don't you just love a molecule with NMR shifts for all atoms?

3. metabolomics
Metabolomics is the research field that studies the small molecules of life. Plant metabolomics is particularly fun. Tens of thousands of molecules, and a lot of metabolite identification to be done, and much more. Lot's of cool stuff to do here, and I am trying to secure funding for it. This is what I blogged about metabolomics before. What about his nice secondary metabolite (source: Wikipedia, CC0):

4. hexavalent carbon
Atom types is another theoretical model for chemistry. Atom typing is one of the underlying technologies of force fields, which are used in many, many fields in chemistry. Now, force fields typically take only a subset of atom types. New atom types, consequently, need to be added. One such new atom type was the hexavalent carbon. Rare, very rare, but just the amount of complexity I like about chemistry:

5. self-organizing maps
Kohonen maps, or self-organizing maps (SOM), are a machine learning method that have interesting visualization features. They have numerous applications, and also in chemistry. The group where I did my PhD developed a supervised SOM, which I used them to classify crystal structures (doi:10.1021/cg060872y). Another of my favorites is the reaction classification by Aires-de-Sousa et al. using unsupervised SOMs.

6. the Maillard reaction
People who know me personally, know that I like tasting things. That also makes me have to worry about overweight. Taste is to a large extend governed by cooking, and the Maillard reaction plays an important role here. If you like to learn more about the chemistry of cooking, checkout these two blogs.

7. Cb
Cb is a new element on the world wide web. Well, not so new anymore, and the full name is likely more familiar: Chemical blogspace. This social web application brings together blogging chemists world wide. Oh, and this meme is picked up nicely:

8. chemical abstracts
No, not the database, but the nice graphical article abstracts in chemistry journals. ChemFeeds gets is all together. BTW, there remains very much to be done about improving publishing chemistry. I blogged about that repeatedly.

9. organometallics
Organometallics is, like metabolomics, a really interesting area, with lots of complexities (pun intended :). Actually, I am not even aware of a organometallics/metabolomics mashup. Anyone with some nice pointers? I have not blogged about it much, and the one time I did was in relation to chemical graph theory.

10. sparkling fire

Burning things. Nothing more to say about that, I guess. Well, perhaps. Chemists like burning things; others might too, but chemists at least. Blowing up things too. When I was a student, I had a very friendly colleague who liked blowing up things and made TNT himself and took that to university too (stabilized, mind you :). Cool!

Anyway, while googling for something to spice up this tenth item, I ran into the book linked to on the right: Caveman Chemistry: 28 Projects, from the Creation of Fire to the Production of Plastics. The prologue nicely writes up that you need to sparkle some fire in education to get the students enlightened:
I teach chemistry at Hampden-Sydney College, a small liberal-arts college in central Virginia. The students here, by and large, do not come equipped with insatiable curiosity about my discipline and experience has convinced me that the profession of professing has more to do with motivation than with explanation; a student who is not curious will resist even the most valiant attempts at compulsory education; conversely, inquiring minds want to know. A great deal of my time, then, has been spent devising tricks, gimmicks, schemes and plots for leading stubborn horses to water, knowing full well that I can’t make them think.
Now, that leaves me with tagging a few further blogs to tag to continue the meme. The meme is spreading fast, so I hope I do not tag someone who already is tagged. Jean-Claude, Peter, Noel, Rich, Antony, would you mind letting us know your ten favourite chemistry things?

Monday, May 10, 2010

How to use GitHub for [CDK|Bioclipse] code review

Triggered by posts in the past three days, I though about writing up a short tutorial on how to perform code review for existing code on GitHub. Therefore, this applied to CDK and Bioclipse source code, many but will work for any project hosted in GitHub. Even if it is not, you could consider putting up a copy there yourself. This example will demonstrate the procedure on CDK functionality in Bioclipse in the bioclipse.cheminformatics repository.

Click on the images to get a higher resolution version.

Step 1: find the class you want to review
Use the GitHub web interface to browse your way towards the source code of the class you want to review. For example, the SmartsMatchingHelper.java:

Step 2: identify something you like to comment on
Next step is to perform some code reviewing. For example, we might want to ask something about how parseProperty() works:

Step 3: find the last commit that touched the line you like to comment on
Git has a blame option (also called annotate) which will show you for each line who last changed that line. The GitHub web page makes this functionality available with the 'blame' link just above the first line of the source code:

This link will lead us to a page with a new column on the left side showing commit hashes, name of the commit author, and the first few characters of the commit message. For example, the web page bits relevant to code we want to comment on, looks like:

This shows us that commit 3ce78ba5 is the one we are interested in:

Step 4: Look up the line again and add a comment
In the web page with the appropriate commit looked in the previous step, you scroll down to the line you want to comment on. If you hover over that line, a blue comment bubble will show up on the left side:

Clicking that blue comment icon, you get a dialog where you can enter your comment:

The next step would be to inform the commit author. GitHub actually helps here, and should send a message, like this one:

But it would certainly not hurt of you filed a bug report or sent an email.

Now, I should only convert this into a screencast...

GitHub simplifies code review and leaving comments

The workflow here is that the proposed patch gets uploaded to a GitHub branch or fork; the code reviewer is made aware of the patch, and goes to the commit page on GitHub, and hovers over the line numbers and clicks the 'Add comment' button and leaves a comment; the reviewer informs the author, and the author updates his patch.

BTW, this screenshot is for a paper I am writing about the use of blogs and source code repositories in chem- and bioinformatics education.

Saturday, May 08, 2010

Web 2.0 technologies in Student Assessment

Below should show up the wave (that is, if you have a Google Wave account), about a piece I am writing for a course on PhD Supervision I am following. The aim is to dig up old standards and how they apply to Web 2.0 technologies, including wikis, waves, blogs, source code repositories etc.

Update: there seems to be some problems even for those who have a wave account to load the wave. Not sure why that is happening. I'm using the Madoqua robot; should I be using something else?

CIP rules #2: parsing @ and @@ from SMILES

I recently wrote about a project for a (partial) CIP implementation. This implementation is in place, and we are working towards setting up an extensive test suite. The data set we had in mind was available as SMILES and as MDL molfile. Now, the latter does not really specify the stereochemistry of the tetrahedral centers, and relies on wedge bonding. Actually, a few years ago Jonathan Brecher wrote up the IUPAC recommendation for the use of the wedge bond for chirality specification (doi:10.1351/pac200678101897), with 74 pages of rules and examples, like the following (copyright by authors or journal; I'm claiming fair use):

So, using wedges leaves plenty of room for incorrectly specifying the stereochemistry. Therefore, we decided to go for SMILES, even though Noel recently showed that processing stereochemistry in SMILES is not trivial either. The SMILES I am currently using:
• Br[C@@H](Cl)I
• Br[C@H](Cl)I
• Br[C@@]([H])(Cl)I
• Br[C@]([H])(Cl)I
• [C@]12(OC1)NCN2
• C[C@H](O)[C@H](O)C
• NC([C@H](O)C)Cl
• I1.Cl2.Br3.[C@]123CCC