DrugAI-Gen


Generate drug like molecules using LSTM Network

This is the continuation of my earlier project DrugAI. I have added this additional generator script to this project which can generate molecules. Training it for long enough, DrugAI-Gen.py will genereates drug like molecules using a LSTM network. In this write-up I will give some brief idea about this script and its inner workings.

After posting this project in reddit I got some not so positive response from the users regarding the use of SMILES. I chose SMILES because it can be easily downloaded as a organized dataset. Because SMILES representation started from 80's there is a large amount of dataset freely available in the web whereas other representation may require data scraping and cleaning. The main aim of this project is to show proof of concept and the code can be easily ported to datasets from other discipline with only minor changes.

Mechanism

Similarly to DrugAI we start with importing stahl's dataset (stahl.csv) to memory. Here one hot encoded values are our features and the SMILES are our targets. If you open and see stahl.csv we can see drug molecule SMILES with various length some greater than 90 character long and the other less than this. As the SMILES length are uneven we first make it even by adding a null character "|" to the end of the SMILES to make it even length. The use of null character can also help to determine the end of SMILES while predicting molecule.After adding null character we will convert SMILES from text to computer redable sequence using our custom function.

For keras LSTM networks the input dataset should also contain one more parameter called time-step. We convert our input dataset without time-step to a new dataset with time-step. I have used the maximum length (105) of a SMILES as the time-step value. Then we code a categoircal LSTM model in keras with a dropout of 30 percentage and a softmax activation functions in the output. Softmax function squases the prediction between (0,1). I have coded DrugAI-Generator in both regression LSTM model and categorical LSTM model and I find categorical LSTM model result promising and the regression LSTM model output looks random.

Now we just have to train the model by providing our feature and target values to the LSTM model. Because of my inferior system I trained the LSTM model for only 20 epoch and the loss was about 1.3. This loss value is not a good one and more iterative training with high epoch number should decrease the loss value further. I tried to predict the molecule struture using this model and still the output SMILES value looks awesome.

I have used this array for prediction feature[[0,0,0,1,0,0],[0,1,0,0,0,0],[0,0,0,0,0,1]] and the SMILES created by the model looks like this

Epoch: 0

'[+++++[PPPPPlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll' '))))=======]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]' 'CCrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr']>
Epoch: 20
['CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)))))|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||', 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||', 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC|||||||||||||||||||||||||||||||||||||||||||||||||||']
see the model starts to understand the important features of a SMILES data. It knows that the last ending of the SMILES is always the null character and in a drug/organic molecule the important element is carbon C. The first molecule is not a valid one but the second and third generated SMILES are valid they are Octatriacontane and Tetrapentacontane respectively.

In the end I would recommend you to train this scirpt in high performance GPU machines for a longer time for getting much more cleaner results.I would also request you to give me your suggestion, opinion and advice regarding this and also share your own results with me so that I can imporve myself in the future.

GithHub: https://github.com/Gananath/DrugAI/blob/master/DrugAI-Gen.py

Reference

  • https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py
  • http://karpathy.github.io/2015/05/21/rnn-effectiveness/