DrugAI-GAN


Experiments with GAN for drug like molecule generation.

DrugAI-WGAN : A Wasserstein GAN model with CNN; this model currently trains the fastest and probably gives the best result. Result is at the bottom of the page.

If you are visiting here for the first time then I would recommend you to my earlier project DrugAI before reading any further. The main aim of this project was to carry out some personal experiments with Generative adversarial networks (GAN) for sequence generation. GAN's were discoverd by Ian Goodfellow in 2014 for image generation. It has revolutinized the way images are generated using machines and there are currently many research groups around the world involved in this algorithm. GANs are so diverse that there are various types of GANs such as stackGAN, cycleGAN, infoGAN, dcGAN, pix2pix etc created by researchers. I would also recommend you to go through those projects and see for yourself the sureal images created by GANs.

cycleGAN

In my previous sub-project of DrugAI I have used a LSTM model for generating drug like molecules. I wanted to see weather these kind of molecules can also be generated using GANs. While searching for the possible applications of GAN in text generation; I came across a reddit post by Ian Goodfellow himself. According to him and an another research paper, its very hard to teach GAN in a discrete dataset. As sequence generation involves discrete data it would be very hard to generate any meaning ful sequence from it. This project is just an experiment to see for myself; what happens if GAN is used for sequence generation.

Most of the code for this project I salvaged from my earlier LSTM model project and added an additional GAN model to it. The GAN model was prepared using the codes and recommendations from the web articles freely available in the web and I am in no way guaranteing its accuracy. Instead of using a gaussian noise as input, I chose my earlier LSTM model's one hot encoded x values as the input (z). I already trained GAN with gaussian inputs and what I observed in my initially 1000 epochs training was that GAN's for sequence generation tends to learn faster with one hot encoded inputs than the gaussian inputs.

Because I have a very old system, training deep learning algorithms for longer duration is not possible in it. I recently came across floydhub.com which has a free trial period for training deep learning models using either gpu or cpu. I know about AWS cloud system and their free tier for deep learning but it requires a submission of credit card or bank account details which I am not comfortable with. This is the first time I am using a cloud server for training deep learning projects and so far I am impressed by floydhub and its support. The web interface is very simple and eventhough being a first time user I can easily deploy my projects in floydhub with ease. Initially their were some hiccups but support was extraordinary and fast.

I deployed my GAN project this way in floydhub
  • Download DrugAI-GAN.py and stahl.csv dataset from github

  • Copy and save those files inside a new folder with name DrugAI

  • Install floydhub command line tool $ pip install -U floyd-cli

  • Login to floyd hub $ floyd login

  • From shell CD and go to DrugAI folder

  • Initilaze a new project $ floyd init DrugAI-GAN

  • Run this code $ floyd run "python DrugAI-GAN.py"

Even with a cloud service like floydhub, it has been taking me more than a day in cpu to train with complete dataset. In free tier a project will automatically time out after 24 hours in floydhub, which is a current limitation of this service. Instead of using complete dataset(>300 rows) I used only 30 rows. This allowed me to train my model for more epochs and thus refine my model.

INPUT

[[0,0,0,1,0,0],[0,1,0,0,0,0],[0,0,0,0,0,1]]

EPOCH: 0

['NNNNNNNNN33333NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN', 'P5555lllll[[[[[[[[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||', '===BBBBBBBBBBBBBBBBBBB===============================================rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr']

EPOCH: 3000+

['CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC||||||||||||||||||||||||||||||||||||||||||' 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC|||||||||||||||||||||||||||||||||||||' 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC|||||||||||||||||||||||||']

The end result seems intersting but as many in the reddit discussion suggested that GAN may learn somethings from the dataset but mostly the output can be rubbish. Even with a dropout layer I am worried that the end result may be caused by overfitting of data.

UPDATE: I have trained it for a little longer (a day to be exact) and it seems GAN learns only a single sequence. Whatever the input I gave to it, the output was always the same and its a well know problem in GAN called as "mode collapse". In order to prevent mode collapse I have retrained Discriminator using a artificial noise data and so far the result is interesting.

EPOCH: 6000+

['CC1NNNNNNNNNNNCCCCCCCCCCCC====CCCCCCC||||||||||||||||||||||||||||||||||||||||||||||||||||' 'CC1NNNNNNNNCCCCCCCCCCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN111111111111111111111' 'CC1NNNNNNNNNCCCCCCCCCCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN111111111111111111111']

May be training it for longer epochs, more data and optimaizing learning rate of generator and discriminator will produce more realistic drug molecules.

Instead of predicting single molecule per class we can also generate seperate molecules from same class. For example if I want to generate a molecule from first class instead of giving one hot encoded discrete values I can give it continous values.

RESULT

DrugAI-GAN
INPUT: [[0.6,0,0,0,0,0],[0.3,0,0,0,0,0],[0.7,0,0,0,0.0]]
['CCNNNNNNNNNNNCCCCCCC====CCCCCCC||||||||||||||||||||||||||||||||||||||||||||||||||||||||||' 'CCNNNNNNNNNNNNCCCCCCCC====CCCCCCC||||||||||||||||||||||||||||||||||||||||||||||||||||||||' 'CCNNNNNNNNNNNCCCCCCC====CCCCCCC||||||||||||||||||||||||||||||||||||||||||||||||||||||||||']

Project Github Page: https://github.com/Gananath/DrugAI/blob/master/DrugAI-GAN.py

FloydHub logs: https://github.com/Gananath/DrugAI/blob/master/Others/Logs/floydlogs_gan.txt

DrugAI-WGAN

(updated on Aug 3 2017)

INPUT: [[0, 0, 0, 1, 0, 0],[0, 1, 0, 0, 0, 0],[0, 0, 0, 0, 0, 1]]

['CC1=C(C(C(=O)O)(=CC=N2[S]CCCCCC(C(Cl)C1C4)[+])C2=C4=O|||||||||||||||||||||||||||' 'CC1=C(C(C(=O)OO(=CC=N2[N]CCCCCC(C(Cl)C1C3)[+])C2=CC=O)||||||||||||||||||||||||||' 'CC1=C(C(C(=O)O)(=CC=N2[N]CCC=CC(C(CO)C1C3)[+])C2=CC=O)||||||||||||||||||||||||||']

*After training for 1 hour in GPU

Project Github Page: https://github.com/Gananath/DrugAI/tree/master/DrugAI-WGAN

FloydHub logs: https://github.com/Gananath/DrugAI/blob/master/DrugAI-WGAN/floyd_logs.txt

Helpful Links:

  • https://github.com/soumith/ganhacks
  • https://medium.com/towards-data-science/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0