bioinformatics hackathon - building a breast cancer classifier

October 13, 2024

first hackathon

Over the summer, I had a friend ask me to participate in a bioinformatics hackathon with a few of his friends. At first, I was hesitant due to my lack of knowledge in biology. However, realizing this was a good learning opportunity, I caved in.

development process

Before the event our group knew that we wanted to utilize a convolutional neural network to classify cancer images. However, our path to success was not very clear. We lacked knowledge about creating convolutional neural networks with Pytorch, what network designs and hyperparameters would train best, and how to visually differentiate between benign and malignant cells. However, we were eager to learn and tread unknown territories.

In the first hour, we scoured the internet for possible datasets to train our network on. We specifically searched for datasets with three criteria:

  1. Pre-classified images (At least two classes)
    • Didn't want to spend time classifying images
  2. 60,000+ images
    • Wanted sufficient data to train on, as well as being able to potential modify the dataset.
  3. Small image sizes
    • Unsure about how long and many tries it would take to train out network, so we wanted to be able to train in a reasonable time.

Eventually, we settled on a dataset of 250,000 breast cancer histopathological images. Each image had a size of 50x50 pixels and were pre classified as benign or malignant. Thus it satisfied all conditions.

Next, we moved onto to developing our project. We split into two teams, one to build the CNN (which I was a part of), the other to develop a web app that could connects with our trained network.

As part of the CNN development team, I watched a few videos describing exactly what a CNN was and read Pytorch's documentation. After a few hours we had built a python notebook that we thought would work. However, when we ran our code, we kept on receiving size errors. After dedicating quite a bit of time to debug it, we realized that the dataset we were using was inconsistent with the size of the images. Thus I wrote a script to remove those images. Additionally, we came to the consensus that a uniform distribution between benign and malignant images could potentially help train our network better, leaving us with 150,000 images left.

At this point, our network was extremely basic, consisting of:

  1. A convolution layer with 3 in channels, 128 out channels, a kernel size of 3, and a padding of 1.
  2. Another convolution layer with 128 in channels, 64 out channels, a kernel size of 3, and a padding of 1.
  3. A pool layer with a kernel size of 2 and stride of 2
  4. Another convolution layer with 64 in channels, 32 out channels, a kernel size of 3, and a padding of 1
  5. A linear layer with 32x25x25 = 20,000 in features, and 2 out features

We used a relu activation function on all convolution layers. Additionally, we normalized our images before feeding it through the network.

Training our network took about 10 min, which was quite a lot considering our network and data was relatively small. Unfortunately, despite the organization having access to SciNet's GPU cluster, we could not get it working at high speeds consistently. Thus we resorted on using Kaggle.

After training it on 70% of the dataset using a cross entropy loss function, using stochastic gradient decent, a learning rate of 0.01, a weight decay of 0.01, and 5 epochs, we had a training accuracy of 82%. And after testing it on the remaining images, we had a test accuracy of 82% as well.

We had mixed feelings about our results. On one hand, we thought 82% quite good on our first run. On the other, 82% was far from being "useful". Thus we spent the rest of our time adjusting our hyper parameters in attempt to achieve better results.

At first tried many things such as:

  • Changing batch size, both higher (to get a more "accurate decent") and lower (to potentially increase "luck")
  • Increasing our in and out channels to create a more advanced network
  • Adding more convolution and pooling layers, again to create a more advanced network
  • Increasing epochs as our test accuracy compared to our train accuracy suggests that we aren't over training our network
  • Adjusting our learning rate as we were not sure if it was too high or low

After implementing variations of these changes for hours, we were able to make marginal success. Only increasing our accuracy by a measly 1%. Luckily, one of the more experienced participants waddled into our room and gave us some good advice on what hyper parameters could potentially be most effective.

He suggested three things:

  1. Adjusting kernel size through the network
  2. Use a simple network (less channels and layers) as the input is relatively small, and its faster to train
  3. Batch normalize between layers

With this, we created many networks according to these standards and spent the next few hours training them. In the end, we peaked at a 86% test accuracy which wasn't the worst. Given more time, I believe that we could achieve upwards of 90% but we needed to save time to integrate the two parts of our project.

Integrating our network wasn't too difficult. We exported the trained model from the notebook and I wrote a script to feed forward images through the network and return the result. We used Django Rest framework to create a API that our front end could call to exchange information.

tldr: final product

After about 30 hours total, we submitted our final product. A web app that classifies user uploaded breast cancer histopathological images as benign or malignant as well as displaying a gradient heat map of the last convolution layers weights.

learning

Did we win any awards? No. But I learned many new things through this experience:

  1. Pytorch
    • I had only created my own neural networks from scratch and used Tensorflow before.
  2. Convolutional Neural Networks
    • Before I understood and have utilized multi layer perception networks, thus my knowledge translated and grew well to CNNs.
  3. Hyperparameters
    • Before my knowledge of hyperparameters was quite limited. Now, I deeply understand the consequences of each change and how to use them effectively.
  4. Data Processing
    • One judge mentioned to us that he works with machine learning in the medical sector, which aligned well with our project. He discussed how the most important part of training a network was getting good data.
  5. Super Computers
    • Not mentioned before, but whilst working on this, I was fortunate to engage in conversations with the staff. Through this I learned that a few universities and organizations have supercomputers across Canada for educational use. I would like to gain access to these one day.