A tutorial of reading captcha with tensorflow

TensorFlow is an open-source machine learning library for research and production, the high-level Keras API provides building blocks to create and train deep learning models.

The Official Tensorflow tutorials would be a good start to play with some basic examples.

This tutorial, will give another example, a little more complicated and realistic one: reading captcha images with TensorFlow Keras API. Since I am a newbee myself, there might be some mistakes I do not even aware of, if you have found something wrong or have suggestions, please do tell me, submit a issue or maybe leave a comment.

When I write this tutorial, the version of tensorflow I am using is 1.10.0. All the code can be download from https://github.com/zxdong262/tf-captcha-reader


In this tutorial, we are using git, python3, pip3 and npm.The python/pip3/git installation part, I will just skip it, leaving that to you.

We will use npm to install a font package for randomly generating digit or English character with python, you could just download/install Node.js latest release from https://nodejs.org, then npm will be installed with nodejs, or use nvm to do the installation, which is a recommended way to install nodejs/npm, simply run:

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash
# or
wget -qO- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash
# no luck still, read more from nvm document: https://github.com/creationix/nvm#installation
# after nvm is ok, open another terminal and install nodejs version 8
nvm install 8

Then let's get the code running, then we can talk about how it works later.

# clone the repo
git clone https://github.com/zxdong262/tf-captcha-reader.git
cd tf-captcha-reader

# install the font pack
npm i

# install the python libs
pip3 install tensorflow numpy Pillow scipy opencv-python --user

# all ok, then fire it
python3 main.py

Output should be like this:

tensorflow version: 1.10.0
loading data
trainData: 35952
testData: 9000
Epoch 1/25
2018-08-22 08:49:53.459550: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
35952/35952 [==============================] - 3s 84us/step - loss: 0.7811 - acc: 0.8287
Epoch 2/25
35952/35952 [==============================] - 2s 56us/step - loss: 0.0741 - acc: 0.9677
Epoch 25/25
35952/35952 [==============================] - 2s 57us/step - loss: 0.0730 - acc: 0.9678
9000/9000 [==============================] - 0s 25us/step
evaluate test data set:
test_loss: 0.2720036897820214
test_acc: 0.94
predict example-image(example-images/example-captcha.png):
['l', '8', 'V', 'T']

The steps

The target is reading captch images with tensorflow, let's make it with a few steps. You can certainly read the source code and skip the my descriptions. Anyway, these explainations would help as a guide.

  1. Write a image generator function, to generate captcha images. This is a simulation of real world captcha.

    In this demo, we will create captcha images with 4 or 5 characters or digits, with random color for every character or digit, randomly rotate some degrees, and place it in a line, random distance, may overlap. It is somehow pretty simple, but not that simple captcha. the digit `1` and character `l', `I` might be hard to distinguish, sometimes even for human eyes. A example image would be like this:

    the main function is like this, just paste 4 or 5 random char/digit to blank image.

    def createTextImg():
      create image with random char.
      BG_COLOR = (255, 255, 255, 0)
      R = randint(60, 190)
      G = randint(60, 190)
      B = randint(60, 190)
      rotate = randint(0, 13)
      TEXT_COLOR = (R, G, B)
      TEXT_POS = (0, 0)
      fontSize = 26
      img = Image.new('RGBA', TEXT_IMAGE_SIZE, color = BG_COLOR)
      font = ImageFont.truetype('./node_modules/open-sans-fonts/open-sans/Regular/OpenSans-Regular.ttf', size=fontSize)
      d = ImageDraw.Draw(img)
      char = randomChar()
      d.text(TEXT_POS, char, fill=TEXT_COLOR, font=font)
      img = img.rotate(rotate, expand=0)
      return img, char

    This generator function is in https://github.com/zxdong262/tf-captcha-reader/blob/master/img/imageGenerator.py

  2. Use opencv to split the image to single images with only one character/digit.

    Source code is in https://github.com/zxdong262/tf-captcha-reader/blob/master/img/imageGrouping.py

    `opencv.findContours` will get us all the external boxes of all single shapes.But some situations need to be handled: overlapping, removing extra stray dots, shapes inside another shape(think about `e` as a example), shapes should be merged(`i` will be two shapes, the dot head and the stick body), helper functions is in the source code. This part surely can be done with maching learning ways, but not in this simple demo.

    Then we can use `image.crop` function to get the splitted images, make it a numpy array.

  3. Build traning data sets and test data sets from the split images, feed to Keras API, do the training and evaluating with tensorflow.

    seprated images will be expanded to 28x28

    The image data will be tranformed from rgb format to int, with formula `R * 299/1000 + G * 587/1000 + B * 114/1000`(range 0~255). Then divided by 255, so we can get a float(range 0 ~ 1).

    The final one image data will be numpy array with shape 28x28:

    [[0,0,...0.1256], ... [0,0,...0.1256]]

    The create traning/test data array function would be like this:

    labels will be index of the character/digit pool([a-zA-Z0-9]), from 0 to 60

    def createData(n):
      create data and labels array, with length = n
      data = []
      labels = []
      for i in range(n):
        (img, text) = createImg(i)
        tlist = list(text)
        le = len(tlist)
        shouldSave = i == 0
        imgs = imageSplit(img, charCount=le, shouldSaveExample=shouldSave)
        for j in range(len(imgs)):
          im = imgs[j]
            convertToDataArray(im, i, j, tlist[j])
        tlist = list(
          map(lambda x: CHAR_INDEX_DIC[x], tlist)
        # if shouldSave:
        #   print(tlist, data)
        labels = labels + tlist
      return (np.array(data), np.array(labels))

    Then do the traning and evaluation:

    Note that input size will be related to our single images size, output would be the char/digit pool's size.

    With `epochs=25` traning, we will get a `test_acc: 0.94` result, adjust the params, we can do better, leaving that to you to play with.

    (trainData, trainLabels) = createData(8000)
    (testData, testLabels) = createData(2000)
    print('trainData:', len(trainData))
    print('testData:', len(testData))
    model = keras.Sequential([
      keras.layers.Dense(128, activation=tf.nn.relu),
      keras.layers.Dense(len(CHAR_POOL), activation=tf.nn.softmax)
    test_loss, test_acc = model.evaluate(testData, testLabels)
  4. In the end, let's read one image with the trained model.

    print('predict example-image(example-images/example-captcha.png):')
    with Image.open('example-images/example-captcha.png') as img:
      imgs = imageSplit(img)
      data = []
      i = 200
      for j in range(len(imgs)):
        im = imgs[j]
        arr = convertToDataArray(im, i, j, 'unknown')
        arr = np.expand_dims(arr, 0)
        predictions = model.predict(arr)
        prediction = predictions[0]
        prediction = np.argmax(prediction)
        char = ''
          char = CHAR_DIC[prediction]
          char = ''

    We will get the right result:

    ['l', '8', 'V', 'T']

That is it, have fun~.