background
I was challenged to identify handwritten amounts as accurately as possible. The challenge is to keep the false positive rate below 0.01%. Since the number of samples in the dataset is fixed, data augmentation is a logical choice. A quick search found no off-the-shelf methods for Optical Character Recognition (OCR). So, I rolled up my sleeves and created a data augmentation routine myself. It was used during training and helped my model achieve its goals. Read on to learn more.
By introducing small changes each time an image is trained, the model is less likely to overfit and is easier to generalize. I use it with TROCR, but any other model should benefit as well.
Test setup
Since I couldn't share images from proprietary datasets, I would have liked to use a sample of the IAM handwritten database, but I didn't get a reply with permission to use it. So, I've created some of my own examples for the demo.
I'm going to use the OpenCV and albumentations libraries for three types of modifications: morphology, noise, and transformations.
OpenCV is a well-known computer vision library. Albumentations is a relatively new Python library for simple but powerful image enhancement.
There's also a nice demo site where you can try out the features of Albumentations. But since I couldn't test with my own images, I created a Jupyter notebook to render all the enhanced images in this article. Please feel free to open and experiment in Colab.
We'll start by showing the individual modifications, with some explanations, and then talk about the techniques that combine them. I'll assume that all images are grayscale and have been contrast-enhanced (e.g., CLAHE).
The first augmentation technique: morphological modification
These are related to the shape of the structure. In simple terms: they can be used to make lines of text look like they were written with a thin or thick pen. They are known as corrosion and swelling. Unfortunately, these aren't currently included in the albumentations library, so I'll have to do it with the help of opencv.
To produce an effect like using a wide line-width pen, we can inflate the original image:
Corrosion, on the other hand, simulates the effect of writing with a thin pen:
Be careful here that the last parameter – the number of iterations – is not set too high (3 here), otherwise the handwriting will be removed completely.
cv2.dilate(img, kernel,iterations=random.randint(1, 3))
For our dataset, we can only set it to 1, so it really depends on your data.
The second enhancement technique: the introduction of noise
We can remove black pixels from the image, or we can add white pixels. There are a few ways to achieve this. I've tried many methods, but here's my short list:
RandomRain of black dropping color is very harmful. Even for me, it's still hard to read the text. That's why I chose to set the chances of it happening very low:
RandomShadow blurs the text with lines of varying intensity:
PixelDropout轻松将随机像素变为黑色:
RandomRain with a white drip color makes the text disintegrate compared to a black drip, which enhances the difficulty of the training. It's like the low quality seen when a photo of a fax is taken. You can set the probability of this transformation occurring higher.
To a lesser extent, PixelDropout to white will also have the same effect. But it causes the image to fade generally:
The third enhancement technique: transformation
ShiftScaleRotate: Be careful with the parameters here. Try to avoid some text being cut off and exceeding its original size. Scale and rotate at the same time. Make sure you don't overuse too many parameters. Otherwise, the chances of the first sample are greater. You can see that it actually moves the text out of the image. This can be prevented by choosing a larger bounding box – effectively adding more white space around the text.
Obscure. Old (but expensive) reliable technology. will be performed at different intensities.
Finale: Putting it all together:
That's where the power lies. We can randomly combine these effects to create a unique image that is included in each training period. Caution needs to be exercised and not to do too many of the same type of methods. We can do this using the OneOf function in the albumentation library. OneOf contains a series of possible transformations, and as its name suggests, only one of them will be performed with probability P. Therefore, it makes sense to group transformations that do more or less similar to them to avoid overuse. Here's the function:
import random
import cv2
import numpy as np
import albumentations as A
#gets PIL image and returns augmented PIL image
def augment_img(img):
#only augment 3/4th the images
if random.randint(1, 4) > 3:
return img
img = np.asarray(img) #convert to numpy for opencv
# morphological alterations
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(3,3))
if random.randint(1, 5) == 1:
# dilation because the image is not inverted
img = cv2.erode(img, kernel, iterations=random.randint(1, 2))
if random.randint(1, 6) == 1:
# erosion because the image is not inverted
img = cv2.dilate(img, kernel,iterations=random.randint(1, 1))
transform = A.Compose([
A.OneOf([
#add black pixels noise
A.OneOf([
A.RandomRain(brightness_coefficient=1.0, drop_length=2, drop_width=2, drop_color = (0, 0, 0), blur_value=1, rain_type = 'drizzle', p=0.05),
A.RandomShadow(p=1),
A.PixelDropout(p=1),
], p=0.9),
#add white pixels noise
A.OneOf([
A.PixelDropout(dropout_prob=0.5,drop_value=255,p=1),
A.RandomRain(brightness_coefficient=1.0, drop_length=2, drop_width=2, drop_color = (255, 255, 255), blur_value=1, rain_type = None, p=1),
], p=0.9),
], p=1),
#transformations
A.OneOf([
A.ShiftScaleRotate(shift_limit=0, scale_limit=0.25, rotate_limit=2, border_mode=cv2.BORDER_CONSTANT, value=(255,255,255),p=1),
A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0, rotate_limit=8, border_mode=cv2.BORDER_CONSTANT, value=(255,255,255),p=1),
A.ShiftScaleRotate(shift_limit=0.02, scale_limit=0.15, rotate_limit=11, border_mode=cv2.BORDER_CONSTANT, value=(255,255,255),p=1),
A.Affine(shear=random.randint(-5, 5),mode=cv2.BORDER_CONSTANT, cval=(255,255,255), p=1)
], p=0.5),
A.Blur(blur_limit=5,p=0.25),
])
img = transform(image=img)['image']
image = Image.fromarray(img)
return image
P stands for the chance of the event occurring. It is a value between 0 and 1, where 1 means it always happens and 0 means it never happens.
So, let's see how it works in action:
Looks pretty good, right?
Another way:
In the EASTER 2.0 paper, they proposed the TACo technique. It stands for tiling and destruction. It does something like this:
I haven't tried this method yet because my gut tells me that the original image was too badly destroyed. It seems to me that if I can't read it, so can the computer. However, when considering that as a human, if you see 'TA█O', you might guess that it's 'TACO'. We look at the letters around us, and taco is a common word. BUT A DICTIONARY-BACKED COMPUTER MIGHT INTERPRET IT AS 'TAMO', WHICH HAPPENS TO BE THE WORD IN ENGLISH FOR 'JAPANESE ASH'.
conclusion
We discussed a number of image operations and their benefits for OCR tasks. I hope this has helped you, or at least given you some inspiration to try. You can use my method as a benchmark, but you may need to fine-tune some parameters to fit your dataset perfectly. Please tell me how much your model accuracy has improved!
I exposed this technique in this Jupyter notebook.
https://github.com/Toon-nooT/notebooks/blob/main/OCR_data_augmentations.ipynb
Further citations
https://opencv.org/
https://albumentations.ai/
https://fki.tic.heia-fr.ch/databases/iam-handwriting-database
https://arxiv.org/abs/2205.14879
https://github.com/Toon-nooT/notebooks