Advanced Image Preprocessing & Augmentation Pipeline for Brain Tumor MRI Datasets (Freely available)
Working with medical images—especially MRI scans—can get tricky because different scanners, resolutions, lighting conditions, and noise levels create inconsistent datasets.
When the dataset isn’t uniform, even powerful CNN models like VGG, ResNet, or EfficientNet struggle during training.
To solve this, here’s a complete preprocessing and augmentation workflow that automatically prepares MRI brain images for deep-learning models.
It's written in simple, professional language and works perfectly for multi-class tumor classification projects.
1. Installing the Required Libraries
We begin by installing essential packages like OpenCV, Pillow, and tqdm. These tools handle image processing, file conversions, and progress visualization.
# Install Required Packages
!pip install opencv-python-headless Pillow tqdm
2. Importing Libraries & Connecting Google Drive
Since MRI datasets are usually stored in Google Drive, the first step is to mount it inside Google Colab.
# Import Libraries
import os
import cv2
import numpy as np
import shutil
from tqdm import tqdm
from PIL import Image
from google.colab import drive
# Mount Google Drive
drive.mount('/content/drive')
We then define paths for the input dataset and the output folder where the processed images will be stored:
input_path = '/content/drive/MyDrive/dataset-agumented'
output_path = '/content/processed_brain_dataset'
os.makedirs(output_path, exist_ok=True)
3. Advanced Preprocessing Function Explained
The main part of this project is a custom function called advanced_preprocess_image(). It generates multiple augmented versions of each MRI image, making the dataset richer and more model-friendly.
Here’s what the function does step-by-step:
- Grayscale conversion – simplifies MRI images without losing structure
- Binary thresholding – highlights strong boundaries
- HSV jitter – adds brightness/contrast variations
- Gamma correction – randomly brightens or darkens the scan
- Random zooming – simulates cropping variations
- Rotation (+30° and −30°) – helps the model handle orientation changes
def advanced_preprocess_image(img):
processed = []
h, w = img.shape[:2]
# 1. Grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
gray_3ch = cv2.cvtColor(gray, cv2.COLOR_GRAY2BGR)
processed.append(gray_3ch)
# 2. Black & White Threshold
_, bw = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY)
bw_3ch = cv2.cvtColor(bw, cv2.COLOR_GRAY2BGR)
processed.append(bw_3ch)
# 3. Random Color Jitter (HSV)
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
h_, s_, v_ = cv2.split(hsv)
s_ = cv2.add(s_, np.random.randint(-20, 20))
v_ = cv2.add(v_, np.random.randint(-20, 20))
hsv_jittered = cv2.merge([h_, s_, v_])
jittered_img = cv2.cvtColor(hsv_jittered, cv2.COLOR_HSV2BGR)
processed.append(jittered_img)
# 4. Gamma Correction
gamma = np.random.uniform(0.5, 1.5)
invGamma = 1.0 / gamma
table = np.array([((i / 255.0) ** invGamma) * 255 for i in np.arange(256)]).astype("uint8")
gamma_corrected = cv2.LUT(img, table)
processed.append(gamma_corrected)
# 5. Random Zoom
zoom_factor = np.random.uniform(0.8, 0.95)
nh, nw = int(h * zoom_factor), int(w * zoom_factor)
startx = np.random.randint(0, w - nw)
starty = np.random.randint(0, h - nh)
zoomed = img[starty:starty + nh, startx:startx + nw]
zoomed = cv2.resize(zoomed, (224, 224))
processed.append(zoomed)
# 6. Rotation ±30 degrees
for angle in [30, -30]:
M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
rotated = cv2.warpAffine(img, M, (w, h))
rotated = cv2.resize(rotated, (224, 224))
processed.append(rotated)
return processed
4. Processing All Four MRI Categories
The dataset contains four folders:
The script loops through each category, resizes images to 224×224, applies preprocessing, and saves every version.
categories = ['glioma_tumor', 'meningioma_tumor', 'no_tumor', 'pituitary_tumor']
image_count = 0
for category in categories:
input_folder = os.path.join(input_path, category)
output_folder = os.path.join(output_path, category)
os.makedirs(output_folder, exist_ok=True)
for img_name in tqdm(os.listdir(input_folder), desc=f"Processing {category}"):
img_path = os.path.join(input_folder, img_name)
img = cv2.imread(img_path)
if img is None:
continue
img = cv2.resize(img, (224, 224))
processed_imgs = advanced_preprocess_image(img)
out_filename = f"{category}_{image_count:04d}_orig.jpg"
cv2.imwrite(os.path.join(output_folder, out_filename), img)
for i, p_img in enumerate(processed_imgs):
out_filename = f"{category}_{image_count:04d}_aug{i+1}.jpg"
cv2.imwrite(os.path.join(output_folder, out_filename), p_img)
image_count += 1
5. Exporting the Final Processed Dataset
Once all images are processed, we zip the dataset so it's easy to download and use for training.
shutil.make_archive('/content/processed_brain_images', 'zip', output_path)
print("Brain tumor dataset preprocessing + augmentation complete and zipped.")
Final Thoughts
This preprocessing pipeline prepares a medical image dataset in the best possible way. It fixes lighting problems, improves contrast, adds useful variations, and creates a much stronger dataset for training.
If you're planning to use models like ResNet, VGG, EfficientNet, or MobileNet, this pipeline gives them clean, consistent, and augmented images—leading to better accuracy and faster convergence.
If you want a more advanced version with CLAHE, denoising, edge detection, or auto-segmentation, just let me know—I can create that for you too.
Comments
Post a Comment