From pixels to prediction — why flattening is essential in neural networks.
Imagine you’re designing a visual quality control system for a steel manufacturing line. Cameras capture images of sheet surfaces. Your goal? Automatically detect imperfections like cracks, folds, or corrosion.
But as you dig in, things become less obvious: - What format is an image in? - Why can’t we just feed these images into a model directly? - Why does everyone say “flatten the image”? - And what’s the big deal about convolution and pooling?
Let’s break it all down, visually and conceptually.
Let’s take a realistic image size to reflect industrial use — for instance, a grayscale steel surface image of size 128 × 128pixels. That’s 16,384 pixels in total.
[128 rows]
[
[ 45, 47, …, 255 ],
[ 42, 50, …, 243 ],
…
[ 38, 29, …, 144 ]
]
Each row has 128 values.
An image is just a grid of numbers. For grayscale images: - Each number (pixel) represents intensity: 0 = black, 255 = white. - Each value = brightness of that pixel. - A color image is 3D: Red, Green, Blue (RGB channels), stacked.
Flattening at this point would destroy that structure. So before flattening, we apply filters and sliding windows to extract patterns, which brings us to convolution. Convolution Helps the Model Learn to “See” A convolutional layer uses small filters (like 3x3 grids) to scan the image and extract features.
Imagine placing a 3×3 filter over the top-left corner of the image, then sliding it across the image from left to right, top to bottom — like a scanner.
Here’s how it works visually:
Image Region (3x3 from 128x128):
[
[34, 36, 39],
[40, 45, 48],
[42, 50, 52]
]
Filter:
[
[ 1, 0, -1],
[ 1, 0, -1],
[ 1, 0, -1]
]
→ Element-wise multiplication → sum → output to feature map.
This sliding process moves over the entire 128×128 image, producing a slightly smaller feature map that highlights specific patterns (like vertical edges in this example). - Filter slides over image - At each step, it multiplies & sums values - Result: a new feature map that highlights patterns
Example use: - Detect edges - Identify curves or textures - These filters are learned during training — the model figures out what patterns matter.
After convolution, you often apply pooling to: - Reduce size (fewer values to process) - Keep the strongest signals
From:
[3 5 1 2
4 8 0 1
9 7 6 2
3 4 2 8]
To:
[8 2
9 8]
You take the maximum in each 2x2 block.
This keeps important features, removes noise, and makes the model robust to small changes.
Pooling operations are typically applied repeatedly after successive convolutional layers. This process helps the network: - Summarise important patterns from increasingly abstract feature maps - Reduce computation by shrinking intermediate representations - Maintain key information while ignoring irrelevant noise
A typical CNN looks like this:
Input Image (128x128)
↓
Conv Layer 1 → Pooling 1
↓
Conv Layer 2 → Pooling 2
↓
Conv Layer 3 → Pooling 3
↓
Flatten → Dense Layer
↓
Output
Each convolutional layer applies a new set of filters, learning to detect increasingly complex patterns: - Layer 1 might learn edges or blobs - Layer 2 might capture shapes or contours - Layer 3 could detect cracks, textures, or irregularities
Each pooling layer ensures that only the most salient information moves forward — enabling flattening to summarise these learned patterns effectively.
After several layers of convolution + pooling: - You have compact, rich feature maps. - Now it’s time to flatten this multi-dimensional output into a 1D vector. Now flattening makes sense — you’re summarising patterns, not raw pixels.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
model = Sequential([
Conv2D(16, (3, 3), activation='relu', input_shape=(64, 64, 1)),
MaxPooling2D(pool_size=(2, 2)),
Conv2D(32, (3, 3), activation='relu'),
MaxPooling2D(pool_size=(2, 2)),
Flatten(),
Dense(64, activation='relu'),
Dense(1, activation='sigmoid')
])
model.summary()