Learn CV – Week 1: Images, Convolution, Filtering, Edges
Day 1 — Images as Signals
Learning Task (~15 min)
Read Programming Computer Vision with Python, Ch. 1–2 (free online). Skimming is fine EXCEPT the sections on pixel access and array shape — read those carefully.
Coding Task (~15 min)
T1. Load an image. Print its .shape, .dtype, and the raw value of one pixel.
T2. Slice a 50×50 patch from the center. Display it.
T3. Zero out the red channel across the entire image. Display the result.
T4. Set a 100×100 region to pure white (255). What does that look like in the array?
Thinking Exercise
E1. You have a 5×5 grayscale image. Write out a grid of made-up values. Without code: what happens visually if you multiply every value by 0.5? What about subtracting 128 from each? E2. A friend says "pixels store color." You say "pixels store numbers." Who is more useful? Why does the distinction matter for writing CV code?
Test Questions
Q1. A grayscale image is 640×480. How many individual numbers represent it?
Q2. Why does a color image have shape (H, W, 3) and not (H, W)?
Q3. You set every pixel to 0. What do you see? You set every pixel to 255. What do you see?
What does this tell you about what "image operations" actually are?
Day 2 — Convolution (Theory)
Learning Task (~25 min)
Watch: 3Blue1Brown — "But what is a convolution?" (full video, ~23 min). Do NOT skim. Pause when the sliding window animation runs — make sure you can predict the output value before he shows it.
Thinking Exercise
E1. A 3×3 kernel of all 1/9 slides over an image. What is it computing at each position?
What does the output look like compared to the input?
E2. Now the kernel is all zeros except the center, which is 1.
What does the output look like? Why?
E3. A kernel has -1 on the left column, +1 on the right column, 0 in the middle.
Without running anything: what kind of image regions will produce large output values?
What regions will produce near-zero output?
E4. If a 3×3 kernel is applied to a 100×100 image with no padding,
what is the output size? Derive it — don't guess.
Test Questions
Q1. In your own words: what are the three things that happen at every kernel position? Q2. What determines what a convolution "looks for" in an image? Q3. You apply two different kernels to the same image and get two different outputs. What does that tell you about the relationship between kernel weights and detected features? Q4. Why is convolution described as "translation equivariant"? What does that property mean for a detector sliding over a scene?
Day 3 — Convolution (Code)
Learning Task (~10 min)
Re-read your Day 2 notes. Sketch the nested loop on paper before opening an editor. No new reading today — the task IS the learning.
Coding Task (~30 min)
T1. Implement convolve2d(image, kernel) from scratch using only NumPy.
Use zero-padding so the output is the same size as the input.
No scipy.signal, no cv2.filter2D — just loops and array indexing.
T2. Apply your function with a 3×3 box blur kernel (all 1/9).
Verify it matches cv2.filter2D using np.allclose.
T3. Apply this kernel to a grayscale image and display the result:
[[-1,-1,-1],[-1,8,-1],[-1,-1,-1]]
Before running: write down your prediction. After running: note where you were wrong.
Thinking Exercise
E1. Your manual result and cv2.filter2D differ slightly at image borders.
What causes this? What assumption is each making at the edge?
E2. The sharpening kernel in T3 can produce values outside 0–255.
What happens when you display them? What should you do about it?
Test Questions
Q1. You implemented the loop. Explain in one sentence what region * kernel computes
and why you then call np.sum() on it.
Q2. What breaks if you skip padding? Be specific about output shape and border behavior.
Q3. Your convolve2d is slow on a 1000×1000 image. Why? What does cv2.filter2D do differently?
Day 4 — Filtering & Noise
Learning Task (~15 min)
Read the OpenCV docs for GaussianBlur, medianBlur, and bilateralFilter.
Skim the parameter descriptions — read carefully: what each filter preserves vs. destroys.
Coding Task (~20 min)
T1. Load a grayscale image. Add Gaussian noise manually using np.random.normal.
T2. Apply all three filters (Gaussian, median, bilateral) with comparable window sizes.
Display them side by side.
T3. Find an edge in the image (a hard boundary between two regions).
Crop a tight strip across that edge. Apply each filter. Print the pixel values across
the strip for each result. Which filter preserves the step the best?
Thinking Exercise
E1. You are blurring an image to remove noise. What are you always trading away?
Name the trade-off precisely.
E2. Portrait photo with sharp facial edges vs. X-ray with speckle noise throughout.
Which filter for each, and why? What property of the filter drives your choice?
E3. A Gaussian kernel with sigma=0.5 vs. sigma=5.
Before running: predict how the output images differ. What does sigma control geometrically?
Test Questions
Q1. "Low-pass filter" — what does low-pass mean in terms of what it keeps and what it removes? Q2. Why does Gaussian blurring destroy edge information even though it reduces noise? Q3. What makes bilateral filtering structurally different from Gaussian? What additional information does it use at each kernel position?
Day 5 — Gradients & Edges
Learning Task (~15 min)
Read the OpenCV tutorial on Sobel and Canny (~2 pages). Read carefully: what Sobel computes in X vs. Y, and what the two Canny thresholds control. Skim: the math derivations.
Coding Task (~20 min)
T1. Apply cv2.Sobel in X and Y directions separately. Display both outputs.
T2. Compute gradient magnitude from T1: sqrt(Gx² + Gy²). Normalize and display.
T3. Apply cv2.Canny. Compare its output to your magnitude map from T2.
Where does Canny produce cleaner results? Where does it lose detail?
T4. Re-implement Sobel X manually using cv2.filter2D with the explicit kernel.
Verify it matches cv2.Sobel output.
Thinking Exercise
E1. Sobel X detects vertical edges. Sobel Y detects horizontal edges. Before running: which one lights up on a vertical stripe pattern? On a horizontal one? Derive the answer from the kernel weights — don't guess from intuition. E2. Canny produces thinner, cleaner edges than raw Sobel magnitude. What two steps does Canny add, and what does each one eliminate? E3. You crank both Canny thresholds very low. What happens to the output? You set them very high. What happens? What are the thresholds actually gating?
Test Questions
Q1. A large gradient magnitude at a pixel means what, physically, in the image? Q2. Sobel X and Sobel Y are both convolutions. What specifically differs between their kernels, and why does that difference make one detect vertical vs. horizontal edges? Q3. A CNN trained on ImageNet — what do researchers consistently find in its first-layer learned filters? Why is that not a coincidence given what you built this week?
End of Week Test
You have a noisy photograph of a street scene. Your task is to detect the edges of buildings, signs, and road markings as cleanly as possible.
Walk me through your full pipeline from raw image to edge map: what operations you apply, in what order, with what parameters, and why. For each step, explain what you are preserving and what you are sacrificing. Then tell me: which single step in your pipeline would a CNN learn to approximate automatically — and how?
End of Week Assessment
[ ]I can explain what a pixel is without using the word "color"[ ]I can write a convolution loop from scratch in ~15 lines, no reference[ ]I can look at a 3×3 kernel and predict roughly what it detects[ ]I understand why Gaussian blur trades noise reduction for edge sharpness[ ]I can produce a clean edge map from a noisy image using Sobel + Canny[ ]I can explain why CNN layer 1 learns Sobel-like filters — and why that's inevitable
Gate: If fewer than 4 boxes are checked confidently, revisit those days before Week 2.
What's Next
Week 2 builds directly on this stack:
- Frequency domain — why blur is "low-pass," why edges are "high-frequency," and what that framing unlocks for understanding filters without doing the math
- Image pyramids & scale — how the same structure appears at multiple resolutions, and why CNNs need pooling layers
- Histogram of Oriented Gradients (HOG) — the conceptual bridge between hand-crafted gradient features and what CNNs learn to do automatically