Highlights: An image is just a matrix — but the Transformer eats sequences of vectors. The whole “vision” trick of ViT lives in how that matrix is turned into a sequence. We cut a 224×224 image into a fixed 14×14 grid of 16×16 patches, flatten each, and project it through ONE learned Linear layer. Patches are not sampled — every cell of the grid becomes a token, in fixed order. We prepend a learnable [CLS]…
Read more