GPT-2 + Vision — built on stackformer

Model: gurumurthy3/gpt2-stackformer-vision_V2 — GPT-2 backbone (frozen) + frozen ViT-B/16 vision encoder + a Perceiver resampler and sparse cross-attention layers (trained on Flickr8k) reading 128 compressed visual tokens. Supports plain text-to-text continuation and image-to-text captioning.

4 127
0.1 1.5
0.1 1
0 200

Note: this model's text backbone is GPT-2 small frozen during training and only fine-tuned for a few epochs of vision-language alignment on Flickr8k (5h, single T4) — expect short, simple captions rather than long, fluent prose. Vision context only influences generation, so the "Image → Text" prompt prefix is optional steering, not a hard constraint.