Previous text-to-video approaches required images and videos used in training to all be the same size, which required significant pre-processing to cut videos down to size. But because Sora trains on “patches” instead of the full frame of the video, it can gobble up any video or image without requiring it to be cut down.

