Skip to content
Clipolette Get the app
← Back to blog · · 10 min read

Vertical video cropping AI on iOS (auto reframe, 2026)

Vertical video cropping AI on iOS: how auto-reframe tracks faces and action, why most apps crop badly, and where Clipolette runs the reframe on-device on iPhone.

guides iphone ios vertical-video on-device-ai

If you searched for vertical video cropping AI on iOS, the problem you’re living is specific: you have horizontal footage — a podcast shot 16:9, an interview on a tripod, a webinar recording, a landscape phone clip — and you need it to be 9:16 vertical for TikTok, Reels, or Shorts, with the speaker’s face actually in frame the whole time. Cropping the center of a 16:9 frame to 9:16 by hand works for exactly one shot: the one where the subject never moves. The moment two people are talking, or one person walks, or the camera was off-center to begin with, a fixed center crop puts a shoulder and an ear on screen while the face slides out the side.

What you want is the crop region to follow the action — to sit on the face when one person talks, swing to the second face when the conversation turns, and widen out when there’s on-screen movement worth keeping. That’s what “vertical video cropping AI” means: an auto-reframe pass that watches the footage and moves the 9:16 window to keep the important thing centered. This post is about how that actually works, why most iOS apps do it badly, and what to check before you trust an app to reframe footage you’re going to post.

What auto-reframe actually has to do

A vertical crop of horizontal footage is a moving window. The source is, say, 1920×1080. The output is 1080×1920. The output window is 608 pixels wide out of the source’s 1920 — you’re throwing away two-thirds of the horizontal frame on every single frame, and the only question that matters is which two-thirds.

A good auto-reframe pass answers that question per-frame with three jobs running together:

Subject detection. Find the thing that matters in each frame. For talking-head footage that’s a face, usually the face that’s currently speaking. For movement footage it’s the moving subject. For screen-share or product footage it’s the active region. This is a vision model running on every sampled frame, not a one-time guess at the start.

Speaker attribution. When there are two or more faces, the window should be on whoever is talking. Doing this well means correlating the audio — who’s speaking right now — with the faces on screen, and moving the crop to the active speaker. An app that only does face detection without speaker attribution will sit between two people or pick the wrong one.

Motion smoothing. The crop window cannot snap. If it jumps to the exact center of the detected face on every frame, the output jitters — a nauseating micro-shake as the detection wobbles a few pixels frame to frame. The window has to ease toward the target with damping, hold still during small movements, and pan smoothly during large ones. This is the difference between “professional reframe” and “the crop is having a seizure.”

Get any one of these wrong and the output looks off. Subject detection alone gives you a face that’s in frame but a crop that jitters. Smoothing without speaker attribution gives you a buttery-smooth window pointed at the wrong person. All three together is what makes auto-reframe feel like a human operator did it.

Why this is hard to do well on iOS specifically

iOS is, on paper, the ideal place to run this. The Neural Engine in the A-series and M-series chips is built for exactly this kind of vision inference, and Apple’s own Vision framework ships face detection, body pose, and saliency APIs that run on-device. The hardware to do face tracking and saliency at near-realtime is in every recent iPhone.

The gap is between “the hardware can do it” and “the app actually does it.” Most vertical-crop apps on the App Store fall into one of three failure modes:

Center-crop with no AI at all. A large number of “convert to vertical” tools just take the middle 9:16 slice and call it done. Fast, free, and wrong for anything but a perfectly centered single subject. These aren’t doing AI reframe; they’re doing a fixed crop with a marketing label.

Cloud reframe behind a native shell. The app looks native, but the reframe runs on a server. You upload the footage, their GPU runs the tracking, you download the result. This works but it means a 60-minute source uploads before anything happens, it doesn’t run offline, and footage under NDA can’t go through it. On an iPhone, often on cellular, the upload is the slowest part of the whole job.

On-device but face-only. Some apps do run a local face detector and crop to it, but skip speaker attribution and use weak smoothing. Single-subject footage looks fine. Put two people in frame and the window can’t decide who to follow; it either parks between them or flips back and forth on every word. The reframe is technically AI, technically on-device, and still not usable for the interview and podcast footage most people are trying to reframe.

The result is that searching “vertical video cropping AI iOS” returns a mix of fixed-crop tools wearing an AI label, cloud tools wearing a native label, and a few genuinely local tools whose tracking falls apart on multi-person footage. Knowing which one you’re looking at takes a specific test, below.

The on-device reframe pipeline on iPhone

Here’s what a full on-device reframe actually does on a recent iPhone, frame by frame, with no server involved:

  1. Sample the source. The video is read in place from local storage — Photos, Files, or an imported file. No upload, no copy to a staging bucket.
  2. Detect faces and saliency per sampled frame. A Vision-class model running on the Neural Engine returns face bounding boxes and a saliency map for each sampled frame. On an A17 Pro / A18 this runs at well above realtime for 1080p.
  3. Attribute the active speaker. The audio track is analyzed for speech activity and correlated with the on-screen faces, so the crop target is the person currently talking, not just the biggest face.
  4. Build a smoothed crop path. The per-frame targets are turned into a damped, eased path — the window holds during small motion, pans smoothly during large motion, and never snaps. This is computed over the whole clip, so the path is globally smooth rather than reactive frame-by-frame.
  5. Render the vertical output. The 9:16 window is sampled from the source along the smoothed path and encoded to the platform-preferred spec — H.264 high profile, 30fps, ~2160 vertical pixels, audio normalized.

End-to-end on an iPhone 15 Pro for a 60-second clip: a few seconds of compute. For a full 60-minute source being cut into clips and reframed, the reframe is a small fraction of the total — transcription and selection dominate. The point is that every step runs on the iPhone’s own silicon; airplane mode doesn’t break any of it.

Clipolette runs this exact reframe as part of its pipeline on iPhone 15 Pro and newer. It doesn’t just reframe — it transcribes the source, picks the highlight moments, captions them, and reframes each clip to vertical on-device. One $9.99/mo purchase covers iPhone, iPad, Mac, and Vision Pro, with a 3-day free trial. Install Clipolette from the App Store, drop in one horizontal source, and the first run shows you whether the auto-reframe holds the face the way you’d crop it by hand.

How to test an auto-reframe app before you trust it

You can’t tell from the App Store screenshots whether an app does real speaker-aware reframe or a fixed center crop. A five-minute test on your own footage tells you everything:

  1. Use a two-person clip. The single most revealing test footage is a 30-second interview or podcast clip where two people are visible and they trade turns talking. Single-subject footage hides every weakness; two-person footage exposes them all.
  2. Watch the crop during a speaker change. When the conversation turns from person A to person B, does the window move to B? If it stays on A, sits between them, or only moves when B’s face is bigger, there’s no speaker attribution — just face detection.
  3. Look for jitter on a held shot. Find a moment where the subject is sitting still and watch the edges of the frame. A good reframe is rock-steady. A weak one shows a constant micro-wobble as the detection box jitters.
  4. Pull the network and re-run. Turn on airplane mode and reframe the same clip. A native on-device app finishes normally. A cloud-behind-a-shell app fails at the upload step. This sorts the two architectures in thirty seconds.
  5. Check the pan, not just the position. During a moment where someone moves across the frame, does the window pan smoothly to follow, or does it lurch in steps? Smooth panning means real motion smoothing; stepping means a naive per-frame crop.

Run that on two or three candidates and the field collapses fast. Most fail step 2 or step 4.

Where auto-reframe still needs a human

Being honest about the limits — auto-reframe is very good at the common cases and still needs a hand in a few:

Whiteboards, slides, and text on screen. If the important content is a slide or a whiteboard off to one side while the speaker is centered, the face-tracking crop will follow the face and cut off the slide. These shots need a manual crop region, or a different aspect ratio that keeps both. A good app lets you override the auto-crop per clip; Clipolette does, by dragging the crop region.

Fast, unpredictable action. Sports, dance, or any footage where the subject moves faster than a smoothed window can follow without either lagging or jerking. The trade-off between “keeps up” and “stays smooth” has no perfect setting for chaotic motion. Manual keyframing in a pro editor wins here.

Three or more people in frame. Two-person attribution is reliable. Three or more, in a tight shot where they overlap, can confuse the speaker attribution during fast cross-talk. The fix is usually a manual crop or accepting a slightly wider window that holds all of them.

Footage where the subject is deliberately off-center. Some cinematography puts the subject in a third of the frame on purpose, with negative space carrying meaning. Auto-reframe will “correct” this by centering the subject, which destroys the composition. For intentional framing, manual override is the right call.

The pattern that works: let the reframe do the 90% of talking-head and interview footage it’s good at, and reach for a manual crop on the slide-heavy or deliberately-composed shots. An app that only offers auto with no override forces you to re-cut those in another tool; an app that offers both lets you stay in one place.

How vertical cropping fits the rest of the workflow

Auto-reframe is one step in a larger loop, and it’s most useful when it’s part of the whole pipeline rather than a standalone tool you bolt on at the end. If you’re starting from a long source, the turn long video into TikTok on iPhone post covers the full selection-plus-reframe loop end to end. For the iPad version of the same hardware story, on-device video AI on iPad goes deep on the Neural Engine side. The AI reels creator for iPad Pro post frames the same reframe step around the Reels use case.

If your source is specifically an interview, interview to Instagram reel AI covers the two-speaker reframe case in detail — the exact footage that breaks weak reframe tools. And if you came to this from a competitor, the Descript alternative for iPhone post covers the broader head-to-head on iOS.

When a cloud reframe tool is still the right call

To be fair about fit, the on-device path isn’t the answer for everyone:

  • You ingest from YouTube URLs. Cloud tools download and reframe server-side from a pasted link. The on-device path needs the file on the iPhone first.
  • You need a shared team review of reframed clips. On-device is solo; there’s no cloud workspace to pass clips around a team.
  • You’re on an iPhone older than the 15 Pro. The on-device reframe pipeline needs recent A-series silicon. Older iPhones won’t run the full local stack at usable speed.
  • Your footage is chaotic motion you’d hand-keyframe anyway. If you were going to manually reframe in a pro editor regardless, the AI step doesn’t save you the work.

If none of those describe you — and for the common case of reframing interview, podcast, webinar, or talking-head footage to vertical — the on-device path is faster (no upload), private (nothing leaves the phone), and works offline.

The bottom line

Vertical video cropping AI on iOS is the difference between a fixed center crop that loses the face and a moving window that follows the speaker the way a human operator would. Doing it well takes three things together: per-frame subject detection, audio-based speaker attribution, and motion smoothing that keeps the window steady. Most iOS apps ship one or two of the three, or run the whole thing on a server behind a native-looking shell.

The fastest way to tell which kind of app you’re looking at is the two-person-clip test: reframe a 30-second interview, watch whether the crop follows the active speaker, check for jitter on held shots, and pull the network to see if it still runs. Install Clipolette from the App Store on an iPhone 15 Pro or newer, drop in one horizontal interview clip, and watch the crop track the conversation. At $9.99/mo flat with one purchase covering iPhone, iPad, Mac, and Vision Pro, the reframe runs on the silicon you already paid for, and the footage never leaves the device.