为什么视频比图像难写
图像只需要一个静止的画面;视频要让画面在时间轴上演化,并且每一帧都符合物理直觉。新手最容易翻三种车:第一,写得太抽象,模型不知道主体在动还是镜头在动,结果整段镜头乱漂;第二,写得太多,把多个动作叠在 5 秒里,模型只能挑一个或者全部失败;第三,忘记写"稳定性",模型默认推镜/拉镜/摇镜,把你想要的静止镜头毁掉。
所以视频提示词必须显式写明:主体动作(缓慢/具体)、运镜(静止/推/拉/横移/俯仰)、时长(3-5-8 秒)、稳定性(hold static / no shake)。这四件事缺一个都会失控。
4 类常见镜头骨架
静帧氛围镜头(最稳)
[scene + subject] + camera holds static + gentle [micro motion: drift, sway, ripple] + [time of day + light] + [no shake]
例子:a single raindrop falling on a glass window at night, camera holds static, neon lights blurred in background, gentle vertical impact ripple, no camera shake。这类镜头最稳定,3-5 秒成功率最高。
缓慢推镜(dolly in)
[scene] + slow steady dolly in toward [subject] + [motion direction] + over 5 seconds + cinematic pacing + no jitter
关键词是 "slow"、"steady"、"over X seconds"。不写时间,模型会用一种迷之"标准推镜速度",通常太快。
横移跟拍(tracking shot)
[subject walking/moving] + camera tracks horizontally to the right at the same pace + [environment scrolling past] + medium shot + steady gimbal feel
跟拍最容易翻车的是"主体走得快、镜头跟不上",必须显式写"at the same pace"。"steady gimbal feel" 防止画面晃。
主体动作特写(action close-up)
close-up of [subject performing single action] + [single specific motion verb: pours, lifts, turns] + slow motion 120fps look + shallow depth of field + camera holds static
动作特写一定只写一个动作(一个动词)。"pours and stirs and lifts"会让模型崩溃,每段视频只允许 1 个核心动作。
视频提示词结构图
错误示范 vs 正确示范
✗ 错误示范
a beautiful cinematic video of a girl walking in a forest, magical, dreamy, stunning, amazing 4k
没写运镜、没写时长、没写稳定性、动作是模糊的 "walking"。模型会自由发挥,10 次出 10 种不同结果。
✓ 正确示范
a young woman in a wool coat walks slowly forward through a misty pine forest, camera tracks horizontally to the right at the same walking pace, 5-second shot, soft morning backlight, cinematic shallow depth of field, steady gimbal feel, no camera shake
主体走路速度(slowly)、运镜(tracks horizontally)、跟拍同步(same pace)、时长(5-second)、光(backlight)、稳定(no shake)全齐。十次出图风格高度一致。
5 条真实样本
a single raindrop slides down a foggy window at night, camera holds static, neon city lights blurred in the background, slow motion 120fps look, gentle vertical motion only, shallow depth of field, no camera shake, 5-second clip
最适合新手起手的镜头:静止+单一微动作,几乎任何模型都能稳出。
close-up of a barista's hands slowly pouring espresso into a glass cup, warm cafe interior blurred behind, single action of pouring only, camera holds static, soft side light from the right, real-time pacing, 4-second clip
"single action of pouring only" 防止模型自动加搅拌、抬起、放下等多余动作。
a young woman in a black trench coat walks forward through a rain-soaked Tokyo alley at night, camera tracks horizontally to the right at the same walking pace, neon reflections on wet ground, shallow depth of field, steady gimbal feel, 6-second cinematic shot
跟拍同步是关键。"at the same walking pace"让镜头不会比人快或慢。
extreme close-up of melted chocolate slowly dripping onto a glossy croissant, camera holds completely static, single dripping motion, warm side light, shallow depth of field, slow motion 120fps look, 3-second clip
食品视频核心是单一动作(dripping),慢动作放大材质细节。
misty mountain valley at sunrise, slow steady drone dolly forward over the treetops, sunlight breaking through clouds, very gentle pacing over 8 seconds, cinematic wide shot, no jitter, smooth motion
"slow steady drone dolly forward" 给模型明确镜头类型和方向,"over 8 seconds" 控制推镜速度。
视频最容易翻车的 6 个点
"moving、walking、interacting"全是模糊词。改成具体动词:"slowly pours"、"lifts the cup"、"turns the head to the left"。
"she walks in, sits down, picks up the cup, drinks" 在 5 秒里塞 4 个动作必崩。一条镜头只允许 1 个核心动作。
不写时长,模型按自己默认节奏来。明确写 "3-second / 5-second / 8-second clip" + "slowly / steady / real-time"。
不写 "no shake / steady",多数模型默认轻微手持抖动,把静帧氛围毁掉。
主体快速运动+镜头大幅运动同时发生,模型几乎必崩。先固定一个,让另一个动。
"masterpiece, best quality, 8k" 这种图像画质词对视频几乎无效,反而占注意力。视频写"cinematic, shallow depth of field"就够。
主流视频模型对照表
| 模型 | 典型时长 | 擅长场景 | 注意点 |
|---|---|---|---|
| Runway Gen-3 | 5-10 秒 | 电影感、人物镜头、跟拍 | 对动作连续性较好,运镜词敏感 |
| Pika 2.x | 3-5 秒 | 短氛围片、概念视频 | 需要明确简短的动作描述 |
| Kling 2.x | 5-10 秒 | 人物表演、产品广告 | 对中文友好,可直接用中文提示 |
| Seedance 2.0 | 5-8 秒 | 横屏电影感、运镜 | 详见本站 Seedance 专页 |
| Sora(部分开放) | 10-60 秒 | 长镜头叙事 | 提示词可以写得更接近自然语言 |
| Hailuo / MiniMax | 5-6 秒 | 真人和风景 | 对中文长句友好 |