05 Mar 2015

How to understand video scaling and framerate conversion - part one

  • Written by 
Video Processing, Part 1 Video Processing, Part 1 Blackmagic Design / RedShark News

There's some amazing and rather beautiful science called for when we scale video and convert frame rates. This rather technical-sounding subject turns out to be a fascinating story.

You'd be forgiven for thinking that converting video images from one format to another should be reasonably straightforward. After all, we're used to seeing computers effortlessly scale pictures up and down in photo viewers and web pages scrolling at the roll a mouse wheel. Frame rates never seem to be a problem, either; your monitor probably runs at 60 or 72 frames per second, yet you can watch YouTube videos shot all over the world without any problem whatsoever. Surely, video conversion couldn't be simpler, right?

Why scaling video is more difficult than you think

Well, sort of. In one sense, you can just buy a converter, such as Teranex from Blackmagic, and it'll do a solid job, but it's worth understanding why something as apparently straightforward as scaling an image down can be done so badly. Zooming out in a photo viewer application frequently reveals problems, as fine detail in the image interacts with the pixels of the display. Computers can also do a bad job of converting frame rates, where the frame rate of the video and the rate at which the monitor displays images interact such that some frames are displayed for longer than others. This might be okay for YouTube, but it can become visible in slow pans or zooms. In broadcast television, where all of the equipment involved in the display chain is synchronised to the frame rate of the video, things must be smooth.

Problems of both frame rate interpolation and scaling are instances of aliasing. Aliasing is familiar, particularly in real-time 3D graphics applications, but it's a problem in information theory that applies whenever detail in a digitally sampled signal is so fine that it exceeds about half of the sampling rate (the familiar Nyquist rate). The easiest way to understand this is with audio, where a signal sampled 48,000 times per second cannot reproduce any audio frequency above 24,000Hz. This is well chosen because 24KHz is adequate to record sounds most humans can hear. If, however, we want to reduce that sampling rate to 24,000Hz (perhaps to save storage space, for a rough version), we can't just leave out every other sample; the result would be aliasing. We must filter the audio to remove all frequencies above 12,000Hz via a low-pass filter.

Exactly the same issue applies, in the two dimensions of an image, rather than the one linear dimension of audio, when scaling images. In this case, the image must first be low-pass filtered (that is, blurred) to ensure that no detail finer than the Nyquist rate of the desired output exists. We've talked in some detail about how Gaussian blurs work, and these are sometimes used. These concepts can be demonstrated by a simple experiment: take a high resolution photographic image and rescale it to one-third of its original size in Photoshop, using the "nearest neighbour" scaling option, which simply leaves out groups of pixels to achieve the desired output size. Notice the aliasing. Repeat the same procedure, but first apply a radius-1.5 (that is, diameter 3) Gaussian blur to the image, and notice that the aliasing is, largely, gone.

The fact that it's taken this long to discuss the practicalities of scaling things down makes it clear that things aren't necessarily as simple as they look. What's more, these are tasks where we're not creating new information; we're simply removing information without unpleasant side-effects. Scaling things up, on the other hand, as when we need to use a piece of crucial footage that's only available in standard definition in a high-definition production, is much more difficult. Most edit software will attempt it, although many simply perform either a linear or pseudo-linear interpolation, duplicating pixels according to a set of basic rules. We're all familiar with the soft appearance of a blown-up computer image.

Why scaling frame rates is even more difficult

Compared to scaling video image size, scaling frame rates is definitely the more difficult task. Reducing frame rate by leaving out frames, on the face of it, is pretty straightforward, at least if we wish to divide the frame rate by an integer – by one half, one third, etc. as is common for 15fps internet video shot on 30fps devices. Strictly speaking, for maximum apparent smoothness of movement, a reduced-frame-rate version should show greater motion blur per frame. This is possible with optical flow interpolation (more below), but is often overlooked in simple situations, such as where the frame rate is halved.

The problem is in making small frame rate changes, perhaps to move from 29.97 to 25 frames, when broadcasting US-produced material in the UK, or where frame rate must be increased, as we might need to do when producing a slow-motion version of a shot. Both require the generation of new frames that represent a point in time between two pre-existing frames. Basic techniques for this simply involve dissolving between frames or groups of frames. However, as with simple interpolation when scaling up images, this is flawed; the point in time between two frames is not accurately represented by an average of their pixel values. The originally-broadcast versions of shows like Star Trek: The Next Generation are good examples of fairly primitive, 1980s standards conversion, which used these basic techniques. It's particularly visible in the star fields, which shimmer across the screen in a slightly staccato series of five five-frame dissolves per second, appearing slightly soft, due to the interpolation from NTSC's 525 to PAL's 625 line image. For a big-budget TV show, it looked terrible, technically, although it was state of the art at the time.

Essentially the same problem

Frame rate scaling improves massively with optical flow interpolation, where the computer attempts to recognise objects in the scene based on their appearance and track their motion to produce new frames from old, but in fact it's the same issue. Whether we're scaling time or pixel dimensions up or down is the same problem: accurately determining what new sample comes between two old samples, be they pixels separated by either space or time. Doing this properly requires active analysis of the image, to determine which changes in colour define edges, which pixel groups represent a particular object (and should therefore be moved wholesale to a new location in a newly-generated frame), and other techniques. Properly done, this is very hard work, especially if we want to involve high frame rates, beyond-HD resolution or stereoscopy.

For the purpose of this series, Blackmagic has supplied us with a Teranex 3D processor, which does an impressive job of putting electronics to implement all these techniques, carefully and with good quality, in a single 1U rack package. Over the next three articles, we'll be looking at examples of footage and the problems faced when converting them from one to another, as well as exploring Teranex's special features in the world of noise reduction and other processing.


Phil Rhodes

Phil Rhodes is a Cinematographer, Technologist, Writer and above all Communicator. Never afraid to speak his mind, and always worth listening to, he's a frequent contributor to RedShark.

Twitter Feed