Search sheldonbrown.com and sheldonbrown.org
When shooting video using multiple cameras, you must synchronize their output so you can cut back and forth between them, or show a picture in picture. In my bicycle videos, I use forward-facing and rearward-facing cameras at the same time, and cameras mounted on different bicycles. Synchronizing 2D video is relatively easy. 3D video and multi-channel sound require tight synchronization and special equipment or techniques.
The high-tech way to synchronize video and audio recordings is called time code. Time code is traditionally used where recorders are synchronized by a single timing signal sent over cables.. A streaming digital signal with timing information is recorded along with the audio and video. Time code can control the playback speed of analog audio tape, and can fast-forward and rewind tapes to synchronize them with each other.
Time code also may be sent and received using wireless devices, but as of now, no consumer-grade helmet camera supports this feature. In any case, a wireless signal only carries over a limited distance, and is vulnerable to interference.
Professional-grade equipment which derives time code from GPS satellite signals is available. It only works where there is a view of the open sky, and is still expensive and complicated to use. GPS can synchronize any number of cameras, anywhere on Planet Earth. GPS can also identify the location of the shoot to within a few feet, and the direction in which a camera is pointing.
Timing information may similarly be taken from shortwave radio time reference stations or computer networks.
I expect that a smartphone app will be offering GPS timing sooner rather than later, but smartphone cameras are rather limited, and this feature would be more useful in a dedicated camera. Some dedicated cameras do include GPS capability, but this is generally a high-end feature.
How do we synchronize when shooting on a budget, lacking time code?
We fall back on the classic synchronizing technique used in film: a marker visible in the image and audible in the soundtrack. Bicycle video takes are generally rather long – you start recording, and then ride – so setting this marker isn't much of an inconvenience.
A traditional wooden slate clapperboard
The clapperboard – with its chalkboard to indicate a take number – is the time-honored time-alignment tool in the motion-picture industry. The clapperboard provides a timing reference at the start of a take, and the speed of all the cameras is synchronized to the power-line frequency. Once clips are aligned on the editing table, they remain in sync.
As I use video cameras which record audio, there's no need for a clapperboard's written identification of the take. I make a verbal announcement instead. For synchronization, I use a hand clap, visible to every camera and recorded in every soundtrack. If I am using two cameras on my helmet to get front and rear views, I will back up to a window so the rear-facing camera shows the reflection, or tap the side of the helmet to shake both images and make a sound.
In the short video clip, the main image is from a camera on my helmet. The picture-in-picture is from a camera duct-taped to the rear rack of a friend's bicycle. My camera is looking forward at him and his is looking back at me. You will see and hear the hand clap, and then hear my short announcement identifying the take. Here is a still of the frame of the video where my hands come together:
The timing of all modern video cameras and digital audio recorders is set by quartz-crystal electronic clocks. Timing standards are very precise. Cameras will typically run for many minutes without a noticeable loss of synchronization. We can thank documentary filmmaker Richard Leacock for introducing crystal control. He used it with modified 8mm film and audio cassette recorders at the Massachusetts Institute of Technology in the early 1970s -- but there's no need any more to modify equipment.
Now, for a bit of technical background to help you use this technique when editing video.
A video consists of a series of frames (individual images) shown one after another quickly enough to create the illusion of motion. The rate is nominally 30 frames per second (actually, 29.97 for color signals) in countries which use 60 Hertz (cycle per second) AC power; the rate is 25 frames per second in countries which use 50 Hertz AC power. The frame rate is nominally ½ the power-line frequency so that any effect of imperfect power-supply filtering stays in the same place from one frame to the next or moves very slowly, rather than causing annoying flickering or wobbling.
Video editing is only by one-frame, 1/25 or 1/30 second, increments. Special software can adjust audio more precisely; and that may be desirable for some purposes. I'll discuss that later.
Some digital video storage formats align the audio to each frame. Other formats use keyframing, where the audio is only tacked to the video once every few frames. Between keyframes, these formats only store the differences between images, resulting in smaller digital files. If a video clip doesn't start with a keyframe, the audio and video may start slightly out of step – and then they stay that way.
All of the better video editing software applications give you a way to realign tracks to adjust timing. To do this, you need a clear cue in every track -- your hand clap.
Sometimes, you need to find any cue you can. One of my favorites in street videos is the sight and sound of a car rolling over a manhole cover. Footsteps also are good. Aligning on speech is harder, though the sound of the letter P begins when mouth opens suddenly, and the waveform in the soundtrack shows a sudden increase in volume. The letters B and M are less good, because the vocal cords are vibrating before the mouth opens, but may be usable. If you do much aligning on speech, you will find that you are beginning to learn to read lips!
Sound travels only about 1100 feet (300 meters), per second, so a distant camera will record delayed audio – like when you see a dribbled basketball out of sync with its sound. A distance of only 30 feet or 10 meters will throw synchronization off by one frame. For good synchronization on an audible cue, all microphones need to be within a few feet of the cameras. Align a distant camera on the visual cue, rather than the audible one. An additional hand clap near the distant camera will align its audio and video tracks.
When editing, I zoom in on the tracks until I can move back and forth one frame at a time. I turn on the “audio scrubbing” function so I can hear the hand clap or other cue. I set a marker for the hand clap separately for each audio and video track, then slide the tracks until all the markers align.
Moving video forward one frame at a time in Pinnacle Studio, the editing suite I use, plays the audio for the frame that is being displayed. So, as I reach the same frame where the hands come together, I hear the sound of the clap, and see its sharp peak in the audio waveform display.
The image below is of the tracks from the video example above, in the video editing application Pinnacle Studio 12. In the thumbnail images, you can see the first frame of each video. The lower thumbnail holds an icon indicating that it contains a picture in picture. The time line (orange line near the top) is expanded so each tick represents an individual frame of video. The cursor is aligned over four markers, which I placed at the hand clap frame in each video and audio track, then slid left or right until they lined up. The video in the lower track is grayed out because it is locked, so I can move the audio independently of it.
Hand clap editing in Pinnacle Studio v. 15
I align a take this way, save the resulting file, and then when I am editing, I save under a new filename so I still have the original full-length take to re-use.
You might ask: with timing only to the nearest 1/25 or 1/30 second, will there be an echo? No! The sense of hearing and the frame rate of video are nicely matched. Humans can't hear an echo unless it is delayed by 1/30 of a second or more, a phenomenon called the “precedence effect” or the “Haas effect”, after the scientist who first described it. Two clips aligned to the nearest frame will be out of sync by no more than ½ frame, 1/50 or 1/60 of a second. The error may either increase or decrease slowly over time if the camera speeds are very slightly different -- but I have yet to have to readjust timing in clips up to 15 minutes long. If the timing of two cameras is off by 1/2 frame, you might choose the alignment so drift decreases the error as the take progresses.
Sometimes, synchronization requires conversion of the video format due to peculiarities of the camera, editing software, or both. The Aiptek digital video recorder with my helmetcamera.com camera records at the standard 29.97 frames per second, but Pinnacle Studio reports it as 25. Similarly, in Pinnacle Studio, an Insight POV HD helmet camera reports speeds slightly different from the 29.97 at which it records. Clips from these cameras do not stay in sync with others. Converting the files into another format solves the problem, and then clips will stay in sync. I use AVS4YOU software and convert to AVI with the XVid/DivX Mpeg-4 codec, which keeps the file sizes down -- though this codec uses keyframing, so I sometimes will have to realign the audio to the video.
There are two important applications in which closer synchronization is important. A very small timing error can seriously disturb 3D imaging, and multi-channel (stereo or surround) sound.
Let's say that a pair of cameras recording 3D are panning from left to right. Then if the timing of the right-eye camera is slightly late, its image will be displaced to the left and objects will appear closer -- and so on, with the opposite delay and/or direction of panning. If the cameras are panning up or down, the images will become displaced vertically, resulting in eyestrain and/or failure of the images to fuse. The problem also occurs with objects in motion in the picture.
Timing must be very precise -- to a small fraction of a millisecond -- to avoid these problems. As a practical matter, the same timing source must initiate the scanning of each frame in both cameras. Tiny differences in the cameras' clock rate, affecting the rate of the scans once underway, are, however, probably unimportant.
GoPro offers a 3D system with a case for two cameras and a wired connection between them, however, this is expensive. I'm hoping that a manufacturer of less expensive cameras might also offer a 3D system. Some action cameras are small and light in weight compared with the GoPro, making a 3D rig less cumbersome, and allowing closer spacing between the cameras for close-up shots.
Now let's consider multi-channel audio. If you are feeding the right and left channels from each recorder to the right and left channels of a stereo mix, then each stereo pair keeps its timing. You need to adjust levels to select which stereo pair you will use at a particular time. Even with nicely-synchronized video, though, two pairs of stereo microphones recording at the same time can create odd stereo perspectives or echoes if one pair of microphones is much farther from the sound source than the other. If the sound source visible in the image is distant, you may deliberately advance the audio so it appears in sync with the video. That makes sense, for example, for a telephoto shot of a basketball being dribbled. If realism is important, though, beware! In many war documentaries, a mortar shell or rocket lands in the distance, and the dubbed sound of the explosion occurs at the same instant as the flash. It Just Doesn't Happen That Way!!!
Surround sound from a stereo microphone on each of two bicycles one behind the other can be very compelling. A front/rear bicycle spacing of 20 feet or so will result in a good surround image in a typical listening room. Shadowing the microphones with the riders' bodies -- front microphone ahead, rear microphone behind -- will increase the front/rear separation. You then of course need to take extra trouble to screen the front microphones from the wind. Use manual level control when using multiple recorders, and adjust levels later. Different automatic level between channels of a recording will cause shifts in the auditory image.
Frame-by-frame timing is not accurate enough to establish a stereo or surround-sound image, and slight speed differences between recorders also become a problem. The simplest solution is to use a single, multitrack recorder. You might use a wireless microphone on one of the bicycles, connected to a multitrack recorder on the other, but then signal dropouts and interference could be a problem. Running multiple recorders from the same clock source also is technically correct, and technically possible if they are on the same bicycle and can be wired together -- for example, with cameras recording 3D -- though I'm not sure whether the GoPro rig synchronizes the clock, or only the frame scans.
It's more cumbersome, but possible, to export soundtracks to an audio editing application to adjust their speed. The free audio editing application Audacity -- available for Windows, the Macintosh and Linux -- can do this. The examples I'll give here use Audacity, but other audio applications also can do this. You will want a hand clap or other similar cue at both the beginning and the end of a take the first time you adjust speeds to match one another. The hand clap should be at the same distances from the microphones both times, to avoid timing errors.
The Audacity screen shot at the right is from near the end of an 8-minute-long clip. A hand clap near the beginning has already been aligned so that it falls at the same time in both stereo tracks. Alignment is easiest if you delete audio up to the exact time of the hand clap in each track. Find your hand clap (or another transient sound) near the end of the tracks, and select the time span between its occurrences, as in the selection shown in the image. The numerical readout at the bottom of the image gives the time at the start and end. It is easiest to set the time display to read in samples, avoiding the need to convert from minutes and seconds to a single number. Some simple math tells us the ratio by which the speed of the lower track must be adjusted:
Selection End - Hand Clap Time
Selection Start - Hand Clap Time
Selection Start + Length - Hand Clap Time
Selection Start - Hand Clap Time
Adjust the speed using the Change Speed effect in Audacity. Save the the result from your computer's calculator applet for later use (a text file is fine for this purpose) and also paste the result into the "Percent Change" field in Audacity's Change Speed dialog box -- but then scroll left to make sure you have the correct number, because the dialog box will at first only display trailing digits. Readjust the sync if needed and trim the start of each track to the same time, so they will align themselves identically to frames of video.
Though Audacity can display multiple tracks, it can only output one pair of stereo channels at a time. Export each track separately as a .WAV file (mute the other track when exporting). Alignment of the front and rear recordings within one or two 1000ths of a second is easy to achieve, only a fraction of the delay due to the speed of sound from one bicycle to the other.
Import the tracks into your video editor, aligning their start to the same frame. Sounds which the front loudspeakers produce first will be heard at the front, and vice versa. As already described, no echo will be audible if all the speakers produce the same sound within about 30 milliseconds of each other.
Once you have determined the drift between two cameras, you probably can apply the same correction to additional clips from the same cameras, so you then need only a hand clap at the start of each clip.
This procedure will work with digital recordings. Analog recorders' speed stability is not good enough. Their accuracy is generally good enough though to align an an analog audio track to video.
You could also record surround sound from one bicycle, placing microphones at its forward and rearward extremes. This way, you can use a single recorder and avoid the timing and level-control issues. Many camcorders can in fact record 4-channel sound.
|Articles by Sheldon Brown and others|
Last Updated: by John Allen