Recording
Use a quality microphone
Pick a quiet place with little background noise or disturbances
Have the "talent" speak clearly and slowly — slower than feels natural, with good annunciation, discernable breaks between words, and plenty of pauses. Have him/her speak slightly louder than usual ("project") but not so much that it sounds unnatural
Record a short test take before starting to ensure equipment is functional and audio quality is good
Have all text to be recorded printed out and numbered
Begin recording. Have them read each phrase in order with a short pause after each (~3 seconds). Have them read the number (in English, if possible) before each phrase, with a short pause (~1 second) between the number and the phrase. The numbers will aid greatly in identifying which phrase is which, especially if they were recorded in a language other than your own.
Try to record all the phrases in one take (one audio file). Don't use a separate file for each phrase. If the recording is interrupted with background noise and the speaker messes up, let the recorder keep running and continue on when possible, starting with that same phrase.
Do several takes.
(many of these next steps assume Linux tools)
Extracting and Splitting
Extract the recordings from the recording device
Convert the recordings into .wav format (most mp3 players have an option to create .wav files)
Open each .wav file in 'audacity' (music editing program)
You should be able to see the numbers and phrases clearly. For each phrase, select the portion you want to extract as the audio clip (with a slight pause both before and after the speaking). Skip attempts that are not usable. Once selected, make a note of the start time and length of the excerpt at the bottom of the windows. Write down these times along with the # of the phrase, and the particular file you're looking at.
Extract out the salient excerpts. For each phrase:
sox [ recording file ] [ extract file ] trim [ start time ] [ length ]
for example:
sox ZOOM0003.wav phrase5_take3.wav trim 1:01.622 10.578
Put the phrase # first in the extract's file name, so they will group together.
Among all the takes for each phrase, choose the best one and discard the rest
Post-processing
The clips must be encoded as mp3. For speech, 64kbit mono encoding should be adequate. If the audio contains other noises or music, 96kbit mono could be considered. For very high quality applications (CommCare user will be using headphones, use 128kbit stereo). 64kbit mono requires ~7KB per second of audio.
To convert to 64kbps:
lame --abr 64 -mm --noreplaygain phrase5_take3.wav phrase5.mp3
Now we need to make all the recordings approximately the same volume:
mp3gain -r -c -d 10 *.mp3 (assuming all the mp3s are in the current directory)
The -d 10 is a volume boost (here, 10dB) to give to all files after they have all been normalized to the same volume. This is because the default volume level tends to sound quiet on the phones. Tailor the amount of boost to your deployment and the devices you will use). Each 10 dB of boost approximately doubles perceived loudness.
Don't boost too much or clipping will occur (the stength of the signal is boosted beyond the maximum of what the sound file can represent; the rest is 'clipped' off). Excessive clipping will sound harsh and severely degrade sound quality. You can view the amount of clipping in audacity.
ok clipping
bad clipping