Modern computer vision programs can track fingers in the 2D plane easily easily, but struggle to track depth. This is a problem since the depth of the finger drastically changes what chord is actually being played.
When a computer vision model tracks a finger to be overlapping with a string, there are three possibilities:
Since computer vision algorithms currently struggle to distinguish between these three possibilities, we present an approach where the audio of the chord is analyzed to determine the actual chord being played.
This notebook takes in 2 inputs, the first of which is the hypothetical input of what a computer vision algorithm sees. This input is in the form [n, n, n, n, n, n], where n can be X to represent a muted note, 0 to represent the open note, or any integer 0 < n < 20 to represent a fretted note.
A list of candidate chords is then created from this input, where each n from 0 < n < 20 is also replaced with either 0 or X.
The second input, also in the form [n, n, n, n, n, n], represents the "actual" chord being played. An audio of this chord is created using the Karplus-Strong algorithm. The audio is matched with the generated audio of each chord in the candidate chord list, and the most likely candidate chord is then identified.
To identify which candidate matches the played sound, we perform a Fast Fourier Transform (FFT) on each synthesized audio clip. The FFT converts each time-domain waveform into the frequency domain, revealing the harmonic spectrum of the chord. Each chord has a unique combination of dominant frequencies corresponding to its notes.
By comparing the FFT spectra of the actual chord and each candidate chord, the algorithm identifies which candidate’s frequency profile most closely matches the real sound.
All inputs are defined in Cell 19.
Modify each n in test_audio = synth_chord_array([n, n, n, n, n, n], dur=2.2, sr=SR) to form the test chord, or the chord that is being heard.
Modify each n in candidates = enumerate_possible_chords([n, n, n, n, n, n]) to form the input "chord", aka what the CV algorithm detects.
The default setup is an E minor chord being played, but a finger is also detected overalapping with the first fret G string. This is a good real-world example of the applicability of this approach because:
Modify each n in test_audio = synth_chord_array([n, n, n, n, n, n], dur=2.2, sr=SR) to form the test chord, or the chord that is being heard.
Modify each n in candidates = enumerate_possible_chords([n, n, n, n, n, n]) to form the input "chord", aka what the CV algorithm detects.
See cell 2 for an explanation of the default setup.