Balancing biasand performance in polyphonic piano transcription systems
Sprache des Titels:
Current state-of-the-art methods for polyphonicpiano transcription tend to use high capacity neural networks.Most models are trained ?end-to-end?, and learn a mapping fromaudio input to pitch labels. They require large training corporaconsisting of many audio recordings of different piano modelsand temporally aligned pitch labels. It has been shown inprevious work that neural network-based systems struggle togeneralize to unseen note combinations, as they tend to learnnote combinations by heart. Semi-supervised linear matrixdecomposition is a frequently used alternative approach to pianotranscription?one that does not have this particular drawback.The disadvantages of linear methods start to show when theyencounter recordings of pieces played on unseen pianos, ascenario where neural networks seem relatively untroubled. Arecently proposed approach called ?Differentiable DictionarySearch? (DDS) combines the modeling capacity of deep densitymodels with the linear mixing model of matrix decomposition inorder to balance the mutual advantages and disadvantages of thestandalone approaches, making it better suited to model unseensources, while generalization to unseen note combinations shouldbe unaffected, because the mixing model is not learned, and thuscannot acquire a corpus bias. In its initially proposed form,however, DDS is too inefficient in utilizing computationalresources to be applied to piano music transcription. To reducecomputational demands and memory requirements, we propose anumber of modifications. These adjustments finally enable a faircomparison of our modified DDS variant with a semi-supervisedmatrix decomposition baseline, as well as a state-of-the-art,deep neural network based system that is trained end-to-end. Insystematic experiments with both musical and ?unmusical? pianorecordings (real musical pieces and unusual chords), we providequantitative and qualitative analyses at the frame level,characterizing the behavior of the modified approach, along witha comparison to several related methods. The results willgenerally show the fundamental promise of the model, and inparticular demonstrate improvement in situations where a corpusbias incurred by learning from musical material of a specificgenre would be problematic.