Proceedings of IEEEInternational Conference on Acoustics, Speech, and SignalProcessing (ICASSP 2025),
Original Kurzfassung:
In modeling musical surprisal expectancy with com
putational methods, it has been proposed to use the information
content (IC) of one-step predictions from an autoregressive model
as a proxy for surprisal in symbolic music. With an appropriately
chosen model, the IC of musical events has been shown to
correlate with human perception of surprise and complexity
aspects, including tonal and rhythmic complexity. This work
investigates whether an analogous methodology can be applied to
music audio. We train an autoregressive Transformer model to
predict compressed latent audio representations of a pretrained
autoencoder network. We verify learning effects by estimating the
decrease in IC with repetitions. We investigate the mean IC of
musical segment types (e.g., A or B) and find that segment types
appearing later in a piece have a higher IC than earlier ones on
average. We investigate the IC?s relation to audio and musical
features and find it correlated with timbral variations and
loudness and, to a lesser extent, dissonance, rhythmic complexity,
and onset density related to audio and musical features. Finally,
we investigate if the IC can predict EEG responses to songs and
thus model humans? surprisal in music. We provide code for our
method on github.com/sonycslparis/audioic.