radomski.co.nz

Cleaning 21 years of messy music data

Patrick Radomski6 min read

When I built the Matinee Idle explorer, the data behind it came from the RNZ website: 15,411 song plays, typed into a web form by human hands over 21 years. And when a person types band names thousands of times across two decades, things drift. Different presenter on shift, a stray autocorrect, a tired Tuesday afternoon. The archive is a fossil record of human inconsistency, and after the explorer went live I went back to tidy it up.

There were two problems.

The same band, spelled five ways

When one artist appears under several spellings, their play count gets split across all of them, so nobody's count is right. Dusty Springfield was credited four different ways (Springfiled, Springfeild, Spingfield). Eilert Pilarm, an obscure Swedish Elvis impersonator, managed three spellings despite barely anyone knowing he exists.

The undisputed champion is Dave Dee, Dozy, Beaky, Mick & Tich, which is surely the worst band name ever committed to a CD sleeve, and the data agrees. A name with six words and a fistful of punctuation, it appeared five different ways. Nobody could agree on Dozy versus Dozey, or Beaky versus Beeky, or where the commas went. If you set out to design a band name that humans would be physically incapable of typing consistently, you would land on this one.

To find the variants I ran an edit-distance pass over all 5,571 unique artist names, flagging any pair that was only one or two characters apart. That surfaced 178 near-duplicate pairs. I reviewed each by hand and applied 134 corrections, throwing out the coincidental ones: The Beatles and The Eagles are a character or two apart, but they are not, it turns out, the same band.

Artists with the most spelling variants
Number of distinct spellings found in the raw archive for a single artist.

Six ways to write "and"

The second problem was collaborations. A song credited to two artists was invisible to each of them on their own. "Serge Gainsbourg and Brigitte Bardot", "William Shatner & Joe Jackson", "Iggy Pop with Deborah Harry": none of these showed up under the individual names. And the archive had at least six ways of writing a collaboration, from an ampersand to "feat." to a plain comma.

How collaborations were credited
Collaboration credits in the archive by separator type. About 1 in 8 songs is a collaboration.

Splitting these is harder than it looks, because the word "and" doesn't always mean a collaboration. It's a separator in "Serge Gainsbourg and Brigitte Bardot", but it's part of the name in "Kid Creole and the Coconuts" and "Me First and the Gimme Gimmes". So I only split a credit if the second artist passes one of two tests: they appear somewhere else in the archive as a standalone act, or they turn up alongside three or more different primary artists. That correctly frees Brigitte Bardot while leaving the Coconuts attached to Kid Creole, where they belong.

The fix that started all this, fittingly, was a duo: "Bonnie and Clyde" was showing up as two separate songs because one entry used "and" and another used "&". Now, for the first time, you can click Brigitte Bardot in the explorer and see every song she's on, including the ones credited to the pair.

87
artists whose name was scattered across multiple spellings
134
corrections applied in a single automated pass
1 in 8
songs in the archive is a collaboration credit

None of this changes the spirit of the show, but it does mean the numbers are slightly closer to what was played on air. The moderately cleaned data is live in the explorer now.


The Matinee Idle archive explorer is at matineeidle.radomski.co.nz. The original write-up, with the most-played songs and artists, is over here.

More posts

Loading replies…