Cleaning 21 years of messy music data
When I built the Matinee Idle explorer, the data behind it came from the RNZ website: 15,411 song plays, typed into a web form by human hands over 21 years. And when a person types band names thousands of times across two decades, things drift. Different presenter on shift, a stray autocorrect, a tired Tuesday afternoon. The archive is a fossil record of human inconsistency, and after the explorer went live I went back to tidy it up.
There were two problems.
The same band, spelled five ways
When one artist appears under several spellings, their play count gets split across all of them, so nobody's count is right. Dusty Springfield was credited four different ways (Springfiled, Springfeild, Spingfield). Eilert Pilarm, an obscure Swedish Elvis impersonator, managed three spellings despite barely anyone knowing he exists.
The undisputed champion is Dave Dee, Dozy, Beaky, Mick & Tich, which is surely the worst band name ever committed to a CD sleeve, and the data agrees. A name with six words and a fistful of punctuation, it appeared five different ways. Nobody could agree on Dozy versus Dozey, or Beaky versus Beeky, or where the commas went. If you set out to design a band name that humans would be physically incapable of typing consistently, you would land on this one.
To find the variants I ran an edit-distance pass over all 5,571 unique artist names, flagging any pair that was only one or two characters apart. That surfaced 178 near-duplicate pairs. I reviewed each by hand and applied 134 corrections, throwing out the coincidental ones: The Beatles and The Eagles are a character or two apart, but they are not, it turns out, the same band.
Six ways to write "and"
The second problem was collaborations. A song credited to two artists was invisible to each of them on their own. "Serge Gainsbourg and Brigitte Bardot", "William Shatner & Joe Jackson", "Iggy Pop with Deborah Harry": none of these showed up under the individual names. And the archive had at least six ways of writing a collaboration, from an ampersand to "feat." to a plain comma.
Splitting these is harder than it looks, because the word "and" doesn't always mean a collaboration. It's a separator in "Serge Gainsbourg and Brigitte Bardot", but it's part of the name in "Kid Creole and the Coconuts" and "Me First and the Gimme Gimmes". So I only split a credit if the second artist passes one of two tests: they appear somewhere else in the archive as a standalone act, or they turn up alongside three or more different primary artists. That correctly frees Brigitte Bardot while leaving the Coconuts attached to Kid Creole, where they belong.
The fix that started all this, fittingly, was a duo: "Bonnie and Clyde" was showing up as two separate songs because one entry used "and" and another used "&". Now, for the first time, you can click Brigitte Bardot in the explorer and see every song she's on, including the ones credited to the pair.
None of this changes the spirit of the show, but it does mean the numbers are slightly closer to what was played on air. The moderately cleaned data is live in the explorer now.
The Matinee Idle archive explorer is at matineeidle.radomski.co.nz. The original write-up, with the most-played songs and artists, is over here.
More posts
16 min read
Twenty-One Years of Weird, Wonderful Music: and What the Data Says
A love letter to Matinee Idle, a cult radio show, and the rabbit hole it sent me down. With 21 years of archive data, 15,411 songs, and one very strange most-played track.
15 min read
Te Ara Tupua: a first look at the data
Te Ara Tupua opened on 16 May. Using two existing sensors at the path's southern entry, here's what the first week looks like — and what the wider WCC network tells us about cycling trends in Wellington.
12 min read
Wellington's Shitsville is real
I used six years of Wellington weather data to fix the rules on Can You Beat Wellington. Turns out some seasons just don't have good days, and Wellington doesn't have 4 seasons.
Replies
Reply on Mastodon →Loading replies…