Data Retrieval System | tiogiras.games

During my first masters semester we had a brief excursus about data management and retrieval. To complete the course, we, a group of three were tasked with creating a software of our choosing. The only restriction given, was to utilize OpenSearch as our data management software.

The Idea

The goal of the project was not to create new innovative software but to gain experience in retrieving and managing large quantities of data. Therefor we decided to crawl and store music data. Our core goals for the software was, first, to be able to crawl songs from multiple sources. Secondly to combine the different crawled songs into an unified format. And lastly to display – and most importantly, be able to play the songs in a web application.

Crawling the Data

My main responsibility of the project was to crawl and unify music data from different external sources. We chose the Spotify API and the MusicBrainz API as our sources. We utilized both APIs in a way that it requires the name of an Artist and then crawles all related songs and albums.

Gathering data from the spotify API was rather easy. All it took was a good read of the documentation and the connection of a payed account with the project. Spotify already includes all relevant information in their API responses. And also we had nearly no issues regarding API restrictions due to our project account.

Working with the MusicBrainz API on the other hand required significantly more effort. First of all we encountered harsh restrictions due to the newly created project account not working correctly. We managed to overcome these restrictions by caching and splitting our requests in multiple chunks. But this meant a greatly increased loading time. Furthermore MusicBrainz did not allow for an easy inclusion of all relevant information in our batched requests, requiring us to send even more single requests – and increasing loading times even further.

After receiving data from both API sources we had to unify the crawled data. To do so we first combined duplicate data from both APIs in a unified object. In terms of id and data correctness we prioritized Spotify over MusicBrainz due to another API later in the project that would only work with Spotify IDs. If the MusicBrainz data included songs not included in the Spotify data we saved the MusicBrainz entry to the database. If Spotify data for a song is crawled at a later point, it automatically overwrites the existing MusicBrainz data.

Making it pretty

Up to this point the crawled data is saved in a OpenSearch project inside a docker container. And all metadata for the song that is not relevant for indexing or searching the songs, is contained in a local database. To now make it accessable to the user we need to recombine the split data.

With the created frontend the user can search for the wanted song, artist, album, release date, … and the most relevant data will be displayed. The search criteria are hereby given to OpenSearch which then returns the most suitable hits. With the id of the returned songs we then can lookup the local database for all further information about the song. This includes embeds like images and the playable song iframe. All these informations are then shown to the user in the frontend.

Have a look

If you would like to have a look at the project code yourself, feel free to head over to GitHub and explore the repository.