CouchDB & Media Metadata

I’ve been looking for a project that I could use to experiment with CouchDB for a while. I’ve found the concept of NoSQL interesting for a while and the way that CouchDB works seems like a sensible way to approach things (the fact it’s open source and part of the ASF are also important). I’ll be using Python so the existence of a good couchdb module is another important factor.

Having been spending some time looking at our media collections and wanting to start sorting things out, using CouchDB as the store for the data seemed an ideal solution. Of course, having not done much with NoSQL I anticipated a steep learning curve 🙂

Our media files are arranged in a number of different directory structures, but most of the files have embedded metadata (of varying degrees of completeness and relevance). Asking any parser to figure out whether the directory structure is artist/album or some other odd combination (we have quite a few) isn’t likely to result in a robust solution so I’m using the metadata as the primary source of information which meant my first step was to code a small media scanner that finds and extracts metadata from the media files I’m interested in. Once I could extract the data I started looking at the data and trying to better understand how to structure the documents.

Files

Representing each file as a document was an easy initial choice. Each file document will have

filename
extension
size (in bytes)
time added (timestamp)
last modified date (timestamp)
full mime-type of the content
type (as per the mime-type for the file content – i.e. one of audio, video or image)

To get a list of the files I have indexed I can then write a simple view that checks for the existence of a filename and a type and return the information I needed, e.g.

if (doc.filename && doc.type) { emit(doc.filename, doc); }

Similarly to get a list of the files of a certain type I can write a view

if (doc.filename && doc.type) { emit(doc.type, doc); }

Adding a reduce function of _count will then give me a count of how many of each type of file are listed. So far, so good, so simple.

*Reading the documentation around CouchDB there is a suggestion that the document unique ID is generated by the user instead of relying on CouchDB to generate them. Given that the ID generated by the system is already unique I’m not sure I fully agree with the need to generate them myself, so have decided to leave it to CouchDB. The string used is a 32 bit hex number which I imagine will be nicely unique 🙂
*

Metadata

After some looking around I have parsers for audio, video and images that each produce a python keyword:value dictionary that I simply add as fields to the document. This exploits the advantage of NoSQL as I can add fields with any name or type of data to the document. The parsers also return information about the media so that when examining an image the width and height are returned and stored. At present I have one parser per media type, but hopefully I’ve left things flexible enough to simply add another parser whenever I feel there is a better one available.

Collections

Having all the information about files then allowed me to start looking at how it should be grouped into collections. While it’s great to have a series of audio files, it’s more useful to know how they interrelate. But how to group documents into collections exposes more questions.

The simplest way is to have a multiple item key in a view. This has the advantage of being dynamic (each time a document is added/edited the view is updated) and requires no additional effort when adding or editing documents.

if (doc.filename && doc.type && doc.artist) { emit([doc.type,doc.artist], doc); }

Adding a reduce of _count gives me a list of type and artist with the number of documents for each. Again, pretty simple and by using the startkey/endkey parameters this is easily searchable. It does however require that for each collection I want a new view is written, which reduces the flexibility I was seeking. Also, I realised that certain types of collections could have their own information, e.g. a link to an artists website or wikipedia page. Adding in subcollections (e.g. a list of albums per artist) would be possible by expanding the view to include the collection item as an additional key, but once again this needs the view to be written ahead of time and reduces the usefulness.

The answer seemed to be to create a document for a collection, but CouchDB doesn’t have joins or relationships between documents. I decided to use a list of the document ID’s to create the relationship which seemed logical and allowed me to check for uniqueness in an easy manner. For each collection document I added a collection key that contained a string detailing what it was, i.e. ‘album’ or ‘artist’. Initially I decided to simply add a list of the document ids, but then I realised that each collection was so generic that this wouldn’t be enough. It’s easy to imagine a situation whereby I have audio and video files from the same artist, but if I’m listening to music from that artist I don’t want to also listen to the soundtrack of their videos. Being able to address this by simply altering the structure of data within the document I create (adding the type of media) shows how much flexibility the NoSQL approach offers.

Creating a view

if (doc.collection && doc.name) { emit([doc.collection, doc.name], doc); }

allows me to quickly get a list of all the collections ordered by type and name. How to easily get the documents contained within each collection? The simple option is to get the collection document and then iterate over the items, but thanks to some help on the CouchDB IRC channel a simpler way is

if (doc.collection && doc.name && doc.ids) {<br></br>
  for (i in doc.ids) {<br></br>
    emit([doc.collection, i], {'_id': doc.ids[i]}); }<br></br>
  }<br></br>
}```

Passing the *include_docs=true* flag when requesting this view will return the documents contained within the collection. Using the *startkey* and *endkey* parameters allows quick and easy searching.

**Thumbnails & Cover Art**

Many media files have an embedded thumbnail image which the metadata extractors can retrieve. I thought it would be good to store this somewhere but certain problems became apparent. Attaching the thumbnail to the document as an attachment is possible, but by default the Python library doesn’t include the attachments=true query parameter so subsequent retrieval of the document omitted the attachments. It’s easy enough to fix, but then I wondered about the sense in storing identical images this way. Imagine an album with 12 audio tracks, each with an embedded cover art image – surely it’s more sensible to create the image as a separate attachment and then store the ID of that image in the documents? I think that maybe using an MD5 hash of the attachment contents will allow me to identify duplicates, but this is something I’m still looking at.

**Unique Documents**

The final issue I need to address is how to deal with duplicates. Uniqueness of filenames is easily accomplished by a simple view with a key of the filename, but as I said I’m using the metadata rather than filenames so there is a high probability that over time the same item will make an appearance with different filenames. Using a view with a composite key looks the most promising, but of course the components of the key will be different for different types of media.

**Summary**

I’m sure that what’s above will not have contained many surprises for people who have used NoSQL before and using it for metadata is hardly a big leap, but it’s been useful to experiment with. If there are better ways of doing things then please feel free to point them out as I’m interested to learn more about CouchDB.

So far I’ve found a solution for all of the things I wanted to do with CouchDB, but coming from an SQL background it’s taken a while to get used to the difference. Futon is a great way of looking at the documents and playing with map/reduce without which it would have taken me far longer to get to where I am. The flexibility to change the data within the documents without needing to adjust schema is a very nice change. I think the type of data suited the NoSQL approach and having spent the time I’ll be far more willing to use it in future.

Cynics Soapbox