If you’ve been playing with CouchDB then you have probably run into the problem of having multiple documents that share a relationship, yet are complex enough that denormalizing them together doesn’t make sense.
For the sake of example let’s assume we have Authors and Posts. Further, let’s assume that an Author has many Posts and that these are represented as independent documents. The goal here is to find posts created by a specific author.
In SQL, this would probably be represented as a table of authors and a table posts. Getting the aggregate result would be something along the lines of:
SELECT * from authors join posts ON authors.id = posts.author_id
This is just a simple standard JOIN, nothing very interesting about it. However, in a document oriented system this “join” operation doesn’t really exist. Enter view collation.
View collation allows us to define a map/reduce function pair that takes in multiple document types, aggregates them onto a single key value, and gives us a means to search on author id.
Let’s say we have an author document like the following:
{'type' : 'author', 'name' : 'Chris Chandler', '_id' : '22d43eaa7e06c9c37ed3e0489401a506' }
and some number of post documents similar to:
{'type' : 'post', 'title' : 'Hello world', 'Body' : 'body text', 'Author' : '22d43eaa7e06c9c37ed3e0489401a506' }
In this case our “foreign key” is 22d43eaa7e06c9c37ed3e0489401a506. The mapping function would need to connect these records based on that key is something like the following:
function(doc) {
if(doc.type == 'author')
{
emit(doc._id , doc);
}
else if(doc.type == 'post')
{
emit(doc.author, doc);
}
}
This view will generate an intermediate hash table containing entries with the author’s key. In essence we have one key (the author’s) pointing to either a post the author has created, or the author’s record itself. To make this view answer the question ‘Show me all posts created by a certain author’ we need to write a reduce function that removes the unnecessary author records so the final table will only contain author keys pointing to lists of posts.
function(keys, values, rereduce)
{
var posts = [];
for(var i = 0; i < values.length; i++)
{
if(values[i].type == 'post')
{
posts.push(values[i]);
}
}
return posts;
}
The final result set appears as:
"22d43eaa7e06c9c37ed3e0489401a506"
[{_id: "d0d0ea6de45c9f4ff983f12a9fed9008", _rev: "2624588756", body: "Weee!", title: "Hello world 2", author: "22d43eaa7e06c9c37ed3e0489401a506", type: "post"}, {_id: "9de65ae955ecc2ea35055b9339f1651c", _rev: "2347078231", body: "Weee!", title: "Hello world", author: "22d43eaa7e06c9c37ed3e0489401a506", type: "post"}, {_id: "5d1ad3eed26f84879835fd47e44f7f55", _rev: "1163133569", body: "Weee!", title: "Hello world 2", author: "22d43eaa7e06c9c37ed3e0489401a506", type: "post"}, {_id: "0717ae0da9bf5919da0957268667c3f4", _rev: "1063237208", body: "Weee!", title: "Hello world 3", author: "22d43eaa7e06c9c37ed3e0489401a506", type: "post"}]
Hey,
Using collation to solve the related documents query is definitely a big part of CouchDB Map/Reduce queries, but in this case I think you’ve jumped the gun a bit. You don’t need to use a reduce at all. Instead, your map would just simply be “function(doc) {if(doc.post) emit(doc.author_id, doc);}” and you’d query the view with http://127.0.0.1:5984/db_name/_view/posts/by_author?key=“Author Name”
And on a side note the reduce provided will break on unbound data sets. In your case, the amount of data being returned is growing linearly with the number of input keys. Ie, if you had N posts, your end result data size would be N*some_factor. Because of implementation details, the data returned from a reduce function should grow at less than log(num_rows_processed).
A slightly different method would be to return only the X most recent posts or some such. There are a few examples of such things floating around the interweb.
I’m wondering if there’s such a thing in CouchDB as a view that spans multiple databases (ie files – 1 database – 1 file, I believe).
I’ve read about people splitting application data across a large number of CouchDB databases. The standard example I’ve seen mentioned is having a separate database for each user account, which offers the possibility of more easily letting a user replicate their data to their own machine and replicate back later.
This all sounds fine, but could we then write a view to do something like return a sorted list of all the users?
In other words, are ‘views’ tightly coupled to ‘databases’ in CouchDB? Or could views be defined across multiple databases?
The rate limiting on the amount of data returned from reduce functions has to do with how the implementation stores partial reduction results in the btree. In terms of it changing the implementation, I haven’t heard anyone pushing for it.
RE: Views over multiple DB’s, the answer is definitely not. A view will only ever be able to touch documents in the db where it is defined.
The ‘proper’ way (as in, the first thing that occurred to me) would be to have a db that aggregates all the data you’d want to see. In the case of 1 db per user, I’d probably have a meta db that stored info on all users.
Thanks for the replies guys. I was guessing that, but it’s nice to have the confirmation.
It sounds like in the “meta db” case, what you’re effectively doing is rolling your own instance of the sort of meta-view I was asking about. To CouchDB it would simply be another database, and it would be up to other parts of your application to keep it synchronized with the other databases it is aggregating from.
It sounds like there is no compelling reason to do this sort of sharding for an app that fits on one server (or one disk, let’s say), unless one is convinced that map-reduce views across databases are not and will not be wanted or needed.
When/if you need to split databases to meet scaling demands, then you’ll have databases on separate compute nodes and so a CouchDB instance level view that aggregates them wouldn’t be too useful anyway. The application would need some other way of combining results.
When running a map-reduce across multiple databases, the ‘map’ part wouldn’t need to change at all. The ‘reduce’ is where it gets interesting. I could see wanting to reduce each database (and store as is done currently), and then re-reduce between separate databases to combine the output into one result.
Assuming we’re in a distributed system with databases on multiple nodes, the question is where does the reduction happen? The vanilla MapReduce model is to have an arbitrarily large set of reducers, have all key-value pairs with the same key sent to a common reducer, and then run the reduce and collect the results from the reducers.
I could see a place for some companion piece of software to do this sort of distributed view re-reduce for CouchDB setups. It could take the form of another CouchDB instance (or likely more than one) running a particular application geared toward such a re-reduction. (And taking it a step farther, multiple MapReduce passes would be more of the same).
What’s interesting to me now is that this may not be too hard to build out of multiple CouchDB instances, without changing CouchDB itself at all. The main reason for piling on “more CouchDBs” would be to persist the results at each step to disk, so that if the same view is asked for twice, it is still only computed once.
Sorry if I’ve rambled. I’m interested in what people’s thoughts are on that. Thanks again for the replies!
but for me what you showed here is not a join operation is more like union of master detail relationship. and I cant see really the way to implement multiple real join operation as rereduce will break things down.
can somebody point me in some direction?
you have let say document
{id:nikon, type:camera, descr:my comment, etc}
{id:canon, type:camera, descr:my comment2, price: 43$, etc}
{id:myphoto, madewith:canon, type:jpg, dim:{}, etc}
{id:myphoto2, madewith:nikon, type:jpg, dim:{}, etc}
and how to join this to have
{id:myphoto, madewith:canon, type:camera, type:camera, descr:my comment2, price: 43$, etc, type:jpg, dim:{}, etc}
{id:myphoto2, madewith:nikon, type:camera, descr:my comment, type:jpg, dim:{}, etc}
@kamiseq
It sounds like what your trying to do would probably either be easier as merging in the camera data as a sub-document. Since CouchDB doesn’t really have “joins” per se the closest you can get is through collating data which is really just emitting identical keys from different documents. In truth, I need to really revise this article :-). Try emit( [doc.type,1], something) for the cameras and emit( [doc.madewith,2], something ) for the photos and combine them in the reduce phase. It’s possible CouchDB will complain this reduction isn’t reducing fast enough since the resultant value would be larger than the original.
Awesome advice, thanks!
In my original example I was going to aggregate some of the post information into the author but opted to do it this way. I should have taken a more critical look at what I was doing. Your suggested change to the map function seems somewhat obvious now that you’ve mentioned it :-).
I’m curious about the nature of the implementation of the reduction function. Is there something specific about the way CouchDB is currently handling reductions that requires the resulting data set to grow at a rate less than log(num_rows_processed)? Is this something that’s likely to change or something more indicative of the nature of Map/Reduce?
I could be wrong, but my understanding of the implementation precludes a view from being able to access multiple databases. Since the context of the request is determined by the path of the REST api (eg http://127.0.0.1:5984/some_db/view) and since the map function only takes docs, there isn’t an intermediate step to refer to external documents from other databases. I think it would also be difficult because the map functions are run ahead of time and the results are stored in b-trees for performance reasons (I need to confirm this). Those b-trees would have to be aware of multiple possible database origins.
I’ve seen several Ruby examples so far that show using multiple databases in cases like 1:1 of model-to-database and a few of 1:1 of user-to-database. To make something like that work you would probably have to aggregate the data in your application layer. Maybe have a “master” database that has references to all the other databases. My initial concern would be how well CouchDB handles that many databases. I’ve seen SQL-based implementations of this and it forms more of an anti-pattern than anything. It is a nightmare to code against, and a nightmare to maintain.
So, I’m going to hypothesize that at this point views are tightly coupled to databases. The only way I can think of to manage that case right now would be to aggregate in the app layer. If I find something else out contrary I’ll post it! Thanks for your comment :-)!