Doctrine: saving you some tables and headaches

I like database schema normalization, but sometimes it just creates too much overhead.

For example, you may want to let your users to enter a number of URLs (facebook, linkedin, twitter, blog, company, etc.) The straightforward solution seems to add a Url table with foreign key (user_id=1)┬áto link it to the User. Sometimes you want to reuse the Url table for other entities, such as Project or Organization. In this case, you’d probably go with a Url table containing the name of the entity to which it belongs and an id (entity=”User”, entity_id=1), which requires me to have some extra logic in my code.

But then I step back and ask myself: “What do I gain by storing URLs in a separate table?” Usually, the answer is “Nothing, really.” I often serialize such data to a text column using JSON:

It’s very natural for me to work with this format, saves me tables and has a positive effect on performance, since I no longer require LEFT JOINs. For years, I wrote my APIs in a way that automatically deserialized JSON when loading from the database and serializing before storing back.

I recently discovered that Doctrine 2 has a nifty column type called “json_array”. You basically interact with your model’s property as though it was a regular PHP array like this $user->getUrls() and $user->setUrls(). Doctrine then takes care of the serialization and deserialization for you, which makes adding new columns of this type faster and easier than ever.

Reference of Doctrine mapping types: http://docs.doctrine-project.org/en/latest/reference/basic-mapping.html#doctrine-mapping-types

Happy coding!

13 thoughts on “Doctrine: saving you some tables and headaches

  1. Feras

    I dont agree with you on several points:

    1) if this sort of performance overhead from joins matter to you you should NOT be using Doctrine to begin with.

    2) Storing values like that undermines the whole point of using a relational database since no operations can be performed on them. If you need to store such unstructured data a non-relational database is what you need such as Couch or mongodb. Actually a lot of applications use that approach and store the relational data in RDBS and non-structured data in a nosql database which obviously makes sense.

    3) Depending how complex is that serialized data and what is your schema (indexes, etc), serializing might actually be slower than joins when you have to do it on huge amounts of records.

    4) If all your API does is serializing / deserializing some data and store it and performance is key, why would you ever use a sql solution?

    I really dont mean to rant on this post but I see so many PHP developers posting similar stuff and doing such premature and unnecessary optimizations on open source projects that just make the code one big pile of crap, essentially ruining it.

  2. I understand your reaction, since this approach doesn’t look “pure”.

    When choosing types of storage, it doesn’t have to be either-or. Find the right balance between things that work for your unique case.

    Note that in the example, only a small portion is serialized and the data is far from complex, causing too little overhead. A couple CPU cycles during the few write operations are less expensive than a join on a multimillion-record table during the countless disk reads. As long as this approach is used sparingly, I wouldn’t worry.

    Besides, the optimization is only a positive side-effect. The main reason would be to simplify the schema and code to reduce development time and later maintenance costs. I have seen complex schemas with hundreds of tables that were created for each type of list imaginable (plus the models) and developers were simply lost in that clutter.

    Once again, this approach shouldn’t be used everywhere, because it doesn’t make sense in every case. It’s good to know that this method exists and has its advantages in certain cases. In the end, every tool has the potential to be misused, so I have to trust that people would think before making decisions.

  3. gggeek

    At the cost of being pedantic – trying to make it clear for even the youngest readers:
    even in the simple case mentioned, and not taking any scaling/performances considerations into account (hey, not everyone becomes the next twitter), what you loose from this solution is the possibility of querying the url table for analytic/statistical information:
    - how many users have a twitter link?
    - how many links are tagged twitter?
    Making a batch update also becomes much heavier operation.

    My advice would be to think about the business logic first, optimize second.

    • foo

      Postgresql actually supports json as a native data type and also has functionalities to query those json-fields.

      I think the example given in the article is good example on case where storing json makes sense.

  4. Your data is still searchable when serialized. I have implemented solutions for analytics and batch operations on big recordsets and much more complex data structures than the one illustrated.

    Once again, it all depends on how much you want to focus on theoretical features rather than the ones at hand. One can spend countless hours planning for the what-ifs that never happen (premature architecture). You should ask yourself “How likely am I to implement all these other things in the near future?” In the worst case, you can refactor. In the meantime you’d be delivering real value and Return on Investment for your client/employer.

    This is my philosophy, and is the one that enabled me to get into the project rescue business. Maybe if I had unlimited time, I could have spent more time “purifying” my code and planning for all potential features that I might or might not implement :)

  5. Feras

    I find the article useful in the fact that it shows this overlooked featured, but again I would advice against such application. If you read your comment:

    “Your data is still searchable when serialized. I have implemented solutions for analytics and batch operations on big recordsets and much more complex data structures than the one illustrated.”

    What do you think is messier – starting to write those solutions that do magic in memory and trying to do operations on otherwise relational data, or just having a 100 tables as you say in the schema and leave the highly optimized DB deamon do those operations for you… Thats my humble opinion.

  6. The reality is that there are plenty of use cases where it makes sense to serialize data into an RDBMS. At the most extreme level one can see it as just a transaction enabled file system with well understood solutions for backup and replication.

    Now between that and a proper normalized RDBMS there are a lot different use cases. If I just have a few records, do I really want to bring in another database technology that needs to be scaled and made fault tolerant?

    Also if I never/rarely need to query the data and if I never/rarely need to do batch operations on the data, does it even matter? Furthermore its not like NoSQL solutions are very good at batch operations, in fact it might be easier to batch update JSON in an RDBMS than in many NoSQL solutions because there is simply no adhoc query language that allows batch updates in many of them.

    Furthermore the JSON approach can often also simply be used for denormalization where the application or triggers keep them updated.

    So the conclusion: Do not oppose serialized data in an RDBMS on principle.

  7. Lee Davis wrote some interesting things regarding Doctrine. I suggest you take a look at this link he gave me: http://www.duckheads.co.uk/in-orms-defence/192

  8. Victor P.

    There are so many reasons why serializing data like this inside an rdbms is terrible but I see that some of the most logical ones have already been listed.

    Anna, you can’t call it premature architecture, or “purifying” your code. They are best practices for a reason, and there’s good reason for data normalization.

    The cost of storing this data correctly is not that big and by serializing your data this way “refactoring” your data & schema later may prove to be a lot harder than you originally thought.

    I use doctrine, and I make sure that I create the correct relationships and proper models so that my tables meet the 3rd normalized form.

  9. Feras

    @Lukas
    I mostly disagree here:
    ” it might be easier to batch update JSON in an RDBMS than in many NoSQL solutions because there is simply no adhoc query language that allows batch updates in many of them.”. I think you are looking at the problem from the wrong angle, sure if you store the serialize json in NoSQL it might be harder but then the whole point of using a schema-less NOSQL solution is lost, the idea of using the NOSQL for unstructured data is exactly just that – you dont need to serialize the data since it doesnt have to have a fixed structure and making batch operations on it is trivial… Having to do it in RDBMS means having to use json decoding where the software supports it and then manually extracting each json field from the set and performing operations on it. Here is a brief example in MySQL:
    http://planet.mysql.com/entry/?id=36677
    Now I dont know how efficient that is and so on, but it looks more like a patch than a solution really. Also I am not against denormalization, but I am for denormalization when it is used for performance gains not to simplify the schema or development time, since its gonna bite you later if you do it for those purposes and if your denormalization is done for performance gains than chances are that you wont mind adding additional software to the stack such as NoSQL since performance gains will generally be larger. Thats my humble opinion.

  10. Victor P.

    @Feras

    Agreed. The correct solution when you have an issue related to dynamic schemas is to use a NoSQL database therefore serialization is not necessary.

    And if you really needed this to be in a relational database but still be dynamic enough, you could possibly add a link table that has a many to one relationship with the user table and a type column.

    It seems to me that the problem described above is not a schema issue but a design issue.

    But a NoSQL approach would be the most correct in a scenario where you truly have a dynamic schema.

    I think that it’s great that doctrine has this data type but I’m not certain that it was intended to be used like this.

  11. Caponica

    I have to agree with Lukas and Anna here… there are plenty of use cases where it makes sense to serialize some small amount of data (relative to the whole).

    Throwing out an existing system to replace it with NOSQL is not realistic and slavishly following principles of normalisation even where it does not serve your actual needs is not good practice.

    Anna, in relation to your original article… have you tried creating an admin or form interface to data serialized in this way? If so, was that complex or are there built in Symfony / Sonata / etc form types that help with this?

    • @Caponica
      Extracting form data is very easy in such frameworks. In Symfony, I’d call ->getData() to get the form’s data as an array.

      I would then use json_encode() to serialize the array. If I’m using Doctrine’s “json_array” type for a column, then I can simply assign the array to that column and let Doctrine do the serialization.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">