Querying an EAV Table - microsoft.public.sqlserver.programming | Google Groups
Ran across this thread recently on the Microsoft SQL Server programming list which was provided in response to a performance question on the postgres performance list. The basic premise is that storing entity-attribute-value (EAV) tuples in a relational database is a bad idea for a number of reasons.
One of those is extra storage overhead required when you have to store additional information about each value. Though this is becoming a weaker argument as storage space continues to get cheaper.
A second issue is data integrity. With a normal database schema, there are a number of data integrity rules that can be imposed into the structure of the database disallowing entry of invalid data. These include foreign keys, data types and even triggers for more complex integrity checks. Even the table structure itself is a form of data integrity checking since it assures that, for example, a person cannot have two birthdates since there is only one column for birthdate. In an EAV solution, those integrity constraints would have to be handled by the programmer. As pointed out by the thread, trying to do data integrity checking in the database for an EAV setup is very difficult.
Probably the most compeling issue is exactly the one we ran into on the web application I work on. Basically, if you store your data in EAV form, it becomes a nightmare to do any kind of decent reporting on that data set in a performant manner. Even though you store it in EAV format, users tend to want to query the data as if the attributes were columns in a mythical table. Thus you write queries where every column must be a join or a subselect back to the main table. Relational databases aren't a good match for this kind of data structure.
The seduction of EAV is that you can model any kind of attribute on an entity without having to do schema changes. The only alternative to EAV is often very wide tables for each entity that can be sparsely populated. This is especially troublesome if the user can add arbitrary attributes on the fly as they can in our application.
Suffice it to say that we've learned the hard way the pain of trying to make EAV queries performant. If we were to start over again today, I would love to experiment with using database columns for attributes. Schema changes in Postgres and many other databases can now be done transactionally, so there are far fewer issues related to frequent schema changes.
But, the reality is that I don't think anyone has come up with a really good way to handle arbitrary attributes on entities and still allow decent performance for queries and reports. No matter what solution you choose, you will eventually run into database limits. With EAV, it's often data querying. With attributes as columns, it's column limits. We've gotten pretty good, which is why we are in business, but our design has it's tradeoffs and we still struggle with performance at times.
The reason we keep struggling with it though is that EAV is very attractive to certain customers. Users love to be able to mold their environment to fit their organization or business and not be told how to structure it. So, there is a huge upside on the user side, which is why we continue to find good solutions to the storage of that data.
Ran across this thread recently on the Microsoft SQL Server programming list which was provided in response to a performance question on the postgres performance list. The basic premise is that storing entity-attribute-value (EAV) tuples in a relational database is a bad idea for a number of reasons.
One of those is extra storage overhead required when you have to store additional information about each value. Though this is becoming a weaker argument as storage space continues to get cheaper.
A second issue is data integrity. With a normal database schema, there are a number of data integrity rules that can be imposed into the structure of the database disallowing entry of invalid data. These include foreign keys, data types and even triggers for more complex integrity checks. Even the table structure itself is a form of data integrity checking since it assures that, for example, a person cannot have two birthdates since there is only one column for birthdate. In an EAV solution, those integrity constraints would have to be handled by the programmer. As pointed out by the thread, trying to do data integrity checking in the database for an EAV setup is very difficult.
Probably the most compeling issue is exactly the one we ran into on the web application I work on. Basically, if you store your data in EAV form, it becomes a nightmare to do any kind of decent reporting on that data set in a performant manner. Even though you store it in EAV format, users tend to want to query the data as if the attributes were columns in a mythical table. Thus you write queries where every column must be a join or a subselect back to the main table. Relational databases aren't a good match for this kind of data structure.
The seduction of EAV is that you can model any kind of attribute on an entity without having to do schema changes. The only alternative to EAV is often very wide tables for each entity that can be sparsely populated. This is especially troublesome if the user can add arbitrary attributes on the fly as they can in our application.
Suffice it to say that we've learned the hard way the pain of trying to make EAV queries performant. If we were to start over again today, I would love to experiment with using database columns for attributes. Schema changes in Postgres and many other databases can now be done transactionally, so there are far fewer issues related to frequent schema changes.
But, the reality is that I don't think anyone has come up with a really good way to handle arbitrary attributes on entities and still allow decent performance for queries and reports. No matter what solution you choose, you will eventually run into database limits. With EAV, it's often data querying. With attributes as columns, it's column limits. We've gotten pretty good, which is why we are in business, but our design has it's tradeoffs and we still struggle with performance at times.
The reason we keep struggling with it though is that EAV is very attractive to certain customers. Users love to be able to mold their environment to fit their organization or business and not be told how to structure it. So, there is a huge upside on the user side, which is why we continue to find good solutions to the storage of that data.


0 comments:
Post a Comment