Wednesday, August 27, 2008

Hadoop - Open source map reduce

FrontPage - Hadoop Wiki

Our application has a background scheduler which works fairly well for many tasks related to our application. But, I wonder if it would be useful to look into using a tool like hadoop or another implementation of map reduce for finer grained tasklets. Move as much of actual application work out of the HTML Request->Response cycle and into the background.

What gets me trulyl excited about tools like Hadoop and languages like erlang the ability to take advantage of new hardware easily when we upgrade as well as providing an easy path for expansion once it's decided that it's time to expand.

I've not had a lot of time to research hadoop, but my understanding is that it provides an open source framework to using the map-reduce concept popularized by google. Hadoop also provides a file system that is shared across all the nodes.

Hadoop is written in Java, but they do have an example of a mapreduce job written in python, compiled via jython and then executed on a cluster.

I also ran across an article which indicates that you can write hadoop jobs directly in python w/o the conversion step. Writing An Hadoop MapReduce Program In Python

Related link: Greenplum and Aster Data have both just announced the integration of MapReduce into their SQL MPP data warehouse products

Interestingly, I believe greenplum database is based on postgreSQL.

How Hard Could It Be?: How I Learned to Love Middle Managers

Monday, August 25, 2008

Huge speed ups in javascript coming to Firefox

Brendan's Roadmap Updates: TraceMonkey: JavaScript Lightspeed

A lot of work is going into making the javascript engine in firefox a lot faster. With the graphs on this page, it looks like the next major version of firefox is going to be a whole lot faster, making Rich Internet Applications using Javascript/HTML/CSS that much better. So, that's one less thing that environments like Adobe Flash and Microsoft's Silerlight can hold over the somewhat crufty but far more ubiquitous Javascript/HTML/CSS environment.

So, will the older, ugly Javascript/HTML/CSS catch up enough to the newer RIA environments that they never really take off across the web as a whole? Or will the other benefits provided by Flash and Silverlight cause most websites to move over to the "flashier" RIA environment?

Monday, August 18, 2008

Why entity-attribute-value is bad

Querying an EAV Table - microsoft.public.sqlserver.programming | Google Groups

Ran across this thread recently on the Microsoft SQL Server programming list which was provided in response to a performance question on the postgres performance list. The basic premise is that storing entity-attribute-value (EAV) tuples in a relational database is a bad idea for a number of reasons.

One of those is extra storage overhead required when you have to store additional information about each value. Though this is becoming a weaker argument as storage space continues to get cheaper.

A second issue is data integrity. With a normal database schema, there are a number of data integrity rules that can be imposed into the structure of the database disallowing entry of invalid data. These include foreign keys, data types and even triggers for more complex integrity checks. Even the table structure itself is a form of data integrity checking since it assures that, for example, a person cannot have two birthdates since there is only one column for birthdate. In an EAV solution, those integrity constraints would have to be handled by the programmer. As pointed out by the thread, trying to do data integrity checking in the database for an EAV setup is very difficult.

Probably the most compeling issue is exactly the one we ran into on the web application I work on. Basically, if you store your data in EAV form, it becomes a nightmare to do any kind of decent reporting on that data set in a performant manner. Even though you store it in EAV format, users tend to want to query the data as if the attributes were columns in a mythical table. Thus you write queries where every column must be a join or a subselect back to the main table. Relational databases aren't a good match for this kind of data structure.

The seduction of EAV is that you can model any kind of attribute on an entity without having to do schema changes. The only alternative to EAV is often very wide tables for each entity that can be sparsely populated. This is especially troublesome if the user can add arbitrary attributes on the fly as they can in our application.

Suffice it to say that we've learned the hard way the pain of trying to make EAV queries performant. If we were to start over again today, I would love to experiment with using database columns for attributes. Schema changes in Postgres and many other databases can now be done transactionally, so there are far fewer issues related to frequent schema changes.

But, the reality is that I don't think anyone has come up with a really good way to handle arbitrary attributes on entities and still allow decent performance for queries and reports. No matter what solution you choose, you will eventually run into database limits. With EAV, it's often data querying. With attributes as columns, it's column limits. We've gotten pretty good, which is why we are in business, but our design has it's tradeoffs and we still struggle with performance at times.

The reason we keep struggling with it though is that EAV is very attractive to certain customers. Users love to be able to mold their environment to fit their organization or business and not be told how to structure it. So, there is a huge upside on the user side, which is why we continue to find good solutions to the storage of that data.

Saturday, August 16, 2008

Organize your electronic equipment with pegboard and ties

Tip Testers: DIY Pegboard Home Network Wall

A nice way to organize all those little gadgets that seem to multiply at home. Pegboards and ties.

Setting up Mozilla Weave on your Server

Marios Tziortzis » Blog » Setting up Mozilla Weave on your Server

Mozilla Weave lets you synchronize links, saved passwords and form data between multiple computers. I've used foxmarks for a couple of years to synchronize bookmarks between computers. But at this point, foxmarks doesn't do passwords. It appears weave does this. In addition of course, you can run it off your own server controlling your own data, which is nice.

Thursday, August 7, 2008

TrueCrypt - multiplatform on the fly encryption

TrueCrypt - Free Open-Source On-The-Fly Disk Encryption Software for Windows Vista/XP, Mac OS X and Linux

True Crypt is a really great tool for creating encrypted volumes on a computer. You can enter a password and the volume is mounted like another drive on your computer. You then can use it normally. When not in use, you unmount it and the data is encrypted. A must have if you use a laptop and have company data residing on it. It is arguably needed even for personal desktops w/ personal data.

Versions - OSX svn client

Versions - Mac Subversion Client

I tend to prefer the command line for most things, but for exploring a repository, it's a lot easier to use a gui client. Versions is by far the best one I've found so far that runs on OSX. It's in beta currently and will eventually cost something to use.