Feb
04
posted at: 12:44 AM
i did some thinking about data tonight while i was writing a web service to wrap a data object for a little side experiment brando and i are doing.. which by the way will be officially unveiled as an Alpha once we get production hosting up and a few more of the essential features out of the way..
basically, the problem statement goes like this: we are creating a site that, if successful and gained say a user base of 50,000 users, could do some serious data damage. the reason for this is that users could contribute to the site with little or no effort, and if each users of 50,000 generated 15 items of content per day, wellâ?¦.. weâ??re talking about potentially 750,000 rows of data per day. And you thought myspace was slow.
With hundreds of millions of rows of data looming on the horizon, if we experienced rapid user growth, weâ??d have to find a way to offload it. And quick.
Since every record of data is time-sensitive, very recent data becomes of much greater value to the user experience than data that is, say, six months old. That said, we could archive older data to additional data silos on a nightly basis. But we're not off the hook with just that.
There would also be scenarios where users may request data that is archived. Perhaps there would be a page that would aggregate users' usage data over the lifespan of that entire user. The solution to this scenario is Service Oriented Architecture (SOA). Each data silo would be capped at a certain size cap for data spanning a specific time range. Programmatic Web Service providers would be used to return data to SOAP requests for specific time spans.
So say every 267 days we create a new data silo (for about 200 million rows of data). If I wanted to get an aggregate of my user data, it would make one web request to a silo for each timespan of 267 days over the life of the user. Maybe we could cache it in the application scope for pagination purposes, but i'm sure youâ??re getting the point here; you sacrifice an initial pageview load time for a legacy data request to avoid massive database bloating on your production server.
Distributed applications rock.