NYCPHP Meetup

NYPHP.org

[nycphp-talk] [OT] number of files in a directory?

tedd tedd at sperling.com
Mon Jan 2 19:57:13 EST 2006


>I started a site a couple years ago which had a similar problem. 
>Users can create pages and each page had one or many assets.
>
>At first I just did /content/page_id/assets.ext, after I got to 
>32,000 directories, it stopped working. At that point I moved to a 
>system where I had 250 directories like, 0, 1000, 2000, 3000, etc. 
>Each of those directories had 1000 sub directories in them, each 
>named after the page_id. After I got to around 300,000 sub 
>directories and a little over a million files, I moved to a 
>completely md5 based system.
>
>I tried to avoid keeping track of files in the database as it tends 
>to get messy, but I suppose it's inevitable. The benefits of an md5 
>system for me outweigh a non md5 system. This may not be the case 
>for you but the main drawing points to me were:
>
>1) Lowered server I/O, there were quite a few duplicated files. Each 
>time two identical files were read, they were pulled from two 
>different spots on the disk. My site quickly grew to using huge 
>amounts of I/O. (around 15,000-20,000 hits a minute on the content 
>server). This definitely helped out for me as some files were 
>duplicated over 1,000 times.
>
>2) Made it a lot easier to keep track of what was being used and 
>what wasn't. Without a DB back end I couldn't tell which files I 
>could delete and which I needed to keep without writing a script 
>that basically checked every directory for a matching entry in the 
>database.
>
>3) Lowered disk space. Again duplicate files.
>
>4) Allowed me to ban certain images and other files, mass delete 
>things that had a certain md5 attached to it. This is very useful if 
>you will ever need to moderate or have troublesome users.
>
>The downside is that you have to make sure your code really keeps 
>track of your file system and you aren't accessing it by hand. 
>Another thing you might worry about using md5 is collisions. If this 
>is a mission critical system, you may want to avoid md5 as it is 
>possible (but somewhat unlikely) you will encounter collisions. I've 
>read anyone with a decent computer can create an md5 collision in 
>about an hour, so that's something to keep in mind.
>
>The way I structured my file system was three levels of single 
>character directories.
>
>/content/a-f0-9/a-f0-9/a-f0-9/filename.ext (4096 directories (16*16*16))
>
>This way I can take any asset md5 and figure out it's location on 
>the file system without database access, and leaves ample room for 
>expansion, as well as moving large (or small) chunks to other 
>servers. At this point I am using this system for over a million 
>files and most of the sub directories only have a few hundred files 
>in them tops.
>
>If you decided to use a single directory approach you will most 
>likely run into quite a few problems. I remember when I had around 
>15-20,000 files, everything I did in that directory became extra 
>slow. With the setup I'm using now I don't really get any lag.
>
>Hope that helped.
>-Max
>
>

This topic is beginning to sound like a problem that a binary tree 
might provide a solution. Anyone have any references for php b-trees?

tedd
-- 
--------------------------------------------------------------------------------
http://sperling.com/



More information about the talk mailing list