NYCPHP Meetup

NYPHP.org

[nycphp-talk] [OT] number of files in a directory?

max goldberg max.goldberg at gmail.com
Mon Jan 2 19:36:58 EST 2006


I started a site a couple years ago which had a similar problem. Users can
create pages and each page had one or many assets.

At first I just did /content/page_id/assets.ext, after I got to 32,000
directories, it stopped working. At that point I moved to a system where I
had 250 directories like, 0, 1000, 2000, 3000, etc. Each of those
directories had 1000 sub directories in them, each named after the page_id.
After I got to around 300,000 sub directories and a little over a million
files, I moved to a completely md5 based system.

I tried to avoid keeping track of files in the database as it tends to get
messy, but I suppose it's inevitable. The benefits of an md5 system for me
outweigh a non md5 system. This may not be the case for you but the main
drawing points to me were:

1) Lowered server I/O, there were quite a few duplicated files. Each time
two identical files were read, they were pulled from two different spots on
the disk. My site quickly grew to using huge amounts of I/O. (around
15,000-20,000 hits a minute on the content server). This definitely helped
out for me as some files were duplicated over 1,000 times.

2) Made it a lot easier to keep track of what was being used and what
wasn't. Without a DB back end I couldn't tell which files I could delete and
which I needed to keep without writing a script that basically checked every
directory for a matching entry in the database.

3) Lowered disk space. Again duplicate files.

4) Allowed me to ban certain images and other files, mass delete things that
had a certain md5 attached to it. This is very useful if you will ever need
to moderate or have troublesome users.

The downside is that you have to make sure your code really keeps track of
your file system and you aren't accessing it by hand. Another thing you
might worry about using md5 is collisions. If this is a mission critical
system, you may want to avoid md5 as it is possible (but somewhat unlikely)
you will encounter collisions. I've read anyone with a decent computer can
create an md5 collision in about an hour, so that's something to keep in
mind.

The way I structured my file system was three levels of single character
directories.

/content/a-f0-9/a-f0-9/a-f0-9/filename.ext (4096 directories (16*16*16))

This way I can take any asset md5 and figure out it's location on the file
system without database access, and leaves ample room for expansion, as well
as moving large (or small) chunks to other servers. At this point I am using
this system for over a million files and most of the sub directories only
have a few hundred files in them tops.

If you decided to use a single directory approach you will most likely run
into quite a few problems. I remember when I had around 15-20,000 files,
everything I did in that directory became extra slow. With the setup I'm
using now I don't really get any lag.

Hope that helped.
-Max













On 12/31/05, Marc Antony Vose <suzerain at suzerain.com> wrote:
>
> Hey all:
>
> First of all:  Happy New Year!
>
> Secondly: I am rebuilding a site that was coded somewhat sloppily,
> and they have product images all stored in one directory (a script
> that I am not writing auto-uploads them to the web server from
> elsewhere).  Presently, this directory contains about 33,000 files.
> It will be more like 75,000 when the site launches, if things remain
> the same.
>
> The question is:  should I be worried about this, or was this only a
> problem several years ago? (I remember people at one time attempting
> to not put too many files in one place.)
>
> If I should be worried, what could happen?  Will we ever reach a hard
> limit of files per directory?
>
> Is it better if each product instead has its own directory inside
> there (i.e., 75,000 directories), each with as many files as we need
> inside, or is that just the same problem?
>
> Cheers,
>
> --
> Marc Antony Vose
> http://www.suzerain.com/
>
> Imagination is more important than knowledge.
> -- Albert Einstein
> _______________________________________________
> New York PHP Talk Mailing List
> AMP Technology
> Supporting Apache, MySQL and PHP
> http://lists.nyphp.org/mailman/listinfo/talk
> http://www.nyphp.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nyphp.org/pipermail/talk/attachments/20060102/56f99676/attachment.html>


More information about the talk mailing list