NYCPHP Meetup

NYPHP.org

[nycphp-talk] [OT] number of files in a directory?

Kerem Tuzemen keremtuzemen at gmail.com
Mon Jan 2 20:21:12 EST 2006


Hi Folks, 

Have you ever considered using reiserfs? I would highly recommend it, especially if you are dealing with small files.

Take a look at : http://www.namesys.com/ or google around for reiserfs

They have good documentation and a bunch of benchmarks.

I am not sure if the version 4 of reiserfs is stable yet, it's been a while since I read the site thoroughly. But to my experience (I am currently using it in production on a very busy server serving around 2 million php pages with an average of 10 photos per page - guess where the photos are stored) it provides quite a solution with its version 3 which comes standard with most of the linux distros. Just enable it with kernel parameters, format your partition with reiserfs and you're ready to go. 

I hope this helps to everybody who look for a similar solution. 

Cheers,

Kerem Tuzemen
[Shameless plug] Ex NewYorker trapped in DC who is looking for a job in the NYC [/Shameless plug] 


  ----- Original Message ----- 
  From: max goldberg 
  To: NYPHP Talk 
  Sent: Monday, January 02, 2006 7:36 PM
  Subject: Re: [nycphp-talk] [OT] number of files in a directory?


  I started a site a couple years ago which had a similar problem. Users can create pages and each page had one or many assets.

  At first I just did /content/page_id/assets.ext, after I got to 32,000 directories, it stopped working. At that point I moved to a system where I had 250 directories like, 0, 1000, 2000, 3000, etc. Each of those directories had 1000 sub directories in them, each named after the page_id. After I got to around 300,000 sub directories and a little over a million files, I moved to a completely md5 based system. 

  I tried to avoid keeping track of files in the database as it tends to get messy, but I suppose it's inevitable. The benefits of an md5 system for me outweigh a non md5 system. This may not be the case for you but the main drawing points to me were: 

  1) Lowered server I/O, there were quite a few duplicated files. Each time two identical files were read, they were pulled from two different spots on the disk. My site quickly grew to using huge amounts of I/O. (around 15,000-20,000 hits a minute on the content server). This definitely helped out for me as some files were duplicated over 1,000 times. 

  2) Made it a lot easier to keep track of what was being used and what wasn't. Without a DB back end I couldn't tell which files I could delete and which I needed to keep without writing a script that basically checked every directory for a matching entry in the database. 

  3) Lowered disk space. Again duplicate files.

  4) Allowed me to ban certain images and other files, mass delete things that had a certain md5 attached to it. This is very useful if you will ever need to moderate or have troublesome users. 

  The downside is that you have to make sure your code really keeps track of your file system and you aren't accessing it by hand. Another thing you might worry about using md5 is collisions. If this is a mission critical system, you may want to avoid md5 as it is possible (but somewhat unlikely) you will encounter collisions. I've read anyone with a decent computer can create an md5 collision in about an hour, so that's something to keep in mind. 

  The way I structured my file system was three levels of single character directories. 

  /content/a-f0-9/a-f0-9/a-f0-9/filename.ext (4096 directories (16*16*16))

  This way I can take any asset md5 and figure out it's location on the file system without database access, and leaves ample room for expansion, as well as moving large (or small) chunks to other servers. At this point I am using this system for over a million files and most of the sub directories only have a few hundred files in them tops. 

  If you decided to use a single directory approach you will most likely run into quite a few problems. I remember when I had around 15-20,000 files, everything I did in that directory became extra slow. With the setup I'm using now I don't really get any lag. 

  Hope that helped.
  -Max

   












  On 12/31/05, Marc Antony Vose < suzerain at suzerain.com> wrote:
    Hey all:

    First of all:  Happy New Year!

    Secondly: I am rebuilding a site that was coded somewhat sloppily,
    and they have product images all stored in one directory (a script
    that I am not writing auto-uploads them to the web server from
    elsewhere).  Presently, this directory contains about 33,000 files. 
    It will be more like 75,000 when the site launches, if things remain
    the same.

    The question is:  should I be worried about this, or was this only a
    problem several years ago? (I remember people at one time attempting 
    to not put too many files in one place.)

    If I should be worried, what could happen?  Will we ever reach a hard
    limit of files per directory?

    Is it better if each product instead has its own directory inside 
    there (i.e., 75,000 directories), each with as many files as we need
    inside, or is that just the same problem?

    Cheers,

    --
    Marc Antony Vose
    http://www.suzerain.com/ 

    Imagination is more important than knowledge.
    -- Albert Einstein
    _______________________________________________
    New York PHP Talk Mailing List
    AMP Technology
    Supporting Apache, MySQL and PHP
    http://lists.nyphp.org/mailman/listinfo/talk
    http://www.nyphp.org





------------------------------------------------------------------------------


  _______________________________________________
  New York PHP Talk Mailing List
  AMP Technology
  Supporting Apache, MySQL and PHP
  http://lists.nyphp.org/mailman/listinfo/talk
  http://www.nyphp.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nyphp.org/pipermail/talk/attachments/20060102/4e656216/attachment.html>


More information about the talk mailing list