NYCPHP Meetup

NYPHP.org

[nycphp-talk] [OT] number of files in a directory?

max goldberg max.goldberg at gmail.com
Tue Jan 3 00:55:05 EST 2006


I was trying to explain that I use a straight md5 sum of each file (the
contents) as the file name, to keep from storing and serving duplicate
files. When a user uploads a file that already exists in my system, it
discards it and just links to the previously uploaded version. I have no
concern about people "stealing" my images as filename has relatively little
to do with it. The files are stored with a 32 character md5 and a filetype
extension.

I guess it really depends on what your major goals are.

On 1/3/06, Anirudh Zala <arzala at gmail.com> wrote:
>
> On Tue, 03 Jan 2006 06:06:58 +0530, max goldberg <max.goldberg at gmail.com>
> wrote:
>
> > I started a site a couple years ago which had a similar problem. Users
> can
> > create pages and each page had one or many assets.
> >
> > At first I just did /content/page_id/assets.ext, after I got to 32,000
> > directories, it stopped working. At that point I moved to a system where
> I
> > had 250 directories like, 0, 1000, 2000, 3000, etc. Each of those
> > directories had 1000 sub directories in them, each named after the
> page_id.
> > After I got to around 300,000 sub directories and a little over a
> million
> > files, I moved to a completely md5 based system.
> >
> > I tried to avoid keeping track of files in the database as it tends to
> get
> > messy, but I suppose it's inevitable. The benefits of an md5 system for
> me
> > outweigh a non md5 system. This may not be the case for you but the main
> > drawing points to me were:
>
> Sometimes you must need to store those values related to images into
> database because you need to store various information of images like size,
> height, width, original file name etc, for which DB storage is inevitable. I
> also prefer to store various image information like size, height, width into
> db directly to display it anywhere rather than finding those values on the
> fly using functions like "getimagesize()" etc. But if you think you will not
> need those information regarding images and you just need to display it,
> then you can avoid usage of DB.
>
> >
> > 1) Lowered server I/O, there were quite a few duplicated files. Each
> time
> > two identical files were read, they were pulled from two different spots
> on
> > the disk. My site quickly grew to using huge amounts of I/O. (around
> > 15,000-20,000 hits a minute on the content server). This definitely
> helped
> > out for me as some files were duplicated over 1,000 times.
> >
> > 2) Made it a lot easier to keep track of what was being used and what
> > wasn't. Without a DB back end I couldn't tell which files I could delete
> and
> > which I needed to keep without writing a script that basically checked
> every
> > directory for a matching entry in the database.
> >
> > 3) Lowered disk space. Again duplicate files.
> >
> > 4) Allowed me to ban certain images and other files, mass delete things
> that
> > had a certain md5 attached to it. This is very useful if you will ever
> need
> > to moderate or have troublesome users.
> >
> > The downside is that you have to make sure your code really keeps track
> of
> > your file system and you aren't accessing it by hand. Another thing you
> > might worry about using md5 is collisions. If this is a mission critical
> > system, you may want to avoid md5 as it is possible (but somewhat
> unlikely)
> > you will encounter collisions. I've read anyone with a decent computer
> can
> > create an md5 collision in about an hour, so that's something to keep in
> > mind.
> >
>
> I don't understand how can there be collisions with generating randome
> hash? Consider below method to generate 16 digit random hash.
>
>
> $hash=substr(md5(md5(time().rand().$GLOBALS['REMOTE_ADDR'].microtime()).time()),0,16);
>
> Can you ever get duplicate hash by above method? I don't think so. However
> usage of "md5()" only might do that. However main purpose of using hash
> value in filename is to avoid stealing of them because by this way stealers
> can't directly downaload images by guessing directory and file name
> structure of your website. Moreover in file name like
>
> \RECORDID_SOME>10CHARACTERHASH.EXT (i.e 123456_aswe34567bg.jpg)
>
> where we use combination of "primary column ID 123456" of particualr
> record that is attached to this image and above mentioned 10 to 16 digit
> hash will solve both of our purposes. Since RECORDID is always unique, you
> will never have duplication of images, that is for sure.
>
> > The way I structured my file system was three levels of single character
> > directories.
> >
> > /content/a-f0-9/a-f0-9/a-f0-9/filename.ext (4096 directories (16*16*16))
> >
> > This way I can take any asset md5 and figure out it's location on the
> file
> > system without database access, and leaves ample room for expansion, as
> well
> > as moving large (or small) chunks to other servers. At this point I am
> using
> > this system for over a million files and most of the sub directories
> only
> > have a few hundred files in them tops.
>
> This is another good mechanism of storing large number of files
> efficiently. But I would like to know a real example of this "filename.ext".
> I assume if it is just like "ASSET.ext" i.e 1.jpg, 1234.jpg, 4567.jpg then
> I would say that your system is prone to stealing of your images. md5 or
> another kind of hash, here, gives you protection against it since stealers
> can't guess exact file name of your images and even if they try hard, they
> wont get benefited much.
>
> >
> > If you decided to use a single directory approach you will most likely
> run
> > into quite a few problems. I remember when I had around 15-20,000 files,
> > everything I did in that directory became extra slow. With the setup I'm
> > using now I don't really get any lag.
> >
> > Hope that helped.
> > -Max
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On 12/31/05, Marc Antony Vose <suzerain at suzerain.com> wrote:
> >>
> >> Hey all:
> >>
> >> First of all:  Happy New Year!
> >>
> >> Secondly: I am rebuilding a site that was coded somewhat sloppily,
> >> and they have product images all stored in one directory (a script
> >> that I am not writing auto-uploads them to the web server from
> >> elsewhere).  Presently, this directory contains about 33,000 files.
> >> It will be more like 75,000 when the site launches, if things remain
> >> the same.
> >>
> >> The question is:  should I be worried about this, or was this only a
> >> problem several years ago? (I remember people at one time attempting
> >> to not put too many files in one place.)
> >>
> >> If I should be worried, what could happen?  Will we ever reach a hard
> >> limit of files per directory?
> >>
> >> Is it better if each product instead has its own directory inside
> >> there (i.e., 75,000 directories), each with as many files as we need
> >> inside, or is that just the same problem?
> >>
> >> Cheers,
> >>
> >> --
> >> Marc Antony Vose
> >> http://www.suzerain.com/
> >>
> >> Imagination is more important than knowledge.
> >> -- Albert Einstein
> >> _______________________________________________
> >> New York PHP Talk Mailing List
> >> AMP Technology
> >> Supporting Apache, MySQL and PHP
> >> http://lists.nyphp.org/mailman/listinfo/talk
> >> http://www.nyphp.org
> >>
> >
>
>
>
> --
> -----------------------------------------------------
> Anirudh Zala (Production Manager)
> ASPL, http://www.aspl.in
> Ph: +91 281 245 1894
> arzala at gmail.com
> -----------------------------------------------------
> _______________________________________________
> New York PHP Talk Mailing List
> AMP Technology
> Supporting Apache, MySQL and PHP
> http://lists.nyphp.org/mailman/listinfo/talk
> http://www.nyphp.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nyphp.org/pipermail/talk/attachments/20060103/3bfaf658/attachment.html>


More information about the talk mailing list