NYCPHP Meetup

NYPHP.org

[nycphp-talk] [OT] number of files in a directory?

Marc Antony Vose suzerain at suzerain.com
Tue Jan 3 02:18:31 EST 2006


Hey y'all:

Thanks for the thoughtful notes.  As usual, people here have given me 
entirely new ways to think about things.

In my case, I'm not accepting uploads from web site visitors; all 
these images are coming from the owner of the site.  They are a store 
that sells one-of-a-kind rare items, and at any given time there will 
be an inventory of ~75,000.

Each item in the inventory will have something like 6-10 images, each 
used for different purposes (for example, we will be displaying 
rotatable versions of the products via Flash).

However, they've indicated a future desire to be able to offer people 
order histories, and so forth, so for the time being sold items will 
be kept in the system.  So the probability of getting into the 
millions pretty quickly with images is definitely there.

It seems, then, like the most sensible idea for me is to create md5() 
hashed directories based on the product ID number (rather than image 
ID number), as I don't think it's necessary for me to store 
information about every image in the database.

So, if I translated the product ID number into something like

IMAGES_DIR/ab5/81d

I could then store inside that directory any images which are needed 
for the particular product.  Something like:

12345_signature.jpg
12345_rotate_1.jpg
12345_rotate_2.jpg
12345_rotate_3.jpg
12345_rotate_4.jpg
12345_rotate_5.jpg
12345_rotate_6.jpg
12345_thumb.jpg

I suppose it is probable under this scenario that two different 
products will end up with the same path, since we're only using the 
first 6 characters of the hash. but it shouldn't really matter as 
long as I have the images keyed with the product ID as well.

Anyone see any major red flags with this strategy?

Cheers,

Marc


>max goldberg wrote:
>  > The downside is that you have to make sure your code really keeps track
>>  of your file system and you aren't accessing it by hand. Another thing
>>  you might worry about using md5 is collisions. If this is a mission
>>  critical system, you may want to avoid md5 as it is possible (but
>>  somewhat unlikely) you will encounter collisions. I've read anyone with
>>  a decent computer can create an md5 collision in about an hour, so
>>  that's something to keep in mind.
>
>Yeah, this is probably the best the solution.  To avoid collisions what
>you want to do is assign a unique database ID to every asset, use that
>ID to create the MD5 hash, then store the asset with a filename
>containing that unique ID.  That should eliminate collisions.  The worst
>that can happen is that you'll have two different files in the same
>directory but with different filenames, which is cool.
>
>A function like this could be used to both plant the file in the MD5
>filesystem and extract its path later on based on that unique ID:
>
>function get_upload_target($file_id) {
>      $hash_id = md5($file_id);
>      $subdir = substr($hash_id, 0, 3) .
>        '/' .
>        substr($hash_id, 3, 3);
>      return $subdir;
>    }
>
>Use case: someone uploads the file "mykitty.jpg" and it's inserted into
>the database as id=1234.  get_upload_target(1234) returns:
>
>   81d/c9b
>
>The file is then written as $ASSET_DIR/81d/c9b/1234
>
>Or 1234.jpg, or 1234.mykitty.jpg, whatever.  I like to give the file a
>recognizable file type extension.
>
>To extract that file later, just run the ID through get_upload_target()
>again to build the filesystem path.



More information about the talk mailing list