NYCPHP Meetup

NYPHP.org

[nycphp-talk] CSV file Reading

Michael B Allen mba2000 at ioplex.com
Mon Mar 26 11:41:06 EDT 2007


On Mon, 26 Mar 2007 19:58:50 +0530
"Aniesh joseph" <anieshjoseph at gmail.com> wrote:

> Hello
> 
> 
> I have to read large CSV file upto 10 MB size. I tried to read each line by
> using getcsv() method, but cannot worthy. I have to make some checking the
> contents of the CSV files such as any duplicate row, or any row missing
> contents etc.
> 
> Can anybody suggests a method to read large file of CSV files ?

The trick to parsing large files is to completely process and then discard
each line one at a time. Hopefully the memory for strings that have been
processed will be collected although I'm not technically sure that will
be true. You might want to run a simple test to read and discard lines
from a file that is much bigger than the memory limit.

If the CSV interface you're using right now doesn't support that model
then you'll have to write your own CSV parser. The below code is one
that I wrote in C but fortunately PHP is very much like C, it shouldn't
be too hard to translate it. It's used in production environments by
major software products, free and otherwise. If you do translate to PHP,
perhaps you can post it back on the list.

Note that this code looks complicated but it's actually one of the
smallest CSV parsers you'll find and it's a lot more correct than just
about anything you'll find. Parsing quotes and quotes within quotes
is non-trivial.

Obviously you'll need to change the sinput parameter to a file or some
kind of stream source and return an array instead of the user providing
a buffer.

Mike

int
csv_parse_str(struct sinput *in,
            unsigned char *buf,
            size_t bn,
            unsigned char *row[],
            int rn,
            int sep,
            int flags)
{
    int trim, quotes, ch, state, r, j, t, inquotes;

    trim = flags & CSV_TRIM;
    quotes = flags & CSV_QUOTES;
    state = ST_START;
    inquotes = 0;
    ch = r = j = t = 0;

    memset(row, 0, sizeof(unsigned char *) * rn);

    while (rn && bn && (ch = snextch(in)) > 0) {
        switch (state) {
            case ST_START:
                if (ch != '\n' && ch != sep && isspace(ch)) {
                    if (!trim) {
                        buf[j++] = ch; bn--;
                        t = j;
                    }
                    break;
                } else if (quotes && ch == '"') {
                    j = t = 0;
                    state = ST_COLLECT;
                    inquotes = 1;
                    break;
                }
                state = ST_COLLECT;
            case ST_COLLECT:
                if (inquotes) {
                    if (ch == '"') {
                        state = ST_END_QUOTE;
                        break;
                    }
                } else if (ch == sep || ch == '\n') {
                    row[r++] = buf; rn--;
                    if (ch == '\n' && t && buf[t - 1] == '\r') {
                        t--; bn++; /* crlf -> lf */
                    }
                    buf[t] = '\0'; bn--;
                    buf += t + 1;
                    j = t = 0;
                    state = ST_START;
                    inquotes = 0;
                    if (ch == '\n') {
                        rn = 0;
                    }
                    break;
                } else if (quotes && ch == '"') {
                    PMNF(errno = EILSEQ, ": unexpected quote in element %d", (r + 1));
                    return -1;
                }
                buf[j++] = ch; bn--;
                if (!trim || isspace(ch) == 0) {
                    t = j;
                }
                break;
            case ST_TAILSPACE:
            case ST_END_QUOTE:
                if (ch == sep || ch == '\n') {
                    row[r++] = buf; rn--;
                    buf[j] = '\0'; bn--;
                    buf += j + 1;
                    j = t =  0;
                    state = ST_START;
                    inquotes = 0;
                    if (ch == '\n') {
                        rn = 0;
                    }
                    break;
                } else if (quotes && ch == '"' && state != ST_TAILSPACE) {
                    buf[j++] = '"';    bn--;         /* nope, just an escaped quote */
                    t = j;
                    state = ST_COLLECT;
                    break;
                } else if (isspace(ch)) {
                    state = ST_TAILSPACE;
                    break;
                }
                errno = EILSEQ;
                PMNF(errno, ": bad end quote in element %d", (r + 1));
                return -1;
        }
    }
    if (ch == -1) {
        AMSG("");
        return -1;
    }
    if (bn == 0) {
        PMNO(errno = E2BIG);
        return -1;
    }
    if (rn) {
        if (inquotes && state != ST_END_QUOTE) {
            PMNO(errno = EILSEQ);
            return -1;
        }
        row[r] = buf;
        buf[t] = '\0';
    }

    return in->count;
}

Note: This code comes from "libmba" and is MIT Licensed (like BSD no
advert).

-- 
Michael B Allen
PHP Active Directory Kerberos SSO
http://www.ioplex.com/



More information about the talk mailing list