The Joys of CSV

I’ve been working with CSV files a lot recently, mostly as a way of building web based management information tools out of SAGE data.

But I’ve always really hated working with the interface to Text::CSV_XS. So I put together Text::CSV::Simple. You just point it at the file you want, and read out all the rows:

my $parser = Text::CSV::Simple->new;
my @data = $parser->read_file($datafile);

You can tell it you only want certain fields:

$parser->want_fields(1, 2, 4, 8 );

And that you want the results straight into a hashref rather than just a listref:

$parser->field_map(qw/id name null town/);

There are also trigger points where you can pre- and post-process the data.

It’s certainly made dealing with CSV much easier for me. And it seems to be useful for other people too, as within a few weeks of its release I’ve had several feature requests and bug reports. Usually it takes a couple of months for a new module of mine to build up enough steam to get that.

However, I’ve now had several people all report a problem that I didn’t even consider before: it doesn’t handle newlines in strings. This disturbed me as I hadn’t realised until this that CSV files could actually contain embedded newlines! Of course, I can’t find any sensible documentation anywhere of what the CSV file format actually does and doesn’t allow, as it seems that Microsoft just made it a defacto standard by making it the main export format from Excel, without ever really specifying how it can be used. The few sites that I found that claim to provide more details on the format are contradictory (e.g. over the issue of header rows).

But it certainly does seem that linebreaks are acceptable, as long as they’re properly quoted. This shoots my whole approach to parsing the files apart, and means I’m going to have to go back and pretty much rewrite the module from scratch, and I may even have to lose one of my trigger points, as I still want to use Text::CSV_XS to do the actual parsing for me, but I’ll need to hook in at a different level now.

Of course I face my normal Open Source dilemma with this. The code clearly has a bug, but it’s not one that has any effect on me. None of the CSV files I have to deal with have linebreaks inside records. If the code wasn’t released, I’d apply my XP YAGNI principles, and defer the fix until I needed it. In some ways I’d like to be able to tell people who reported the bug that I’ll happily accept a patch if they can fix it, but otherwise they’ll have to wait until I need it. But having public code out there with known bugs irks me, so I guess I’ll just have to find the time from somewhere to fix it myself!

Leave a Reply

Your email address will not be published. Required fields are marked *