CSV_XS(3) User Contributed Perl Documentation CSV_XS(3)
NAME
Text::CSV_XS - comma-separated values manipulation routines
SYNOPSIS
use Text::CSV_XS;
my @rows;
my $csv = Text::CSV_XS->new ({ binary => 1 }) or
die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
while (my $row = $csv->getline ($fh)) {
$row->[2] =~ m/pattern/ or next; # 3rd field should match
push @rows, $row;
}
$csv->eof or $csv->error_diag ();
close $fh;
$csv->eol ("
");
open $fh, ">:encoding(utf8)", "new.csv" or die "new.csv: $!";
$csv->print ($fh, $_) for @rows;
close $fh or die "new.csv: $!";
DESCRIPTION
Text::CSV_XS provides facilities for the composition and decomposition of comma-separated values. An instance of the Text::CSV_XS class
can combine fields into a CSV string and parse a CSV string into fields.
The module accepts either strings or files as input and can utilize any user-specified characters as delimiters, separators, and escapes so
it is perhaps better called ASV (anything separated values) rather than just CSV.
Embedded newlines
Important Note: The default behavior is to only accept ascii characters. This means that fields can not contain newlines. If your data
contains newlines embedded in fields, or characters above 0x7e (tilde), or binary data, you *must* set "binary => 1" in the call to "new
()". To cover the widest range of parsing options, you will always want to set binary.
But you still have the problem that you have to pass a correct line to the "parse ()" method, which is more complicated from the usual
point of usage:
my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
while (<>) { # WRONG!
$csv->parse ($_);
my @fields = $csv->fields ();
will break, as the while might read broken lines, as that doesn't care about the quoting. If you need to support embedded newlines, the way
to go is either
my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
while (my $row = $csv->getline (*ARGV)) {
my @fields = @$row;
or, more safely in perl 5.6 and up
my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
open my $io, "<", $file or die "$file: $!";
while (my $row = $csv->getline ($io)) {
my @fields = @$row;
Unicode (UTF8)
On parsing (both for "getline ()" and "parse ()"), if the source is marked being UTF8, then all fields that are marked binary will also be
be marked UTF8.
On combining ("print ()" and "combine ()"), if any of the combining fields was marked UTF8, the resulting string will be marked UTF8.
For complete control over encoding, please use Text::CSV::Encoded:
use Text::CSV::Encoded;
my $csv = Text::CSV::Encoded->new ({
encoding_in => "iso-8859-1", # the encoding comes into Perl
encoding_out => "cp1252", # the encoding comes out of Perl
});
$csv = Text::CSV::Encoded->new ({ encoding => "utf8" });
# combine () and print () accept *literally* utf8 encoded data
# parse () and getline () return *literally* utf8 encoded data
$csv = Text::CSV::Encoded->new ({ encoding => undef }); # default
# combine () and print () accept UTF8 marked data
# parse () and getline () return UTF8 marked data
SPECIFICATION
While no formal specification for CSV exists, RFC 4180 1) describes a common format and establishes "text/csv" as the MIME type registered
with the IANA.
Many informal documents exist that describe the CSV format. How To: The Comma Separated Value (CSV) File Format 2) provides an overview of
the CSV format in the most widely used applications and explains how it can best be used and supported.
1) http://tools.ietf.org/html/rfc4180
2) http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
The basic rules are as follows:
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that
contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single
entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by
placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line
terminator format.
o Each record is one line terminated by a line feed (ASCII/LF=0x0A) or a carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A),
however, line-breaks can be embedded.
o Fields are separated by commas.
o Allowable characters within a CSV field include 0x09 (tab) and the inclusive range of 0x20 (space) through 0x7E (tilde). In binary mode
all characters are accepted, at least in quoted fields.
o A field within CSV must be surrounded by double-quotes to contain a the separator character (comma).
Though this is the most clear and restrictive definition, Text::CSV_XS is way more liberal than this, and allows extension:
o Line termination by a single carriage return is accepted by default
o The separation-, escape-, and escape- characters can be any ASCII character in the range from 0x20 (space) to 0x7E (tilde). Characters
outside this range may or may not work as expected. Multibyte characters, like U+060c (ARABIC COMMA), U+FF0C (FULLWIDTH COMMA), U+241B
(SYMBOL FOR ESCAPE), U+2424 (SYMBOL FOR NEWLINE), U+FF02 (FULLWIDTH QUOTATION MARK), and U+201C (LEFT DOUBLE QUOTATION MARK) (to give
some examples of what might look promising) are therefor not allowed.
If you use perl-5.8.2 or higher, these three attributes are utf8-decoded, to increase the likelihood of success. This way U+00FE will be
allowed as a quote character.
o A field within CSV must be surrounded by double-quotes to contain an embedded double-quote, represented by a pair of consecutive double-
quotes. In binary mode you may additionally use the sequence ""0" for representation of a NULL byte.
o Several violations of the above specification may be allowed by passing options to the object creator.
FUNCTIONS
version ()
(Class method) Returns the current module version.
new (\%attr)
(Class method) Returns a new instance of Text::CSV_XS. The objects attributes are described by the (optional) hash ref "\%attr". Currently
the following attributes are available:
eol An end-of-line string to add to rows. "undef" is replaced with an empty string. The default is "$". Common values for "eol" are "