CSV_XS(3) User Contributed Perl Documentation CSV_XS(3)
NAME
Text::CSV_XS - comma-separated values manipulation routines
SYNOPSIS
use Text::CSV_XS;
my @rows;
my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
while (my $row = $csv->getline ($fh)) {
$row->[2] =~ m/pattern/ or next; # 3rd field should match
push @rows, $row;
}
close $fh;
$csv->eol ("
");
open $fh, ">:encoding(utf8)", "new.csv" or die "new.csv: $!";
$csv->print ($fh, $_) for @rows;
close $fh or die "new.csv: $!";
DESCRIPTION
Text::CSV_XS provides facilities for the composition and decomposition of comma-separated values. An instance of the Text::CSV_XS class
will combine fields into a CSV string and parse a CSV string into fields.
The module accepts either strings or files as input and support the use of user-specified characters for delimiters, separators, and
escapes.
Embedded newlines
Important Note: The default behavior is to accept only ASCII characters in the range from 0x20 (space) to 0x7E (tilde). This means that
fields can not contain newlines. If your data contains newlines embedded in fields, or characters above 0x7e (tilde), or binary data, you
must set "binary => 1" in the call to "new". To cover the widest range of parsing options, you will always want to set binary.
But you still have the problem that you have to pass a correct line to the "parse" method, which is more complicated from the usual point
of usage:
my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
while (<>) { # WRONG!
$csv->parse ($_);
my @fields = $csv->fields ();
will break, as the while might read broken lines, as that does not care about the quoting. If you need to support embedded newlines, the
way to go is to not pass "eol" in the parser (it accepts "
", "
", and "
" by default) and then
my $csv = Text::CSV_XS->new ({ binary => 1 });
open my $io, "<", $file or die "$file: $!";
while (my $row = $csv->getline ($io)) {
my @fields = @$row;
The old(er) way of using global file handles is still supported
while (my $row = $csv->getline (*ARGV)) {
Unicode
Unicode is only tested to work with perl-5.8.2 and up.
On parsing (both for "getline" and "parse"), if the source is marked being UTF8, then all fields that are marked binary will also be marked
UTF8.
For complete control over encoding, please use Text::CSV::Encoded:
use Text::CSV::Encoded;
my $csv = Text::CSV::Encoded->new ({
encoding_in => "iso-8859-1", # the encoding comes into Perl
encoding_out => "cp1252", # the encoding comes out of Perl
});
$csv = Text::CSV::Encoded->new ({ encoding => "utf8" });
# combine () and print () accept *literally* utf8 encoded data
# parse () and getline () return *literally* utf8 encoded data
$csv = Text::CSV::Encoded->new ({ encoding => undef }); # default
# combine () and print () accept UTF8 marked data
# parse () and getline () return UTF8 marked data
On combining ("print" and "combine"), if any of the combining fields was marked UTF8, the resulting string will be marked UTF8. Note
however that all fields before the first field that was marked UTF8 and contained 8-bit characters that were not upgraded to UTF8, these
will be bytes in the resulting string too, causing errors. If you pass data of different encoding, or you don't know if there is different
encoding, force it to be upgraded before you pass them on:
$csv->print ($fh, [ map { utf8::upgrade (my $x = $_); $x } @data ]);
SPECIFICATION
While no formal specification for CSV exists, RFC 4180 1) describes a common format and establishes "text/csv" as the MIME type registered
with the IANA.
Many informal documents exist that describe the CSV format. How To: The Comma Separated Value (CSV) File Format 2) provides an overview of
the CSV format in the most widely used applications and explains how it can best be used and supported.
1) http://tools.ietf.org/html/rfc4180
2) http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
The basic rules are as follows:
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that
contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single
entry that is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by
placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line
terminator format.
o Each record is a single line ended by a line feed (ASCII/LF=0x0A) or a carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A),
however, line-breaks may be embedded.
o Fields are separated by commas.
o Allowable characters within a CSV field include 0x09 (tab) and the inclusive range of 0x20 (space) through 0x7E (tilde). In binary mode
all characters are accepted, at least in quoted fields.
o A field within CSV must be surrounded by double-quotes to contain a the separator character (comma).
Though this is the most clear and restrictive definition, Text::CSV_XS is way more liberal than this, and allows extension:
o Line termination by a single carriage return is accepted by default
o The separation-, escape-, and escape- characters can be any ASCII character in the range from 0x20 (space) to 0x7E (tilde). Characters
outside this range may or may not work as expected. Multibyte characters, like U+060c (ARABIC COMMA), U+FF0C (FULLWIDTH COMMA), U+241B
(SYMBOL FOR ESCAPE), U+2424 (SYMBOL FOR NEWLINE), U+FF02 (FULLWIDTH QUOTATION MARK), and U+201C (LEFT DOUBLE QUOTATION MARK) (to give
some examples of what might look promising) are therefor not allowed.
If you use perl-5.8.2 or higher, these three attributes are utf8-decoded, to increase the likelihood of success. This way U+00FE will be
allowed as a quote character.
o A field within CSV must be surrounded by double-quotes to contain an embedded double-quote, represented by a pair of consecutive double-
quotes. In binary mode you may additionally use the sequence ""0" for representation of a NULL byte.
o Several violations of the above specification may be allowed by passing options to the object creator.
FUNCTIONS
version
(Class method) Returns the current module version.
new
(Class method) Returns a new instance of Text::CSV_XS. The objects attributes are described by the (optional) hash ref "\%attr".
my $csv = Text::CSV_XS->new ({ attributes ... });
The following attributes are available:
eol An end-of-line string to add to rows.
When not passed in a parser instance, the default behavior is to accept "
", "
", and "
", so it is probably safer to not specify
"eol" at all. Passing "undef" or the empty string behave the same.
Common values for "eol" are "