sed isn't really the right tool for dealing with columns. awk would be the correct tool, it understands columns as columns without weird regex convolutions.
As long as your CSV is actually comma separated -- uses , to separate columns and nowhere else -- this may work:
If your CSV isn't actually a CSV, a recursive parser that understands quotes is required and things start getting hard.
I can trust that the first column will never contain a comma within the "ABC" Internal ID field but not in the following fields, Name is likely suspect.
I tried you suggested command out, looks to have worked preserving the remainder of my record.
sed isn't really the right tool for dealing with columns. awk would be the correct tool, it understands columns as columns without weird regex convolutions.
I beg to disagree. Actually sed is a very good tool for manipulating any text data, regardless of being in tables or not. If one prefers awk or sed is more a matter of taste, because both languages are Turing-complete.
Quote:
Originally Posted by lojkyelo
I found that the following works to remove the first column in my file when my CSV is delimited with a simple comma:
However, I have a new file where the fields are encapsulated with double quotes - general example of file:
Desired outcome:
Let us first examine what your regexp does:
This searches for a string of non-commas in the length of zero (=empty field) or more ("[^,]*", so in fact this makes this string optional), followed by a comma. The resulting string will be deleted (replaced by a null-string).
Let us have a look at the composition of your fields: any single field is "a sequence of zero or more non-, followed by a comma. This is true regardless of the field being enclosed in double quotes or not. So it seems that you do not have to change your regexp at all. A double-quote is just a character like any other.
What might happen is that commata enclosed in quotes should not be treated as field-separators, like this:
As your script is now (and, btw., the awk script too) this would be interpreted as the first field ending after "c", the second field after "f", etc.. But probably such a line should be interpreted as 3 fields, ending after "f", "o" and end-of-line. To accomodate for this you need to enhance your definition of what a "field" is a little: a field is a sequence of zero or more strings enclosed in double-quotes mixed with zero or more non-commata, followed by a comma. The following regexp is based on this definition:
Let us peel this apart. Basically it is quite easy, but the nesting level makes it somewhat difficult to understand:
This is a quoted string: a double quote, followed by zero or more non-double quotes, followed by another double quote.
The double-quoted string grouped in brackets, so that the following "*" means zero or more occurences of this expression. This is followed by zero or more non-commas, which would be characters outside the double-quoted strings, so that the expression even allows for mixed quoted and non-quoted field contents.
This whole expression is again grouped, made optional with the "*" (so that we allow for empty fields) and followed by a final ",", which is the record separator.
You could use this as a blueprint to manipulate other fields easily, not just the first. Suppose you would want to change the fourth field to "@@@". Use another grouping to bring one whole field together, skip the first three occurences of a field and work on the fourth. To preserve the first three fields contents we have to use another grouping:
Which is more simply written as:
Which is where we started, except the code given in the 1st message in this thread was intended to handle cases where the field separator was a comma; while the input file we're processing here has a comma followed by a space sas the field separator.
Last edited by Don Cragun; 08-08-2014 at 03:59 PM..
Reason: Add note.
This User Gave Thanks to Don Cragun For This Post:
Hi,
I need to remove first column from a csv file and i can do this by using below command.
cut -f1 -d, --complement Mytest.csv
I need to implement this in shell scripting, Whenever i am using the above command alone in command line it is working fine.
I have 5 files in my directory and... (3 Replies)
Hello people,
I am having problem to sort, sed and zero padding of column in csv file.
7th column only.
Input of csv file:
1,2,3,4,5,6,4/1/2010 12:00 AM,8
1,2,3,4,5,6,3/11/2010 9:39 AM,8
1,2,3,4,5,6,5/12/2011 3:43 PM,8
1,2,3,4,5,6,12/20/2009 7:23 PM,8
Output:... (5 Replies)
I have a data file that has 14 columns. I cannot use awk or perl but sed is installed on my host. I would like to delete a line if fields 10, 11 or twelve is greater than 999.99. How is this done using sed? :wall:
sed '/^*,*,*,*,*,*,*,*,*,*,*,*,*,*,/d' infile
1 2 3 4 ... (2 Replies)
Hi All,
i have a csv file .
In the 7th column i have data that has line feed in it.
Requirement is to remove the line feed from the 7th column whenever it appears
There are 11 columns in the file
C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11
The value in C7 contains line feed ( Alt + Enter ),... (2 Replies)
I have a large CSV files (e.g. 2 million records) and am hoping to do one of two things. I have been trying to use awk and sed but am a newbie and can't figure out how to get it to work. Any help you could offer would be greatly appreciated - I'm stuck trying to remove the colon and wildcards in... (6 Replies)
I am trying to remove the ita from this file:
"1234ita","john","smith"
"56789ita","jim","thomas"
the command i am using is:
sed '/^ita/d' infile.csv > outfile.csv
it is running but doing nothing, very rarely use sed so trying to learn it any help would be appreciated (2 Replies)
Hello,
I have many csv file, but I would like to delete lines with some values in a column conditionally.
My example look like this,
ex1e, ex2g, ex39, pasg, ssg, mrlc, pc, kb, coop
-112, -53, -177, 64, 62, 71, 1, 487, 20
-101, -61, -53, 0, 32767, 51, 0, ... (6 Replies)