Visit The New, Modern Unix Linux Community


Using sed to remove a column from a CSV


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Using sed to remove a column from a CSV
# 1  
Using sed to remove a column from a CSV

I found that the following works to remove the first column in my file when my CSV is delimited with a simple comma:

Code:
sed -i 's/[^,]*,//' file.csv

However, I have a new file where the fields are encapsulated with double quotes - general example of file:

Code:
"Internal ID", "External ID", "Name"
"123", "ABC", "ABC Incorporated"

Desired outcome:

Code:
"External ID", "Name"
"ABC", "ABC Incorporated"

Can the above sed be modified to handle the "," delimiter? If so, how? Or are there better alternatives?

Any help be appreciated.

Last edited by lojkyelo; 08-08-2014 at 11:47 AM.. Reason: Formatting
# 2  
sed isn't really the right tool for dealing with columns. awk would be the correct tool, it understands columns as columns without weird regex convolutions.

As long as your CSV is actually comma separated -- uses , to separate columns and nowhere else -- this may work:

Code:
awk -F"," -v OFS="," '{ $1="" ; $0=substr($0,2) } 1' inputfile > outputfile

If your CSV isn't actually a CSV, a recursive parser that understands quotes is required and things start getting hard.
This User Gave Thanks to Corona688 For This Post:
# 3  
I can trust that the first column will never contain a comma within the "ABC" Internal ID field but not in the following fields, Name is likely suspect.

I tried you suggested command out, looks to have worked preserving the remainder of my record.

Thanks,
This User Gave Thanks to lojkyelo For This Post:
# 4  
Another approach:
Code:
awk -F", " 'sub($1 FS,x)' file

# 5  
Quote:
Originally Posted by Corona688
sed isn't really the right tool for dealing with columns. awk would be the correct tool, it understands columns as columns without weird regex convolutions.
I beg to disagree. Actually sed is a very good tool for manipulating any text data, regardless of being in tables or not. If one prefers awk or sed is more a matter of taste, because both languages are Turing-complete.

Quote:
Originally Posted by lojkyelo
I found that the following works to remove the first column in my file when my CSV is delimited with a simple comma:

Code:
sed -i 's/[^,]*,//' file.csv

However, I have a new file where the fields are encapsulated with double quotes - general example of file:

Code:
"Internal ID", "External ID", "Name"
"123", "ABC", "ABC Incorporated"

Desired outcome:

Code:
"External ID", "Name"
"ABC", "ABC Incorporated"

Let us first examine what your regexp does:

Code:
s/[^,]*,//

This searches for a string of non-commas in the length of zero (=empty field) or more ("[^,]*", so in fact this makes this string optional), followed by a comma. The resulting string will be deleted (replaced by a null-string).

Let us have a look at the composition of your fields: any single field is "a sequence of zero or more non-, followed by a comma. This is true regardless of the field being enclosed in double quotes or not. So it seems that you do not have to change your regexp at all. A double-quote is just a character like any other.

What might happen is that commata enclosed in quotes should not be treated as field-separators, like this:

Code:
"abc,def","ghi,jkl,mno","..."

As your script is now (and, btw., the awk script too) this would be interpreted as the first field ending after "c", the second field after "f", etc.. But probably such a line should be interpreted as 3 fields, ending after "f", "o" and end-of-line. To accomodate for this you need to enhance your definition of what a "field" is a little: a field is a sequence of zero or more strings enclosed in double-quotes mixed with zero or more non-commata, followed by a comma. The following regexp is based on this definition:

Code:
s/\(\("[^"]*"\)*[^,]*\)*,//

Let us peel this apart. Basically it is quite easy, but the nesting level makes it somewhat difficult to understand:

Code:
"[^"]*"

This is a quoted string: a double quote, followed by zero or more non-double quotes, followed by another double quote.

Code:
\("[^"]*"\)*[^,]*

The double-quoted string grouped in brackets, so that the following "*" means zero or more occurences of this expression. This is followed by zero or more non-commas, which would be characters outside the double-quoted strings, so that the expression even allows for mixed quoted and non-quoted field contents.

Code:
\(\("[^"]*"\)*[^,]*\)*,

This whole expression is again grouped, made optional with the "*" (so that we allow for empty fields) and followed by a final ",", which is the record separator.

You could use this as a blueprint to manipulate other fields easily, not just the first. Suppose you would want to change the fourth field to "@@@". Use another grouping to bring one whole field together, skip the first three occurences of a field and work on the fourth. To preserve the first three fields contents we have to use another grouping:

Code:
s/\(\(\(\("[^"]*"\)*[^,]*\)*,\)\{3\}\)\(\("[^"]*"\)*[^,]*\)*/\1@@@/


I hope this helps.

bakunin

Last edited by bakunin; 08-08-2014 at 02:55 PM..
# 6  
This should work too:
Code:
sed 's/[^,]*, \(.*\)/\1/' file

# 7  
Quote:
Originally Posted by Franklin52
This should work too:
Code:
sed 's/[^,]*, \(.*\)/\1/' file

Which is more simply written as:
Code:
sed 's/[^,]*, //' file

Which is where we started, except the code given in the 1st message in this thread was intended to handle cases where the field separator was a comma; while the input file we're processing here has a comma followed by a space sas the field separator.

Last edited by Don Cragun; 08-08-2014 at 03:59 PM.. Reason: Add note.
This User Gave Thanks to Don Cragun For This Post:

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #33
Difficulty: Easy
In HTTPS, the letter 'S' stands for 'safe'.
True or False?

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need a piece of shell scripting to remove column from a csv file

Hi, I need to remove first column from a csv file and i can do this by using below command. cut -f1 -d, --complement Mytest.csv I need to implement this in shell scripting, Whenever i am using the above command alone in command line it is working fine. I have 5 files in my directory and... (3 Replies)
Discussion started by: Samah
3 Replies

2. Shell Programming and Scripting

Remove the values from a certain column without deleting the Column name in a .CSV file

(14 Replies)
Discussion started by: dhruuv369
14 Replies

3. Shell Programming and Scripting

Sort, sed, and zero padding date column csv bash scripting

Hello people, I am having problem to sort, sed and zero padding of column in csv file. 7th column only. Input of csv file: 1,2,3,4,5,6,4/1/2010 12:00 AM,8 1,2,3,4,5,6,3/11/2010 9:39 AM,8 1,2,3,4,5,6,5/12/2011 3:43 PM,8 1,2,3,4,5,6,12/20/2009 7:23 PM,8 Output:... (5 Replies)
Discussion started by: sean1357
5 Replies

4. UNIX for Dummies Questions & Answers

Use sed on column (csv) file if data in colmns is greater > than?

I have a data file that has 14 columns. I cannot use awk or perl but sed is installed on my host. I would like to delete a line if fields 10, 11 or twelve is greater than 999.99. How is this done using sed? :wall: sed '/^*,*,*,*,*,*,*,*,*,*,*,*,*,*,/d' infile 1 2 3 4 ... (2 Replies)
Discussion started by: Chris Eagleson
2 Replies

5. Shell Programming and Scripting

Remove line feed from csv file column

Hi All, i have a csv file . In the 7th column i have data that has line feed in it. Requirement is to remove the line feed from the 7th column whenever it appears There are 11 columns in the file C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11 The value in C7 contains line feed ( Alt + Enter ),... (2 Replies)
Discussion started by: r_t_1601
2 Replies

6. Shell Programming and Scripting

Remove line feed from csv file column

Hi All, My requirement is to remove line (3 Replies)
Discussion started by: r_t_1601
3 Replies

7. Shell Programming and Scripting

Sed or awk script to remove text / or perform calculations from large CSV files

I have a large CSV files (e.g. 2 million records) and am hoping to do one of two things. I have been trying to use awk and sed but am a newbie and can't figure out how to get it to work. Any help you could offer would be greatly appreciated - I'm stuck trying to remove the colon and wildcards in... (6 Replies)
Discussion started by: metronomadic
6 Replies

8. Shell Programming and Scripting

Remove text from a csv file using sed

I am trying to remove the ita from this file: "1234ita","john","smith" "56789ita","jim","thomas" the command i am using is: sed '/^ita/d' infile.csv > outfile.csv it is running but doing nothing, very rarely use sed so trying to learn it any help would be appreciated (2 Replies)
Discussion started by: Pablo_beezo
2 Replies

9. Shell Programming and Scripting

sed csv remove conditionally

Hello, I have many csv file, but I would like to delete lines with some values in a column conditionally. My example look like this, ex1e, ex2g, ex39, pasg, ssg, mrlc, pc, kb, coop -112, -53, -177, 64, 62, 71, 1, 487, 20 -101, -61, -53, 0, 32767, 51, 0, ... (6 Replies)
Discussion started by: Jae
6 Replies

Featured Tech Videos