Using sed to remove a column from a CSV

08-08-2014

Registered User

4, 1

Join Date: Jul 2014

Last Activity: 4 September 2014, 12:07 PM EDT

Posts: 4

Thanks Given: 3

Thanked 1 Time in 1 Post

Using sed to remove a column from a CSV

I found that the following works to remove the first column in my file when my CSV is delimited with a simple comma:

Code:

sed -i 's/[^,]*,//' file.csv

However, I have a new file where the fields are encapsulated with double quotes - general example of file:

Code:

"Internal ID", "External ID", "Name"
"123", "ABC", "ABC Incorporated"

Desired outcome:

Code:

"External ID", "Name"
"ABC", "ABC Incorporated"

Can the above sed be modified to handle the "," delimiter? If so, how? Or are there better alternatives?

Any help be appreciated.

Last edited by lojkyelo; 08-08-2014 at 11:47 AM.. Reason: Formatting

lojkyelo

View Public Profile for lojkyelo

Find all posts by lojkyelo

08-08-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

sed isn't really the right tool for dealing with columns. awk would be the correct tool, it understands columns as columns without weird regex convolutions.

As long as your CSV is actually comma separated -- uses , to separate columns and nowhere else -- this may work:

Code:

awk -F"," -v OFS="," '{ $1="" ; $0=substr($0,2) } 1' inputfile > outputfile

If your CSV isn't actually a CSV, a recursive parser that understands quotes is required and things start getting hard.

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

08-08-2014

Registered User

4, 1

Join Date: Jul 2014

Last Activity: 4 September 2014, 12:07 PM EDT

Posts: 4

Thanks Given: 3

Thanked 1 Time in 1 Post

I can trust that the first column will never contain a comma within the "ABC" Internal ID field but not in the following fields, Name is likely suspect.

I tried you suggested command out, looks to have worked preserving the remainder of my record.

Thanks,

This User Gave Thanks to lojkyelo For This Post:

lojkyelo

View Public Profile for lojkyelo

Find all posts by lojkyelo

08-08-2014

Registered User

7,747, 559

Join Date: Feb 2007

Last Activity: 20 April 2020, 11:28 AM EDT

Location: The Netherlands

Posts: 7,747

Thanks Given: 139

Thanked 559 Times in 520 Posts

Another approach:

Code:

awk -F", " 'sub($1 FS,x)' file

Franklin52

View Public Profile for Franklin52

Find all posts by Franklin52

08-08-2014

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by Corona688

sed isn't really the right tool for dealing with columns. awk would be the correct tool, it understands columns as columns without weird regex convolutions.

I beg to disagree. Actually sed is a very good tool for manipulating any text data, regardless of being in tables or not. If one prefers awk or sed is more a matter of taste, because both languages are Turing-complete.

Quote:

Originally Posted by lojkyelo

I found that the following works to remove the first column in my file when my CSV is delimited with a simple comma:

Code:

sed -i 's/[^,]*,//' file.csv

However, I have a new file where the fields are encapsulated with double quotes - general example of file:

Code:

"Internal ID", "External ID", "Name"
"123", "ABC", "ABC Incorporated"

Desired outcome:

Code:

"External ID", "Name"
"ABC", "ABC Incorporated"

Let us first examine what your regexp does:

Code:

s/[^,]*,//

This searches for a string of non-commas in the length of zero (=empty field) or more ("[^,]*", so in fact this makes this string optional), followed by a comma. The resulting string will be deleted (replaced by a null-string).

Let us have a look at the composition of your fields: any single field is "a sequence of zero or more non-, followed by a comma. This is true regardless of the field being enclosed in double quotes or not. So it seems that you do not have to change your regexp at all. A double-quote is just a character like any other.

What might happen is that commata enclosed in quotes should not be treated as field-separators, like this:

Code:

"abc,def","ghi,jkl,mno","..."

As your script is now (and, btw., the awk script too) this would be interpreted as the first field ending after "c", the second field after "f", etc.. But probably such a line should be interpreted as 3 fields, ending after "f", "o" and end-of-line. To accomodate for this you need to enhance your definition of what a "field" is a little: a field is a sequence of zero or more strings enclosed in double-quotes mixed with zero or more non-commata, followed by a comma. The following regexp is based on this definition:

Code:

s/\(\("[^"]*"\)*[^,]*\)*,//

Let us peel this apart. Basically it is quite easy, but the nesting level makes it somewhat difficult to understand:

Code:

"[^"]*"

This is a quoted string: a double quote, followed by zero or more non-double quotes, followed by another double quote.

Code:

\("[^"]*"\)*[^,]*

The double-quoted string grouped in brackets, so that the following "*" means zero or more occurences of this expression. This is followed by zero or more non-commas, which would be characters outside the double-quoted strings, so that the expression even allows for mixed quoted and non-quoted field contents.

Code:

\(\("[^"]*"\)*[^,]*\)*,

This whole expression is again grouped, made optional with the "*" (so that we allow for empty fields) and followed by a final ",", which is the record separator.

You could use this as a blueprint to manipulate other fields easily, not just the first. Suppose you would want to change the fourth field to "@@@". Use another grouping to bring one whole field together, skip the first three occurences of a field and work on the fourth. To preserve the first three fields contents we have to use another grouping:

Code:

s/\(\(\(\("[^"]*"\)*[^,]*\)*,\)\{3\}\)\(\("[^"]*"\)*[^,]*\)*/\1@@@/

I hope this helps.

bakunin

Last edited by bakunin; 08-08-2014 at 02:55 PM..

bakunin

View Public Profile for bakunin

Find all posts by bakunin

08-08-2014

Registered User

7,747, 559

Join Date: Feb 2007

Last Activity: 20 April 2020, 11:28 AM EDT

Location: The Netherlands

Posts: 7,747

Thanks Given: 139

Thanked 559 Times in 520 Posts

This should work too:

Code:

sed 's/[^,]*, \(.*\)/\1/' file

Franklin52

View Public Profile for Franklin52

Find all posts by Franklin52

08-08-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by Franklin52

This should work too:

Code:

sed 's/[^,]*, \(.*\)/\1/' file

Which is more simply written as:

Code:

sed 's/[^,]*, //' file

Which is where we started, except the code given in the 1st message in this thread was intended to handle cases where the field separator was a comma; while the input file we're processing here has a comma followed by a space sas the field separator.

Last edited by Don Cragun; 08-08-2014 at 03:59 PM.. Reason: Add note.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Using sed to remove a column from a CSV

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need a piece of shell scripting to remove column from a csv file

Discussion started by: Samah

2. Shell Programming and Scripting

Remove the values from a certain column without deleting the Column name in a .CSV file

Discussion started by: dhruuv369

3. Shell Programming and Scripting

Sort, sed, and zero padding date column csv bash scripting

Discussion started by: sean1357

4. UNIX for Dummies Questions & Answers

Use sed on column (csv) file if data in colmns is greater > than?

Discussion started by: Chris Eagleson

5. Shell Programming and Scripting

Remove line feed from csv file column

Discussion started by: r_t_1601

6. Shell Programming and Scripting

Remove line feed from csv file column

Discussion started by: r_t_1601

7. Shell Programming and Scripting

Sed or awk script to remove text / or perform calculations from large CSV files

Discussion started by: metronomadic

8. Shell Programming and Scripting

Remove text from a csv file using sed

Discussion started by: Pablo_beezo

9. Shell Programming and Scripting

sed csv remove conditionally

Discussion started by: Jae