Problem with changing field separators in a file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Problem with changing field separators in a file
# 1  
Old 02-17-2011
Problem with changing field separators in a file

I have a file with content as shown below.

cat t2 :
Code:
100,100,"X",1234,"12A",,,"ab,c"

Comma is the field seperator, however string fields will be within double quotes and comma within double quotes should not be treated as field seperator.

I am trying to replace this field seperator to a distinct character like a pipe or \001 and then perform some analysis.

I have used below perl command which is working correctly, but has some problem with performance. My file has about 7 Million rows and this command is taking about 45 mins.
Code:
cat t2 | perl -M'Text::ParseWords' -ne 'print (join("\001" => quotewords(",",0, $_)))' | cat -v
 
100^A100^AX^A1234^A12A^A^A^Aab,c

Looking forward for some advise on making this script run faster or if there is alternate approach using unix commands like AWK or SED..

Last edited by Franklin52; 02-17-2011 at 08:57 AM.. Reason: Please use code tags
# 2  
Old 02-17-2011
.Do you want to replace ALL commas in your script to something else?
Code:
sed -i "s/\,/\|/g" infile

Changes all commas to the pipe

Before using sed -i, do it without the -i so you can see what it'll do prior to making the change

So
Code:
sed "s/\,/\|/g" infile

This User Gave Thanks to DC Slick For This Post:
# 3  
Old 02-17-2011
Yes, only the field seperators. If there is a comma within double quotes, then leave it as it is.
# 4  
Old 02-17-2011
Give a try to :
Code:
sed 's:\("[^",][^",]*\),\([^"]*"\):\1\2:g' infile

Note that if you have more than 1 coma within the double quote, it will only remove 1 so you may have to run it several times

Code:
# echo '100,100,"X",1234,"12A",,,"ab,c"'
100,100,"X",1234,"12A",,,"ab,c"
# echo '100,100,"X",1234,"12A",,,"ab,c"' | sed 's:\("[^",][^",]*\),\([^"]*"\):\1\2:g'
100,100,"X",1234,"12A",,,"abc"


Last edited by ctsgnb; 02-17-2011 at 10:45 AM..
This User Gave Thanks to ctsgnb For This Post:
# 5  
Old 02-17-2011
Thanks for this,

Code:
100,100,"X",1234,"12A",,,"abc"

but what I am trying is to replace the field seperator comma alone with character like pipe or \001.

Code:
 
Original data
100,100,"X",1234,"12A",,,"ab,c"

So the output should look like

Code:
Code:
100|100|X|1234|12A|||ab,c

Any further advise please ?
# 6  
Old 02-17-2011
Code:
# echo '100,100,"X",1234,"12A",,,"ab,c"' | sed 's:\("[^",][^",]*\),\([^"]*"\):\1#\2:g' | tr ',#' '|,'
100|100|"X"|1234|"12A"|||"ab,c"

Code:
# echo '100,100,"X",1234,"12A",,,"ab,c"' | sed 's:\("[^",][^",]*\),\([^"]*"\):\1#\2:g;s:,:|:g;s:#:,:'
100|100|"X"|1234|"12A"|||"ab,c"

Code:
# echo '100,100,"X",1234,"12A",,,"ab,c"' | sed 's:,:|:g;s:\("[^"|][^"|]*\)|\([^"]*"\):\1,\2:g'
100|100|"X"|1234|"12A"|||"ab,c"

---------- Post updated at 04:03 PM ---------- Previous update was at 03:57 PM ----------

Try this :
Code:
sed 's:,:|:g;s:\("[^"|][^"|]*\)|\([^"]*"\):\1,\2:g' infile

# 7  
Old 02-17-2011
The following assumes that the file format is as simple as it appears (no special rules such as how to quote quotes, etc):

Code:
BEGIN {
    delimiter = ","
    new_delimiter = "|"
}

{
    len = length($0)
    in_quotes = 0
    for (i = 1; i <= len; i++) {
        char = substr($0, i, 1)
        if (char == "\"") {
            in_quotes = (in_quotes ? 0 : 1)
            continue
        }
        if (char == delimiter && !in_quotes)
            char = new_delimiter
        printf("%s", char)
    }
    printf("\n")
}

Code:
$ echo '100,100,"X",1234,"12A",,,"ab,c"' | awk -f csv.awk 
100|100|X|1234|12A|||ab,c

Regards,
Alister
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parsing out data with multiple field separators

I have a large file that I need to print certain sections out of. file.txt /alpha/beta/delta/gamma/425/590/USC00015420.blah.lt.0.01.str:USC00015420Y2017M10BLALT.01 12 13 14 -9 1 -9 -9 -9 -9 -9 1 2 3 4 5 -9 -9 I need to print the "USC00015420" and... (5 Replies)
Discussion started by: ncwxpanther
5 Replies

2. Shell Programming and Scripting

Extract lines with min value, using two field separators.

I have a file with two ID columns followed by five columns of counts in fraction form. I'd like to print lines that have a count of at least 4 (so at least 4 in the numerator, e.g. 4/17) in at least one of the five columns. Input file: comp51820_c1_seq1 693 0/29 0/50 0/69 0/36 0/31... (6 Replies)
Discussion started by: pathunkathunk
6 Replies

3. Shell Programming and Scripting

Multiple long field separators

How do I use multiple field separators in awk? I know that if I use awk -F"", both a and b will be field separators. But what if I need two field separators that both are longer than one letter? If I want the field separators to be "ab" and "cd", I will not be able to use awk -F"". The ... (2 Replies)
Discussion started by: locoroco
2 Replies

4. UNIX for Dummies Questions & Answers

Can one use 2 field separators in awk?

I have files such as n02-z30-dsr65-terr0.25-dc0.008-16x12drw-run1.cmd I am wondering if it is possible to define two field separators "-" and "." for these strings so that $7 is run1. (5 Replies)
Discussion started by: kristinu
5 Replies

5. UNIX Desktop Questions & Answers

awk Varing Field Separators

Hi Guys, I have small dilemma which I could do with a little help solving . I currently have text HDD S.M.A.R.T report which I have pasted below: smartctl 5.39 2008-10-24 22:33 (openSUSE RPM) Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net Device: COMPAQ... (2 Replies)
Discussion started by: bikerben
2 Replies

6. Shell Programming and Scripting

Fixed width file with newline field separators

I have some huge files that are produced daily from a production system written in basic (really). The files are fixed width records, 512 bytes, with newline field separators, newlines if the field is null, and trailing newlines for null fields. The data in the fields can be any ascii... (0 Replies)
Discussion started by: vtischuk@yahoo.
0 Replies

7. UNIX for Dummies Questions & Answers

Multiple field separators in awk? (First a space, then a colon)

How do I deal with extracting a portion of a record when multiple field separators are involved. Let's say I have: Mike Harrington;(555) 555-5555:250:100:175 Christian Dobbins;(555) 555-2358:155:90:201 Susan Dalsass;(555) 555-6279:250:60:50 Archie McNichol;(555) 555-1348:250:100:175 Jody... (3 Replies)
Discussion started by: doubleminus
3 Replies

8. Shell Programming and Scripting

Multiple input field Separators in awk.

I saw a couple of posts here referencing how to handle more than one input field separator in awk. I figured I would share how I (just!) figured out how to turn this line in a logfile: 90000000000000000000010001 name... (4 Replies)
Discussion started by: kinksville
4 Replies

9. Shell Programming and Scripting

I need help counting the fields and field separators using Nawk

I need help counting the fields and field separators using Nawk. I have a file that has multiple lines on it and I need to read the file 1 at a time and then count the fields and field separators and then store those numbers in variables. I then need to delete the first 5 fields and the blank... (3 Replies)
Discussion started by: scrappycc
3 Replies

10. Shell Programming and Scripting

Awk Multiple Field Separators

Hi Guys, I'm tying to split a line similar to this:YO6-2000-30.htm: (3 properties found).......into separate columns, so effectively I need to check for a -, ., :, a tab and a space in the statement. Any help would be appreciated Thanks! (7 Replies)
Discussion started by: Tonka52
7 Replies
Login or Register to Ask a Question