How to extract fields from a CSV i.e comma separated where some of the fields having comma as value?


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers How to extract fields from a CSV i.e comma separated where some of the fields having comma as value?
# 1  
Old 02-19-2019
How to extract fields from a CSV i.e comma separated where some of the fields having comma as value?

can anyone help me!!!! How to I parse the CSV file
file name : abc.csv (csv file) The above file containing data like
Code:
abv,sfs,,hju,',',jkk wff,fst,,rgr,',',rgr ere,edf,erg,',',rgr,rgr

I have a requirement like i have to extract different field and assign them into different variables.
My Code:
Code:
cat $file | awk 'NR!=1' | while read -r line   do      a=`echo "$line" | awk -F',' '{print $1}'`      b=`echo "$line" | awk -F',' '{print $2}'`      c=`echo "$line" | awk -F',' '{print $3}'`      d=`echo "$line" | awk -F',' '{print $4}'`      e=`echo "$line" | awk -F',' '{print $5}'`      f=`echo "$line" | awk -F',' '{print $6}'`      echo "$e"      echo "$f" done

outputSmilieit gave the output as single quote)
' ' ' ' ' ' Required Output Should be likeSmiliemy 5th field having value "comma" )
, jkk , rgr , rgr rgr




Moderator's Comments:
Mod Comment Please use CODE tags as required by forum rules!

Last edited by RudiC; 02-19-2019 at 08:54 AM.. Reason: Added CODE tags.
# 2  
Old 02-19-2019
Welcome to the forum.


You're problem is quite ubiquitous and has been solved / commented on in these fora more than several times. Try the search function, or one of the links given at the lower left of this page under "More UNIX and Linux Forum Topics You Might Find Helpful".


The code with which you addressed your problem is not the most efficient one. Did you consider just reading the variables? Like
Code:
{ read dummy
  while IFS=, read a b c d e f dummy
     do   echo $e
          echo $f

     .
     .
     .
     done
 } < $file

# 3  
Old 02-19-2019
Quote:
Originally Posted by J.Jena
can anyone help me!!!! How to I parse the CSV file
file name : abc.csv (csv file) The above file containing data like
Code:
abv,sfs,,hju,',',jkk wff,fst,,rgr,',',rgr ere,edf,erg,',',rgr,rgr

I have a requirement like i have to extract different field and assign them into different variables.
My Code:
Code:
cat $file | awk 'NR!=1' | while read -r line   do      a=`echo "$line" | awk -F',' '{print $1}'`      b=`echo "$line" | awk -F',' '{print $2}'`      c=`echo "$line" | awk -F',' '{print $3}'`      d=`echo "$line" | awk -F',' '{print $4}'`      e=`echo "$line" | awk -F',' '{print $5}'`      f=`echo "$line" | awk -F',' '{print $6}'`      echo "$e"      echo "$f" done

output: (it gave the output as single quote)
' ' ' ' ' ' Required Output Should be like: (my 5th field having value "comma" )
, jkk , rgr , rgr rgr
First: even if your script line is working, it will be slow as a slug. The reason is that you call external (that is: external to the shell) programs (here: awk) for every line over and over again. Calling a program costs a considerable amount of time. You will not notice that if you call it once but if your file has, say, 1000 lines and each lines has 10 fields, you will call awk 10.000 times. That takes a considerable amount of time.

What you want to do is called "parsing" and if you want to assign the parsed values to (shell) variables in the end you can as well do the parsing itself in the shell too. Let us start with something simple: a loop that reads a string one character at a time. I suppose your input file does not contain anything fancy (like escaped characters, unprintable sequences, etc.) but only printable characters. In this case it can be done in shell alone quite simply:

Code:
#! /bin/ksh

chLine=""
chChar=""

while read chLine ; do                           # go through the whole file
     echo "Whole line: $chLine"                  # test output 
     while [ -n "$chLine" ] ; do                 # go through the line
          chChar="${chLine%${chLine#?}}"         # chop off the first character
          chLine="${chLine#?}"                   # and remove that from the line
          echo "$chChar"                         # test output
     done
done

Now let this run against any (short!) file with a few lines and see how it works. It reads one line at a time, prints it (this is just to control its workings) completely, then starts to remove one character at a time until the end of the line. The removed character is also printed to show the mechanism.

Now we need to determine if we are inside a quotation or outside. This can be done simply by introducing a logical flag which we switch every time we find a quote. We start with the flag in "off", switch it to "on" when we find a quote, switch it to "off" again when we find another quote, etc.. Here we go:

Code:
#! /bin/ksh

chLine=""
chChar=""
lInsideQuote=0

while read chLine ; do                           
     echo "Whole line: $chLine"                  
     lInsideQuote=0                              # switch off for new line
     while [ -n "$chLine" ] ; do                 
          chChar="${chLine%${chLine#?}}"          
          chLine="${chLine#?}"                    
          case "$chChar" in
               \')                               # if we found a single quote
                    if (( lInsideQuote )) ; then # flip the status of the flag
                         lInsideQuote=0
                    else
                         lInsideQuote=1
                    fi
                    ;;

               *)                                # ignore all other characters
                    ;;

          esac
                    
          if (( lInsideQuote )) ; then           # test output shows quote-status
               echo "$chChar"   (inside quote)"
          else
               echo "$chChar"   (outside quote)"
          fi
     done
done

Now we are prepared to deal with the fields themselves: we read a character a time and add that to a growing string which represents our next field. When we encounter a comma we react in one of two ways: when the quotation-flag is switched off, we found a legal separator and we output the field and start over with the next one. If the quotation-field is switched on we just the comma as a normal character to the fields value and carry on.

Here it is:


Code:
#! /bin/ksh

chLine=""
chChar=""
chField=""
lInsideQuote=0

while read chLine ; do
     echo "Whole line: $chLine"
     lInsideQuote=0
     chField=""                            # reset field buffer for next line
     while [ -n "$chLine" ] ; do
          chChar="${chLine%${chLine#?}}"
          chLine="${chLine#?}"
          case "$chChar" in
               "'")
                    if (( lInsideQuote )) ; then
                         lInsideQuote=0
                    else
                         lInsideQuote=1
                    fi
                    ;;

               ",")                                # if we found a comma
                    if (( lInsideQuote )) ; then   # inside a quoted string
                         chField="${chField},"     # we just add the comma
                    else
                                                   # outside a quoted string
                         echo "Field: $chField"  # field is finished, display it
                         chField=""                # empty buffer for next field
                    fi
                    ;;

               *)
                    chField="${chField}${chChar}"  # all other chars are just added
                    ;;

          esac
     done
     echo "last Field: $chField"  # display the last field in the line
done

Try this with a small sample file and see if it works for you. Try to understand its inner workings.

You can use this as a basic structure and refine it to your needs. Where now the echo-statements just display the field values you can insert code to actually deal with the values instead. Things i have left out for clarity are: there are two ways to quote, single and double quotes. How should the script deal with that, i.e.:

Code:
"double ' quoted", 'single " quoted'

My script only takes single-quotes into account but if you allow single- and double-quotes it gets a bit more complicated when one quote is inside a string quoted with the other type. Usually the rule is that quote characters lose their special meanings inside a quoted string. The example above would be two quoted strings, one with a single quote character in it, one with a double-quote character in it.

Also i left out some context checking: suppose this input line:

Code:
'field1','field2

The last opened quotation is not closed again. Should this lead to an error? My script ignores it simply, but you might want to raise some error because of this. How you react to such a line is your decision and you need to decide based on your purpose.

Another thing is if you have escaped quote characters like this:

Code:
"a double quote character \" inside a double-quoted string"

It is possible to parse these (hint: introduce an "escape-flag", which you set whenever you encounter a "\") but again it makes the parser more complicated.

I hope this helps.

bakunin

Last edited by bakunin; 02-19-2019 at 06:38 PM.. Reason: added forgotten quotes
# 4  
Old 02-19-2019
A start:
Code:
awk '
FNR > 1 {
   while (match($0, "\x027[^\x027]*\x027")) {
      quote_field=substr($0, RSTART, RLENGTH);
      quote_field_id="_:" c++ ":_"
      quote_fields[quote_field_id]=quote_field;
      sub(quote_field, quote_field_id);
   }
   comma_fields_count=split($0, comma_fields, ",");
   for (j=1; j<=comma_fields_count; j++) print (quote_fields[comma_fields[j]] ? quote_fields[comma_fields[j]] : comma_fields[j]);
}
' infile

This User Gave Thanks to rdrtx1 For This Post:
# 5  
Old 02-20-2019
In a recent shell, e.g. bash, all can be done with a "here string" and "parameter expansion: Pattern substitution", like



Code:
$ T=$'\001'                                                                        # set a token improbable to appear in file
$ while read LINE
    do    IFS=, read a b c d e f REST <<< ${LINE//\',\'/$T}                        # read from $LINE but with ',' replaced by token
          echo ${a//$T/,}:${b//$T/,}:${c//$T/,}:${d//$T/,}:${e//$T/,}:${f//$T/,}   # print variables with token replaced by comma
    done < file
abv:sfs::hju:,:jkk wff


Last edited by RudiC; 02-20-2019 at 05:49 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Convert fixed value fields to comma separated values

Hi All, Hope you are doing Great!!!. Today i have came up with a problem to say exactly it was for performance improvement. I have written code in perl as a solution for this to cut in specific range, but it is taking time to run for files thousands of lines so i am expecting a sed... (9 Replies)
Discussion started by: mad man
9 Replies

2. Shell Programming and Scripting

awk to parse comma separated field and removing comma in between number and double quotes

Hi Experts, Please support I have below data in file in comma seperated, but 4th column is containing comma in between numbers, bcz of which when i tried to parse the file the column 6th value(5049641141) is being removed from the file and value(222.82) in column 5 becoming value of column6. ... (3 Replies)
Discussion started by: as7951
3 Replies

3. Shell Programming and Scripting

Combine two Fields if there's a comma between

Hi All, Seeking for your assistance on how to combine two fields if there's a comma between them. What i did was, extract the records from database. file1.csv(extracted from db) 82|Supplies Station, Inc.|112012|KARISSA APPAREL, INC. - 112012|NON TRADE F/A AND... (2 Replies)
Discussion started by: znesotomayor
2 Replies

4. Shell Programming and Scripting

awk print - fields separated with comma's need to ignore inbetween double quotes

I am trying to re-format a .csv file using awk. I have 6 fields in the .csv file. Some of the fields are enclosed in double quotes and contain comma's inside the quotes. awk is breaking this into multiple fields. Sample lines from the .csv file: Device Name,Personnel,Date,Solution... (1 Reply)
Discussion started by: jxrst
1 Replies

5. Shell Programming and Scripting

Need Help - comma inside double quote in comma separated csv,

Hello there, I have a comma separated csv , and all the text field is wrapped by double quote. Issue is some text field contain comma as well inside double quote. so it is difficult to process. Input in the csv file is , 1,234,"abc,12,gh","GH234TY",34 I need output like below,... (8 Replies)
Discussion started by: Uttam Maji
8 Replies

6. Shell Programming and Scripting

How to redirect in comma separated csv from grep

Hi, I am newbie in unix. Could someone tell me how do I redirect my grep output to a csv/excel ? I have used below command but the outputs are appearing in one column Not in different column. grep -e cmd -e cmd1 test.txt | cut -f 5 | sort | uniq -c> op.csv I do not understand how do I... (14 Replies)
Discussion started by: kmajumder
14 Replies

7. Shell Programming and Scripting

Inserting additional comma delimiters in a csv file, after and before certian fields.

Hello I have a csv file which I need to insert addtional commas into. The csv is of the format field1,field2,field3,field4,...etc...,field13,field14 I need to add extra commas in each record so that the final output looks like ... (1 Reply)
Discussion started by: kamal_p_99
1 Replies

8. Shell Programming and Scripting

Extract comma separated value in unix

Hello All Can anyone please guide me how to solve the issue In the below code I am getting concat of two value in result variable with comma separated result=`sqlplus -s / <<EOF set pages 0 feed off; set feedback off; spool abc.txt select... (4 Replies)
Discussion started by: Pratik4891
4 Replies

9. Shell Programming and Scripting

Parse apart strings of comma separated data with varying number of fields

I have a situation where I am reading a text file line-by-line. Those lines of data contain comma separated fields of data. However, each line can vary in the number of fields it can contain. What I need to do is parse apart each line and write each field of data found (left to right) into a file.... (7 Replies)
Discussion started by: 2reperry
7 Replies

10. UNIX for Dummies Questions & Answers

Remove whitespaces between comma separated fields from file

Hello all, I am a unix dummy. I am trying to remove spaces between fields. I have the file in the following format 12332432, 2345 , asdfsdf ,100216 , 9999999 12332431, 2341 , asdfsd2 ,100213 , 9999999 &... (2 Replies)
Discussion started by: nitinbjoshi
2 Replies
Login or Register to Ask a Question