Sponsored Content
Top Forums UNIX for Beginners Questions & Answers How to extract fields from a CSV i.e comma separated where some of the fields having comma as value? Post 303030924 by bakunin on Tuesday 19th of February 2019 09:25:23 AM
Old 02-19-2019
Quote:
Originally Posted by J.Jena
can anyone help me!!!! How to I parse the CSV file
file name : abc.csv (csv file) The above file containing data like
Code:
abv,sfs,,hju,',',jkk wff,fst,,rgr,',',rgr ere,edf,erg,',',rgr,rgr

I have a requirement like i have to extract different field and assign them into different variables.
My Code:
Code:
cat $file | awk 'NR!=1' | while read -r line   do      a=`echo "$line" | awk -F',' '{print $1}'`      b=`echo "$line" | awk -F',' '{print $2}'`      c=`echo "$line" | awk -F',' '{print $3}'`      d=`echo "$line" | awk -F',' '{print $4}'`      e=`echo "$line" | awk -F',' '{print $5}'`      f=`echo "$line" | awk -F',' '{print $6}'`      echo "$e"      echo "$f" done

output: (it gave the output as single quote)
' ' ' ' ' ' Required Output Should be like: (my 5th field having value "comma" )
, jkk , rgr , rgr rgr
First: even if your script line is working, it will be slow as a slug. The reason is that you call external (that is: external to the shell) programs (here: awk) for every line over and over again. Calling a program costs a considerable amount of time. You will not notice that if you call it once but if your file has, say, 1000 lines and each lines has 10 fields, you will call awk 10.000 times. That takes a considerable amount of time.

What you want to do is called "parsing" and if you want to assign the parsed values to (shell) variables in the end you can as well do the parsing itself in the shell too. Let us start with something simple: a loop that reads a string one character at a time. I suppose your input file does not contain anything fancy (like escaped characters, unprintable sequences, etc.) but only printable characters. In this case it can be done in shell alone quite simply:

Code:
#! /bin/ksh

chLine=""
chChar=""

while read chLine ; do                           # go through the whole file
     echo "Whole line: $chLine"                  # test output 
     while [ -n "$chLine" ] ; do                 # go through the line
          chChar="${chLine%${chLine#?}}"         # chop off the first character
          chLine="${chLine#?}"                   # and remove that from the line
          echo "$chChar"                         # test output
     done
done

Now let this run against any (short!) file with a few lines and see how it works. It reads one line at a time, prints it (this is just to control its workings) completely, then starts to remove one character at a time until the end of the line. The removed character is also printed to show the mechanism.

Now we need to determine if we are inside a quotation or outside. This can be done simply by introducing a logical flag which we switch every time we find a quote. We start with the flag in "off", switch it to "on" when we find a quote, switch it to "off" again when we find another quote, etc.. Here we go:

Code:
#! /bin/ksh

chLine=""
chChar=""
lInsideQuote=0

while read chLine ; do                           
     echo "Whole line: $chLine"                  
     lInsideQuote=0                              # switch off for new line
     while [ -n "$chLine" ] ; do                 
          chChar="${chLine%${chLine#?}}"          
          chLine="${chLine#?}"                    
          case "$chChar" in
               \')                               # if we found a single quote
                    if (( lInsideQuote )) ; then # flip the status of the flag
                         lInsideQuote=0
                    else
                         lInsideQuote=1
                    fi
                    ;;

               *)                                # ignore all other characters
                    ;;

          esac
                    
          if (( lInsideQuote )) ; then           # test output shows quote-status
               echo "$chChar"   (inside quote)"
          else
               echo "$chChar"   (outside quote)"
          fi
     done
done

Now we are prepared to deal with the fields themselves: we read a character a time and add that to a growing string which represents our next field. When we encounter a comma we react in one of two ways: when the quotation-flag is switched off, we found a legal separator and we output the field and start over with the next one. If the quotation-field is switched on we just the comma as a normal character to the fields value and carry on.

Here it is:


Code:
#! /bin/ksh

chLine=""
chChar=""
chField=""
lInsideQuote=0

while read chLine ; do
     echo "Whole line: $chLine"
     lInsideQuote=0
     chField=""                            # reset field buffer for next line
     while [ -n "$chLine" ] ; do
          chChar="${chLine%${chLine#?}}"
          chLine="${chLine#?}"
          case "$chChar" in
               "'")
                    if (( lInsideQuote )) ; then
                         lInsideQuote=0
                    else
                         lInsideQuote=1
                    fi
                    ;;

               ",")                                # if we found a comma
                    if (( lInsideQuote )) ; then   # inside a quoted string
                         chField="${chField},"     # we just add the comma
                    else
                                                   # outside a quoted string
                         echo "Field: $chField"  # field is finished, display it
                         chField=""                # empty buffer for next field
                    fi
                    ;;

               *)
                    chField="${chField}${chChar}"  # all other chars are just added
                    ;;

          esac
     done
     echo "last Field: $chField"  # display the last field in the line
done

Try this with a small sample file and see if it works for you. Try to understand its inner workings.

You can use this as a basic structure and refine it to your needs. Where now the echo-statements just display the field values you can insert code to actually deal with the values instead. Things i have left out for clarity are: there are two ways to quote, single and double quotes. How should the script deal with that, i.e.:

Code:
"double ' quoted", 'single " quoted'

My script only takes single-quotes into account but if you allow single- and double-quotes it gets a bit more complicated when one quote is inside a string quoted with the other type. Usually the rule is that quote characters lose their special meanings inside a quoted string. The example above would be two quoted strings, one with a single quote character in it, one with a double-quote character in it.

Also i left out some context checking: suppose this input line:

Code:
'field1','field2

The last opened quotation is not closed again. Should this lead to an error? My script ignores it simply, but you might want to raise some error because of this. How you react to such a line is your decision and you need to decide based on your purpose.

Another thing is if you have escaped quote characters like this:

Code:
"a double quote character \" inside a double-quoted string"

It is possible to parse these (hint: introduce an "escape-flag", which you set whenever you encounter a "\") but again it makes the parser more complicated.

I hope this helps.

bakunin

Last edited by bakunin; 02-19-2019 at 06:38 PM.. Reason: added forgotten quotes
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Remove whitespaces between comma separated fields from file

Hello all, I am a unix dummy. I am trying to remove spaces between fields. I have the file in the following format 12332432, 2345 , asdfsdf ,100216 , 9999999 12332431, 2341 , asdfsd2 ,100213 , 9999999 &... (2 Replies)
Discussion started by: nitinbjoshi
2 Replies

2. Shell Programming and Scripting

Parse apart strings of comma separated data with varying number of fields

I have a situation where I am reading a text file line-by-line. Those lines of data contain comma separated fields of data. However, each line can vary in the number of fields it can contain. What I need to do is parse apart each line and write each field of data found (left to right) into a file.... (7 Replies)
Discussion started by: 2reperry
7 Replies

3. Shell Programming and Scripting

Extract comma separated value in unix

Hello All Can anyone please guide me how to solve the issue In the below code I am getting concat of two value in result variable with comma separated result=`sqlplus -s / <<EOF set pages 0 feed off; set feedback off; spool abc.txt select... (4 Replies)
Discussion started by: Pratik4891
4 Replies

4. Shell Programming and Scripting

Inserting additional comma delimiters in a csv file, after and before certian fields.

Hello I have a csv file which I need to insert addtional commas into. The csv is of the format field1,field2,field3,field4,...etc...,field13,field14 I need to add extra commas in each record so that the final output looks like ... (1 Reply)
Discussion started by: kamal_p_99
1 Replies

5. Shell Programming and Scripting

How to redirect in comma separated csv from grep

Hi, I am newbie in unix. Could someone tell me how do I redirect my grep output to a csv/excel ? I have used below command but the outputs are appearing in one column Not in different column. grep -e cmd -e cmd1 test.txt | cut -f 5 | sort | uniq -c> op.csv I do not understand how do I... (14 Replies)
Discussion started by: kmajumder
14 Replies

6. Shell Programming and Scripting

Need Help - comma inside double quote in comma separated csv,

Hello there, I have a comma separated csv , and all the text field is wrapped by double quote. Issue is some text field contain comma as well inside double quote. so it is difficult to process. Input in the csv file is , 1,234,"abc,12,gh","GH234TY",34 I need output like below,... (8 Replies)
Discussion started by: Uttam Maji
8 Replies

7. Shell Programming and Scripting

awk print - fields separated with comma's need to ignore inbetween double quotes

I am trying to re-format a .csv file using awk. I have 6 fields in the .csv file. Some of the fields are enclosed in double quotes and contain comma's inside the quotes. awk is breaking this into multiple fields. Sample lines from the .csv file: Device Name,Personnel,Date,Solution... (1 Reply)
Discussion started by: jxrst
1 Replies

8. Shell Programming and Scripting

Combine two Fields if there's a comma between

Hi All, Seeking for your assistance on how to combine two fields if there's a comma between them. What i did was, extract the records from database. file1.csv(extracted from db) 82|Supplies Station, Inc.|112012|KARISSA APPAREL, INC. - 112012|NON TRADE F/A AND... (2 Replies)
Discussion started by: znesotomayor
2 Replies

9. Shell Programming and Scripting

awk to parse comma separated field and removing comma in between number and double quotes

Hi Experts, Please support I have below data in file in comma seperated, but 4th column is containing comma in between numbers, bcz of which when i tried to parse the file the column 6th value(5049641141) is being removed from the file and value(222.82) in column 5 becoming value of column6. ... (3 Replies)
Discussion started by: as7951
3 Replies

10. Shell Programming and Scripting

Convert fixed value fields to comma separated values

Hi All, Hope you are doing Great!!!. Today i have came up with a problem to say exactly it was for performance improvement. I have written code in perl as a solution for this to cut in specific range, but it is taking time to run for files thousands of lines so i am expecting a sed... (9 Replies)
Discussion started by: mad man
9 Replies
All times are GMT -4. The time now is 04:58 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy