Csv file parsing and validating


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Csv file parsing and validating
# 36  
Old 04-25-2014
Hi Srini,

It's working fine. Thank you so much for your support and help.

Regards,
Shree

---------- Post updated 04-25-14 at 02:28 AM ---------- Previous update was 04-24-14 at 11:25 PM ----------

Hi,

One more thing. I also wanted to display the column headers for the badrec file as well.
I'm witting the above code in a bash script. Once i write the good and bad to the goodrec and badrec files and finally i'm moving the goodrecs and badrec on Hadoop HDFS. Once the process is complete it should show the sucess message, count of goodrecs and badrec records on the console. Also it should display the HDFS path where i'm storing the file. How this can be done.How the print statement written in the script file will display the result on the console
Below is my script file :
Code:
#!/bin/bash
awk -F "," 'NR == FNR {h = (h == "") ? $1 : (h FS $1); gsub("[)(]", "-", $2); split($2, a, "-"); d[NR] = a[1]; l[NR] = a[2]; n[NR] = ($3 == "NOT NULL") ? 1 : 0; next}
  FNR == 1 {print h > "goodrec"}
 {for(i = 1; i <= NF; i++)
  {if(((d[i] == "Numeric" && (($i + 0) == $i || $i == "")) || d[i] == "String") && (length($i) <= l[i]) && (length($i) >= n[i]))
      {f = 1} else {f = 0};
    if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' df.txt df.txt
hadoop fs -put /home/hduser/validate/badrec /user/hduser/Dataparse/
hadoop fs -put /home/hduser/validate/goodrec /user/hduser/Dataparse/

If i excecute the ablove code the goodrec and badrec are dumped on HDFS in the below path
Quote:
/user/hduser/Dataparse/
And on the console i would like to get :
Quote:
Parsing is Success
Count of Goodrec : 3
Count of badrec : 4
Validated records are found on the path "/user/hduser/Dataparse"
Please help me how this can be done ?
# 37  
Old 04-25-2014
If you are dumping the data into HDFS, I assume the data is very big.
In that case you can run the awk code as streams and it would run faster
Anyways I shall provide the code in some time as I don't have a pc now
# 38  
Old 04-25-2014
Yes the data is very big. But right now i'm trying with the very small amount of data. If the data is very large, using awk will affect performance ?
# 39  
Old 04-25-2014
It wouldn't degrade the performance...but performing an operation on 100 GB file at a time vs performing the task on 1GB files parallelly...
That's the beauty of Hadoop
But upto you to choose where to do this task.
Below is the code
Code:
awk -F "," 'NR == FNR {h = (h == "") ? $1 : (h FS $1); gsub("[)(]", "-", $2); split($2, a, "-"); d[NR] = a[1]; l[NR] = a[2]; n[NR] = ($3 == "NOT NULL") ? 1 : 0; next}
  FNR == 1 {print h > "goodrec"; print h > "badrec"}
 {for(i = 1; i <= NF; i++)
  {if(((d[i] == "Numeric" && (($i + 0) == $i || $i == "")) || d[i] == "String") && (length($i) <= l[i]) && (length($i) >= n[i]))
      {f = 1} else {f = 0};
    if(f == 0) {print $0 > "badrec"; b++; next}} print $0 > "goodrec"; g++}
  END {print "Parsing is Success";
    print "Count of Goodrec : " g;
 print "ount of badrec : " b;
 print "Validated records are found on the path \"/user/hduser/Dataparse\""}' cf.txt df.txt

This is tested the below is the output
Code:
$ cat conf
id,Numeric(2),NOT NULL
name,String(20)
state,String(10),NOT NULL
street_No,Numeric(4)
$ cat data
abc,john,MI,201
22,Lilly,CA,405
33,Richard,CA,21Q
444,Reet5,NY,258
55,Taylor,GI,3333
66,Merry,,3333
77,,,22
$ awk -F "," 'NR == FNR {h = (h == "") ? $1 : (h FS $1); gsub("[)(]", "-", $2); split($2, a, "-"); d[NR] = a[1]; l[NR] = a[2]; n[NR] = ($3 == "NOT NULL") ? 1 : 0; next}
  FNR == 1 {print h > "goodrec"; print h > "badrec"}
 {for(i = 1; i <= NF; i++)
  {if(((d[i] == "Numeric" && (($i + 0) == $i || $i == "")) || d[i] == "String") && (length($i) <= l[i]) && (length($i) >= n[i]))
      {f = 1} else {f = 0};
    if(f == 0) {print $0 > "badrec"; b++; next}} print $0 > "goodrec"; g++}
  END {print "Parsing is Success";
    print "Count of Goodrec : " g;
print "ount of badrec : " b;
print "Validated records are found on the path \"/user/hduser/Dataparse\""}' conf data
Parsing is Success
Count of Goodrec : 2
ount of badrec : 5
Validated records are found on the path "/user/hduser/Dataparse"
$ cat goodrec
id,name,state,street_No
22,Lilly,CA,405
55,Taylor,GI,3333
$ cat badrec
id,name,state,street_No
abc,john,MI,201
33,Richard,CA,21Q
444,Reet5,NY,258
66,Merry,,3333
77,,,22
$

---------- Post updated at 06:21 AM ---------- Previous update was at 06:20 AM ----------

dont get confused about last record '77', I manually made it a bad record to check few conditions
This User Gave Thanks to SriniShoo For This Post:
# 40  
Old 04-25-2014
Hi Srini,

Thank you for your great support and for listening to all my querries patiently. Thank you so much.

Thanks,
Shree
# 41  
Old 04-25-2014
Hi Srini,

Thank you for your support and for listening to all my queries patiently.Thank you so much.

One more thing, whatever we have done so far i.e., parsing a file with the given configuration file. Same thing can be done for xml file validating against xsd file ? Here can we go for AWK or it it be possible only using through xmllint utility ? (asking this query only to know whether its possible or not )

Thanks,
Shree
# 42  
Old 04-25-2014
Since both are flat files with data patterns, it is possible with awk
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with Parsing a CSV File

Hello All, I have an input CSV file like below, where first row data can be in different position after every run of the tool, i.e. pzTest in below example is in column 1, but it can be also in 3 column and same for all the headers in the first row. pzTest, pzExtract, pxUpdate, pzInfo... (1 Reply)
Discussion started by: asirohi
1 Replies

2. Shell Programming and Scripting

Parsing csv file and pass to a variable

Hi, Newbie here and I need some help to parse a csv file that contains fields separated by ",". What I need to achieve here is, read the 1 line file and extract 240 fields and pass to a variable and then read the next 240 fields and pass to a variable, over and over. If anyone can assist that... (4 Replies)
Discussion started by: tmslixx
4 Replies

3. Shell Programming and Scripting

Help required in parsing a csv file

Hi Members, I am stuck with the following problem. Request your kind help I have an csv file which contains, 1 header record, data records and 1 footer record. Sample is as below Contents of cm_update_file_101010.csv -------------------------------------------------- ... (6 Replies)
Discussion started by: ramakanth_burra
6 Replies

4. Shell Programming and Scripting

Parsing a CSV file and deleting all rows on condition

Hello list, I am working on a csv file which contains two fields per record which contain IP addresses. What I am trying to do is find records which have identical fields(IP addresses) which occur 4(four) times, and if they do, delete all records with that specific identical field(ip address). ... (4 Replies)
Discussion started by: landossa
4 Replies

5. Shell Programming and Scripting

Parsing complicated CSV file with sed

Yes, there is a great doc out there that discusses parsing csv files with sed, and this topic has been covered before but not enough to answer my question (unix.com forums). I'm trying to parse a CSV file that has optional quotes like the following: "Apple","Apples, are fun",3.60,4.4,"I... (3 Replies)
Discussion started by: analog999
3 Replies

6. Shell Programming and Scripting

Parsing a CSV File

Hey guys, I'm in the process of learning PHP and BASH scripting. I'm getting there, slowly ;) I would like some help with parsing a CSV file. This file contains a list of hostnames, dates, and either Valid, Expired, or Expired Soon in the last column. Basically, I want to parse the file,... (12 Replies)
Discussion started by: dzl
12 Replies

7. Shell Programming and Scripting

2 problems: Mailing CSV file / parsing CSV for display

I have been trying to find a good solution for this seemingly simple task for 2 days, and I'm giving up and posting a thread. I hope someone can help me out! I'm on HPUX, using sqlplus, mailx, awk, have some other tools available, but can't install stuff that isn't already in place (without a... (6 Replies)
Discussion started by: soldstatic
6 Replies

8. Shell Programming and Scripting

CSV file parsing and validation

I have a CSV file that needs to through two seperate processes (in the end there will be 2 files (Dload.unl and Tload.unl and we'll say the input file name is mass.csv). I have a processfile() function that will call the process Dload funtion. In Dload I want to read mass.csv into Dload and then... (1 Reply)
Discussion started by: dolo21taf
1 Replies

9. Shell Programming and Scripting

Parsing a csv file

I am trying to parse a csv file in the below 'name-value pair' format and then use the values corresponding to the name. Type:G,Instance:instance1,FunctionalID:funcid,Env:dev,AppName:appname... (6 Replies)
Discussion started by: chiru_h
6 Replies

10. Shell Programming and Scripting

Help in parsing a CSV file with Shell script

I have a CSV file which contains number series as one of the fields. Some of the records of the field look like : 079661/3 I have to convert the above series as 079661 079662 079663 and store it as 3 different records. Looking for help on how to achieve this. Am a newbie at Shell... (10 Replies)
Discussion started by: mihirk
10 Replies
Login or Register to Ask a Question