Better way to Validate column data in file.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Better way to Validate column data in file.
# 1  
Old 12-18-2006
Better way to Validate column data in file.

I am trying to validate the third column in a pipe delimited file.
The column must be 10 char long and all digits 0-9.

I am writing out two new files from the existing file, if it would be quicker, I could leave the bad rows in the file and ignore them in the next process.

What I have is working, but is taking a long time to run.
There are over 1,000,000 rows in the file and the current code has taken 1hr 40 min to process 230,000 rows.

Main part of program.

while read line
do
echo "${line}" | awk -F"|" '{print $3 }' | read emplid

badrow="N"

# remove echo commands

if [ ${badrow} = "N" ]
then
if [ ${#emplid} -ne 10 ]
then
badrow="Y"
# echo " Bad emplid length = ${emplid}"
fi
fi


if [ ${badrow} = "N" ]
then
nonums="$(echo ${emplid} | sed 's/[0-9]//g')"

if [ ! -z "$nonums" ]
then
badrow="Y"
# echo " Bad emplid numeric = ${emplid} "
else
badrow="N"
# echo " Good emplid = ${emplid} "
fi
fi


#
# If badrow = N then write to bad file if good then write to good file.
#
# increment new counters

if [ ${badrow} = "N" ]
then
# Write good row to file.
echo "${line}" >> ${good_record}
let good_recno=${good_recno}+1
else
# Write bad row to file.
echo "${line}" >> ${bad_record}
let bad_recno=${bad_recno}+1
fi

let recno=${recno}+1

done < ${incomingpathfile}


***
Example data

0001010101|TT10101|0000011111|More data and delimiters
1001010101|SS10101|0000022222|More data and delimiters
2001010101|RR10101| 00022222|More data and delimiters
2001010101|QQ10101|O000033333|More data and delimiters
# 2  
Old 12-18-2006
nawk -f barry.awk myDate

barry.awk
Code:
BEGIN {
  FS=OFS="|"
  fileRoot="file"
}
{
  out= fileRoot ( match($3, /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/) ? "Good" : "Bad")
  print >> out
  close(out)
}

# 3  
Old 12-18-2006
Python alternative:
Code:
goodfile = open("goodfile.txt","a")
badfile = open("badfile.txt","a")
for line in open("inputfile"):
  thirdcol = line.split("|")[2] #get 3rd column
  if not thirdcol.isdigit() or not len(thirdcol) == 10: #check for digit and length 10
     badfile.write(line)  #write to bad file
  else:
     goodfile.write(line) #write to good file

# 4  
Old 12-19-2006
Quote:
Originally Posted by vgersh99
Code:
BEGIN {
  FS=OFS="|"
  fileRoot="file"
}
{
  out= fileRoot ( match($3, /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/) ? "Good" : "Bad")
  print >> out
  close(out)
}

Nice and concise, as usual. But as the original poster has a large file, I wondered if opening and closing a file every line of the input file was as time efficient as a more traditionnal code:

Code:
$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ {
    print > "fileGood"
}

$3 !~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ {
    print > "fileBad"
}

Timing the two scripts on a 500.000 lines file gives me this:

Code:
real    0m13.361s
user    0m6.420s
sys     0m6.500s


real    0m1.051s
user    0m0.940s
sys     0m0.110s

# 5  
Old 12-19-2006
$3 ~ /^[0-9]\{10\}$/ {
print > fileGood
}

$3 !~ /^[0-9]\{10\}$/ {
print > fileBad
}
# 6  
Old 12-19-2006
Quote:
Originally Posted by ripat
Nice and concise, as usual. But as the original poster has a large file, I wondered if opening and closing a file every line of the input file was as time efficient as a more traditionnal code:
you're right - it's just a habbit of mine to close the output descriptors if dealing with multipe files (knowing that awk's have a low set limit on the number of openned output descriptors).

As in OP's situation there will be only TWO output descriptors, there's no need to keep closing and reopenning one.

Code:
BEGIN {
  FS=OFS="|"
  fileRoot="file"
}
{
  out= fileRoot ( match($3, /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/) ? "Good" : "Bad")
  print >> out
  #close(out)
}

timing the scripts WITH the 'close' and withOUT the 'close' on the 100K line file:
Quote:
real 0m3.280s
user 0m1.259s
sys 0m2.019s

real 0m1.491s
user 0m0.746s
sys 0m0.739s
Quote:
Originally Posted by ripat
Code:
$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ {
    print > "fileGood"
}

$3 !~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ {
    print > "fileBad"
}

Speaking of speed..... This implementation will try to do TWO pattern matches for EVERY input line. In essense, you need to do just ONE and the result of your match is binary: either "Good" OR "Bad".
<modestyON>
I believe my implementation should perform a bit better
</modestyON>

Quote:
Originally Posted by ripat
Timing the two scripts on a 500.000 lines file gives me this:

Code:
real    0m13.361s
user    0m6.420s
sys     0m6.500s


real    0m1.051s
user    0m0.940s
sys     0m0.110s


Last edited by vgersh99; 12-19-2006 at 12:58 PM..
# 7  
Old 12-19-2006
Quote:
Originally Posted by aju_kup
$3 ~ /^[0-9]\{10\}$/ {
print > fileGood
}

$3 !~ /^[0-9]\{10\}$/ {
print > fileBad
}
unfortunately this paradigm with 'number of repetitions' does not work on all awk's, including Solaris' 'nawk' and gawk.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to insert data into black column( Secound Column ) in excel (.XLSX) file using shell script?

Source Code of the original script is down below please run the script and try to solve this problem this is my data and I want it column wise 2019-03-20 13:00:00:000 2019-03-20 15:00:00:000 1 Operating System LAB 0 1 1 1 1 1 1 1 1 1 0 1 (5 Replies)
Discussion started by: Shubham1182
5 Replies

2. Shell Programming and Scripting

Bash to verify and validate file header and data type

The below bash is a file validation check executed that will verify the correct header count of 10 and the correct data type in each field of the tab-delimited file. The key has the data type of each field in it. My real data has 58 headers in it but only the header and next row need to be... (6 Replies)
Discussion started by: cmccabe
6 Replies

3. Shell Programming and Scripting

Change data in one column with data from another file's column

Hello, I have this file outputData: # cat /tmp/outputData __Capacity^6^NBSC01_Licences^L3_functionality_for_ESB_switch __Capacity^2100^NBSC01_Licences^Gb_over_IP __Capacity^1837^NBSC01_Licences^EDGE_BSS_Fnc __Capacity^1816^NBSC01_Licences^GPRS_CS3_and_CS4... (1 Reply)
Discussion started by: nypreH
1 Replies

4. Shell Programming and Scripting

Need a ready Shell script to validate a high volume data file

Hi, I am looking for a ready shell script that can help in loading and validating a high volume (around 4 GB) .Dat file . The data in the file has to be validated at each of its column, like the data constraint on each of the data type on each of its 60 columns and also a few other constraints... (2 Replies)
Discussion started by: Guruprasad
2 Replies

5. Shell Programming and Scripting

Generate tabular data based on a column value from an existing data file

Hi, I have a data file with : 01/28/2012,1,1,98995 01/28/2012,1,2,7195 01/29/2012,1,1,98995 01/29/2012,1,2,7195 01/30/2012,1,1,98896 01/30/2012,1,2,7083 01/31/2012,1,1,98896 01/31/2012,1,2,7083 02/01/2012,1,1,98896 02/01/2012,1,2,7083 02/02/2012,1,1,98899 02/02/2012,1,2,7083 I... (1 Reply)
Discussion started by: himanish
1 Replies

6. Shell Programming and Scripting

Compare 2 files and match column data and align data from 3 column

Hello experts, Please help me in achieving this in an easier way possible. I have 2 csv files with following data: File1 08/23/2012 12:35:47,JOB_5330 08/23/2012 12:35:47,JOB_5330 08/23/2012 12:36:09,JOB_5340 08/23/2012 12:36:14,JOB_5340 08/23/2012 12:36:22,JOB_5350 08/23/2012... (5 Replies)
Discussion started by: asnandhakumar
5 Replies

7. Shell Programming and Scripting

Replace column that matches specific pattern, with column data from another file

Can anyone please help with this? I have 2 files as given below. If 2nd column of file1 has pattern foo1@a, find the matching 1st column in file2 & replace 2nd column of file1 with file2's value. file1 abc_1 foo1@a .... abc_1 soo2@a ... def_2 soo2@a .... def_2 foo1@a ........ (7 Replies)
Discussion started by: prashali
7 Replies

8. UNIX for Dummies Questions & Answers

How to validate data of excel

I have a software which generates excel report with some specific data. The excel file format is .xls (old 2003 format) The data are in the forms like differenct cells contains numeric, string and alphanumeric data. The data per cell for specific input data is fixed. I need to retrive specific... (11 Replies)
Discussion started by: PratLinux
11 Replies

9. Shell Programming and Scripting

Replace data of one column with data on other file corresponding to transaction ID matched

Hi All, I have two files one of which having some mobile numbers and corresponding value whose sample content as follows: 9058629605,8.0 9122828964,30.0 And in second file complete details of all mobile numbers and sample content as follows and delimeter used is comma(,): ... (8 Replies)
Discussion started by: poweroflinux
8 Replies

10. Shell Programming and Scripting

How to validate a column?

Dear guru's, I am learning shellscripting and now I 'm struggeling with this problem: When the number in the left column is equal or higer then 200, I want to send an email to "postmaster" @ the corresponding domain in the right column. 220 shoes.com 217 dishwashers.net 209 ... (11 Replies)
Discussion started by: algernonz
11 Replies
Login or Register to Ask a Question