Better way to Validate column data in file.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Better way to Validate column data in file.
# 8  
Old 12-19-2006
Vergersh99, your code worked great cutting the time from 8:30 Hrs to 4 min.

removing the close cut the time to 2 min

Can I provide a $ var as my path for the fileRoot or out?
I shouldn't be writing data files to the directory that I am calling my scripts from.
# 9  
Old 12-19-2006
Figured it out, I just cd to the dir that my files are in then processed the command.

Thank you.
# 10  
Old 12-19-2006
Quote:
Originally Posted by barry1
Vergersh99, your code worked great cutting the time from 8:30 Hrs to 4 min.

removing the close cut the time to 2 min

Can I provide a $ var as my path for the fileRoot or out?
I shouldn't be writing data files to the directory that I am calling my scripts from.
nawk -v path='path2directoryForGoodBadFiles' -f barry.awk myDate
Code:
BEGIN {
  FS=OFS="|"
  fileRoot=path "file"
}
{
  out= fileRoot ( match($3, /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/) ? "Good" : "Bad")
  print >> out
  #close(out)
}

# 11  
Old 12-19-2006
Quote:
Originally Posted by barry1
Vergersh99, your code worked great cutting the time from 8:30 Hrs to 4 min.

removing the close cut the time to 2 min
Nice improvement isn't it?

Quote:
Originally Posted by Vergersh99
Speaking of speed..... This implementation will try to do TWO pattern matches for EVERY input line. In essense, you need to do just ONE and the result of your match is binary: either "Good" OR "Bad".
True. I agree but doing it my way seems, surprisingly enough, 50 % faster on my Linux for a reason that I don't understand. I thought that it was the ternary test var = condition ? true : false so I replaced it by the more conventionnal if else. No change.

My version of gawk (GNU Awk 3.1.4) seems to prefer the
pattern {action} format. Even with a "double match" test.

Trying to improve that "double match" I tried:

Code:
$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ {
    print > "fileGood"
    next
}

$3 !~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ {
    print > "fileBad"
}

If I correctly understand the instruction next, if the first match succeed, awk goes no further.

Only a slight change on speed. Interesting exercice anyway. Smilie
# 12  
Old 12-19-2006
Quote:
Originally Posted by ripat
Nice improvement isn't it?



True. I agree but doing it my way seems, surprisingly enough, 50 % faster on my Linux for a reason that I don't understand. I thought that it was the ternary test var = condition ? true : false so I replaced it by the more conventionnal if else. No change.
No, it's not the ternary operand - it should not matter.
It's whether you're using the 'match' operand OR the '~' matching operand.
My initial thought was that 'match' would be much slower because it's more 'expensive' than a simple '~' (as it sets return variables RSTART & RLENGTH etc). But the reality was quite different: the 'match' was just a bit slower:
Code:
BEGIN {
  FS=OFS="|"
  fileRoot="file"
}
{
  out= fileRoot ( match($3, "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$") ? "Good" : "Bad")
  print >> out
  #close(out)
}

Code:
BEGIN {
  FS=OFS="|"
  fileRoot="file"
}
{
  out= fileRoot ( $3 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$" ) ? "Good" : "Bad"
  print >> out
  #close(out)
}

Quote:
real 0m1.460s
user 0m0.755s
sys 0m0.700s

real 0m1.432s
user 0m0.830s
sys 0m0.602s
Quote:
Originally Posted by ripat
My version of gawk (GNU Awk 3.1.4) seems to prefer the
pattern {action} format. Even with a "double match" test.

Trying to improve that "double match" I tried:

Code:
$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ {
    print > "fileGood"
    next
}

$3 !~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ {
    print > "fileBad"
}

If I correctly understand the instruction next, if the first match succeed, awk goes no further.
that's right. But...... If the 'majority' of your records are 'Bad' you're still doing TWO matches: the first/GOOD match fails and the second/BAD succeeds.
The perromance in this case will vary based on the 'quality' [pun intended] of your input.
Quote:
Originally Posted by ripat
Only a slight change on speed. Interesting exercice anyway. Smilie
# 13  
Old 12-19-2006
Thanks a lot for all this. Learned a lot on this thread.

Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to insert data into black column( Secound Column ) in excel (.XLSX) file using shell script?

Source Code of the original script is down below please run the script and try to solve this problem this is my data and I want it column wise 2019-03-20 13:00:00:000 2019-03-20 15:00:00:000 1 Operating System LAB 0 1 1 1 1 1 1 1 1 1 0 1 (5 Replies)
Discussion started by: Shubham1182
5 Replies

2. Shell Programming and Scripting

Bash to verify and validate file header and data type

The below bash is a file validation check executed that will verify the correct header count of 10 and the correct data type in each field of the tab-delimited file. The key has the data type of each field in it. My real data has 58 headers in it but only the header and next row need to be... (6 Replies)
Discussion started by: cmccabe
6 Replies

3. Shell Programming and Scripting

Change data in one column with data from another file's column

Hello, I have this file outputData: # cat /tmp/outputData __Capacity^6^NBSC01_Licences^L3_functionality_for_ESB_switch __Capacity^2100^NBSC01_Licences^Gb_over_IP __Capacity^1837^NBSC01_Licences^EDGE_BSS_Fnc __Capacity^1816^NBSC01_Licences^GPRS_CS3_and_CS4... (1 Reply)
Discussion started by: nypreH
1 Replies

4. Shell Programming and Scripting

Need a ready Shell script to validate a high volume data file

Hi, I am looking for a ready shell script that can help in loading and validating a high volume (around 4 GB) .Dat file . The data in the file has to be validated at each of its column, like the data constraint on each of the data type on each of its 60 columns and also a few other constraints... (2 Replies)
Discussion started by: Guruprasad
2 Replies

5. Shell Programming and Scripting

Generate tabular data based on a column value from an existing data file

Hi, I have a data file with : 01/28/2012,1,1,98995 01/28/2012,1,2,7195 01/29/2012,1,1,98995 01/29/2012,1,2,7195 01/30/2012,1,1,98896 01/30/2012,1,2,7083 01/31/2012,1,1,98896 01/31/2012,1,2,7083 02/01/2012,1,1,98896 02/01/2012,1,2,7083 02/02/2012,1,1,98899 02/02/2012,1,2,7083 I... (1 Reply)
Discussion started by: himanish
1 Replies

6. Shell Programming and Scripting

Compare 2 files and match column data and align data from 3 column

Hello experts, Please help me in achieving this in an easier way possible. I have 2 csv files with following data: File1 08/23/2012 12:35:47,JOB_5330 08/23/2012 12:35:47,JOB_5330 08/23/2012 12:36:09,JOB_5340 08/23/2012 12:36:14,JOB_5340 08/23/2012 12:36:22,JOB_5350 08/23/2012... (5 Replies)
Discussion started by: asnandhakumar
5 Replies

7. Shell Programming and Scripting

Replace column that matches specific pattern, with column data from another file

Can anyone please help with this? I have 2 files as given below. If 2nd column of file1 has pattern foo1@a, find the matching 1st column in file2 & replace 2nd column of file1 with file2's value. file1 abc_1 foo1@a .... abc_1 soo2@a ... def_2 soo2@a .... def_2 foo1@a ........ (7 Replies)
Discussion started by: prashali
7 Replies

8. UNIX for Dummies Questions & Answers

How to validate data of excel

I have a software which generates excel report with some specific data. The excel file format is .xls (old 2003 format) The data are in the forms like differenct cells contains numeric, string and alphanumeric data. The data per cell for specific input data is fixed. I need to retrive specific... (11 Replies)
Discussion started by: PratLinux
11 Replies

9. Shell Programming and Scripting

Replace data of one column with data on other file corresponding to transaction ID matched

Hi All, I have two files one of which having some mobile numbers and corresponding value whose sample content as follows: 9058629605,8.0 9122828964,30.0 And in second file complete details of all mobile numbers and sample content as follows and delimeter used is comma(,): ... (8 Replies)
Discussion started by: poweroflinux
8 Replies

10. Shell Programming and Scripting

How to validate a column?

Dear guru's, I am learning shellscripting and now I 'm struggeling with this problem: When the number in the left column is equal or higer then 200, I want to send an email to "postmaster" @ the corresponding domain in the right column. 220 shoes.com 217 dishwashers.net 209 ... (11 Replies)
Discussion started by: algernonz
11 Replies
Login or Register to Ask a Question