Bash to verify and validate file header and data type


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Bash to verify and validate file header and data type
# 1  
Old 04-16-2017
Bash to verify and validate file header and data type

The below bash is a file validation check executed that will verify the correct header count of 10 and the correct data type in each field of the tab-delimited file. The key has the data type of each field in it. My real data has 58 headers in it but only the header and next row need to be checked. The below files are examples that have all possible data types in them. That is the data type of each line after the header is the same as the line above it. All lines will have some sort of data in it, either a numeric, alpha charter or a . (dot) for a null value. If the file is validate a message is written to the output indicated this, else the missing header or bad data type is written to output.
I'm not sure if the below is the best way to do this, but hopefully it is close. Each line is commented as to what I think is happening. Thank you Smilie.

There are 3 example files represent each of the only possibilities.
Code:
file1  --- is a good file, validated for both header and data type in all fields in file1
file2  --- is a bad file, not validated though the header line is good, the data type expected in QUAL is alpha and it is a .(dot) in red in file2
file3  --- is a bad file, not validated though the header line is not good (10 columns are expected), though the data type expected in file3

key
Code:
Index    Chr    Start    End    Ref    Alt    Freq    Qual    Score    Input    ---- defined 10 column headers ----
Integar     Integar    Integar    Integar    Alpha    Alpha    Integar    Alpha    Integar    Integar   --- data type of each line after header  ----

file1
Code:
Index    Chr    Start    End    Ref    Alt    Freq    Qual    Score    Input
1    1    1    100    C    -    1    GOOD    10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

file2
Code:
Index    Chr    Start    End    Ref    Alt    Freq    Qual    Score    Input
1    1    1    100    C    -    1    .    10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

file3
Code:
Index    Chr    Start    End    Ref    Alt    Freq    Qual    Input
1    1    1    100    C    -    1    GOOD    10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

Code:
#!/bin/bash# call bash script
awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}' file >> output  # detect header row in file and store in output
   if [[ $NF -eq 1 ]]; then   # display results
      echo "file has expected number of fields"   # file is validated for headers
    else
      echo "file is missing header for:"  # missing header field ...in file not-validated
      echo "$NF"
    fi  # close if.... else    
    
isnumeric()   # numeric function
{   # start block
    result=$(echo "$1" | tr -d '[[:digit:]]')  # check each field in file for numeric and store result
    echo ${#result}   # display result
}  # end block

isalpha()   # charcter function
{  # start block
    result=$(echo "$1" | tr -d '[[:alpha:]]')  # check each field in file for character and store result
    echo ${#result}   # display result
}  # end block
col1=""   # define col to search
col2=""   # define col to search
col3=""   # define col to search
col4=""   # define col to search
col5=""   # define col to search
col6=""   # define col to search
col7=""   # define col to search
col8=""   # define col to search
col9=""    # define col to search
col10=""  # define col to search
let retval=1  # data to check in this row

while read record  # start loop to read each column in file
do
    echo "$record" | awk -F'\t' '{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10 }' | read col1 col2 col3 col4 col5 col6 col7 col8 col col10  # store in col name in record
    
    # check  if numeric in col
    if [[ $(isnumeric "$col1") -eq 1 && $(isnumeric "$col2") -eq 1 && $(isnumeric "$col3") -eq 1 && $(isnumeric "$col4") -eq 1 && $(isnumeric "$col7") -eq 1 && $(isnumeric "$col9") -eq 1 && $(isnumeric "$col10") -eq 1 ]]; then
         retval=1  # check data in this row
    else
         retval=0  # go back to header row
         break
    fi  # close if.... else
    
    # check if alpha in col
    if [[ $(isalpha "$col5") -eq 1 && $(isalpha "$col6") -eq 1 && $(isalpha "$col8") -eq 1 ]]; then
         retval=1  # check data in this row
    else
         retval=0  # go back to header row
         break
    fi  # close if....else
    
    if [[ $retval -eq 1 ]]; then   # display results
      echo "file is correct data type in each field"   # file isvalidated
    else
      echo "file is  not the correct data type for:"  # colums ...in file not-validated
      echo "$col1 $col2 $col3 $col4 $col5 $col6 $col7 $col8 $col9 $col10"
    fi  # close if.... else    
    
    if [[ NF == 10 && $retval -eq 1 ]]; then   # execute and display file validated
      echo "file is validated"
    else
      echo "file is not validated"
    fi
done  < file >> output  # end loop and define file to check and add to output


Last edited by cmccabe; 04-17-2017 at 09:19 AM.. Reason: added details added red color to file2, corrected syntax errors detected by shell check
# 2  
Old 04-17-2017
Code:
if [[ NF == 10 && $retval -eq 1 ]]

will always evaluate to false, because the constrant string NF is not equal to the constant string 10.
This User Gave Thanks to rovf For This Post:
# 3  
Old 04-17-2017
Hi.

After fixing the syntax error in:
Code:
isnumeric()   3 numeric function

shellcheck then provided messages (see end of long line):
Code:
In z2 line 38:
    if [[ $(isnumeric "$col1") -eq 1 && $(isnumeric "$col2") -eq 1 && $(isnumeric "$col3") -eq 1 && $(isnumeric "$col4") -eq 1 && $(isnumeric "$col7") -eq 1 && $(isnumeric "$col9") -eq 1 && $(isnumeric "$col10") -eq 1 &&]]; then
    ^-- SC1009: The mentioned parser error was in this if expression.
       ^-- SC1073: Couldn't parse this test expression.
                                                                                                                                                                                                                             ^-- SC1072: Unexpected keyword/token. Fix any mentioned problems and try again.

Details for shellcheck:
Code:
shellcheck      analyse shell scripts (man)
Path    : /usr/bin/shellcheck
Version : ShellCheck - shell script analysis tool
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help    : probably available with -h
Repo    : Debian 8.7 (jessie) 
Home    : http://hackage.haskell.org/package/ShellCheck

Best wishes ... cheers, drl
This User Gave Thanks to drl For This Post:
# 4  
Old 04-17-2017
I fixed the syntax errors and the script does execute but I get. I updated the post with the changes as well.

displayed in terminal:

file is missing header for: then the script ends. Thank you Smilie.

output created in directory
Code:
1 fields detected in file and they are:
Index    Chr    Start    End    Ref    Alt    Freq    Qual    Score    Input


Last edited by cmccabe; 04-17-2017 at 09:22 AM.. Reason: updated output
# 5  
Old 04-17-2017
Hi.

Aside from the fact that the script has many lines, some of the lines are long. I tend to look at code if it fits within the width of a page, without me needing to scroll horizontally. Shells are very good at being able to continue pipelines. Other code can have lines terminated with \ to escape the newline. I think that aids comprehension and maintainability.

So without looking at your script in any detail, the next thing I would try is placing set -x in the script. You could also place intermediate printf/echo statements at crucial spots in your script. I use functions to turn on/off debugging output.

You could place set -x at the beginning to see everything. You could place it near the middle of the code, and then bisect the placement depending on whether you see something wrong or not.

Keep in mind that there could be more than error.

Best wishes ... cheers, drl

Last edited by drl; 04-17-2017 at 09:53 AM..
This User Gave Thanks to drl For This Post:
# 6  
Old 04-17-2017
You also have a one line awk script followed by an if statement that is using awk variables instead of bash variables. Since NF has not been defined in your bash code, the test in that if statement will also always evaluate to false.
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 04-17-2017
The first portion of the bash in bold verifies the headers in each text file in dir and creates 2 out files, one for each unique file. That seems to be working perfectly.

The second portion of the bash is to test and verify each data type. The script executes but the data type in each field is not verified, only the headers are verified.

The key also tab-delimited has the defined headers and data type of each field.

Only the header line and line under that need to be verified as all files in the dir will have the same format of each. Thank you Smilie.

file1 tab-delimited
Code:
Index   Chr Start   End Ref Alt Freq    Qual    Score   Input   ---- this file is verified with 10 headers and the data type in each field is good
1    1    1    100    C    -    1    GOOD    10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

file2 tab-delimited
Code:
Index   Chr Start   End Ref Alt Freq    Qual    Score    Input --- this file is verified with 10 headers but not verified as the red . in QUAL should be "GOOD" or alpha
1    1    1    100    C    -    1    .   10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

key
Code:
Index    Chr    Start    End    Ref    Alt    Freq    Qual    Score    Input    ---- defined 10 column headers ----
Integar     Integar    Integar    Integar    Alpha    Alpha    Integar    Alpha    Integar    Integar   --- data type of each line after header  ----

the ---- are nor part of each file, only there to help in the description


Bash
Code:
#!/bin/bash

dir="/home/cmccabe/bash"   # directory to search for files
for f in "$dir"/*.txt; do   # start for loop
bname=`basename $f`    # strip off path
pref=${bname%%.txt}    # strip of path and extention from output
awk '
FNR==NR {  # process all columns and rows in file
    for(n=1;n<=NF;n++)   # iterate through  each
        a[$n]  # store inarray n
    nextfile   # next file
}
NF==(n-1) {  # define NF
    print FILENAME " file has expected number of fields"   # Good file
    nextfile   # next file
}
{
    for(i=1;i<=NF;i++)  # iterate through headers
        b[$i]   # header lines
    print FILENAME " is missing header for: "   # Bad file
    for(i in a)   # read headers into i
    if(i in b==0)  # if can not find header in key
        print i    # print missing header
    nextfile  
}' /home/cmccabe/bash/key $f > /home/cmccabe/bash/${pref}_out # use key as headers to look for in files and create out for each
done

isnumeric()   # numeric function
{   # start block
    result=$(echo "$1" | tr -d '[[:digit:]]')  # check each field in file for numeric and store result
    echo ${#result}   # display result
}  # end block

isalpha()   # charcter function
{  # start block
    result=$(echo "$1" | tr -d '[[:alpha:]]')  # check each field in file for character and store result
    echo ${#result}   # display result
}  # end block
col1=""   # define col to search
col2=""   # define col to search
col3=""   # define col to search
col4=""   # define col to search
col5=""   # define col to search
col6=""   # define col to search
col7=""   # define col to search
col8=""   # define col to search
col9=""    # define col to search
col10=""  # define col to search
let retval=1  # data to check in this row

while read record  # start loop to read each column in file
do
    echo "$record" | awk -F'\t' '{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10 }' | read col1 col2 col3 col4 col5 col6 col7 col8 col col10  # store in col name in record
    
    # check  if numeric in col
    if [[ $(isnumeric "$col1") -eq 1 && $(isnumeric "$col2") -eq 1 && $(isnumeric "$col3") -eq 1 && $(isnumeric "$col4") -eq 1 && $(isnumeric "$col7") -eq 1 && $(isnumeric "$col9") -eq 1 && $(isnumeric "$col10") -eq 1 ]]; then
         retval=1  # check data in this row
    else
         retval=0  # go back to header row
         break
    fi  # close if.... else
    
    # check if alpha in col
    if [[ $(isalpha "$col5") -eq 1 && $(isalpha "$col6") -eq 1 && $(isalpha "$col8") -eq 1 ]]; then
         retval=1  # check data in this row
    else
         retval=0  # go back to header row
         break
    fi  # close if....else
    
    if [[ $retval -eq 1 ]]; then   # display results
      echo "file is correct data type in each field"   # file isvalidated
    else
      echo "file is  not the correct data type for:"  # colums ...in file not-validated
      echo "$col1 $col2 $col3 $col4 $col5 $col6 $col7 $col8 $col9 $col10"
    fi  # close if.... else    
    
    if [[ NF == 10 && $retval -eq 1 ]]; then   # execute and display file validated
      echo "$f is validated"
    else
      echo "$f is not validated"
    fi
done  < $f >> /home/cmccabe/bash/${pref}_out  # end loop and define file to check and add to output

desired out ---- one for each file
Code:
/home/cmccabe/bash/file1.txt file has expected number of fields
/home/cmccabe/bash/file1.txt is validated
/home/cmccabe/bash/file1.txt is correct data type in each field

Code:
/home/cmccabe/bash/file2.txt has the expected number of fields
/home/cmccabe/bash/file2.txt is not the correct data type for: QUAL
/home/cmccabe/bash/file2.txt is not validated


Last edited by cmccabe; 04-18-2017 at 06:03 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Script to validate header in a csv file

Hi All; I am struggling to write a script that validates file header. Header file would be like below with TAB separated TRX # TYPE REF # Source Piece Code Destination Piece Code every time I need to check the txt file if the header was same as above fields if validation success... (6 Replies)
Discussion started by: heye18
6 Replies

2. Shell Programming and Scripting

Need a ready Shell script to validate a high volume data file

Hi, I am looking for a ready shell script that can help in loading and validating a high volume (around 4 GB) .Dat file . The data in the file has to be validated at each of its column, like the data constraint on each of the data type on each of its 60 columns and also a few other constraints... (2 Replies)
Discussion started by: Guruprasad
2 Replies

3. Shell Programming and Scripting

Extract header data from one file and combine it with data from another file

Hi, Great minds, I have some files, in fact header files, of CTD profiler, I tried a lot C programming, could not get output as I was expected, because my programming skills are very poor, finally, joined unix forum with the hope that, I may get what I want, from you people, Here I have attached... (17 Replies)
Discussion started by: nex_asp
17 Replies

4. Shell Programming and Scripting

Verify the header and trailer in file

please see my requirement, I hope I am clear. (9 Replies)
Discussion started by: mirwasim
9 Replies

5. Shell Programming and Scripting

Exclude the header row in the file to validate

Hi All, File contains header row.. we need to exclude the header row...no need to validate the first row in the file. Data in the file should take valid data(two columns)..we need to exclude the more than two columns in the file except the first line. email|firstname a|123|100 b|345... (4 Replies)
Discussion started by: bmk
4 Replies

6. Shell Programming and Scripting

Script to validate file header and trailer

Hi, I need a script that validates a file header/detail/trailer. File layout is: Header - Rec_Type|File_name|File_Date Detail - Rec_Type|field1|field2|field3... Trailder - Rec_Type|File_name|File_Date|Record_count Sample Data: HDR|customer_data.dat|20120709... (7 Replies)
Discussion started by: ash_sh
7 Replies

7. UNIX for Advanced & Expert Users

Verify file was sftp'd via bash script

Hello Experts, I have a script that that transfers a file (via sftp) and it works fine but we ran into a snag where the target server asked for the ssh key and the script didn't know what to do. I want to add some logic to this script that at least sends an email that it didn't complete as... (4 Replies)
Discussion started by: Tiberius777
4 Replies

8. UNIX for Dummies Questions & Answers

Verify the data type in a file with UNIX function

I am seeking help on this UNIX function, please help. Thanks in advance. I have a large file, named as 'MyFile'. It was tab-delmited, I am told that each record in column 1 is unique. How would I verify this with UNIX function or command? (1 Reply)
Discussion started by: duke0001
1 Replies

9. Shell Programming and Scripting

Better way to Validate column data in file.

I am trying to validate the third column in a pipe delimited file. The column must be 10 char long and all digits 0-9. I am writing out two new files from the existing file, if it would be quicker, I could leave the bad rows in the file and ignore them in the next process. What I have is... (12 Replies)
Discussion started by: barry1
12 Replies

10. Programming

FILE data type

Hi all, Can anyone tell me a little about the datatype FILE, which represents stream. What does its structure look like, and in which header file is it defined and so on... Ex : FILE *fp ; fp = fopen("filename", "w") ; (6 Replies)
Discussion started by: milhan
6 Replies
Login or Register to Ask a Question