Fields shifting in file, do to null values?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Fields shifting in file, do to null values?
# 1  
Old 12-02-2016
Fields shifting in file, do to null values?

The below code runs and creates an output file with three sections. The first 2 sections are ok, but the third section doesn't seem to put a . in all the fields that are blank. I don't know if this is what causes the last two fields in the current output to shift to a newline, but I can not seem to solve this. They should not be on a new line and maybe it is because of the spaces in the fields. The code in bold seems to add . to some of the nulls, but not all of them.Thank you Smilie.


Code:
# update found in reference missing in IDP
for file in /home/cmccabe/Desktop/concordance/comparison/update/*.txt ; do
    file1=${file##*/}    # Strip off directory
    getprefix=${file1%%_*.txt}
    file1=$(printf '%s\n' "/home/cmccabe/Desktop/concordance/reference/files/${file1%%_*.txt}_"*.txt) # look for matching file
    if [[ -f "$file1" ]]
    then
          awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 OFS $2 OFS $3] = $0
}
{for(f=1;f<=19;f++)
{if($f == "")$f = "."}
}
{if($2 OFS $4 OFS $5 in o)
o[$2 OFS $4 OFS $5] = $1 OFS $2 OFS $4 OFS $5 OFS $6 OFS $7 OFS $8 OFS $9 OFS $10 OFS $11 OFS $12 OFS $13 OFS $14 OFS $15 OFS $16 OFS $17 OFS $18 OFS $19}
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' $file $file1
   fi
done

current output
Code:
Missing in IDP but found in Reference:	
CHR	POS	REF	ALT	FUNC	GENE	COVERAGE	PHRED	A[#F,#R]	C[#F,#R]	G[#F,#R]	T[#F,#R]	INS[#F,#R]	DEL[#F,#R]	SNP	MUT	FREQ	SANGER	REGION	TVC
9	138676398	-	C	exonic	KCNT1	97	13.5	0;0	0;97	0;0	0;0	0;24	0;0	.	c.2961_2962insC	24.74	FP
	Not low	 Not found
9	131337098	-	T	intronic	SPTAN1	1522	15.3	0;0	0;0	295;1227	0;0	1;277	0;0	.	c.504+4_504+5insT	18.27	
	Not low	 Not found
10	78944590	G	A	exonic	KCNMA1	2173	24.8	448;626	1;0	496;598	0;0	3;0	0;4	rs1131824	c.[687C>T]+[=]	49.42	
	Not low	 found
20	62038393	-	G	exonic	KCNQ2	140	13.6	0;0	0;0	132;8	0;0	63;2	0;0	.	c.2223_2224insC	46.43	FP
	Not low	 Not found
2	166848646	G	A	exonic	SCN1A	110	15.7	20;16	0;0	44;30	0;0	0;0	0;0	.	c.[5139C>T]+[=]	32.73	
	Not low	 found
2	166210776	C	T	exonic	SCN2A	3095	23.1	0;0	1158;1177	0;0	457;303	1;0	0;0	.	c.[2994C>T]+[=]	24.56	
	Not low	 found
9	138676400	-	C	exonic	KCNT1	98	13.5	0;0	0;0	0;98	0;0	0;19	0;0	.	c.2963_2964insC	19.39	FP
	Not low	 Not found
11	1780815	C	-	exonic	CTSD	187	12.9	0;0	9;117	0;0	0;0	0;0	0;61	rs141482597	c.283delG	32.62	RFP
	Not low	 Not found
16	10273906	-	G	exonic	GRIN2A	3252	16.7	0;2	0;0	586;2664	0;0	3;627	0;0	rs145961628	c.363_364insC	19.37	RFP
	Not low	 Not found
7	148106478	-	GT	intronic	CNTNAP2	4168	28.6	0;0	0;1	0;0	2199;1967	1129;997	0;1	rs60451214	c.3716-5_3716-4insGT	51.01	
	Not low	 Not found
18	53303101	C	G	exonic	TCF4	1822	20	2;0	0;0	739;1027	0;0	0;0	1;53	rs611326	c.[-48754C>G]+[-48754C>G]	96.93	
	Not low	 found
2	166901684	-	T	exonic	SCN1A	1540	14.4	313;1227	0;0	0;0	0;0	0;291	0;0	.	c.1530_1531insA	18.9	FP
	Not low	 Not found
7	148106476	-	TT	intronic	CNTNAP2	4170	28.6	0;0	0;1	0;0	2208;1961	1131;996	0;0	rs61232377	c.3716-7_3716-6insTT	51.01	
	Not low	 Not found

desired output
Code:
Missing in IDP but found in Reference:	
CHR	POS	REF	ALT	FUNC	GENE	COVERAGE	PHRED	A[#F,#R]	C[#F,#R]	G[#F,#R]	T[#F,#R]	INS[#F,#R]	DEL[#F,#R]	SNP	MUT	FREQ	SANGER	REGION	TVC
9	138676398	-	C	exonic	KCNT1	97	13.5	0;0	0;97	0;0	0;0	0;24	0;0	.	c.2961_2962insC	24.74	FP     Not low	Not found
9	131337098	-	T	intronic	SPTAN1	1522	15.3	0;0	0;0	295;1227	0;0	1;277	0;0	.	c.504+4_504+5insT	18.27	.	Not low	 Not found
10	78944590	G	A	exonic	KCNMA1	2173	24.8	448;626	1;0	496;598	0;0	3;0	0;4	rs1131824	c.[687C>T]+[=]	49.42	.	Not low	found
20	62038393	-	G	exonic	KCNQ2	140	13.6	0;0	0;0	132;8	0;0	63;2	0;0	.	c.2223_2224insC	46.43	FP	Not low	Not found
2	166848646	G	A	exonic	SCN1A	110	15.7	20;16	0;0	44;30	0;0	0;0	0;0	.	c.[5139C>T]+[=]	32.73	.	Not low	found
2	166210776	C	T	exonic	SCN2A	3095	23.1	0;0	1158;1177	0;0	457;303	1;0	0;0	.	c.[2994C>T]+[=]	24.56	.	Not low	found
9	138676400	-	C	exonic	KCNT1	98	13.5	0;0	0;0	0;98	0;0	0;19	0;0	.	c.2963_2964insC	19.39	FP	Not low	Not found
11	1780815	C	-	exonic	CTSD	187	12.9	0;0	9;117	0;0	0;0	0;0	0;61	rs141482597	c.283delG	32.62	RFP	Not low	Not found
16	10273906	-	G	exonic	GRIN2A	3252	16.7	0;2	0;0	586;2664	0;0	3;627	0;0	rs145961628	c.363_364insC	19.37	RFP	Not low	Not found
7	148106478	-	GT	intronic	CNTNAP2	4168	28.6	0;0	0;1	0;0	2199;1967	1129;997	0;1	rs60451214	c.3716-5_3716-4insGT	51.01	.	Not low	Not found
18	53303101	C	G	exonic	TCF4	1822	20	2;0	0;0	739;1027	0;0	0;0	1;53	rs611326	c.[-48754C>G]+[-48754C>G]	96.93	.	Not low	found
2	166901684	-	T	exonic	SCN1A	1540	14.4	313;1227	0;0	0;0	0;0	0;291	0;0	.	c.1530_1531insA	18.9	FP	Not low     Not found
7	148106476	-	TT	intronic	CNTNAP2	4170	28.6	0;0	0;1	0;0	2208;1961	1131;996	0;0	rs61232377	c.3716-7_3716-6insTT	51.01	. Not low     Not found

# 2  
Old 12-02-2016
Perhaps some fields aren't blank, but spaces?

Code:
if($f ~ /[ ]*/) $f=".";

This User Gave Thanks to Corona688 For This Post:
# 3  
Old 12-03-2016
I will try it out, will that capture both spaces and null's/blanks? Thank you very muchSmilie.
# 4  
Old 12-05-2016
I updated the code using your suggestion and the last 6 fields in the output seem to be removed, if($f == "")$f = "." creates output as in the post (with the spaces added and the last 2 fields shifted). Thank you Smilie.

Code:
# update found in reference missing in IDP
for file in /home/cmccabe/Desktop/concordance/comparison/update/*.txt ; do
    file1=${file##*/}    # Strip off directory
    getprefix=${file1%%_*.txt}
    file1=$(printf '%s\n' "/home/cmccabe/Desktop/concordance/reference/files/${file1%%_*.txt}_"*.txt) # look for matching file
    if [[ -f "$file1" ]]
    then
          awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 OFS $2 OFS $3] = $0
}
{for(f=1;f<=19;f++)
{if($f ~ /[ ]*/) $f=".";}
}
{if($2 OFS $4 OFS $5 in o)
o[$2 OFS $4 OFS $5] = $1 OFS $2 OFS $4 OFS $5 OFS $6 OFS $7 OFS $8 OFS $9 OFS $10 OFS $11 OFS $12 OFS $13 OFS $14 OFS $15 OFS $16 OFS $17 OFS $18 OFS $19}
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' $file $file1
   fi
done

output (last 6 fields removed)
Code:
Missing in IDP but found in Reference:    
CHR    POS    REF    ALT    FUNC    GENE    COVERAGE    PHRED    A[#F,#R]    C[#F,#R]    G[#F,#R]    T[#F,#R]    INS[#F,#R]    DEL[#F,#R]    SNP    MUT    FREQ    SANGER    REGION    TVC
138676398    -    C    Not low     Not found
131337098    -    T    Not low     Not found
78944590    G    A    low     Not found
62038393    -    G    Not low     Not found
166848646    G    A    low     Not found
166210776    C    T    Not low     Not found
138676400    -    C    Not low     Not found
1780815    C    -    Not low     Not found
10273906    -    G    Not low     Not found
148106478    -    GT    Not low     Not found
166901684    -    T    Not low     Not found
148106476    -    TT    Not low     Not found

---------- Post updated at 07:38 AM ---------- Previous update was at 07:02 AM ----------

If the if($f == "")$f = "." was used in the if statement, the output is below:

Code:
Missing in IDP but found in Reference:    
CHR    POS    REF    ALT    FUNC    GENE    COVERAGE    PHRED    A[#F,#R]    C[#F,#R]    G[#F,#R]    T[#F,#R]    INS[#F,#R]    DEL[#F,#R]    SNP    MUT    FREQ    SANGER    REGION    TVC
9    138676398    -    C    exonic    KCNT1    97    13.5    0;0    0;97    0;0    0;0    0;24    0;0    .    c.2961_2962insC    24.74    FP     Not low     Not found
9    131337098    -    T    intronic    SPTAN1    1522    15.3    0;0    0;0    295;1227    0;0    1;277    0;0    .    c.504+4_504+5insT    18.27         Not low     Not found
10    78944590    G    A    exonic    KCNMA1    2173    24.8    448;626    1;0    496;598    0;0    3;0    0;4    rs1131824    c.[687C>T]+[=]    49.42         Not low     found
20    62038393    -    G    exonic    KCNQ2    140    13.6    0;0    0;0    132;8    0;0    63;2    0;0    .    c.2223_2224insC    46.43    FP     Not low     Not found
2    166848646    G    A    exonic    SCN1A    110    15.7    20;16    0;0    44;30    0;0    0;0    0;0    .    c.[5139C>T]+[=]    32.73         Not low     found
2    166210776    C    T    exonic    SCN2A    3095    23.1    0;0    1158;1177    0;0    457;303    1;0    0;0    .    c.[2994C>T]+[=]    24.56         Not low     found
9    138676400    -    C    exonic    KCNT1    98    13.5    0;0    0;0    0;98    0;0    0;19    0;0    .    c.2963_2964insC    19.39    FP     Not low     Not found
11    1780815    C    -    exonic    CTSD    187    12.9    0;0    9;117    0;0    0;0    0;0    0;61    rs141482597    c.283delG    32.62    RFP     Not low     Not found
16    10273906    -    G    exonic    GRIN2A    3252    16.7    0;2    0;0    586;2664    0;0    3;627    0;0    rs145961628    c.363_364insC    19.37    RFP     Not low     Not found
7    148106478    -    GT    intronic    CNTNAP2    4168    28.6    0;0    0;1    0;0    2199;1967    1129;997    0;1    rs60451214    c.3716-5_3716-4insGT    51.01         Not low     Not found
2    166901684    -    T    exonic    SCN1A    1540    14.4    313;1227    0;0    0;0    0;0    0;291    0;0    .    c.1530_1531insA    18.9    FP     Not low     Not found
7    148106476    -    TT    intronic    CNTNAP2    4170    28.6    0;0    0;1    0;0    2208;1961    1131;996    0;0    rs61232377    c.3716-7_3716-6insTT    51.01         Not low     Not found

manually edited

. added to $10 or SNP if empty
tabs used to seperate$11 and $12

Code:
Missing in IDP but found in Reference:   
CHR    POS    REF    ALT    FUNC    GENE    COVERAGE    PHRED    A[#F,#R]    C[#F,#R]    G[#F,#R]    T[#F,#R]    INS[#F,#R]    DEL[#F,#R]    SNP    MUT    FREQ    SANGER    REGION    TVC
9    138676398    -    C    exonic    KCNT1    97    13.5    0;0    0;97    0;0    0;0    0;24    0;0    .    c.2961_2962insC    24.74    FP    Not low    Not found
9    131337098    -    T    intronic    SPTAN1    1522    15.3    0;0    0;0    295;1227    0;0    1;277    0;0    .    c.504+4_504+5insT    18.27    .    Not low    Not found
10    78944590    G    A    exonic    KCNMA1    2173    24.8    448;626    1;0    496;598    0;0    3;0    0;4    rs1131824    c.[687C>T]+[=]    49.42    .    Not low    found
20    62038393    -    G    exonic    KCNQ2    140    13.6    0;0    0;0    132;8    0;0    63;2    0;0    .    c.2223_2224insC    46.43    FP    Not low    Not found
2    166848646    G    A    exonic    SCN1A    110    15.7    20;16    0;0    44;30    0;0    0;0    0;0    .    c.[5139C>T]+[=]    32.73    Not low    found
2    166210776    C    T    exonic    SCN2A    3095    23.1    0;0    1158;1177    0;0    457;303    1;0    0;0    .    c.[2994C>T]+[=]    24.56    .    Not low    found
9    138676400    -    C    exonic    KCNT1    98    13.5    0;0    0;0    0;98    0;0    0;19    0;0    .    c.2963_2964insC    19.39    FP    Not low    Not found
11    1780815    C    -    exonic    CTSD    187    12.9    0;0    9;117    0;0    0;0    0;0    0;61    rs141482597    c.283delG    32.62    RFP    Not low    Not found
16    10273906    -    G    exonic    GRIN2A    3252    16.7    0;2    0;0    586;2664    0;0    3;627    0;0    rs145961628    c.363_364insC    19.37    RFP    Not low    Not found
7    148106478    -    GT    intronic    CNTNAP2    4168    28.6    0;0    0;1    0;0    2199;1967    1129;997    0;1    rs60451214    c.3716-5_3716-4insGT    51.01    .    Not low    Not found
2    166901684    -    T    exonic    SCN1A    1540    14.4    313;1227    0;0    0;0    0;0    0;291    0;0    .    c.1530_1531insA    18.9    FP     Not low    Not found
7    148106476    -    TT    intronic    CNTNAP2    4170    28.6    0;0    0;1    0;0    2208;1961    1131;996    0;0    rs61232377    c.3716-7_3716-6insTT    51.01    .    Not low    Not found


Last edited by cmccabe; 12-05-2016 at 09:38 AM.. Reason: added manual edit
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Count null values in a file using awk

I have the following a.txt file A|1|2|3|4|5| A||2|3|0|| A|1|6||8|10| A|9|2|3|4|1| A|0|9|3|4|5| A||2|3|4|5| A|0|av|.9|4|9| I use the following command to count null values for 2nd field awk -F"|" '!$2 { N++; next } END {print N}' a.txt It should give the result 2, but it is giving... (2 Replies)
Discussion started by: RJG
2 Replies

2. Shell Programming and Scripting

Print . in blank fields to prevent fields from shifting

The below code works great, kindly provided by @Don Cragun, the lines in bold print the current output. Since some of the fields printed can be blank some of the fields are shifted. I can not seem too add . to the blank fields like in the desired output. Basically, if there is nothing in the field... (10 Replies)
Discussion started by: cmccabe
10 Replies

3. Shell Programming and Scripting

Grep null values in a file with no delimiter

Hi Folks, We have a file that has null values but there are no delimiters. So all columns are considered as a single column. Ex: abc def 123 abcdef1234567 hijklmn7896545 Now from "a" till "3" all are considered as a single column from the first row. Our requirement is like, we... (2 Replies)
Discussion started by: jayadanabalan
2 Replies

4. Shell Programming and Scripting

Replace a field where values are null in a file.

Hi, I've a pipe delimited file and wanted to replace the 3rd field to 099990 where the values are null. How can I achieve it using awk or sed. 20130516|00000061|02210|111554|03710|2|205069|SM APPL $80-100 RTL|S 20130516|00000061|02210|111554|03710|2|205069|SM APPL $80-100 RTL|S... (12 Replies)
Discussion started by: rudoraj
12 Replies

5. Shell Programming and Scripting

File values alwaya null

Hi All , below is my shell program. !/bin/sh set -x #---------------------------------------------------------------------------------------- # Program : weekly_remove_icd_file.sh # Author : # Date : 04/06/2013 # Purpose : Execute the script to... (3 Replies)
Discussion started by: krupasindhu18
3 Replies

6. Shell Programming and Scripting

Find out if few fields in a file are null

Hi, I've a pipe delimited file where I want to find out a number of lines where 1st 2nd and last field are null using awk/sed. Is it possible? Thanks (5 Replies)
Discussion started by: rudoraj
5 Replies

7. Shell Programming and Scripting

identifying null values in a file

I have a huge file with 20 fileds in each record and each field is seperated by "|". If i want to get all the reocrds that have 18th or for that matter any filed as null how can i do it? Please let me know (3 Replies)
Discussion started by: dsravan
3 Replies

8. Shell Programming and Scripting

Replace 3 fields with null in the file

Hi, I have a file with 104 columns delimited by comma. I have to replace fields 4,5 and 19 with null values and after replacing the columns in the file , the file should be still comma delimited. I am new to shell scripting, Experts please help me out. Thank you (1 Reply)
Discussion started by: vukkusila
1 Replies

9. Shell Programming and Scripting

Find null fields in file

Hi All, I have some csv files out of which i want to find records which have empty values in either the 14th or 16th fields. The following is a sample. $cut -d',' -f14,16 SPS* | head -5 VOIP_ORIG_INFO,VOIP_DEST_INFO sip:445600709315@sip.com,sip:999@sip.com... (2 Replies)
Discussion started by: rahulrathod
2 Replies

10. Shell Programming and Scripting

Null values after emptying a log file

Hi, I have a log file which is constantly being written to by some process. I need to clear that log file on a daily basis. The problem is that when I issue this command: echo "" > logfile.log the file gets filled with nulls thus increasing the size of the file. Is there a way to... (2 Replies)
Discussion started by: kasie4life
2 Replies
Login or Register to Ask a Question