Print . in blank fields to prevent fields from shifting


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Print . in blank fields to prevent fields from shifting
# 1  
Old 10-21-2016
Print . in blank fields to prevent fields from shifting

The below code works great, kindly provided by @Don Cragun, the lines in bold print the current output. Since some of the fields printed can be blank some of the fields are shifted. I can not seem too add . to the blank fields like in the desired output. Basically, if there is nothing in the field then . otherwise print what the script matches. Thank you Smilie.

script
Code:
for file in /home/cmccabe/Desktop/concordance/comparison/update/*.txt ; do
    file1=${file##*/}    # Strip off directory
    getprefix=${file1%%_*.txt}
    file1=$(printf '%s\n' "/home/cmccabe/Desktop/concordance/reference/files/${file1%%_*.txt}_"*.txt) # look for matching file
    if [[ -f "$file1" ]]
    then
          awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 OFS $2 OFS $3] = $0
}
{if($2 OFS $4 OFS $5 in o)
o[$2 OFS $4 OFS $5] = $1 OFS $2 OFS $4 OFS $5 OFS $6 OFS $7 OFS $8 OFS $9 OFS $10 OFS $11 OFS $12 OFS $13 OFS $14 OFS $15 OFS $16 OFS $17 OFS $18 OFS $19           }   
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' $file $file1
   fi
done

current output
Code:
Missing in IDP but found in Reference:                                         
2    166848646    G    A    exonic    SCN1A    68    13    16;20    0;0    17;15    0;0    0;0    0;0        c.[5139C>T]+[=]    52.94        Not low     found 
12    52200340    A    C    exonic    SCN8A    4129    28.3    1560;1672    413;453    0;0    0;0    0;2    31;0        c.[5070A>C]+[=]    20.97        Not low     Not found 
13    77570076    -    A    exonic    CLN5    2762    26.6    2060;702    0;0    0;0    0;0    2050;696    0;0        c.526_527insA    99.42    TP    Not low     Not found 
7    148106478    -    GT    intronic    CNTNAP2    4051    28.5    0;1    0;0    0;0    2220;1829    1085;887    0;1    rs60451214    c.3716-5_3716-4insGT    48.68        Not low     Not found 
9    138678036    TGCCC    -    intronic    KCNT1    834    23.1    0;0    0;0    0;31    0;1    0;0    0;802    rs141359570    c.3178-7_3178-3delTGCCC    96.16        Not low     Not found 
7    148106476    -    TT    intronic    CNTNAP2    4052    28.8    0;0    5;0    0;0    2221;1826    1081;884    0;0    rs61232377    c.3716-7_3716-6insTT    48.49        Not low     Not found 
2    166245425    C    T    exonic    SCN2A    49    12.6    0;0    13;9    0;0    18;9    0;0    0;0        c.[5109C>T]+[=]    55.1        Not low     found

desired output
Code:
Missing in IDP but found in Reference:                                                                             
CHR    POS    REF    ALT    FUNC    GENE    COVERAGE    PHRED    A[#F,#R]    C[#F,#R]    G[#F,#R]    T[#F,#R]    INS[#F,#R]    DEL[#F,#R]    SNP    MUT    FREQ    SANGER    REGION    TVC 
2    166848646    G    A    exonic    SCN1A    68    13    16;20    0;0    17;15    0;0    0;0    0;0    .    c.[5139C>T]+[=]    52.94    .    Not low     found 
12    52200340    A    C    exonic    SCN8A    4129    28.3    1560;1672    413;453    0;0    0;0    0;2    31;0    .    c.[5070A>C]+[=]    20.97    .    Not low     Not found 
13    77570076    -    A    exonic    CLN5    2762    26.6    2060;702    0;0    0;0    0;0    2050;696    0;0    .    c.526_527insA    99.42    TP    Not low     Not found 
7    148106478    -    GT    intronic    CNTNAP2    4051    28.5    0;1    0;0    0;0    2220;1829    1085;887    0;1    rs60451214    c.3716-5_3716-4insGT    48.68    .    Not low     Not found 
9    138678036    TGCCC    -    intronic    KCNT1    834    23.1    0;0    0;0    0;31    0;1    0;0    0;802    rs141359570    c.3178-7_3178-3delTGCCC    96.16    .    Not low     Not found 
7    148106476    -    TT    intronic    CNTNAP2    4052    28.8    0;0    5;0    0;0    2221;1826    1081;884    0;0    rs61232377    c.3716-7_3716-6insTT    48.49    .    Not low     Not found 
2    166245425    C    T    exonic    SCN2A    49    12.6    0;0    13;9    0;0    18;9    0;0    0;0    .    c.[5109C>T]+[=]    55.1    .    Not low     found


Last edited by cmccabe; 10-21-2016 at 06:29 PM.. Reason: fixed foemat
# 2  
Old 10-22-2016
Hello cmccabe,

As you haven't shown us the sample Input_file so I haven't tested it, could you please run following and let us know how it goes then.
Code:
for file in /home/cmccabe/Desktop/concordance/comparison/update/*.txt ; do
    file1=${file##*/}    # Strip off directory
    getprefix=${file1%%_*.txt}
    file1=$(printf '%s\n' "/home/cmccabe/Desktop/concordance/reference/files/${file1%%_*.txt}_"*.txt) # look for matching file
    if [[ -f "$file1" ]]
    then
          awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 OFS $2 OFS $3] = $0
}
{for(i=1;i<=19;i++)
{if(!$i){$i="."}}
}
{if($2 OFS $4 OFS $5 in o)
o[$2 OFS $4 OFS $5] = $1 OFS $2 OFS $4 OFS $5 OFS $6 OFS $7 OFS $8 OFS $9 OFS $10 OFS $11 OFS $12 OFS $13 OFS $14 OFS $15 OFS $16 OFS $17 OFS $18 OFS $19           }   
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' $file $file1
   fi
done

I have highlighted bold code in above.

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 10-22-2016
Quote:
Originally Posted by RavinderSingh13
Hello cmccabe,

As you haven't shown us the sample Input_file so I haven't tested it, could you please run following and let us know how it goes then.
Code:
for file in /home/cmccabe/Desktop/concordance/comparison/update/*.txt ; do
    file1=${file##*/}    # Strip off directory
    getprefix=${file1%%_*.txt}
    file1=$(printf '%s\n' "/home/cmccabe/Desktop/concordance/reference/files/${file1%%_*.txt}_"*.txt) # look for matching file
    if [[ -f "$file1" ]]
    then
          awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 OFS $2 OFS $3] = $0
}
{for(i=1;i<=19;i++)
{if(!$i){$i="."}}
}
{if($2 OFS $4 OFS $5 in o)
o[$2 OFS $4 OFS $5] = $1 OFS $2 OFS $4 OFS $5 OFS $6 OFS $7 OFS $8 OFS $9 OFS $10 OFS $11 OFS $12 OFS $13 OFS $14 OFS $15 OFS $16 OFS $17 OFS $18 OFS $19           }   
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' $file $file1
   fi
done

I have highlighted bold code in above.

Thanks,
R. Singh
Hi Ravinder,
I don't remember which thread the code shown in post #1 was addressing so I don't have any sample input either and I haven't tested your code. Note, however, that the code:
Code:
if(!$i){$i="."}

will not only change field #i to a <period> if the field is empty, it was also change it to a period if the field contains a numeric string that evaluates to zero (e.g., 0, 0.000, and 0e+10). For cases like this, the following would be safer:
Code:
{if($i == "")$i = "."}

These 2 Users Gave Thanks to Don Cragun For This Post:
# 4  
Old 10-24-2016
Sorry about that I was leaving for a weekend trip. Anyway here are the files:

file that is updated:
Code:
Missing in IDP but found in Reference:
2	166848646	G	A	exonic	SCN1A	68	13	16;20	0;0	17;15	0;0	0;0	0;0	c.[5139C>T]+[=]	52.94	Not low
12	52200340	A	C	exonic	SCN8A	4129	28.3	1560;1672	413;453	0;0	0;0	0;2	31;0	c.[5070A>C]+[=]	20.97	Not low
13	77570076	-	A	exonic	CLN5	2762	26.6	2060;702	0;0	0;0	0;0	2050;696	0;0	c.526_527insA	99.42	TP	Not low
7	148106478	-	GT	intronic	CNTNAP2	4051	28.5	0;1	0;0	0;0	2220;1829	1085;887	0;1	rs60451214	c.3716-5_3716-4insGT	48.68	Not low
9	138678036	TGCCC	-	intronic	KCNT1	834	23.1	0;0	0;0	0;31	0;1	0;0	0;802	rs141359570	c.3178-7_3178-3delTGCCC	96.16	Not low
7	148106476	-	TT	intronic	CNTNAP2	4052	28.8	0;0	5;0	0;0	2221;1826	1081;884	0;0	rs61232377	c.3716-7_3716-6insTT	48.49	Not low
2	166245425	C	T	exonic	SCN2A	49	12.6	0;0	13;9	0;0	18;9	0;0	0;0	c.[5109C>T]+[=]	55.1	Not low

current output:
Code:
Missing in IDP but found in Reference: has no . so fields shift when blank
CHR	POS	REF	ALT	FUNC	GENE	COVERAGE	PHRED	A[#F,#R]	C[#F,#R]	G[#F,#R]	T[#F,#R]	INS[#F,#R]	DEL[#F,#R]	SNP	MUT	FREQ	SANGER	REGION
 TVC 
2	166848646	G	A	exonic	SCN1A	68	13	16;20	0;0	17;15	0;0	0;0	0;0	c.[5139C>T]+[=]	52.94	Not low	 found
12	52200340	A	C	exonic	SCN8A	4129	28.3	1560;1672	413;453	0;0	0;0	0;2	31;0	c.[5070A>C]+[=]	20.97	Not low	 Not found
13	77570076	-	A	exonic	CLN5	2762	26.6	2060;702	0;0	0;0	0;0	2050;696	0;0	c.526_527insA	99.42	TP	Not low	 Not found
7	148106478	-	GT	intronic	CNTNAP2	4051	28.5	0;1	0;0	0;0	2220;1829	1085;887	0;1	rs60451214	c.3716-5_3716-4insGT	48.68	Not low	 Not found
9	138678036	TGCCC	-	intronic	KCNT1	834	23.1	0;0	0;0	0;31	0;1	0;0	0;802	rs141359570	c.3178-7_3178-3delTGCCC	96.16	Not low	 Not found
7	148106476	-	TT	intronic	CNTNAP2	4052	28.8	0;0	5;0	0;0	2221;1826	1081;884	0;0	rs61232377	c.3716-7_3716-6insTT	48.49	Not low	 Not found
2	166245425	C	T	exonic	SCN2A	49	12.6	0;0	13;9	0;0	18;9	0;0	0;0	c.[5109C>T]+[=]	55.1	Not low	 found

desired output: tab-delimited with . if the field is blank
Code:
Missing in IDP but found in Reference:			
CHR	POS	REF	ALT	FUNC	GENE	COVERAGE	PHRED	"A[#F,#R]"	"C[#F,#R]"	"G[#F,#R]"	"T[#F,#R]"	"INS[#F,#R]"	"DEL[#F,#R]"	SNP	MUT	FREQ	SANGER	REGION	TVC
2	166848646	G	A	exonic	SCN1A	68	13	16;20	0;0	17;15	0;0	0;0	0;0	.	c.[5139C>T]+[=]	52.94	.	Not low	 found
12	52200340	A	C	exonic	SCN8A	4129	28.3	1560;1672	413;453	0;0	0;0	0;2	31;0	.	c.[5070A>C]+[=]	20.97	.	Not low	 Not found
13	77570076	-	A	exonic	CLN5	2762	26.6	2060;702	0;0	0;0	0;0	2050;696	0;0	.	c.526_527insA	99.42	TP	Not low	 Not found
7	148106478	-	GT	intronic	CNTNAP2	4051	28.5	0;1	0;0	0;0	2220;1829	1085;887	0;1	rs60451214	c.3716-5_3716-4insGT	48.68	.	Not low	 Not found
9	138678036	TGCCC	-	intronic	KCNT1	834	23.1	0;0	0;0	0;31	0;1	0;0	0;802	rs141359570	c.3178-7_3178-3delTGCCC	96.16	.	Not low	 Not found
7	148106476	-	TT	intronic	CNTNAP2	4052	28.8	0;0	5;0	0;0	2221;1826	1081;884	0;0	rs61232377	c.3716-7_3716-6insTT	48.49	.	Not low	 Not found
2	166245425	C	T	exonic	SCN2A	49	12.6	0;0	13;9	0;0	18;9	0;0	0;0	.	c.[5109C>T]+[=]	55.1		Not low	 found

I updated the code with:
Code:
{for(i=1;i<=19;i++)
{if($i == "")$i = "."}
}

but that seemed to remove the last 6 fields from the output. Thank you Smilie.
# 5  
Old 10-24-2016
Hello cmccabe,

Could you please make sure that your Input_file has delimiter as TAB as example shown by you, doesn't seems to have TABs as delimiter in it, please do confirm on same.

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 6  
Old 10-24-2016
Hi RavinderSingh13,

The input file is space delimited but the output is tab-delimited. Thank you Smilie.
# 7  
Old 10-24-2016
Quote:
Originally Posted by cmccabe
Hi RavinderSingh13,
The input file is space delimited but the output is tab-delimited. Thank you Smilie.
Hello cmccabe,

If I am not wrong code would have given for TAB delimited Input_file only as you could see we are setting it in BEGINsection, so how it will read the fields correctly if there are NO TABS in Input_file. So with space why it will NOT work out because let's have an example of following line 1 2 3 4 5 6. So let's run the code with a TAB delimited field separator first as follows.
Code:
echo "1              2 3 4 5   6" | awk -F"\t" '{for(i=1;i<=NF;i++){print i "---->" $i}}'
1---->1              2 3 4 5   6

See as there is NO TAB present in line so it is printing whole line as a single field.
Now let's test it without setting TAB delimiter as follows.
Code:
echo "1              2 3 4 5   6" | awk  '{for(i=1;i<=NF;i++){print i"-->"$i}}'
1-->1
2-->2
3-->3
4-->4
5-->5
6-->6

So in above outputs left side of digits before -->shows the number of field and after arrow it shows the field's value. So one thing you could try here, if none of your fields into Input_file have space in their values that you could substitute space with TAB and then try to run above code. As a hint you could use gsub(/ +/,"\t",$0)utility of awk for doing so.

Kindly try it and do let us know how it goes then.

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Is there a UNIX command that can compare fields of files with differing number of fields?

Hi, Below are the sample files. x.txt is from an Excel file that is a list of users from Windows and y.txt is a list of database account. $ head -500 x.txt y.txt ==> x.txt <== TEST01 APP_USER_PROFILE USER03 APP_USER_PROFILE TEST02 APP_USER_EXP_PROFILE TEST04 APP_USER_PROFILE USER01 ... (3 Replies)
Discussion started by: newbie_01
3 Replies

2. Shell Programming and Scripting

Fields shifting in file, do to null values?

The below code runs and creates an output file with three sections. The first 2 sections are ok, but the third section doesn't seem to put a . in all the fields that are blank. I don't know if this is what causes the last two fields in the current output to shift to a newline, but I can not seem... (3 Replies)
Discussion started by: cmccabe
3 Replies

3. Shell Programming and Scripting

awk sort based on difference of fields and print all fields

Hi I have a file as below <field1> <field2> <field3> ... <field_num1> <field_num2> Trying to sort based on difference of <field_num1> and <field_num2> in desceding order and print all fields. I tried this and it doesn't sort on the difference field .. Appreciate your help. cat... (9 Replies)
Discussion started by: newstart
9 Replies

4. Shell Programming and Scripting

Can ksh read records with blank fields

I have a tab delimited file with some fields potentially containing no data. In ksh 'read' though treats multiple tabs as a single delimiter. Is there any way to change that behavior so I could have blank data too? I.e. When encountering 2 tabs it would take it as a null field? Or do I have to... (3 Replies)
Discussion started by: benalt
3 Replies

5. Shell Programming and Scripting

How to search for blank fields in a text file from a certain position?

Sample txt file : OK00001111112| OK00003443434|skjdaskldj OK32812983918|asidisoado OK00000000001| ZM02910291029|sldkjaslkjdasldjk what would be the shell script to figure out the blank space (if any) after the pipe sign? (4 Replies)
Discussion started by: chatwithsaurav
4 Replies

6. Shell Programming and Scripting

Count blank fields in every line

Hello All, I am trying a one liner for finding the number of null columns in every line of my flat file. The format of my flat file is like this a|b|c|d||||e|f|g| a|b|c|d||||e|f|g| I want to count the number of fields delimited by "|" which are blank. In above case the count should be... (6 Replies)
Discussion started by: nnani
6 Replies

7. Shell Programming and Scripting

How to print 1st field and last 2 fields together and the rest of the fields after it using awk?

Hi experts, I need to print the first field first then last two fields should come next and then i need to print rest of the fields. Input : a1,abc,jsd,fhf,fkk,b1,b2 a2,acb,dfg,ghj,b3,c4 a3,djf,wdjg,fkg,dff,ggk,d4,d5 Expected output: a1,b1,b2,abc,jsd,fhf,fkk... (6 Replies)
Discussion started by: 100bees
6 Replies

8. Shell Programming and Scripting

remove blank spaces from fields

Hi Friends, I have large volume of data file as shown below. Beganing or end of each filed, there are some blank spaces. How do I remove those spaces? AAA AAA1 | BBB BB1 BB2 |CC CCCC DDDD DD | EEEEEEE EEEEEEEE | FFF FFFFFF FFFF GG GGGGGG |HH HH ... (3 Replies)
Discussion started by: ppat7046
3 Replies

9. Shell Programming and Scripting

awk sed cut? to rearrange random number of fields into 3 fields

I'm working on formatting some attendance data to meet a vendors requirements to upload to their system. With some help on the forums here, I have the data close. But they've since changed what they want. The vendor wants me to submit three fields to them. Field 1 is the studentid field,... (4 Replies)
Discussion started by: axo959
4 Replies

10. Shell Programming and Scripting

how to include field separator if there are blank fields?

Hi, I have the following data in the format as shown (note: there are more than 1 blank spaces between each field and the spaces are not uniform, meaning there can be one blank space between field1 and field2 and 3 spaces between field3 and field4, in this example, # are the spaces in between... (19 Replies)
Discussion started by: ReV
19 Replies
Login or Register to Ask a Question