awk output is not the correct count


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk output is not the correct count
# 1  
Old 05-31-2016
awk output is not the correct count

The awk below runs and produces the following output on the file2. This is just an example of the format as the file is ~14MB. file1.txt is attached. I am trying to count the ids that match between the two files and out the ids that are missing. Thank you Smilie.

file2
Code:
970    NM_213590    chr13    +    50571142    50592603    50586076    50587300    2    50571142,50586070,    50571899,50592603,    0    TRIM13    cmpl    cmpl    -1,0,
2166    NM_001017364    chr1    +    207262583    207273337    207262876    207273274    6    207262583,207263655,207264988,207269866,207271494,207273133,    207262934,207263826,207265165,207269960,207271609,207273337,    0    C4BPB    cmpl    cmpl    0,1,1,1,2,0,
1044    NM_152866    chr11    +    60223281    60238225    60229847    60235941    8    60223281,60228544,60229657,60230474,60231760,60233393,60234431,60235722,    60223418,60228633,60230006,60230594,60231817,60233630,60234533,60238225,    0    MS4A1    cmpl    cmpl    -1,-1,0,0,0,0,0,0,
1274    NM_020466    chr6    -    90341942    90348474    90346991    90348435    3    90341942,90347460,90348390,    90347072,90347601,90348474,    0    LYRM2    cmpl    cmpl    0,0,0,
162    NM_014912    chr10    -    93808396    94050875    93811968    94000107    10    93808396,93841076,93851586,93870832,93902785,93904701,93940719,93952233,93999102,94050682,    93812196,93841258,93851701,93870951,93902875,93904842,93940776,93952393,94000118,94050875,    0    CPEB3    cmpl    cmpl    0,1,0,1,1,1,1,0,0,-1,
1241    NM_015613    chr10    -    85991275    86001217    85991682    86001195    4    85991275,85993828,85996975,86001073,    85992659,85994134,85997442,86001217,    0    LRIT1    cmpl    cmpl    1,1,2,0,
1962    NM_206880    chr5    +    180581942    180582890    180581942    180582890    1    180581942,    180582890,    0    OR2V2    cmpl    cmpl    0,
205    NM_020464    chr6    -    138743180    138893668    138745217    138893048    7    138743180,138750847,138751529,138768137,138794442,138817355,138892846,    138745953,138750980,138754817,138768330,138794570,138817508,138893668,    0    NHSL1    cmpl    cmpl    2,1,1,0,1,1,0,
116    NM_001193342    chr20    -    45186461    45280100    45188660    45242181    13    45186461,45192052,45194867,45204211,45212210,45216697,45217798,45221042,45224795,45228609,45239084,45242098,45279949,    45188837,45192190,45195029,45204324,45212308,45216802,45217894,45221168,45224981,45228676,45239248,45242191,45280100,    0    SLC13A3    cmpl    cmpl    0,0,0,1,2,2,2,2,2,1,2,0,-1,
1038    NM_152716    chr11    -    59404191    59436511    59405862    59436368    19    59404191,59406520,59406764,59410352,59415226,59416934,59418226,59419016,59419936,59420310,59421455,59422995,59423428,59423971,59425002,59426338,59426724,59434325,59436353,    59405884,59406670,59406856,59410508,59415386,59417083,59418286,59419114,59420060,59420491,59421545,59423213,59423518,59424073,59425197,59426419,59426942,59434437,59436511,    0    PATL1    cmpl    cmpl    2,2,0,0,2,0,0,1,0,2,2,0,0,0,0,0,1,0,0,
2041    NM_198184    chr3    +    190930321    190967910    190930321    190967910    3    190930321,190936535,190967825,    190930423,190936750,190967910,    0    OSTN    cmpl    cmpl    0,0,2,
1022    NM_198183    chr11    -    57319127    57335803    57319830    57322021    4    57319127,57321909,57327809,57335321,    57319982,57322096,57327905,57335803,    0    UBE2L6    cmpl    cmpl    1,0,-1,-1,
735    NR_037612    chr22    +    19705991    19712297    19712297    19712297    12    19705991,19707124,19707328,19707637,19707842,19708071,19708290,19709162,19709344,19709759,19709950,19711376,    19706094,19707221,19707415,19707761,19707977,19708189,19708392,19709259,19709480,19709862,19711102,19712297,    0    SEPT5-GP1BB    unk    unk    -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
735    NR_037611    chr22    +    19704742    19712297    19712297    19712297    12    19704742,19707124,19707328,19707637,19707842,19708071,19708290,19709162,19709344,19709759,19709950,19711376,    19706341,19707221,19707415,19707761,19707977,19708189,19708392,19709259,19709480,19709862,19711102,19712297,    0    SEPT5-GP1BB    unk    unk    -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,


awk
Code:
awk 'NR==FNR{a[$1];next}
     $13 in a{c++; delete a[$13]}
         END{if(c) print c " ids found"; 
             for(k in a) print k " missing"}' file1.txt file2 > output

output
Code:
4602 ids found
POMK-SGK196 missing
ACKR1 missing
POMGNT2-GTDC2 missing

However when I do wc -l file1 there are 4609 lines. Is there a better way to find what count is common and output the missing. Thank you Smilie.
# 2  
Old 05-31-2016
The discrepancy from 4602+3and 4609 comes from the fact that file1.txt has the following ids represented twice:
Code:
IL4R
CSF3R
RANGRF
SYNE1

# 3  
Old 05-31-2016
You've got duplicates in file1:

Code:
      2 CSF3R
      2 IL4R
      2 RANGRF
      2 SYNE1

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to output match and mismatch with count using specific fields

In the below awk I am trying output to one file those lines that match between $2,$3,$4 of file1 and file2 with the count in (). I am also trying to output those lines that are missing between $2,$3,$4 of file1 and file2 with the count of in () each. Both input files are tab-delimited, but the... (7 Replies)
Discussion started by: cmccabe
7 Replies

2. Shell Programming and Scripting

Now showing the correct output

Hello I am working on one script where I am trying to display all the directories which is inside the workspace but somehow it is giving me weird output and this is occurring only with one directory other also having the result.html file inside the directory. for i in `ls -1 | egrep -iv... (2 Replies)
Discussion started by: anuragpgtgerman
2 Replies

3. Shell Programming and Scripting

Output not in correct format - cd script

I have a script that looks like this: dirname2=/usr/tmp/filelist/*/* for dirname2 in /tmp/filelist/*/*; do (cd $dirname2/catalog ||echo "file does not exist" && echo "$dirname2" |cut -d '/' -f 7,8 && echo $i && ls -la |awk 'NR>3 {SUM += $5} END { print "Total number of kb " SUM }');done... (2 Replies)
Discussion started by: newbie2010
2 Replies

4. Shell Programming and Scripting

Html output in correct format

Hi, I am running two scripts as below. In Script 1 i am getting correct output in proper HTML format while in script 2 i am not getting output in mail and only html code is getting printed.I want to get the output of script 2. Please guide. 1.IFILE=/home/home01/Report.csv if #Checks... (7 Replies)
Discussion started by: Vivekit82
7 Replies

5. Shell Programming and Scripting

diff output is it correct??

I'm asking for explanation about the output of the diff format when i compare the two files f1 and f2: root@host1 # cat f1 205226 205237 205250 205255 205262 205274 205307 205403 205464 205477 205500 205520 205626 205759 205766 205776 (2 Replies)
Discussion started by: ahmad.zuhd
2 Replies

6. Shell Programming and Scripting

Getting the correct identifier in the output file

Hi All I do have a file like this: 1 1 12 26 289 3.2e-027 GCGTATGGCGGC 2 12 26 215 6.7e+006 TTCCACCTTTTG 3 9 26 175 8.9e+016 GCGGTAACT 4 20 26 232 1.7e+013 TTTTTATTTTTTTTTTTTCC 5 ... (6 Replies)
Discussion started by: Lucky Ali
6 Replies

7. Shell Programming and Scripting

AWK not giving me correct output.

i have a line like this in my script IP=`get_IP <hostname> | awk '{ print $1 }' echo $IP the problem is get_IP <hostname> returns data formated as follows: ip 1.1.1.1 name server_name the code above returns 1.1.1.1 server_name and i just need the 1.1.1.1 I have tried to add "|... (5 Replies)
Discussion started by: mcdef
5 Replies

8. Shell Programming and Scripting

Conversion of Exponential to numeric in awk- not correct output

Hi All, I have 1 million records file. Using awk, I am counting the number of records. But as the number is huge, after crossing a number, awk is displaying it in exponential format. At the end, I need to verify this count given by awk with expected count. But as it is in exponential format,... (3 Replies)
Discussion started by: ssunda6
3 Replies

9. UNIX for Dummies Questions & Answers

IPv4 addresses: count/output and Awk/Sed

Hi forum. I am fairly new to scripting and use a simple script to process e-mails for my work. These e-mails contain a list of IPv4 IPs that I process and seperate into text files, which are then attached to a larger, 'digest' e-mail. I also put some of the output from the text files into the... (4 Replies)
Discussion started by: laebshade
4 Replies

10. Shell Programming and Scripting

Count not correct

the count is off ... man ... help please. The Code open (FILE1, "xy1.TXT") or die "$0: Could not open SOURCEFILE.TXT: $!\n"; open (FILE2, "xy2.TXT") or die "$0: Could not open RESULTFILE.TXT: $!\n"; chomp(my @strings = <FILE2>); while (1) { foreach $pattern (<FILE1>) { ... (3 Replies)
Discussion started by: popeye
3 Replies
Login or Register to Ask a Question