awk does not find ids with semi-colon in the name


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk does not find ids with semi-colon in the name
# 1  
Old 04-23-2016
awk does not find ids with semi-colon in the name

I am using awk to search $5 of the "input" file using the "list" file as the search criteria. So if the id in line 1 of "list" is found in "search" then it is counted in the ids found. However, if the line in "list" is not found in "search", then it is outputted as is missing. The awk below runs and works for most but the ids with a ; in them are missing but can be manually found in the file. I am not sure where to add this though. Thank you Smilie.

input
Code:
chrX    48933012    48933134    chrX:48933012-48933134    PRAF2;WDR45
chrX    48934078    48934193    chrX:48934078-48934193    PRAF2;WDR45
chrX    48934293    48934422    chrX:48934293-48934422    PRAF2;WDR45
chr17    42426522    42426680    chr17:42426522-42426680    GRN;L01117
chr17    42426783    42426929    chr17:42426783-42426929    GRN;L01117
chr17    30814628    30815572    chr17:30814628-30815572    AK307275;CDK5R1
chr2    234668923    234669807    chr2:234668923-234669807    UGT1A1;UGT1A10;UGT1A3;UGT1A4;UGT1A5;UGT1A6;UGT1A7;UGT1A8;UGT1A9
chr2    234675669    234675821    chr2:234675669-234675821    UGT1A1;UGT1A10;UGT1A3;UGT1A4;UGT1A5;UGT1A6;UGT1A7;UGT1A8;UGT1A9
chr12    9221325    9221448    chr12:9221325-9221448    A2M
chr12    9222330    9222419    chr12:9222330-9222419    A2M

list
Code:
PRAF
GRN
CDK5R1
UGT1A1
A2M

current output
Code:
1 ids found
CDK5R1 is missing
PRAF is missing
GRN is missing
UGT1A1 is missing

desired output
Code:
5 ids found

Code:
awk '
    NR==FNR { lookup[$0]++; next }
    ($5 in lookup) { seen[$5]++ } 
    END {
      print length(seen)" ids found"; 
      for (id in seen) delete lookup[id]; 
      for (id in lookup) print id " is missing"
}' list input > count

awk with error
Code:
awk '
>     NR==FNR { lookup[$0]+|;++; next }
>     ($5 in lookup) { seen[$5]++ } 
>     END {
>       print length(seen)" ids found"; 
>       for (id in seen) delete lookup[id]; 
>       for (id in lookup) print id " is missing"
> }' list2 input > count
awk: cmd. line:2:     NR==FNR { lookup[$0]+|;++; next }
awk: cmd. line:2:                          ^ syntax error
awk: cmd. line:2:     NR==FNR { lookup[$0]+|;++; next }
awk: cmd. line:2:                              ^ syntax error


Last edited by cmccabe; 04-23-2016 at 10:28 AM.. Reason: added awk error
# 2  
Old 04-23-2016
Hi, try this modification to your code:
Code:
awk '
    NR==FNR { 
      lookup[$1]++
      next
    }
    { 
      split($5,F,/;/)
      for(i in F)
        if (F[i] in lookup)
          seen[F[i]]++
    } 
    END {
      print length(seen)" ids found"; 
      for (id in lookup) 
        if (!(id in seen)) 
          print id " is missing"
    }
' list input > count

Code:
4 ids found
PRAF is missing


--
Code:
Note: length(array) is a non-standard extension, so not every awk will support it


Last edited by Scrutinizer; 04-24-2016 at 01:48 AM.. Reason: Grammar correction
This User Gave Thanks to Scrutinizer For This Post:
# 3  
Old 04-23-2016
Thank you very much for your help, I really appreciate it Smilie.
# 4  
Old 04-23-2016
There really isn't any need to count the number of times you have seen an ID in the lookup[] and seen[] arrays. Assuming that your sample input data isn't really representative of the sizes of your real input files, the following suggestion might be faster or slower than Scrutinizer's suggestion since it handles the list of IDs in lookup[] (from the 1st input file) and the list of IDs in seen[] (from the 2nd input file) differently:
  • Scrutinizer's code trims lookup[] and adds entries to seen[] as it processes each input line. So seen[] will only contain elements that had previously been in lookup[].
  • The following code doesn't look at lookup[] while its reading the 2nd input file. It adds elements to seen[] for each ID found in field 5 in lines in the 2nd input file. It then makes a single walk through lookup[] at the end removing entries for IDs that are also found in seen[]. (Note that this might be a little more portable to other versions of awk because it doesn't depend on being able to use length(array name) which is an extension not required by the standards.)

You might want to compare the time taken by our two approaches with some of your real data.
Code:
awk '
FNR == NR {
	lookup[$1]
	next
}
{	for(i = split($5, F, /;/); i; i--)
		seen[F[i]]
}
END {	for(id in lookup)
		if(id in seen) {
			found++
			delete lookup[id]
		}
	print found, "of", NR - FNR, "ids found"
	for(id in lookup)
		print id, "is missing"
}' list input

which, with your sample input files produces the output:
Code:
4 of 5 ids found
PRAF is missing

If you don't want the additional information shown in red in the output above, remove the code shown in red in the above script.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delete all lines without a trailing semi colon

shell : bash os : RHEL 7.2 I have a file like below 61265388 1-11Y5C-7690 1-11Y4Q-6763 INSERT INTO emp VALUES('oramds:test.xref','CBS_01','MIGWO161265388','61265388','N',SYSDATE); INSERT INTO emp VALUES('oramds:test.xref','COMMON','MIGWO161265388','MIG1COMMON61265388','N',SYSDATE);... (3 Replies)
Discussion started by: kraljic
3 Replies

2. Shell Programming and Scripting

awk unique count of partial match with semi-colon

Trying to get the unique count of the below input, but if the text in beginning of $5 is a partial match to another line in the file then it is not unique. awk awk '!seen++ {n++} END {print n}' input 7 input chr1 159174749 159174770 chr1:159174749-159174770 ACKR1 chr1 ... (2 Replies)
Discussion started by: cmccabe
2 Replies

3. UNIX for Dummies Questions & Answers

awk colon separated items

Hi, I need to filter my data based on items in column 23. Column 1 until column 23 are tab separated. This is how column 23 looks like: PRIMARY=<0/1:504:499,5:.:.:.:0.01:1:15:.> I want to extract lines if items 7 (separated by : ) in column 23 are more than 0.25 . In example above , item... (2 Replies)
Discussion started by: janshamsani
2 Replies

4. Shell Programming and Scripting

Find first n element by matching IDs

Hi All I have a problem that I am not able to resolve. Briefly, I have a file like this: ID_1 10 ID_2 15 ID_3 32 ID_4 45 ID_5 66 ID_6 79 ID_7 88This file is numerically ordered for the 2th column. And another file containing a list of IDs(just one in this example) ID_4What I... (7 Replies)
Discussion started by: giuliangiuseppe
7 Replies

5. Homework & Coursework Questions

C++ Attempting to modify this function to read from a (;) semi-colon-separated file

After some thought. I am uncomfortable issuing my professors name where, there may be unintended side effects from any negative responses/feedback. Willing to re post if I can omit school / professor publicly, but can message moderator for validation? I am here for knowledge and understanding,... (1 Reply)
Discussion started by: briandanielz
1 Replies

6. Shell Programming and Scripting

Need a script to convert comma delimited files to semi colon delimited

Hi All, I need a unix script to convert .csv files to .skv files (changing a comma delimited file to a semi colon delimited file). I am a unix newbie and so don't know where to start. The script will be scheduled using cron and needs to convert each .csv file in a particular folder to a .skv... (4 Replies)
Discussion started by: CarpKing
4 Replies

7. Shell Programming and Scripting

Colon in awk script output

I'm using AIX 5.3 and running a awk replace to modify data as follows: echo 1234: 1234 123 123 444 555 666 7777 | awk '/^:/{split($2,N);n=N} {n=$1} {sub(n,n+10000000)}1' 10001234 1234 123 123 444 555 666 7777 dumb question.. how do I get the colon back in, so it outputs 10001234: 1234... (4 Replies)
Discussion started by: say170
4 Replies

8. Shell Programming and Scripting

Running multiple commands stored as a semi-colon separated string

Hi, Is there a way in Korn Shell that I can run multiple commands stored as a semi-colon separated string, e.g., # vs="echo a; echo b;" # $vs a; echo b; I want to be able to store commands in a variable, then run all of it once and pipe the whole output to another program without using... (2 Replies)
Discussion started by: svhyd
2 Replies

9. Shell Programming and Scripting

bash aliases and command chaining with ; (semi-colon)

What am I doing wrong here? Or is this not possible? A bug? alias f='find . >found 2>/dev/null &' f ; sleep 20 ; ls -l -bash: syntax error near unexpected token `;' (2 Replies)
Discussion started by: star_man
2 Replies

10. Shell Programming and Scripting

Need to find Unix ids

Hi How can find the Unix ids for couple of users i am not sure of the command , can anyone help me on this :) (1 Reply)
Discussion started by: raghav1982
1 Replies
Login or Register to Ask a Question