awk does not find ids with semi-colon in the name

04-23-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

awk does not find ids with semi-colon in the name

I am using awk to search $5 of the "input" file using the "list" file as the search criteria. So if the id in line 1 of "list" is found in "search" then it is counted in the ids found. However, if the line in "list" is not found in "search", then it is outputted as is missing. The awk below runs and works for most but the ids with a ; in them are missing but can be manually found in the file. I am not sure where to add this though. Thank you

.

input

Code:

chrX    48933012    48933134    chrX:48933012-48933134    PRAF2;WDR45
chrX    48934078    48934193    chrX:48934078-48934193    PRAF2;WDR45
chrX    48934293    48934422    chrX:48934293-48934422    PRAF2;WDR45
chr17    42426522    42426680    chr17:42426522-42426680    GRN;L01117
chr17    42426783    42426929    chr17:42426783-42426929    GRN;L01117
chr17    30814628    30815572    chr17:30814628-30815572    AK307275;CDK5R1
chr2    234668923    234669807    chr2:234668923-234669807    UGT1A1;UGT1A10;UGT1A3;UGT1A4;UGT1A5;UGT1A6;UGT1A7;UGT1A8;UGT1A9
chr2    234675669    234675821    chr2:234675669-234675821    UGT1A1;UGT1A10;UGT1A3;UGT1A4;UGT1A5;UGT1A6;UGT1A7;UGT1A8;UGT1A9
chr12    9221325    9221448    chr12:9221325-9221448    A2M
chr12    9222330    9222419    chr12:9222330-9222419    A2M

list

Code:

PRAF
GRN
CDK5R1
UGT1A1
A2M

current output

Code:

1 ids found
CDK5R1 is missing
PRAF is missing
GRN is missing
UGT1A1 is missing

desired output

Code:

5 ids found

Code:

awk '
    NR==FNR { lookup[$0]++; next }
    ($5 in lookup) { seen[$5]++ } 
    END {
      print length(seen)" ids found"; 
      for (id in seen) delete lookup[id]; 
      for (id in lookup) print id " is missing"
}' list input > count

awk with error

Code:

awk '
>     NR==FNR { lookup[$0]+|;++; next }
>     ($5 in lookup) { seen[$5]++ } 
>     END {
>       print length(seen)" ids found"; 
>       for (id in seen) delete lookup[id]; 
>       for (id in lookup) print id " is missing"
> }' list2 input > count
awk: cmd. line:2:     NR==FNR { lookup[$0]+|;++; next }
awk: cmd. line:2:                          ^ syntax error
awk: cmd. line:2:     NR==FNR { lookup[$0]+|;++; next }
awk: cmd. line:2:                              ^ syntax error

Last edited by cmccabe; 04-23-2016 at 10:28 AM.. Reason: added awk error

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

04-23-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Hi, try this modification to your code:

Code:

awk '
    NR==FNR { 
      lookup[$1]++
      next
    }
    { 
      split($5,F,/;/)
      for(i in F)
        if (F[i] in lookup)
          seen[F[i]]++
    } 
    END {
      print length(seen)" ids found"; 
      for (id in lookup) 
        if (!(id in seen)) 
          print id " is missing"
    }
' list input > count

Code:

4 ids found
PRAF is missing

Code:

Note: length(array) is a non-standard extension, so not every awk will support it

Last edited by Scrutinizer; 04-24-2016 at 01:48 AM.. Reason: Grammar correction

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

04-23-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you very much for your help, I really appreciate it

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

04-23-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

There really isn't any need to count the number of times you have seen an ID in the lookup[] and seen[] arrays. Assuming that your sample input data isn't really representative of the sizes of your real input files, the following suggestion might be faster or slower than Scrutinizer's suggestion since it handles the list of IDs in lookup[] (from the 1st input file) and the list of IDs in seen[] (from the 2nd input file) differently:

Scrutinizer's code trims lookup[] and adds entries to seen[] as it processes each input line. So seen[] will only contain elements that had previously been in lookup[].
The following code doesn't look at lookup[] while its reading the 2nd input file. It adds elements to seen[] for each ID found in field 5 in lines in the 2nd input file. It then makes a single walk through lookup[] at the end removing entries for IDs that are also found in seen[]. (Note that this might be a little more portable to other versions of awk because it doesn't depend on being able to use length(array name) which is an extension not required by the standards.)

You might want to compare the time taken by our two approaches with some of your real data.

Code:

awk '
FNR == NR {
	lookup[$1]
	next
}
{	for(i = split($5, F, /;/); i; i--)
		seen[F[i]]
}
END {	for(id in lookup)
		if(id in seen) {
			found++
			delete lookup[id]
		}
	print found, "of", NR - FNR, "ids found"
	for(id in lookup)
		print id, "is missing"
}' list input

which, with your sample input files produces the output:

Code:

4 of 5 ids found
PRAF is missing

If you don't want the additional information shown in red in the output above, remove the code shown in red in the above script.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

awk does not find ids with semi-colon in the name

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delete all lines without a trailing semi colon

Discussion started by: kraljic

2. Shell Programming and Scripting

awk unique count of partial match with semi-colon

Discussion started by: cmccabe

3. UNIX for Dummies Questions & Answers

awk colon separated items

Discussion started by: janshamsani

4. Shell Programming and Scripting

Find first n element by matching IDs

Discussion started by: giuliangiuseppe

5. Homework & Coursework Questions

C++ Attempting to modify this function to read from a (;) semi-colon-separated file

Discussion started by: briandanielz

6. Shell Programming and Scripting

Need a script to convert comma delimited files to semi colon delimited

Discussion started by: CarpKing

7. Shell Programming and Scripting

Colon in awk script output

Discussion started by: say170

8. Shell Programming and Scripting

Running multiple commands stored as a semi-colon separated string

Discussion started by: svhyd

9. Shell Programming and Scripting

bash aliases and command chaining with ; (semi-colon)

Discussion started by: star_man

10. Shell Programming and Scripting

Need to find Unix ids

Discussion started by: raghav1982