awk !seen[$]++ in else loop


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers awk !seen[$]++ in else loop
# 1  
Old 02-19-2019
awk !seen[$]++ in else loop

Hi all,

I was searching the net for a solution for my problem... unfortunately nothing so far.
I want to sort on more than on column tab delimited file and keep the line if in the column I sort there is no value, but for those who have a value I want them only unique.

I have tried the options:
Code:
sort -u -k 5,5 input file| awk '!seen[$12]++'| grep 'IPR013087'

but here I lose the lines that have nothing in the 12th column...
another option:
Code:
sort -u -k 5,5 Acropora_digitifera_protein.fasta.tsv| awk -F "\t" '{if ($12=="") print $0; else; !seen[$12]++}'| grep 'IPR013087'

here it looks like "!seen[$12]++}" do nothing and the output empty.
Smilie
I want to keep all lines but have the unique once by the 5th column and by the 12th column, meaning the lines that have no value in the 12th column should be kept (keep the line).

More in details:
# My data set:
Code:
ACDI|gi|1005438440|ref|XP_015756623.1|  855e9b79f65e051746158c0f63a763a6        427     Pfam    PF00096 Zinc finger, C2H2 type  328     350     3.2E-5  T       14-02-2019      IPR013087       Zinc finger C2H2-type
ACDI|gi|1005438440|ref|XP_015756623.1|  855e9b79f65e051746158c0f63a763a6        427     SMART   SM00355         356     378     5.5E-5  T       14-02-2019      IPR013087       Zinc finger C2H2-type
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   646     688     -       T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Gene3D  G3DSA:3.90.70.10                88      176     3.5E-13 T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Gene3D  G3DSA:3.90.70.10                195     496     2.0E-66 T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Gene3D  G3DSA:3.10.20.90                964     1044    1.5E-5  T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Pfam    PF00443 Ubiquitin carboxyl-terminal hydrolase   96      492     1.9E-39 T       14-02-2019      IPR001394       Peptidase C19, ubiquitin carboxyl-terminal hydrolase
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   130     181     -       T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    CDD     cd02668 Peptidase_C19L  97      493     5.37034E-140    T       14-02-2019      IPR033841       Ubiquitin-specific peptidase 48
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   933     974     -       T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   944     961     -       T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   130     178     -       T       14-02-2019
Acropora_digitifera_protein.fasta.tsv 
ACDI|gi|1005433616|ref|XP_015754353.1|  3d8e1345398c9346035f2aaf36a0ba63        227     MobiDBLite      mobidb-lite     consensus disorder prediction   1       49      -       T       14-02-2019
ACDI|gi|1005492169|ref|XP_015748752.1|  9649b3b2f3e16813b541cc225f16e7e5        196     Pfam    PF04103 CD20-like family        13      156     7.0E-7  T       14-02-2019      IPR007237       CD20-like family
ACDI|gi|1005474816|ref|XP_015774180.1|  1169df0014aa2b06a4e07981d056bbcc        211     Pfam    PF03184 DDE superfamily endonuclease    3       140     1.8E-22 T       14-02-2019      IPR004875       DDE superfamily endonuclease domain
ACDI|gi|1005478159|ref|XP_015775824.1|  801de18fcf5e339f411fe95038ca00f3        192     CDD     cd01670 Death   148     181     1.65022E-6      T       14-02-2019
ACDI|gi|1005435757|ref|XP_015755391.1|  50dff494b456096e706288e96a1506e0        207     MobiDBLite      mobidb-lite     consensus disorder prediction   130     180     -       T       14-02-2019
ACDI|gi|1005480051|ref|XP_015776754.1|  c4efb60815fdf57cf0244dacf475f25d        266     Pfam    PF14997 CECR6/TMEM121 family    66      244     1.4E-22 T       14-02-2019      IPR032776       CECR6/TMEM121 family
ACDI|gi|1005453471|ref|XP_015763894.1|  4a622b0f2466759e2ab0e050856d6fcc        143     Pfam    PF04752 ChaC-like protein       6       123     1.8E-26 T       14-02-2019      IPR006840       Glutathione-specific gamma-glutamylcyclotransferase
ACDI|gi|1005420589|ref|XP_015757954.1|  5cbfe3f69839493b89232b2be5be6b49        190     Pfam    PF08499 3'5'-cyclic nucleotide phosphodiesterase N-terminal     137     188     5.8E-11 T       14-02-2019      IPR013706       3'5'-cyclic nucleotide phosphodiesterase N-terminal
ACDI|gi|1005471241|ref|XP_015772489.1|  c5c2e6c3d63d0d13b87ad195f58f54e6        234     Pfam    PF15745 AP-1 complex-associated regulatory protein      27      178     4.5E-17 T       14-02-2019      IPR031483       AP-1 complex-associated regulatory protein
ACDI|gi|1005448265|ref|XP_015761397.1|  4e8c83abd5bd43fcf3d681da11c99ac7        135     Gene3D  G3DSA:1.20.1250.20              1       112     3.0E-10 T       14-02-2019

I want to sort by the 5th and the 12 column and have no duplicates for the two of them.
the 5h - is the method hit number (for example cd/G3D/PF etc) and the 12th - is the interpro hit number (IPR)

so the output should contain unique lines by the 5th column and by the 12th column even if nothing in the 12th, like here:
Code:
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   646     688     -       T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Gene3D  G3DSA:3.90.70.10                88      176     3.5E-13 T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Gene3D  G3DSA:3.10.20.90                964     1044    1.5E-5  T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Pfam    PF00443 Ubiquitin carboxyl-terminal hydrolase   96      492     1.9E-39 T       14-02-2019      IPR001394       Peptidase C19, ubiquitin carboxyl-terminal hydrolase
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    CDD     cd02668 Peptidase_C19L  97      493     5.37034E-140    T       14-02-2019      IPR033841       Ubiquitin-specific peptidase 48
ACDI|gi|1005433616|ref|XP_015754353.1|  3d8e1345398c9346035f2aaf36a0ba63        227     MobiDBLite      mobidb-lite     consensus disorder prediction   1       49      -       T       14-02-2019
ACDI|gi|1005492169|ref|XP_015748752.1|  9649b3b2f3e16813b541cc225f16e7e5        196     Pfam    PF04103 CD20-like family        13      156     7.0E-7  T       14-02-2019      IPR007237       CD20-like family
ACDI|gi|1005474816|ref|XP_015774180.1|  1169df0014aa2b06a4e07981d056bbcc        211     Pfam    PF03184 DDE superfamily endonuclease    3       140     1.8E-22 T       14-02-2019      IPR004875       DDE superfamily endonuclease domain
ACDI|gi|1005478159|ref|XP_015775824.1|  801de18fcf5e339f411fe95038ca00f3        192     CDD     cd01670 Death   148     181     1.65022E-6      T       14-02-2019
ACDI|gi|1005480051|ref|XP_015776754.1|  c4efb60815fdf57cf0244dacf475f25d        266     Pfam    PF14997 CECR6/TMEM121 family    66      244     1.4E-22 T       14-02-2019      IPR032776       CECR6/TMEM121 family
ACDI|gi|1005453471|ref|XP_015763894.1|  4a622b0f2466759e2ab0e050856d6fcc        143     Pfam    PF04752 ChaC-like protein       6       123     1.8E-26 T       14-02-2019      IPR006840       Glutathione-specific gamma-glutamylcyclotransferase
ACDI|gi|1005420589|ref|XP_015757954.1|  5cbfe3f69839493b89232b2be5be6b49        190     Pfam    PF08499 3'5'-cyclic nucleotide phosphodiesterase N-terminal     137     188     5.8E-11 T       14-02-2019      IPR013706       3'5'-cyclic nucleotide phosphodiesterase N-terminal
ACDI|gi|1005471241|ref|XP_015772489.1|  c5c2e6c3d63d0d13b87ad195f58f54e6        234     Pfam    PF15745 AP-1 complex-associated regulatory protein      27      178     4.5E-17 T       14-02-2019      IPR031483       AP-1 complex-associated regulatory protein
ACDI|gi|1005448265|ref|XP_015761397.1|  4e8c83abd5bd43fcf3d681da11c99ac7        135     Gene3D  G3DSA:1.20.1250.20              1       112     3.0E-10 T       14-02-2019
ACDI|gi|1005438440|ref|XP_015756623.1|  855e9b79f65e051746158c0f63a763a6        427     Pfam    PF00096 Zinc finger, C2H2 type  328     350     3.2E-5  T       14-02-2019      IPR013087       Zinc finger C2H2-type

Thanks for reading until here!
Hope someone will have a solution for that!

of curse I can have a solution in more than one line, but it will be better to have one line solution...

Thanks a lot! Smilie

Last edited by ksenia; 02-19-2019 at 10:29 AM.. Reason: too long
# 2  
Old 02-19-2019
The original statement contains a grep which will reduce the lines on the output shown. Not sure how much this helps (I added tabs manually to the post to test):

Code:
sort -u -k 5,5 input_file | awk -F "\t" '(!seen[$12]++ || ! length($12)) && /IPR013087/'

These 4 Users Gave Thanks to rdrtx1 For This Post:
# 3  
Old 02-19-2019
Amazing, it is doing exactly what I was searching for.
Thanks a lot!

--- Post updated at 04:21 PM ---

Can you please give walk through for "(!seen[$12]++ || ! length($12))"?
Thanks

Moderator's Comments:
Mod Comment MOD's comment: Please wrap your samples, codes into [CODE]....[/CODE] tags in all your posts as per forum's rules.

Last edited by RavinderSingh13; 02-19-2019 at 07:57 PM..
This User Gave Thanks to ksenia For This Post:
# 4  
Old 02-19-2019
Code:
awk '!t[$5]++ && !b[$12"_"]++ && /IPR013087/' input_file

# 5  
Old 02-19-2019
Quote:
Originally Posted by ksenia
Amazing, it is doing exactly what I was searching for.
Thanks a lot!
--- Post updated at 04:21 PM ---
Can you please give walk through for "(!seen[$12]++ || ! length($12))"?
Thanks
Moderator's Comments:
Mod Comment MOD's comment: Please wrap your samples, codes into [CODE]....[/CODE] tags in all your posts as per forum's rules.
Hello ksenia,

Could you please go through following and let me know if this helps you.
This is only for understanding purposes, I haven't run it to see if this is working with comments or not(fair warning here).

Code:
sort -u -k 5,5 input_file |                         ##Using sort command on Input_file and sending its standard output as an Input to awk command, read about |(pipe) more in man bash too.
awk -F "\t" '                                       ##Starting awk program here whose input is output of sort command passed to it. Setting -F(field seprator) as TAB here for all lines of Input_file.
(!seen[$12]++ || ! length($12)) && /IPR013087/      ##Checking 2 conditions here. 1st- Either 12th field is coming first time in array named seen OR length of 12th field is ZERO.
'                                                   ##2nd condition is line should have string IPR013087 in it if BOTH conditions are TRUE then print that line.
                                                    ##Since awk works on method of condition then action and we haven't mentioned any action here so when condition is TRUE by default print of current line will happen.

Thanks,
R. Singh
These 2 Users Gave Thanks to RavinderSingh13 For This Post:
# 6  
Old 03-02-2019
I must be missing something very basic here. If the input file has tab separated fields (as stated in post #1 in this thread), then why is any pipeline needed here? Why not just use:
Code:
sort -u -t"$(printf '\t')" -k5,5 -k12,12 "input file"

Note that the $(printf '\t') in the above can be replaced by a single literal <tab> character.

This will produce at least one line of output that is not in the output you say you want, which is the line in your sample input file:
Code:
Acropora_digitifera_protein.fasta.tsv

which has empty fields for both field #5 and field #12. Since this is a unique value for that pair of fields, it seems to meet your criteria and should be displayed, shouldn't it?

This is untested since the sample input file provided did not contain any <tab>s and I wasn't sure which <space>s should be replaced by <tab>s.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk with For loop

Hi My Requirement is to take the sum of each column below is the input file. 1 2 3 4 1 2 3 4 1 2 3 4 Initial i was using below command to achieve my desired result. however this was adding the row and not column. i am not able understand why this is happening awk... (1 Reply)
Discussion started by: scriptor
1 Replies

2. Shell Programming and Scripting

Using awk within a for loop

Hello, I currently have managed to get an awk function working inside a for loop that allows me to combine two files based on their headings but what I have not been able to do is print the output to files with variable names. awk ' NR==FNR {a=$0; next} /^>/ {$0 = $0" "a;} ... (2 Replies)
Discussion started by: Allie_gastrator
2 Replies

3. Shell Programming and Scripting

awk programming -Passing variable to awk for loop

Hi All, I am new to AWK programming. I have the following for loop in my awk program. cat printhtml.awk: BEGIN -------- <some code here> END{ ----------<some code here> for(N=0; N<H; N++) { for(M=5; M<D; M++) print "\t" D ""; } ----- } ... (2 Replies)
Discussion started by: ctrld
2 Replies

4. Shell Programming and Scripting

awk loop using array:wish to store array values from loop for use outside loop

Here's my code: awk -F '' 'NR==FNR { if (/time/ && $5>10) A=$2" "$3":"$4":"($5-01) else if (/time/ && $5<01) A=$2" "$3":"$4-01":"(59-$5) else if (/time/ && $5<=10) A=$2" "$3":"$4":0"($5-01) else if (/close/) { B=0 n1=n2; ... (2 Replies)
Discussion started by: klane
2 Replies

5. Shell Programming and Scripting

awk loop and using shell in awk

Hi, everyone! I have a file, when I print its $1 out it show several strings like this: AABBCC AEFJKLFG FALEF FAIWEHF What I want to do is that, after output of each record, search the string in all files in the same folder, print out the record and file name. This is what I want... (4 Replies)
Discussion started by: xshang
4 Replies

6. Shell Programming and Scripting

awk - loop from a to z

Hello, I was wondering if it is possible to do a loop on letters rather than numbers with awk (gawk). Basically I used to do: echo "nothing" | gawk '{for(i=1;i<11;i++)print i}' But I would like to do something like that (which obviously does not work): echo "nothing" | gawk '{for(i in... (6 Replies)
Discussion started by: jolecanard
6 Replies

7. Shell Programming and Scripting

Comparison and editing of files using awk.(And also a possible bug in awk for loop?)

I have two files which I would like to compare and then manipulate in a way. File1: pictures.txt 1.1 1.3 dance.txt 1.2 1.4 treehouse.txt 1.3 1.5 File2: pictures.txt 1.5 ref2313 1.4 ref2345 1.3 ref5432 1.2 ref4244 dance.txt 1.6 ref2342 1.5 ref2352 1.4 ref0695 1.3 ref5738 1.2... (1 Reply)
Discussion started by: linuxkid
1 Replies

8. Shell Programming and Scripting

awk for-loop and NR

Hey, I know this is a stupid question, but it doesn't work. I have a file with 10 lines and I want to pipe the content to awk and then print line 1 til 2 into another file and then line 3-4 ... So my script looks like that, but doesn't work: cat grid_ill.pts | awk '{ for (NR=1;NR<3;NR++)... (8 Replies)
Discussion started by: ergy1983
8 Replies

9. UNIX for Dummies Questions & Answers

for loop in awk?

I am new to unix and have pieced together two scripts that work independently. The first checks all the filesystems and reports which are running low on space. df -m | awk 'int($4) > 75 { print $1 " has only " $3 "mb free from a total of " $2 ", this filesystem is " $4 " full! \n" }... (1 Reply)
Discussion started by: Bdawk
1 Replies

10. Shell Programming and Scripting

Using AWK in a for loop

Hello, I am trying to use AWK to print only the first field of numerous text files, and then overwrite these files. They are of the format 1*2,3,4,5. I have tried the following code (using tcsh): foreach f (file1 file2 file3) cat $f | awk -F'*' '{print $1}' > $f end However, I get very... (4 Replies)
Discussion started by: Jahn
4 Replies
Login or Register to Ask a Question