awk !seen[$]++ in else loop

02-19-2019

Registered User

2, 1

Join Date: Feb 2019

Last Activity: 4 February 2020, 8:01 AM EST

Posts: 2

Thanks Given: 2

Thanked 1 Time in 1 Post

awk !seen[$]++ in else loop

Hi all,

I was searching the net for a solution for my problem... unfortunately nothing so far.
I want to sort on more than on column tab delimited file and keep the line if in the column I sort there is no value, but for those who have a value I want them only unique.

I have tried the options:

Code:

sort -u -k 5,5 input file| awk '!seen[$12]++'| grep 'IPR013087'

but here I lose the lines that have nothing in the 12th column...
another option:

Code:

sort -u -k 5,5 Acropora_digitifera_protein.fasta.tsv| awk -F "\t" '{if ($12=="") print $0; else; !seen[$12]++}'| grep 'IPR013087'

here it looks like "!seen[$12]++}" do nothing and the output empty.

I want to keep all lines but have the unique once by the 5th column and by the 12th column, meaning the lines that have no value in the 12th column should be kept (keep the line).

More in details:
# My data set:

Code:

ACDI|gi|1005438440|ref|XP_015756623.1|  855e9b79f65e051746158c0f63a763a6        427     Pfam    PF00096 Zinc finger, C2H2 type  328     350     3.2E-5  T       14-02-2019      IPR013087       Zinc finger C2H2-type
ACDI|gi|1005438440|ref|XP_015756623.1|  855e9b79f65e051746158c0f63a763a6        427     SMART   SM00355         356     378     5.5E-5  T       14-02-2019      IPR013087       Zinc finger C2H2-type
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   646     688     -       T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Gene3D  G3DSA:3.90.70.10                88      176     3.5E-13 T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Gene3D  G3DSA:3.90.70.10                195     496     2.0E-66 T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Gene3D  G3DSA:3.10.20.90                964     1044    1.5E-5  T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Pfam    PF00443 Ubiquitin carboxyl-terminal hydrolase   96      492     1.9E-39 T       14-02-2019      IPR001394       Peptidase C19, ubiquitin carboxyl-terminal hydrolase
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   130     181     -       T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    CDD     cd02668 Peptidase_C19L  97      493     5.37034E-140    T       14-02-2019      IPR033841       Ubiquitin-specific peptidase 48
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   933     974     -       T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   944     961     -       T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   130     178     -       T       14-02-2019
Acropora_digitifera_protein.fasta.tsv 
ACDI|gi|1005433616|ref|XP_015754353.1|  3d8e1345398c9346035f2aaf36a0ba63        227     MobiDBLite      mobidb-lite     consensus disorder prediction   1       49      -       T       14-02-2019
ACDI|gi|1005492169|ref|XP_015748752.1|  9649b3b2f3e16813b541cc225f16e7e5        196     Pfam    PF04103 CD20-like family        13      156     7.0E-7  T       14-02-2019      IPR007237       CD20-like family
ACDI|gi|1005474816|ref|XP_015774180.1|  1169df0014aa2b06a4e07981d056bbcc        211     Pfam    PF03184 DDE superfamily endonuclease    3       140     1.8E-22 T       14-02-2019      IPR004875       DDE superfamily endonuclease domain
ACDI|gi|1005478159|ref|XP_015775824.1|  801de18fcf5e339f411fe95038ca00f3        192     CDD     cd01670 Death   148     181     1.65022E-6      T       14-02-2019
ACDI|gi|1005435757|ref|XP_015755391.1|  50dff494b456096e706288e96a1506e0        207     MobiDBLite      mobidb-lite     consensus disorder prediction   130     180     -       T       14-02-2019
ACDI|gi|1005480051|ref|XP_015776754.1|  c4efb60815fdf57cf0244dacf475f25d        266     Pfam    PF14997 CECR6/TMEM121 family    66      244     1.4E-22 T       14-02-2019      IPR032776       CECR6/TMEM121 family
ACDI|gi|1005453471|ref|XP_015763894.1|  4a622b0f2466759e2ab0e050856d6fcc        143     Pfam    PF04752 ChaC-like protein       6       123     1.8E-26 T       14-02-2019      IPR006840       Glutathione-specific gamma-glutamylcyclotransferase
ACDI|gi|1005420589|ref|XP_015757954.1|  5cbfe3f69839493b89232b2be5be6b49        190     Pfam    PF08499 3'5'-cyclic nucleotide phosphodiesterase N-terminal     137     188     5.8E-11 T       14-02-2019      IPR013706       3'5'-cyclic nucleotide phosphodiesterase N-terminal
ACDI|gi|1005471241|ref|XP_015772489.1|  c5c2e6c3d63d0d13b87ad195f58f54e6        234     Pfam    PF15745 AP-1 complex-associated regulatory protein      27      178     4.5E-17 T       14-02-2019      IPR031483       AP-1 complex-associated regulatory protein
ACDI|gi|1005448265|ref|XP_015761397.1|  4e8c83abd5bd43fcf3d681da11c99ac7        135     Gene3D  G3DSA:1.20.1250.20              1       112     3.0E-10 T       14-02-2019

I want to sort by the 5th and the 12 column and have no duplicates for the two of them.
the 5h - is the method hit number (for example cd/G3D/PF etc) and the 12th - is the interpro hit number (IPR)

so the output should contain unique lines by the 5th column and by the 12th column even if nothing in the 12th, like here:

Code:

ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   646     688     -       T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Gene3D  G3DSA:3.90.70.10                88      176     3.5E-13 T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Gene3D  G3DSA:3.10.20.90                964     1044    1.5E-5  T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Pfam    PF00443 Ubiquitin carboxyl-terminal hydrolase   96      492     1.9E-39 T       14-02-2019      IPR001394       Peptidase C19, ubiquitin carboxyl-terminal hydrolase
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    CDD     cd02668 Peptidase_C19L  97      493     5.37034E-140    T       14-02-2019      IPR033841       Ubiquitin-specific peptidase 48
ACDI|gi|1005433616|ref|XP_015754353.1|  3d8e1345398c9346035f2aaf36a0ba63        227     MobiDBLite      mobidb-lite     consensus disorder prediction   1       49      -       T       14-02-2019
ACDI|gi|1005492169|ref|XP_015748752.1|  9649b3b2f3e16813b541cc225f16e7e5        196     Pfam    PF04103 CD20-like family        13      156     7.0E-7  T       14-02-2019      IPR007237       CD20-like family
ACDI|gi|1005474816|ref|XP_015774180.1|  1169df0014aa2b06a4e07981d056bbcc        211     Pfam    PF03184 DDE superfamily endonuclease    3       140     1.8E-22 T       14-02-2019      IPR004875       DDE superfamily endonuclease domain
ACDI|gi|1005478159|ref|XP_015775824.1|  801de18fcf5e339f411fe95038ca00f3        192     CDD     cd01670 Death   148     181     1.65022E-6      T       14-02-2019
ACDI|gi|1005480051|ref|XP_015776754.1|  c4efb60815fdf57cf0244dacf475f25d        266     Pfam    PF14997 CECR6/TMEM121 family    66      244     1.4E-22 T       14-02-2019      IPR032776       CECR6/TMEM121 family
ACDI|gi|1005453471|ref|XP_015763894.1|  4a622b0f2466759e2ab0e050856d6fcc        143     Pfam    PF04752 ChaC-like protein       6       123     1.8E-26 T       14-02-2019      IPR006840       Glutathione-specific gamma-glutamylcyclotransferase
ACDI|gi|1005420589|ref|XP_015757954.1|  5cbfe3f69839493b89232b2be5be6b49        190     Pfam    PF08499 3'5'-cyclic nucleotide phosphodiesterase N-terminal     137     188     5.8E-11 T       14-02-2019      IPR013706       3'5'-cyclic nucleotide phosphodiesterase N-terminal
ACDI|gi|1005471241|ref|XP_015772489.1|  c5c2e6c3d63d0d13b87ad195f58f54e6        234     Pfam    PF15745 AP-1 complex-associated regulatory protein      27      178     4.5E-17 T       14-02-2019      IPR031483       AP-1 complex-associated regulatory protein
ACDI|gi|1005448265|ref|XP_015761397.1|  4e8c83abd5bd43fcf3d681da11c99ac7        135     Gene3D  G3DSA:1.20.1250.20              1       112     3.0E-10 T       14-02-2019
ACDI|gi|1005438440|ref|XP_015756623.1|  855e9b79f65e051746158c0f63a763a6        427     Pfam    PF00096 Zinc finger, C2H2 type  328     350     3.2E-5  T       14-02-2019      IPR013087       Zinc finger C2H2-type

Thanks for reading until here!
Hope someone will have a solution for that!

of curse I can have a solution in more than one line, but it will be better to have one line solution...

Thanks a lot!

Last edited by ksenia; 02-19-2019 at 10:29 AM.. Reason: too long

ksenia

View Public Profile for ksenia

Find all posts by ksenia

02-19-2019

Read Only

1,278, 486

Join Date: Sep 2012

Last Activity: 27 February 2020, 8:59 PM EST

Location: Houston, Texas, USA

Posts: 1,278

Thanks Given: 0

Thanked 486 Times in 451 Posts

The original statement contains a grep which will reduce the lines on the output shown. Not sure how much this helps (I added tabs manually to the post to test):

Code:

sort -u -k 5,5 input_file | awk -F "\t" '(!seen[$12]++ || ! length($12)) && /IPR013087/'

These 4 Users Gave Thanks to rdrtx1 For This Post:

rdrtx1

View Public Profile for rdrtx1

Find all posts by rdrtx1

02-19-2019

Registered User

2, 1

Join Date: Feb 2019

Last Activity: 4 February 2020, 8:01 AM EST

Posts: 2

Thanks Given: 2

Thanked 1 Time in 1 Post

Amazing, it is doing exactly what I was searching for.
Thanks a lot!

--- Post updated at 04:21 PM ---

Can you please give walk through for "(!seen[$12]++ || ! length($12))"?
Thanks

Moderator's Comments:

MOD's comment: Please wrap your samples, codes into [CODE]....[/CODE] tags in all your posts as per forum's rules.

Last edited by RavinderSingh13; 02-19-2019 at 07:57 PM..

This User Gave Thanks to ksenia For This Post:

ksenia

View Public Profile for ksenia

Find all posts by ksenia

02-19-2019

Registered User

489, 285

Join Date: Nov 2018

Last Activity: 30 October 2021, 10:47 AM EDT

Location: undefined

Posts: 489

Thanks Given: 382

Thanked 285 Times in 215 Posts

Code:

awk '!t[$5]++ && !b[$12"_"]++ && /IPR013087/' input_file

nezabudka

View Public Profile for nezabudka

Find all posts by nezabudka

02-19-2019

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Quote:

Originally Posted by ksenia

Amazing, it is doing exactly what I was searching for.
Thanks a lot!
--- Post updated at 04:21 PM ---
Can you please give walk through for "(!seen[$12]++ || ! length($12))"?
Thanks

Moderator's Comments:

MOD's comment: Please wrap your samples, codes into [CODE]....[/CODE] tags in all your posts as per forum's rules.

Hello ksenia,

Could you please go through following and let me know if this helps you.
This is only for understanding purposes, I haven't run it to see if this is working with comments or not(fair warning here).

Code:

sort -u -k 5,5 input_file |                         ##Using sort command on Input_file and sending its standard output as an Input to awk command, read about |(pipe) more in man bash too.
awk -F "\t" '                                       ##Starting awk program here whose input is output of sort command passed to it. Setting -F(field seprator) as TAB here for all lines of Input_file.
(!seen[$12]++ || ! length($12)) && /IPR013087/      ##Checking 2 conditions here. 1st- Either 12th field is coming first time in array named seen OR length of 12th field is ZERO.
'                                                   ##2nd condition is line should have string IPR013087 in it if BOTH conditions are TRUE then print that line.
                                                    ##Since awk works on method of condition then action and we haven't mentioned any action here so when condition is TRUE by default print of current line will happen.

Thanks,
R. Singh

These 2 Users Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

03-02-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I must be missing something very basic here. If the input file has tab separated fields (as stated in post #1 in this thread), then why is any pipeline needed here? Why not just use:

Code:

sort -u -t"$(printf '\t')" -k5,5 -k12,12 "input file"

Note that the $(printf '\t') in the above can be replaced by a single literal <tab> character.

This will produce at least one line of output that is not in the output you say you want, which is the line in your sample input file:

Code:

Acropora_digitifera_protein.fasta.tsv

which has empty fields for both field #5 and field #12. Since this is a unique value for that pair of fields, it seems to meet your criteria and should be displayed, shouldn't it?

This is untested since the sample input file provided did not contain any <tab>s and I wasn't sure which <space>s should be replaced by <tab>s.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

UNIX for Beginners Questions & Answers

awk !seen[$]++ in else loop

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk with For loop

Discussion started by: scriptor

2. Shell Programming and Scripting

Using awk within a for loop

Discussion started by: Allie_gastrator

3. Shell Programming and Scripting

awk programming -Passing variable to awk for loop

Discussion started by: ctrld

4. Shell Programming and Scripting

awk loop using array:wish to store array values from loop for use outside loop

Discussion started by: klane

5. Shell Programming and Scripting

awk loop and using shell in awk

Discussion started by: xshang

6. Shell Programming and Scripting

awk - loop from a to z

Discussion started by: jolecanard

7. Shell Programming and Scripting

Comparison and editing of files using awk.(And also a possible bug in awk for loop?)

Discussion started by: linuxkid

8. Shell Programming and Scripting

awk for-loop and NR

Discussion started by: ergy1983

9. UNIX for Dummies Questions & Answers

for loop in awk?

Discussion started by: Bdawk

10. Shell Programming and Scripting

Using AWK in a for loop

Discussion started by: Jahn