Unique extraction of rows

11-14-2013

Registered User

44, 0

Join Date: Sep 2013

Last Activity: 30 August 2016, 2:32 PM EDT

Posts: 44

Thanks Given: 8

Thanked 0 Times in 0 Posts

Unique extraction of rows

I do have a tab delimited file of the following format:

Code:


Code:

431 kat1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
432 kat2 2 NA NA NA NA NA NA NA NA NA NA NA NA NA
433 KATe NA 3 NA NA 6 NA NA NA 10 11 NA NA NA NA
542 Kaed 2 NA NA NA NA NA NA NA NA NA NA NA NA NA
543 hkwuy NA NA NA NA 6 NA NA NA NA 11 NA NA NA NA
633 KAT1 NA 3 NA NA 6 NA NA NA 10 11 NA NA NA NA



Each row contains 16 columns and the missing values are indicated as NA. I want to extract all the rows containing a single or more than one numeric value 2 to 15 that I specify and  extract those rows.

Suppose if I want to extract the row that contain only 2. below is the output I need:

Code:

432 kat2 2 NA NA NA NA NA NA NA NA NA NA NA NA NA
542 Kaed 2 NA NA NA NA NA NA NA NA NA NA NA NA NA

If I want to specify more than one numberr for example rows that contains only 3 10 11:

Code:

433 KATe NA 3 NA NA 6 NA NA NA 10 11 NA NA NA NA
633 KAT1 NA 3 NA NA 6 NA NA NA 10 11 NA NA NA NA

I tried the following using awk to get the row containing 2:

Code:

awk -F"\t" '$3 == "2" { print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15"\t"$16 }' file.in

But I don't know how to specify rows that contain only "2" nor specify more than one number. Please let me know the best way in awk to do this extraction

Kanja

View Public Profile for Kanja

Find all posts by Kanja

11-14-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Do you insist on awk or would grep help as well? ?

Code:

grep -E "^[^ ]* [^ ]*.* (3|10|11) .*$" file
431 kat1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
433 KATe NA 3 NA NA 6 NA NA NA 10 11 NA NA NA NA
543 hkwuy NA NA NA NA 6 NA NA NA NA 11 NA NA NA NA
633 KAT1 NA 3 NA NA 6 NA NA NA 10 11 NA NA NA NA

BTW - what if the sequence of the patterns is reversed, like 11 - 10 - 3 - would that have to be a hit or not? Plus, is that an AND condition (all three patterns must show up) or an OR (any would be sufficient)?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-14-2013

Registered User

44, 0

Join Date: Sep 2013

Last Activity: 30 August 2016, 2:32 PM EDT

Posts: 44

Thanks Given: 8

Thanked 0 Times in 0 Posts

reverse pattern is ok. But AND condition (all three patterns must show up) is a must when extracting rows with more than one numbers

Kanja

View Public Profile for Kanja

Find all posts by Kanja

11-14-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

This seems to work, but I feel it's not quite satisfying for all imaginable constellations

Code:

grep -E "^[^ ]* [^ ]*.* 3( .*|.* )10( .*|.* )11 *.*$" file
431 kat1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
433 KATe NA 3 NA NA 6 NA NA NA 10 11 NA NA NA NA
633 KAT1 NA 3 NA NA 6 NA NA NA 10 11 NA NA NA NA

Maybe you need to run awk through every single field > 2.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-14-2013

Registered User

44, 0

Join Date: Sep 2013

Last Activity: 30 August 2016, 2:32 PM EDT

Posts: 44

Thanks Given: 8

Thanked 0 Times in 0 Posts

Does it help if we remove all NA's and leave it blank?

Kanja

View Public Profile for Kanja

Find all posts by Kanja

11-14-2013

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

something to start with - assuming a number may appear only once on a line.

Code:

awk -f kan.awk myFile
or
awk -v nums='10 3 11' -f kan.awk myFile

where kan.awk is:

Code:

BEGIN {
  if (!(nums)) nums="2"
  numsN=split(nums, tA,FS)
  for(i=1;i<=numsN;i++)
    numsA[tA[i]]
}
{
  found=0
  for(i=1;i<=NF;i++)
    if ($i in numsA)
     found++
}
found==numsN

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

11-14-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

I'm afraid we're getting nowhere with those regexes. Try this awkthingy and come back with results:

Code:

awk     '       {P=1
                 n=split (PARA, PATT)
                 for (i=3; i<=NF; i++)
                   for (j=1; j<=n; j++) if ($i==PATT[j]) delete PATT[j]
                 for (k in PATT) if (PATT[k]) P=0
                }
         P
        ' PARA="11 3 10" file | less

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Unique extraction of rows

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Removing rows that contain non-unique column entry

Discussion started by: msatseqs

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

Discussion started by: cokedude

3. UNIX for Dummies Questions & Answers

Extract unique combination of rows from text files

Discussion started by: Unilearn

4. Shell Programming and Scripting

Delete unique rows - optimize script

Discussion started by: varu0612

5. UNIX for Dummies Questions & Answers

merging rows into new file based on rows and first column

Discussion started by: A-V

6. UNIX for Dummies Questions & Answers

Delete rows with unique value for specific column

Discussion started by: A-V

7. Shell Programming and Scripting

Change unique file names into new unique filenames

Discussion started by: avonm

8. Shell Programming and Scripting

Shell script to count unique rows in a CSV

Discussion started by: Nani369

9. Shell Programming and Scripting

Deleting specific rows in large files having rows greater than 100000

Discussion started by: manish2009

10. Shell Programming and Scripting

get part of file with unique & non-unique string

Discussion started by: andrewsc