Extract lines with unique value using a specific column

10-06-2013

Registered User

11, 0

Join Date: Nov 2011

Last Activity: 8 October 2013, 10:44 AM EDT

Posts: 11

Thanks Given: 6

Thanked 0 Times in 0 Posts

Extract lines with unique value using a specific column

Hi there,

I need a help with extracting data from tab delimited file which look like this

Code:

#CHROM POS ID REF ALT Human Cow Dog Mouse Lizard
chr2 3033 . G C 0/0 0/0 0/0 1/1 0/0
chr3 35040 . G T 0/0 0/0 ./. 1/1 0/1
chr4 60584 . T G 1/1 1/1 0/1 1/1 0/0
chr10 7147815 . G A 0/0 1/1 0/0 0/0 ./.

I am only interested to what is unique to the mouse when compared the other species (the actual file have more species).

The desired output

Code:

#CHROM POS ID REF ALT Human Cow Dog Mouse Lizard
chr2 3033 . G C 0/0 0/0 0/0 1/1 0/0
chr3 35040 . G T 0/0 0/0 ./. 1/1 0/1

I will be grateful for your help

N.B the data file is 5 Gb in size.

Thanks

Last edited by houkto; 10-06-2013 at 11:57 PM..

houkto

View Public Profile for houkto

Find all posts by houkto

10-07-2013

Registered User

1,416, 266

Join Date: Sep 2013

Last Activity: 13 January 2021, 9:37 AM EST

Location: Swissh

Posts: 1,416

Thanks Given: 328

Thanked 266 Times in 239 Posts

This should work:

Code:

grep ^#CHROM file
grep chr[23] file

Where file refers to /path/to/file

Last edited by sea; 10-07-2013 at 12:23 AM..

sea

View Public Profile for sea

Find all posts by sea

10-07-2013

Registered User

11, 0

Join Date: Nov 2011

Last Activity: 8 October 2013, 10:44 AM EDT

Posts: 11

Thanks Given: 6

Thanked 0 Times in 0 Posts

Hi sea,

Thanks for your reply. I might not fully explain what I want. I am interested in values in Mouse column which are unique (in a row) when compared to other species. I don't see that being address in your code.

Thanks

houkto

View Public Profile for houkto

Find all posts by houkto

10-07-2013

Registered User

1,416, 266

Join Date: Sep 2013

Last Activity: 13 January 2021, 9:37 AM EST

Location: Swissh

Posts: 1,416

Thanks Given: 328

Thanked 266 Times in 239 Posts

erm, then its:

Code:

grep chr[23] /path/to/file|awk '{print $9}'

Hope this helps

sea

View Public Profile for sea

Find all posts by sea

10-07-2013

Registered User

11, 0

Join Date: Nov 2011

Last Activity: 8 October 2013, 10:44 AM EDT

Posts: 11

Thanks Given: 6

Thanked 0 Times in 0 Posts

Hi again,

Thanks for your fast reply. This might be confusing, but I am more interested in the value underneath the Mouse i.e 1/1 or whenever its unique to other species and when it does then I will be interested in column such as Chrom and POS REF ALT.

Thanks

houkto

View Public Profile for houkto

Find all posts by houkto

10-07-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by houkto

Hi there,

I need a help with extracting data from tab delimited file which look like this

Code:

#CHROM POS ID REF ALT Human Cow Dog Mouse Lizard
chr2 3033 . G C 0/0 0/0 0/0 1/1 0/0
chr3 35040 . G T 0/0 0/0 ./. 1/1 0/1
chr4 60584 . T G 1/1 1/1 0/1 1/1 0/0
chr10 7147815 . G A 0/0 1/1 0/0 0/0 ./.

I am only interested to what is unique to the mouse when compared the other species (the actual file have more species).

The desired output

Code:

#CHROM POS ID REF ALT Human Cow Dog Mouse Lizard
chr2 3033 . G C 0/0 0/0 0/0 1/1 0/0
chr3 35040 . G T 0/0 0/0 ./. 1/1 0/1

I will be grateful for your help Smilie

N.B the data file is 5 Gb in size.

Thanks

You say that the file is tab delimited, but there are single space characters (rather than tabs) between fields in your sample file. Which is it?

How long is the longest line in your file? What operating system are you using and what is the LINE_MAX limit on your system. I.e. what is the output from the commands:

Code:

uname -a
getconf LINE_MAX

Are you saying that you want to print lines where the contents of the 9th field on the line is different from the contents of the 6th, 7th, 8th, and 10th fields? Is it always the 9th field that matters, is it always the field with the label Mouse in the 1st line in the file that matters, or is there some other way that your will let your script know which field matters?

Are the 1st five fields always ignored when comparing fields, or do the fields to be ignored vary?

Do you really want to print the entire line, or do you just want to print the 1st (#CHROM), 2nd (POS), 4th (REF), and 5th (ALT) fields from lines with unique Mouse data as indicated in your last message? Are those fields always in the same columns?

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-07-2013

Registered User

334, 87

Join Date: May 2011

Last Activity: 30 May 2018, 2:26 AM EDT

Posts: 334

Thanks Given: 39

Thanked 87 Times in 86 Posts

Based on your example input file, this should work:

Code:

awk '{for (i=6;i<=NF;i++) if ($i==$9 && i!=9) next}1' file

This User Gave Thanks to Subbeh For This Post:

Subbeh

View Public Profile for Subbeh

Find all posts by Subbeh

Shell Programming and Scripting

Extract lines with unique value using a specific column

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to filter out lines containing unique values in a specified column

Discussion started by: owwow14

2. Shell Programming and Scripting

Merging two tables including multiple ocurrence of column identifiers and unique lines

Discussion started by: BSP

3. Shell Programming and Scripting

ksh sed - Extract specific lines with mulitple occurance of interesting lines

Discussion started by: popeye

4. Shell Programming and Scripting

Count frequency of unique values in specific column

Discussion started by: owwow14

5. Shell Programming and Scripting

Extract values from a specific column to the end

Discussion started by: prvnrk

6. Shell Programming and Scripting

Print unique names in a specific column using awk

Discussion started by: quincyjones

7. Shell Programming and Scripting

Print unique names in each row of a specific column using awk

Discussion started by: quincyjones

8. UNIX for Dummies Questions & Answers

Extract lines with specific words with addition 2 lines before and after

Discussion started by: Amanda Low

9. UNIX for Dummies Questions & Answers

Delete rows with unique value for specific column

Discussion started by: A-V

10. Shell Programming and Scripting

How to extract first column with a specific character

Discussion started by: selamba_warrior