Extract lines with unique value using a specific column


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extract lines with unique value using a specific column
# 1  
Old 10-06-2013
Code Extract lines with unique value using a specific column

Hi there,

I need a help with extracting data from tab delimited file which look like this

Code:
#CHROM POS ID REF ALT Human Cow Dog Mouse Lizard
chr2 3033 . G C 0/0 0/0 0/0 1/1 0/0
chr3 35040 . G T 0/0 0/0 ./. 1/1 0/1
chr4 60584 . T G 1/1 1/1 0/1 1/1 0/0
chr10 7147815 . G A 0/0 1/1 0/0 0/0 ./.

I am only interested to what is unique to the mouse when compared the other species (the actual file have more species).

The desired output

Code:
#CHROM POS ID REF ALT Human Cow Dog Mouse Lizard
chr2 3033 . G C 0/0 0/0 0/0 1/1 0/0
chr3 35040 . G T 0/0 0/0 ./. 1/1 0/1

I will be grateful for your help Smilie

N.B the data file is 5 Gb in size.

Thanks

Last edited by houkto; 10-06-2013 at 11:57 PM..
# 2  
Old 10-07-2013
This should work:
Code:
grep ^#CHROM file
grep chr[23] file

Where file refers to /path/to/file

Last edited by sea; 10-07-2013 at 12:23 AM..
# 3  
Old 10-07-2013
Hi sea,

Thanks for your reply. I might not fully explain what I want. I am interested in values in Mouse column which are unique (in a row) when compared to other species. I don't see that being address in your code.

Thanks
# 4  
Old 10-07-2013
erm, then its:
Code:
grep chr[23] /path/to/file|awk '{print $9}'

Hope this helps
# 5  
Old 10-07-2013
Hi again,

Thanks for your fast reply. This might be confusing, but I am more interested in the value underneath the Mouse i.e 1/1 or whenever its unique to other species and when it does then I will be interested in column such as Chrom and POS REF ALT.

Thanks
# 6  
Old 10-07-2013
Quote:
Originally Posted by houkto
Hi there,

I need a help with extracting data from tab delimited file which look like this

Code:
#CHROM POS ID REF ALT Human Cow Dog Mouse Lizard
chr2 3033 . G C 0/0 0/0 0/0 1/1 0/0
chr3 35040 . G T 0/0 0/0 ./. 1/1 0/1
chr4 60584 . T G 1/1 1/1 0/1 1/1 0/0
chr10 7147815 . G A 0/0 1/1 0/0 0/0 ./.

I am only interested to what is unique to the mouse when compared the other species (the actual file have more species).

The desired output

Code:
#CHROM POS ID REF ALT Human Cow Dog Mouse Lizard
chr2 3033 . G C 0/0 0/0 0/0 1/1 0/0
chr3 35040 . G T 0/0 0/0 ./. 1/1 0/1

I will be grateful for your help Smilie

N.B the data file is 5 Gb in size.

Thanks
You say that the file is tab delimited, but there are single space characters (rather than tabs) between fields in your sample file. Which is it?

How long is the longest line in your file? What operating system are you using and what is the LINE_MAX limit on your system. I.e. what is the output from the commands:
Code:
uname -a
getconf LINE_MAX

Are you saying that you want to print lines where the contents of the 9th field on the line is different from the contents of the 6th, 7th, 8th, and 10th fields? Is it always the 9th field that matters, is it always the field with the label Mouse in the 1st line in the file that matters, or is there some other way that your will let your script know which field matters?

Are the 1st five fields always ignored when comparing fields, or do the fields to be ignored vary?

Do you really want to print the entire line, or do you just want to print the 1st (#CHROM), 2nd (POS), 4th (REF), and 5th (ALT) fields from lines with unique Mouse data as indicated in your last message? Are those fields always in the same columns?
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 10-07-2013
Based on your example input file, this should work:
Code:
awk '{for (i=6;i<=NF;i++) if ($i==$9 && i!=9) next}1' file

This User Gave Thanks to Subbeh For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to filter out lines containing unique values in a specified column

Hi, I have multiple files that each contain four columns of strings: File1: Code: 123 abc gfh 273 456 ddff jfh 837 789 ghi u4u 395 File2: Code: 123 abc dd fu 456 def 457 nd 891 384 djh 783 I want to compare the strings in Column 1 of File 1 with each other file and Print in... (3 Replies)
Discussion started by: owwow14
3 Replies

2. Shell Programming and Scripting

Merging two tables including multiple ocurrence of column identifiers and unique lines

I would like to merge two tables based on column 1: File 1: 1 today 1 green 2 tomorrow 3 red File 2: 1 a lot 1 sometimes 2 at work 2 at home 2 sometimes 3 new 4 a lot 5 sometimes 6 at work (4 Replies)
Discussion started by: BSP
4 Replies

3. Shell Programming and Scripting

ksh sed - Extract specific lines with mulitple occurance of interesting lines

Data file example I look for primary and * to isolate the interesting slot number. slot=`sed '/^primary$/,/\*/!d' filename | tail -1 | sed s'/*//' | awk '{print $1" "$2}'` Now I want to get the Touch line for only the associate slot number, in this case, because the asterisk... (2 Replies)
Discussion started by: popeye
2 Replies

4. Shell Programming and Scripting

Count frequency of unique values in specific column

Hi, I have tab-deliminated data similar to the following: dot is-big 2 dot is-round 3 dot is-gray 4 cat is-big 3 hot in-summer 5 I want to count the frequency of each individual "unique" value in the 1st column. Thus, the desired output would be as follows: dot 3 cat 1 hot 1 is... (5 Replies)
Discussion started by: owwow14
5 Replies

5. Shell Programming and Scripting

Extract values from a specific column to the end

Hello friends, I have a text file with many columns (no. columns vary from row to row) separated by space. I need to collect all the values from 18th column to the end from each line and group them as pairs and then numbering like below.. 1. 18th-col-value 19th-col-value 2. 20th-col-value ... (5 Replies)
Discussion started by: prvnrk
5 Replies

6. Shell Programming and Scripting

Print unique names in a specific column using awk

Is it possible to modify file like this. 1. Remove all the duplicate names in a define column i.e 4th col 2. Count the no.of unique names separated by ";" and print as a 5th col thanx in advance!! Q input c1 30 3 Eh2 c10 96 3 Frp c41 396 3 Ua5;Lop;Kol;Kol c62 2 30 Fmp;Fmp;Fmp ... (5 Replies)
Discussion started by: quincyjones
5 Replies

7. Shell Programming and Scripting

Print unique names in each row of a specific column using awk

Is it possible to remove redundant names in the 4th column? input cqWE 100 200 singapore;singapore AZO 300 400 brazil;america;germany;ireland;germany .... .... output cqWE 100 200 singapore AZO 300 400 brazil;america;germany;ireland (4 Replies)
Discussion started by: quincyjones
4 Replies

8. UNIX for Dummies Questions & Answers

Extract lines with specific words with addition 2 lines before and after

Dear all, Greetings. I would like to ask for your help to extract lines with specific words in addition 2 lines before and after these lines by using awk or sed. For example, the input file is: 1 ak1 abc1.0 1 ak2 abc1.0 1 ak3 abc1.0 1 ak4 abc1.0 1 ak5 abc1.1 1 ak6 abc1.1 1 ak7... (7 Replies)
Discussion started by: Amanda Low
7 Replies

9. UNIX for Dummies Questions & Answers

Delete rows with unique value for specific column

Hi all I have a file which looks like this 1234|1|Jon|some text|some text 1234|2|Jon|some text|some text 3453|5|Jon|some text|some text 6533|2|Kate|some text|some text 4567|3|Chris|some text|some text 4567|4|Maggie|some text|some text 8764|6|Maggie|some text|some text My third column is my... (9 Replies)
Discussion started by: A-V
9 Replies

10. Shell Programming and Scripting

How to extract first column with a specific character

Hi All, Below is the sample data of my files: O|A|571000689|D|S|PNH|S|SI sadm|ibscml1x| I|A|571000689|P|S|PNH|S|SI sadm|ibscml1x| O|A|571000689|V|S|PNH|S|SI sadm|ibscml1x| S|C|CAM|D|S|PNH|R|ZOA|2004 bscml1x| ... (3 Replies)
Discussion started by: selamba_warrior
3 Replies
Login or Register to Ask a Question