Filtering rows for first two instances of a value


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Filtering rows for first two instances of a value
# 1  
Old 04-29-2010
Java Filtering rows for first two instances of a value

Kindly help me with this problem:

My data looks like this:
Code:
SNPfile.txt
CHR_A    BP_A    SNP_A    CHR_B    BP_B    SNP_B    R2    p-SNP_A     p-SNP_B
 4    172575323    rs17056855    4    172601079    rs11945883     0.119414    0.049972656    0.031050345
 4    172575323    rs17056855    4    172603701    rs7662060     0.11978    0.049972656    0.034664046
 4    172575323    rs17056855    4    172604186    rs17056919     0.382896    0.049972656    0.037136752
 4    172575323    rs17056855    4    172604350    rs4692884     0.11925    0.049972656    0.034632158
 4    172575323    rs17056855    4    172605017    rs17056925     0.373343    0.049972656    0.037902134
 5    148767855    rs12652986    5    148773386    rs353285     0.0727606    0.049972812    0.01748838
 5    148767855    rs12652986    5    148781294    rs353299     0.0692342    0.049972812    0.018621813
 5    148767855    rs12652986    5    148781964    rs353297     0.0324627    0.049972812    0.033439399
 5    148767855    rs12652986    5    148831342    rs414582     0.00996126    0.049972812    0.035560327
 5    148767855    rs12652986    5    148854047    rs10057083     0.00890368    0.049972812    0.03527856
 5    148767855    rs12652986    5    148866716    rs3756505     0.0106361    0.049972812    0.042519464
 21    19804594    rs7280435    21    19816817    rs456482     0.0613113    0.049972845    0.040731593
 21    19804594    rs7280435    21    19833893    rs128365    0.347322     0.049972845    0.01136657
 21    19804594    rs7280435    21    19865079    rs2226396     0.0193921    0.049972845    0.012638692
 21    19804594    rs7280435    21    19865343    rs2825653     0.0189789    0.049972845    0.018289668
 6    9690024    rs17797675    6    9703199    rs7749303    0.369563     0.049973276    0.018423278
 1    220966050    rs17532708    1    220970921    rs4240934     0.270007    0.049975232    0.010602066
 1    220966050    rs17532708    1    220972176    rs2378605     0.270007    0.049975232    0.010644871
 18    32233667    rs8092959    18    32264921    rs8087319     0.00315182    0.049975447    0.043287658
 12    2440796    rs4765937    12    2456906    rs2239087    0.416573     0.049978648    0.005179932

  1. I want to sort this first on col8 and then col9 in ascending order
  2. I also want to filter this list by taking the first two occurrence of row-value in col3 and write to another file with tab delimited format.

Output
Code:
CHR_A	BP_A	SNP_A	CHR_B	BP_B	SNP_B	R2	p-SNP_A	p-SNP_B
4	172575323	rs17056855	4	172601079	rs11945883	0.119414	0.049972656	0.031050345
4	172575323	rs17056855	4	172604350	rs4692884	0.11925	0.049972656	0.034632158
5	148767855	rs12652986	5	148773386	rs353285	0.0727606	0.049972812	0.01748838
5	148767855	rs12652986	5	148781294	rs353299	0.0692342	0.049972812	0.018621813
21	19804594	rs7280435	21	19833893	rs128365	0.347322	0.049972845	0.01136657
21	19804594	rs7280435	21	19865079	rs2226396	0.0193921	0.049972845	0.012638692
6	9690024	rs17797675	6	9703199	rs7749303	0.369563	0.049973276	0.018423278
1	220966050	rs17532708	1	220970921	rs4240934	0.270007	0.049975232	0.010602066
1	220966050	rs17532708	1	220972176	rs2378605	0.270007	0.049975232	0.010644871
18	32233667	rs8092959	18	32264921	rs8087319	0.00315182	0.049975447	0.043287658
12	2440796	rs4765937	12	2456906	rs2239087	0.416573	0.049978648	0.005179932

What I would have done if I wanted a unique list is to take only the col3 and then sort for uniq. However, I need two values per instance and I don't know how to do that.
In some cases, the values are present only once, adding to the problem.
I am looking for a solution in awk, since I am learning as it goes by posting some of the issues here.
I also want to thank many of you, who have helped me so far.

Sincere thanks
~GH

Last edited by genehunter; 04-29-2010 at 04:47 PM..
# 2  
Old 04-29-2010
Code:
sort -k8.1,8.10 -k9.1,9.10 infile > outfile

The number 2 requirement is not clear.
Does col 3 have to have repeating values, ie., 1 .. 1 in order to qualify as the first two?

This:
Quote:
In some cases, the values are present only once, adding to the problem
lends to the idea that somehow the contents of field #3 have to occur twice.
# 3  
Old 04-30-2010
Posted the Output required.
If the Col3 occurs only once in some cases and then, it should also be written to the output. However, when Col3 value occurs more than once
like..
Code:
 CHR_A	BP_A	SNP_A	CHR_B	BP_B	SNP_B	R2	p-SNP_A	p-SNP_B
4	172575323	rs17056855	4	172601079	rs11945883	0.119414	0.049972656	0.031050345
4	172575323	rs17056855	4	172603701	rs7662060	0.11978	0.049972656	0.034664046
4	172575323	rs17056855	4	172604186	rs17056919	0.382896	0.049972656	0.037136752
4	172575323	rs17056855	4	172604350	rs4692884	0.11925	0.049972656	0.034632158
4	172575323	rs17056855	4	172605017	rs17056925	0.373343	0.049972656	0.037902134

It should write only the first two instances sorted on the columns p-SNP_A and p-SNP_B in that order.

Code:
CHR_A	BP_A	SNP_A	CHR_B	BP_B	SNP_B	R2	p-SNP_A	p-SNP_B
4	172575323	rs17056855	4	172601079	rs11945883	0.119414	0.049972656	0.031050345
4	172575323	rs17056855	4	172604350	rs4692884	0.11925	0.049972656	0.034632158

Also, can you explain how to sort a column with a mixture of values in decimals and scientific format (1E-06)?
Thanks
~GH

---------- Post updated 04-30-10 at 01:41 AM ---------- Previous update was 04-29-10 at 03:55 PM ----------

Please help!
vgersh my savior!!
Thanks
~GH
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Moving or copying first rows and last rows into another file

Hi I would like to move the first 1000 rows of my file into an output file and then move the last 1000 rows into another output file. Any help would be great Thanks (6 Replies)
Discussion started by: kylle345
6 Replies

2. Shell Programming and Scripting

Filtering out rows

# powermt display dev=all .... snipped ... Pseudo name=hdiskpower8 Symmetrix ID=000192602584 Logical device ID=059F state=alive; policy=SymmOpt; priority=0; queued-IOs=0 ============================================================================== ---------------- Host --------------- ... (7 Replies)
Discussion started by: Daniel Gate
7 Replies

3. Shell Programming and Scripting

Replace other instances except the first one

Hi Friends, This is my input track type=alpha name="omega" fixedStep chrom=chr10 name="omega" 1 2 3 34 4 4 44 4 4 34 5 5 566 6 (1 Reply)
Discussion started by: jacobs.smith
1 Replies

4. UNIX for Dummies Questions & Answers

merging rows into new file based on rows and first column

I have 2 files, file01= 7 columns, row unknown (but few) file02= 7 columns, row unknown (but many) now I want to create an output with the first field that is shared in both of them and then subtract the results from the rest of the fields and print there e.g. file 01 James|0|50|25|10|50|30... (1 Reply)
Discussion started by: A-V
1 Replies

5. Programming

Multiple instances of pthread

Suppose I declare pthread_t clear_thread; and then pthread_create(&clear_thread, &detach, clear_message, this); the thread is supposed to go away, perform the service it is intended to procide, and then kill itself. A little while later, I require this service again, so I say ... (2 Replies)
Discussion started by: clerew
2 Replies

6. Shell Programming and Scripting

filtering the rows in a file

hi all, please help on this isssue, i have a file which contains something like this and i want to seprate the servers which has vasd.pid ,i need only server names. i want output something like this which vasd.pid . server1 server3 server4 (4 Replies)
Discussion started by: sudharson
4 Replies

7. Shell Programming and Scripting

Split single rows to multiple rows ..

Hi pls help me out to short out this problem rm PAB113_011.out rm: PAB113_011.out: override protection 644 (yes/no)? n If i give y it remove the file. But i added the rm command as a part of ksh file and i tried to remove the file. Its not removing and the the file prompting as... (7 Replies)
Discussion started by: sri_aue
7 Replies

8. Shell Programming and Scripting

Deleting specific rows in large files having rows greater than 100000

Hi Guys, I need help in modifying a large text file containing more than 1-2 lakh rows of data using unix commands. I am quite new to the unix language the text file contains data in a pipe delimited format sdfsdfs sdfsdfsd START_ROW sdfsd|sdfsdfsd|sdfsdfasdf|sdfsadf|sdfasdf... (9 Replies)
Discussion started by: manish2009
9 Replies

9. Shell Programming and Scripting

Command filtering ONLY rows NOT beginning with '*'

I need a command which filters rows ONLY NOT beginning with '*' So far I have following NOT sufficient command, because it does not include ALL possible literals except of '*' grep ^ INPUT_FILE >>OUTPUT_FILE Is it possible to write something like grep NOT ^ INPUT_FILE... (3 Replies)
Discussion started by: ABE2202
3 Replies

10. UNIX for Advanced & Expert Users

multiple instances of syslogd - is it possible?

I would like to start up multiple instances of syslog daemon. I am having a little difficulty. Is this at all possible? I have separate syslog.conf1.... syslog.conf5 files. I have linked the daemon to separate files syslogd1 ... syslogd5 I have arranged the rcd.2 start/stop scripts for... (9 Replies)
Discussion started by: Gary Dunn
9 Replies
Login or Register to Ask a Question