Extract strings based on the value


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Extract strings based on the value
# 1  
Old 11-05-2014
Extract strings based on the value

I have a file with multiple columns (in this case, the file has 3 columns):
Code:
NM_001006304 (-33.7)	XM_418228 (-38.4)	JN880447 (-33.7)
CR387600 (-33.7)	CR524203 (-36.3)	GALGA_6AKII_KRT75 (-33.7)
GALGA25_SC7 (-31.9)	CR352795 (-36.3)	NM_204172 (-31.7)
NM_204137 (-31.9)	NM_001030561 (-36.3)	
AB011672 (-31.5)	XM_414526 (-35.3)	
CR386285 (-31.3)	NM_001278076 (-35.3)	
BX930087 (-30.8)	NM_213578 (-35.0)	
CR406893 (-30.6)	NM_205141 (-34.9)	
BX930205 (-30.5)	NM_001277385 (-34.1)	
CR385278 (-30.4)	CR386046 (-34.0)	
CR406366 (-30.4)	NM_001001603 (-33.8)	
NM_001277590 (-30.3)	CR385555 (-33.5)	
XM_414551 (-30.1)	CR407317 (-33.2)	
CR386585 (-30.0)	CR391594 (-33.1)	
CR390278 (-30.0)	CR391382 (-32.8)	
NM_001277979 (-30.0)	XM_004939970 (-32.8)	
CR352458 (-29.9)	J02823 (-32.7)	
CR353040 (-29.9)	X80114 (-32.7)	
CR352882 (-29.8)	BX931544 (-32.5)	
XM_003643271 (-29.7)	CR391698 (-32.2)	
CR389895 (-29.6)	GALGA_UnR_CL2 (-29.5)	
NM_001002856 (-29.5)	L25374 (-28.6)	
BX930628 (-29.3)		
CR407317 (-29.2)		
NM_001199294 (-29.2)		
CR387217 (-28.7)		
CR389430 (-28.7)		
CR388761 (-28.5)		
NM_001185051 (-28.1)		
CR390290 (-27.9)		
GALGA25_CL8 (-27.1)		
GALGA25_CL4 (-26.8)

the strings in each column has been sorted ascendingly by the value in the parenthesis as you can see. For each column, how can I extract the string with the top “n” lowest values in the parenthesis? If n is 1, I just want to extract the string with the lowest value in the parenthesis for each column, so the output file is like this:
Code:
NM_001006304 (-33.7)	XM_418228 (-38.4)	JN880447 (-33.7)
CR387600 (-33.7)		                GALGA_6AKII_KRT75 (-33.7)

If n is 2, I want to extract strings with top 2 lowest values in the parenthesis for each column, so the output file is like this:
Code:
NM_001006304 (-33.7)	XM_418228 (-38.4)	JN880447 (-33.7)
CR387600 (-33.7)	CR524203 (-36.3)	GALGA_6AKII_KRT75 (-33.7)
GALGA25_SC7 (-31.9)	CR352795 (-36.3)	NM_204172 (-31.7)
NM_204137 (-31.9)	NM_001030561 (-36.3)

If n is 3, I want to extract strings with top 3 lowest values in the parenthesis for each column, although the third column has only 2 different values (-33.7 and -31.7), I will still extract all strings in this column if it’s available. so the output file is like this:
Code:
NM_001006304 (-33.7)	XM_418228 (-38.4)	JN880447 (-33.7)
CR387600 (-33.7)	CR524203 (-36.3)	GALGA_6AKII_KRT75 (-33.7)
GALGA25_SC7 (-31.9)	CR352795 (-36.3)	NM_204172 (-31.7)
NM_204137 (-31.9)	NM_001030561 (-36.3)	
AB011672 (-31.5)	XM_414526 (-35.3)	
	                NM_001278076 (-35.3)

Thank you in advance!

Last edited by yuejian; 11-05-2014 at 03:30 PM..
# 2  
Old 11-05-2014
Any attempts from your side?
Why do CR386285 (-31.3) and NM_213578 (-35.0) miss in your n=4 sample output?
# 3  
Old 11-05-2014
Quote:
Originally Posted by RudiC
Any attempts from your side?
Why do CR386285 (-31.3) and NM_213578 (-35.0) miss in your n=4 sample output?
Sorry, I mean n=3, I just correct it in the original thread.
# 4  
Old 11-05-2014
Attempts?
# 5  
Old 11-05-2014
Quote:
Originally Posted by RudiC
Attempts?
Hi RudiC, since for each column has been sorted, I initially naively thought just using the head or sed command to extract the top lines with the top "n" lowest values. Then I realized that a value may have multiple strings for each column. Then I don't know how to deal with that.
# 6  
Old 11-05-2014
Try this, tested from n=1 till n=4:
Code:
awk     'BEGIN  {MAX[1]=MAX[2]=MAX[3]=-1E100}
                {for (i=1; i<=3; i++)   {TX=$i
                                         gsub (/^[^(]*\(|\)/, "", TX)
                                         V[i]=TX+0
                                         if (V[i] > MAX[i]) {CNT[i]++; MAX[i]=V[i]}
                                         if (CNT[i] <= n) OUT[i] = $i
                                        }
                 if (OUT[1]OUT[2]OUT[3]) print OUT[1] "\t" OUT[2] "\t" OUT[3]
                 delete OUT
                }
        ' n=4 FS="\t" file

This User Gave Thanks to RudiC For This Post:
# 7  
Old 11-05-2014
Quote:
Originally Posted by RudiC
Try this, tested from n=1 till n=4:
Code:
awk     'BEGIN  {MAX[1]=MAX[2]=MAX[3]=-1E100}
                {for (i=1; i<=3; i++)   {TX=$i
                                         gsub (/^[^(]*\(|\)/, "", TX)
                                         V[i]=TX+0
                                         if (V[i] > MAX[i]) {CNT[i]++; MAX[i]=V[i]}
                                         if (CNT[i] <= n) OUT[i] = $i
                                        }
                 if (OUT[1]OUT[2]OUT[3]) print OUT[1] "\t" OUT[2] "\t" OUT[3]
                 delete OUT
                }
        ' n=4 FS="\t" file

Thank you very much. I just adjusted it to accommodate my column numbers and it works perfectly. I also tested n with various numbers and they are all good. Thank you RudiC.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract strings from output

I am having the following output when executing a dig command : dig @1.1.1.1 google.com +noall +answer +stats ; <<>> DiG 9.11.4-P1 <<>> @1.1.1.1 google.com +noall +answer +stats ; (1 server found) ;; global options: +cmd obodrm.prod.at.dmdsdp.com. 86154 IN A ... (1 Reply)
Discussion started by: liviusbr
1 Replies

2. UNIX for Beginners Questions & Answers

Extract content between strings

Hello i am stuck with this. i have input which is as follows /type/work /works/OL10627594W 3 2019-04-24T16:46:21.351549 {"created": {"type": "/type/datetime", "value": "2009-12-11T03:18:17.488715"}, "title": "Tog the dog", "covers": , "last_modified": {"type":... (3 Replies)
Discussion started by: ahfze
3 Replies

3. UNIX for Dummies Questions & Answers

Issue when using egrep to extract strings (too many strings)

Dear all, I have a data like below (n of rows=400,000) and I want to extract the rows with certain strings. I use code below. It works if there is not too many strings for example n of strings <5000. while I have 90,000 strings to extract. If I use the egrep code below, I will get error: ... (3 Replies)
Discussion started by: forevertl
3 Replies

4. UNIX for Dummies Questions & Answers

Extract code between 2 strings.

Hi, Im having some problems with this. I have loaded a file with html code. All code is placed in the same line. I want to get everything between two given strings (including these strings and get only the first appearance). Example: File contains <html><body><a href='a.html'>abc</a><a... (5 Replies)
Discussion started by: ngb
5 Replies

5. Shell Programming and Scripting

Extract two strings from a file and create a new file with these strings

I have the following lines in a log file. It would be great if some one can help me to create a new file with the just entries in the below format. 66.150.161.195 HPSAC=Z05 66.150.161.196 HPSAC=A05 That is just extract the IP address and the string DPSAC=its value 66.150.161.195 -... (1 Reply)
Discussion started by: Tuxidow
1 Replies

6. Shell Programming and Scripting

ksh: how to extract strings from each line based on a condition

Hi , I'm a newbie.Never worked on Unix before. I want a shell script to perform the following: I want to extract strings from each line ,based on the type of line(Nameline,Subline) and output it to another file.Below is a sample format. 2010-12-21 14:00"1"Nameline"Midterm"First Name:Jane ... (4 Replies)
Discussion started by: angie1234
4 Replies

7. UNIX for Dummies Questions & Answers

Delete strings in file1 based on the list of strings in file2

Hello guys, should be a very easy questn for you: I need to delete strings in file1 based on the list of strings in file2. like file2: word1_word2_ word3_word5_ word3_word4_ word6_word7_ file1: word1_word2_otherwords..,word3_word5_others... (7 Replies)
Discussion started by: roussine
7 Replies

8. Shell Programming and Scripting

Extract data between two strings

Hi , I have a billing CDR file which has repeated lines as indicated below and I need to extract data between two strings (i.e.: <?> and </?>). Eventually, map that information with the corresponding field. I'm new to unix, any help will be greatly appreciated. Gamini Input (single line): !... (3 Replies)
Discussion started by: jaygamini
3 Replies

9. Shell Programming and Scripting

How extract strings (perl)

Sample data: revision001 | some text | some text Comment: some comment Brief: 1) brief 2) brief ------------------------------------------ revision002 | some text | some text Brief: 1) brief 2) brief FIX: some fix ------------------------------------------ revision003 | some... (8 Replies)
Discussion started by: inotech
8 Replies

10. Shell Programming and Scripting

extract strings between tags

Hi, I have data as follows in a text file <key='data1'> <String>abcdef</String> <String>abcdef1</String> <String>abcdef2</String> </key> <key='data2'> <String>abcdef</String> <String>abcdef1</String> <String>abcdef2</String> <String>abcdef3</String> </key> Is there a way i... (10 Replies)
Discussion started by: userscript
10 Replies
Login or Register to Ask a Question