AWK or python??


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting AWK or python??
# 1  
Old 04-06-2011
AWK or python??

Hi!

I'm a experimental biochemist, but I'm few time ago, I started to do bioinformatics stuffs.
I have to filer the output results of some programs that I've been used.

I've learned bash and a little bit of AWK. But, now I have to do filters more complicated, and I don't know how! Smilie

What I should learn? AWK or python?

For example, here are one of the filters, that I have to do:


Original Data

HTML Code:
NAMES      Match     %Score          Signals       Path
--------------------------------------------------------------------------------------------------

AK154051    1612    83.5666148263    GTAG,GTAG    AK154051chrX155,398,170,105,317,40,1,34,38,71,36,95,136,95,27,4,14,22,42,71,0,158,556,726,1327,1644,1788,1789,1823,1861,1933,1969,2064,2203,2302,2330,2336,2352,2375,2432
AK154051    1918    99.4297563504    GTAG,GTAG,GTAG,GTAG    AK154051chr1284,128,208,192,619,696,0,11097,13260,19852,21314,21933
AK154154    378    12.3368146214    ACAC,CTGC,CTAC    AK154154chr10211,17,2,36,1,440,133,0,211,228,230,30134,119467,147210
AK154154    575    18.7663185379    AATG    AK154154chr10416,195,6,320,412,55,247,0,416,612,619,50578,50990,51045
AK154356    1082    35.3710362864    GCAG,GTAG,GTAG,ATAC,GTAG,ATCC,GTAG    AK154356chr617,2,99,4,102,33,3,23,75,56,176,154,17,112,67,108,186,30,40,46,17,1,105,99,4,2,36,0,3528,3530,3629,16150,16253,84877,84882,84908,95995,128988,129165,129320,129338,129451,129518,151565,151758,151788,151828,151874,151891,210498,210603,210708,210721,210726
AK154356    1110    36.2863680941    GTAG,GTAG,GTAG    AK154356chrX189,151,25,44,102,15,4,11,8,28,12,11,54,3,21,40,114,173,19,154,129,3,181,53,112,99,0,191,342,367,411,513,647,653,666,674,702,715,120143,120199,120207,120229,120269,120384,120558,120578,120733,131163,131170,131354,131408,131521
AK154356    2238    73.161163779    ATAC,CTTT,GTAG,GTAG,GTAG,GTAG,GTAG,ATCA,GTAG    AK154356chr1722,77,88,18,53,299,14,149,24,148,51,6,81,68,57,59,30,103,123,96,7,261,234,277,17,41,106,105,233,174,0,6819,6897,6985,7004,23372,23676,23691,27402,27427,64510,64561,93606,93687,106380,106439,106505,106535,106638,106761,107259,107267,107528,107763,108043,108061,108102,165246,165352,295436
AK154365    1932    85.0352112676    GTAG,GTAG,ATAC    AK154365chr12152,148,11,79,158,145,337,2,8,24,35,35,246,254,157,179,137,134,0,136504,136653,136665,136746,136904,137049,137386,137388,144711,144735,144771,144807,145054,145308,145641,145820,145964
AK154365    1970    86.7077464789    ATAC,GTAG,GTAG    AK154365chr12134,137,178,417,246,35,35,24,8,2,337,145,158,79,11,112,187,0,136,273,628,1046,1293,1329,1364,8533,8541,8543,8880,9025,9184,9264,9276,94145
AK154365    2093    92.1214788732    ACCC,CTAC    AK154365chr1224,276,11,332,64,341,34,38,30,213,32,602,137,134,0,25,302,314,646,719,60885,175711,175750,175781,175994,176027,176630,176777
AK154365    2109    92.8257042254    CTGC    AK154365chr12135,136,340,249,246,35,33,376,396,11,300,0,162,298,638,888,1135,1171,9541,9929,10326,10338
AK154365    2109    92.8257042254    CTGC    AK154365chr12300,11,396,376,33,35,246,249,340,136,135,0,301,313,721,9518,9552,9588,9835,10084,10424,10587
AK154650    2570    99.3428681871    GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG    AK154650chr15797,104,58,91,160,137,158,136,169,158,188,127,237,65,0,1397,2129,2390,3305,4242,6973,7253,7878,8125,9759,12393,13559,20995
Filter:
At column 1, there are groups of lines which have the same name. Inside of this groups, select the name with the highest perceptual score (column 3) and the names that have 10 or less percent of difference with the highest score.
Then, if has been selected more than one name and there are names which only have GTAG, ATAC or GCAG signals, select the one of this which have the best score.


Desired Output

HTML Code:
NAMES      Match     %Score          Signals       Path
--------------------------------------------------------------------------------------------------

AK154051    1918    99.4297563504    GTAG,GTAG,GTAG,GTAG    AK154051chr1284,128,208,192,619,696,0,11097,13260,19852,21314,21933
AK154154    378    12.3368146214    ACAC,CTGC,CTAC    AK154154chr10211,17,2,36,1,440,133,0,211,228,230,30134,119467,147210
AK154154    575    18.7663185379    AATG    AK154154chr10416,195,6,320,412,55,247,0,416,612,619,50578,50990,51045
AK154356    2238    73.161163779     ATAC,CTTT,GTAG,GTAG,GTAG,GTAG,GTAG,ATCA,GTAG     AK154356chr1722,77,88,18,53,299,14,149,24,148,51,6,81,68,57,59,30,103,123,96,7,261,234,277,17,41,106,105,233,174,0,6819,6897,6985,7004,23372,23676,23691,27402,27427,64510,64561,93606,93687,106380,106439,106505,106535,106638,106761,107259,107267,107528,107763,108043,108061,108102,165246,165352,295436
AK154365    1970    86.7077464789    ATAC,GTAG,GTAG     AK154365chr12134,137,178,417,246,35,35,24,8,2,337,145,158,79,11,112,187,0,136,273,628,1046,1293,1329,1364,8533,8541,8543,8880,9025,9184,9264,9276,94145
AK154650    2570    99.3428681871    GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG    AK154650chr15797,104,58,91,160,137,158,136,169,158,188,127,237,65,0,1397,2129,2390,3305,4242,6973,7253,7878,8125,9759,12393,13559,20995
Code suggestion are welcome Smilie

But, mainly I need to know the easier way to do this, AWK or python or another? Because I have to do many filters and operations like that.

Thanks for your time!!!
# 2  
Old 04-06-2011
I don't know about Python;
I'd recommend awk or perl; perl would be a better choice if a lot of substitutions is to be done.
Also IMHO it's easier to write/maintain complex perl scripts, and perl is easier to debug.
They are both very fast processing line-by-line.
# 3  
Old 04-06-2011
awk has an obnoxious line-length limit on many systems. perl never has any limit except available memory and address space. perl's also available more places than python is, I think, and biochemists have been using it for similar purposes for quite a long time.
This User Gave Thanks to Corona688 For This Post:
# 4  
Old 04-06-2011
So, really it's all about Horses for Courses, sledgehammers and nuts, and all that?

I suppose the real winner is Perl. It's more widely available than Python, and is more powerful than AWK.

So long as people don't start playing golf with it, it's a fine language!
# 5  
Old 04-06-2011
the last step below I didn't understand.
Code:
Then, if has been selected more than one name and there are names which only have GTAG, ATAC or GCAG signals, select the one of this which have the best score.

So first, make groups of same name (1st column), then find out the highest score in the group, this entry should be in final result.

2nd step, compare other entries, within the same group, take a look how many entries with the score difference (with highest score) <=10

if the count >=2 (the 2 here doesn't include the highest score entry?) check the signals, entry which has higher score with certain signal wins.

is that correct?

therefore in the final result, for the same name, there would be max two entries. is that right?

and in your output

why AK154650 is not there?
and why
Code:
 AK154365    1970    86.7077464789

is there but
Code:
AK154365	2109	92.8257042254

not??

i make a python script, it may not exactly match what you need. but would be a good example that python can do it easily.

since I didn't understand your last requirement, the script didn't do the last step, for the same name, the script lists out the highest score entry and the entry with diff <=10 and only has GTAG, ATAC or GCAG signals

to make it simple, assume the column delimiter is "TAB" and the sample text has no header, that is, the first line is the data.

PHP Code:
#!/usr/bin/python
def doFilter():
    
open("t.txt")
    
groups = {}
    
result = []
    
dict set(['GTAG','ATAC','GCAG'])
    try:
        
rows = [x.replace("\n","") for x in f.readlines()]
        for 
r in rows:
            
key r.split("\t")[0]
            if (
not groups.has_key(key)):
                
groups[key] = []
            
groups[key].append(r)
        
keys groups.keys()
        for 
k in keys:
            
maxl max(groups[k], key=lambda xx.split("\t")[2])
            
result.append(maxl)
            for 
l in groups[k]:
                
diff float(l.split("\t")[2]) - float(maxl.split("\t")[2])
                if(
diff >and diff <= 10 and len(set(l.split("\t")[3].split(","))-dict)==0):
                    
result.append(l)
        for 
i in result:
            print 
i
                

    
finally:
        
f.close()

doFilter() 
output:
Code:
AK154051	1918	99.4297563504	GTAG,GTAG,GTAG,GTAG	AK154051chr1284,128,208,192,619,696,0,11097,13260,19852,21314,21933
AK154154	575	18.7663185379	AATG	AK154154chr10416,195,6,320,412,55,247,0,416,612,619,50578,50990,51045
AK154365	2109	92.8257042254	CTGC	AK154365chr12135,136,340,249,246,35,33,376,396,11,300,0,162,298,638,888,1135,1171,9541,9929,10326,10338
AK154650	2570	99.3428681871	GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG,GTAG	AK154650chr15797,104,58,91,160,137,158,136,169,158,188,127,237,65,0,1397,2129,2390,3305,4242,6973,7253,7878,8125,9759,12393,13559,20995
AK154356	2238	73.161163779	ATAC,CTTT,GTAG,GTAG,GTAG,GTAG,GTAG,ATCA,GTAG	AK154356chr1722,77,88,18,53,299,14,149,24,148,51,6,81,68,57,59,30,103,123,96,7,261,234,277,17,41,106,105,233,174,0,6819,6897,6985,7004,23372,23676,23691,27402,27427,64510,64561,93606,93687,106380,106439,106505,106535,106638,106761,107259,107267,107528,107763,108043,108061,108102,165246,165352,295436


Last edited by sk1418; 04-06-2011 at 07:03 PM..
This User Gave Thanks to sk1418 For This Post:
# 6  
Old 04-06-2011
Thanks sk1418!

Your code will guide me a lot!

I didn't explain the filter in simple words, sorry about that, but it isn't easy.

Quote:
Originally Posted by sk1418
and in your output

why AK154650 is not there?
and why
Code:
 AK154365    1970    86.7077464789

is there but
Code:
AK154365    2109    92.8257042254

not??
It's beacuse
Code:
AK154365    2109    92.8257042254

has non GTAG, ATAC or GCAG signal, instead has a CTGC signal.

If the score isn't 10 percent higher, I want to selects those names which have GTAG, ATAC or GCAG signals.

Thanks you anyway! Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Programming

Create a C source and compile inside Python 1.4.0 to 3.7.0 in Python for ALL? platforms...

Hi all... As you know I like making code backwards compatible for as many platforms as possible. This Python script was in fact dedicated for the AMIGA A1200 using Pythons 1.4.0, 1.5.2, 1.6.0, 2.0.1, and 2.4.6 as that is all we have for varying levels of upgrades from a HDD and 4MB FastRam... (1 Reply)
Discussion started by: wisecracker
1 Replies

2. Windows & DOS: Issues & Discussions

How to execute python script on remote with python way..?

Hi all, I am trying to run below python code for connecting remote windows machine from unix to run an python file exist on that remote windows machine.. Below is the code I am trying: #!/usr/bin/env python import wmi c = wmi.WMI("xxxxx", user="xxxx", password="xxxxxxx")... (1 Reply)
Discussion started by: onenessboy
1 Replies

3. Shell Programming and Scripting

awk or sed or python for regular expressions ?

Linux 6.X environments (RHEL, Oracle Linux ) I could write basic shell scripts in bash. In my spare time, I was planning to learn awk or sed to deal with regular expression tasks I have to deal with. But, I gather that python is gaining popularity these days and I came to know that python has a... (5 Replies)
Discussion started by: John K
5 Replies

4. Shell Programming and Scripting

**python** unable to read the background color in python

I am working on requirement on spreadsheet in python scripting. I have a spreadsheet containing cell values and with background color. I am able to read the value value but unable to get the background color of that particular cell. Actually my requirement is to read the cell value along... (1 Reply)
Discussion started by: giridhar276
1 Replies

5. UNIX for Dummies Questions & Answers

awk/grep or parsing in python code

Hello, I am writing a python code. The output of the python code needs a little bit of parsing. From the output of python code, which has a lot of redundant data, I need to cut only those words or numbers which end with &. for example: if the output is-- "This is an example of tgbn123& what i... (0 Replies)
Discussion started by: Screech_you
0 Replies

6. Shell Programming and Scripting

python - wget xml doc and parse with awk

Well, that's what I'd do in bash :) Here's what I have so far: import urllib2 from BeautifulSoup import BeautifulStoneSoup xml = urllib2.urlopen('http://weatherlink.com/xml.php?user=blah&pass=blah') soup = BeautifulStoneSoup(xml) print soup.prettify() but all it does is grab the html... (0 Replies)
Discussion started by: unclecameron
0 Replies

7. Shell Programming and Scripting

awk or python to join alternating columns

Hi, here is my input TEST-SAM-A6ZZ SM-S6KV 0 0 2 1 2 3 3 3 2 2 2 2 1 1 2 2 1 3 1 3 1361 SM-HA4Q 0 0 2 1 3 3 3 3 2 2 2 2 1 1 2 2 1 3 3 3 4 2 4 2 2 276217 SM-H9ZG 0 0 2 2 3 3 3 3 2 2 2 2 1 1 2 2 3 3 1 3 2 2 2 2 GC15458 SM-HAQX 0 0 2 1 2 3 1 3 2 2 2 2 1 1 2 2 1 3 1 3 2 2 2 18331 SM-HA5E 0 0 2 2... (2 Replies)
Discussion started by: genehunter
2 Replies

8. Shell Programming and Scripting

Merge two rows using awk or python

Hi, Suppose I have a space delimited file like this: Serial# 1970 1971 1972 1973 1974 193532 21 2 X X X 200201 20 30 X X 40 200201 X X 13 15 X 393666 66 3 X X 5 ... (2 Replies)
Discussion started by: grossgermany
2 Replies

9. Shell Programming and Scripting

use python or awk to match names 'with error tolerance'

I think this is a very challenging problem I am facing and I have no idea how to deal with it Suppose I have two csv files A.csv Toyota Camry,1998,blue Honda Civic,1999,blue B.csv Toyota Inc. Camry, 2000km Honda Corp Civic,1500km I want to generate C.csv Toyota Camry,1998,blue... (7 Replies)
Discussion started by: grossgermany
7 Replies
Login or Register to Ask a Question