Perl to identify specific runs in input and print only lines identified


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Perl to identify specific runs in input and print only lines identified
# 8  
Old 12-18-2016
Quote:
Originally Posted by cmccabe
In the instance of:

input
Code:
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta

there is a run of 7A in bold, so if the if I am looking for 6A, there are 6 in that sequence. However,

Code:
perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/a{6}/i){say qq(>$h); say $1 while /(a{6,})/gi}' input

output
Code:
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAAA

desired output
Code:
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA

Is there a way to print the 6A from this run instead of the entire run of 7A if only 6A is being searched for? Thank you Smilie.
Quote:
Originally Posted by Scrutinizer
Hi Aia, that would also match 6 [Aa]'s or more. But OP in #4 is looking for matches of exactly six [Aa]'s. This approach would also match in the header and not just in the payload unlike the earlier suggestions and it does not find consecutive [Aa]'s that are on either side of a line wrap in the payload, nor would it find mixed case matches..
Hi Scrutinizer,
Post #4 shows that the OP still would like to print matches with more than 6 but only show the match amount, this case 6. Please, see highlighted parts of that post.
Also, in fasta you are not going to find six A or a in the header.
My suggestion does not match mixed cases.

I used the following exaggerated examples to test:

cat fasta02.file

Code:
>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
gtctttgaagCTCTCCGTGGACAGGTTGTTCAGTTCAAGCTCTCAGACAT
TGGAGAAGGGATTAGAGAAGTAACTGTTAAAGAATGgtaagtgaat
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta
ag
>hg19_refGene_NM_001918_10 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAAAAAaaaaaaCGGAAGCTTTAAAAGgtattgta
ag
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTC
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
>hg19_refGene_NM_001918_7 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAAAAAAGGGCCGAAAAACACTGGCAACTCaaaaaa
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none
cattttttagATTAAGCTGAGTGAAGTTGTTGGCTCAGGAAAAGATGGCA
GAATACTTAAAGAAGATATCCTCAACTATTTGGAAAAGCAGACAGGAGCT
ATATTGCCTCCTTCACCCAAAGTTGAAATTATGCCACCTCCACCAAAGCC
AAAAGACATGACTGTTCCTATACTAGTATCAAAACCTCCGGTATTCACAG
GCAAAGACAAAACAGAACCCATAAAAGgtaatgataa
>hg19_refGene_NM_001918_8 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAaaaAAGGGCCGAAAAACACTGGCAACTC
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc

Here's the result:
Code:
perl -076 -naF'\n' -e '@c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"' fasta02.file
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA
>hg19_refGene_NM_001918_10 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA AAAAAA aaaaaa
>hg19_refGene_NM_001918_7 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA aaaaaa


Let's fix the problem of missing possible matches if found in multiple lines:

Code:
perl -076 -naF'\n' -e 's/\n+//g; @c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"'


Last edited by Aia; 12-18-2016 at 10:57 PM.. Reason: Adds fix for multiple lines.
This User Gave Thanks to Aia For This Post:
# 9  
Old 12-18-2016
Hi Aia, OK I think you may be right about your interpretation of only printing 6 A's when there would be 7 consecutive A's. That would leave an interesting question of what to do when there are a mutiple of 6 [Aa]'s or more. Should that then also only be printed once?

I am not sure about multiple [Aa]'s under no circumstances present in the FASTA header. Unlikely sure, but impossible? The OP's original approach excludes that from matching..

The mixed case I think could be mitigated by using:
Code:
@c=/a{6}/ig

But this also leaves the line wrap case; this approach does not find consecutive [Aa]'s that are on either side of a line wrap in the payload. So the newlines would need to be removed in the payload before the matching.

Code:
>hg19_refGene_NM_001918_9 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
ATGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTCAAAA
AACTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc

This User Gave Thanks to Scrutinizer For This Post:
# 10  
Old 12-18-2016
Quote:
Originally Posted by Aia
Hi Scrutinizer,
Post #4 shows that the OP still would like to print matches with more than 6 but only show the match amount, this case 6. Please, see highlighted parts of that post.
Also, in fasta you are not going to find six A or a in the header.
My suggestion does not match mixed cases.

[...]

Let's fix the problem of missing possible matches if found in multiple lines:

Code:
perl -076 -naF'\n' -e 's/\n+//g; @c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"'

Quote:
Originally Posted by Scrutinizer
Hi Aia, OK I think you may be right about your interpretation of only printing 6 A's when there would be 7 consecutive A's. That would leave an interesting question of what to do when there are a mutiple of 6 [Aa]'s or more. Should that then also only be printed once?

I am not sure about multiple [Aa]'s under no circumstances present in the FASTA header. Unlikely sure, but impossible? The OP's original approach excludes that from matching..

The mixed case I think could be mitigated by using:
Code:
@c=/a{6}/ig

But this also leaves the line wrap case; this approach does not find consecutive [Aa]'s that are on either side of a line wrap in the payload. So the newlines would need to be removed in the payload before the matching.

Code:
>hg19_refGene_NM_001918_9 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
ATGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTCAAAA
AACTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc

Hi, Scrutinizer,
I did post a fix for the case of multiple lines, previously. Please, see post #8.

On purpose, I did want not this to be that case, by design. If I error in this case, that adaptation might suffices.
Quote:
The mixed case I think could be mitigated by using:
Code:
@c=/a{6}/ig


Last edited by Aia; 12-18-2016 at 11:52 PM..
This User Gave Thanks to Aia For This Post:
# 11  
Old 12-19-2016
Thank you all for your help and explanations, I really appreciate it Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to print lines from a files with specific start and end patterns and pick only the last lines?

Hi, I need to print lines which are matching with start pattern "SELECT" and END PATTERN ";" and only select the last "select" statement including the ";" . I have attached sample input file and the desired input should be as: INPUT FORMAT: SELECT ABCD, DEFGH, DFGHJ, JKLMN, AXCVB,... (5 Replies)
Discussion started by: nani2019
5 Replies

2. Shell Programming and Scripting

awk to combine all matching fields in input but only print line with largest value in specific field

In the below I am trying to use awk to match all the $13 values in input, which is tab-delimited, that are in $1 of gene which is just a single column of text. However only the line with the greatest $9 value in input needs to be printed. So in the example below all the MECP2 and LTBP1... (0 Replies)
Discussion started by: cmccabe
0 Replies

3. Shell Programming and Scripting

How to print the specific lines?

I need to print specific lines 5,100,67,123 in a file. file name: today.csv (3 Replies)
Discussion started by: ramkumar15
3 Replies

4. Shell Programming and Scripting

Help to just print out specific line from an input file

Hi, I have a file which contains 2,500,500,432 lines. Can I know what command I should type in order just print out particular line from the input file? eg. I just wanna to see what is the contents at line 522,484,612. Thanks for advice. (3 Replies)
Discussion started by: perl_beginner
3 Replies

5. Shell Programming and Scripting

how to print specific lines or words

Hi, Please have a look on below records. STG_HCM_STATE_DIS_TAX_TBL.1207.Xfm: The value of the row is: EMPLID = 220677 COMPANY = 919 BALANCE_ID = 0 BALANCE_YEAR = 2012 STG_HCM_STATE_DIS_TAX_TBL.1207.Xfm: ORA-00001: unique constraint (SYSADM.PS_TAX_BALANCE) violated ... (4 Replies)
Discussion started by: Sachin Lakka
4 Replies

6. Shell Programming and Scripting

Print Specific lines when found specific character

Hello all, I have thousand file input like this: file1: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$ | | | |$$ $$ UERT | TTYH | TAFE | FRFG |$$ $$______|______|________|______|$$ $$ | | | |$$ $$ 1 | DISK | TR1311 | 1 |$$ $$ 1 |... (4 Replies)
Discussion started by: attila
4 Replies

7. Shell Programming and Scripting

print first few lines, then apply regex on a specific column to print results.

abc.dat tty cpu tin tout us sy wt id 0 0 7 3 19 71 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 133.2 0.0 682.9 0.0 1.0 0.0 7.2 0 79 c1t0d0 0.2 180.4 0.1 5471.2 3.0 2.8 16.4 15.6 15 52 aaaaaa1-xx I want to skip first 5 line... (4 Replies)
Discussion started by: kchinnam
4 Replies

8. Shell Programming and Scripting

Sed one-liner to print specific lines?

I need to print specific lines from a file, say 2-5, 8, 12-15, 17, 19, 21-27. How do I achieve this? (2 Replies)
Discussion started by: Ilja
2 Replies

9. Shell Programming and Scripting

print specific lines

I have a text file made of different blocks separated by blank lines. I need to print the blocks with odd indexes. How can I get it with awk? For example i need to print the first and the third block of a file like this: asgdg sadsd ssgsdgd ass uff fedd sddddso ieeduydd dddee deeo ssancnc... (4 Replies)
Discussion started by: littleboyblu
4 Replies

10. Shell Programming and Scripting

How to print specific lines with awk

Hi! How can I print out a specific range of rows, like "cat file | awk NR==5,NR==9", but in the END-statement? I have a small awk-script that finds specific rows in a file and saves the line number in an array, like this: awk ' BEGIN { count=0} /ZZZZ/ { list=NR ... (10 Replies)
Discussion started by: Bugenhagen
10 Replies
Login or Register to Ask a Question