Perl to identify specific runs in input and print only lines identified


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Perl to identify specific runs in input and print only lines identified
# 1  
Old 12-17-2016
Perl to identify specific runs in input and print only lines identified

In the perl one-liner below I am identifying the runs of 6a or 6A in each line starting with >. The code seems close but it prints each > line no matter if it has 6a or 6A in it. Only the line with the 6a or 6A needs to be printed.

So using the input file, only the >hg19_refGene_NM_001918_3 line would be printed because it had either 6a or 6A in it. The other lines are just skipped (not printed). Thank you Smilie.

input
Code:
>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
gtctttgaagCTCTCCGTGGACAGGTTGTTCAGTTCAAGCTCTCAGACAT
TGGAGAAGGGATTAGAGAAGTAACTGTTAAAGAATGgtaagtgaat
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta
ag
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTC
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none
cattttttagATTAAGCTGAGTGAAGTTGTTGGCTCAGGAAAAGATGGCA
GAATACTTAAAGAAGATATCCTCAACTATTTGGAAAAGCAGACAGGAGCT
ATATTGCCTCCTTCACCCAAAGTTGAAATTATGCCACCTCCACCAAAGCC
AAAAGACATGACTGTTCCTATACTAGTATCAAAACCTCCGGTATTCACAG
GCAAAGACAAAACAGAACCCATAAAAGgtaatgataa

current output
Code:
>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none

desired output
Code:
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA

perl
Code:
perl -076 -nE 'chomp; s/(.+)// && say qq{>$1}; s/\s//g; say $1 while /(a{6})/gi' input

# 2  
Old 12-18-2016
Hello cmccabe,

Not a perl solution, in case you require you could try with following awk once and could let me know how it goes then.
Code:
awk '(gsub(/A|a/,"&")==6 && $0 ~ /^>/)'  Input_file

Output will be as follows.
Code:
>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 12-18-2016
Try:
Code:
awk '{h=$1; $1=x} toupper($0)~/A{6}/{print RS h}' RS=\> FS='\n' OFS= file

Produces:
Code:
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none

This approach does the following:
  • It concatenates all the lines in the payload into one single line without gaps
  • Then it matches the regular expression against uppercase version of the payload
  • It prints the header if there is at least one match


--
Note: some older version of awk have trouble with the {} part. In that case:
Code:
awk '{h=$1; $1=x} toupper($0)~/AAAAAA/{print RS h}' RS=\> FS='\n' OFS= file


--
Or would you like the matches printed as well?


--

I would suggest to modify your perl approach along these lines. See if that helps:
Code:
perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/a{6}/i){say qq(>$h); say $1 while /(a{6,})/gi}'

This one prints matches of 6 or more

The original perl code:
  • does not test if there was a match before printing the line...
  • only prints with exact matches, but for instance if there is a sequence of 12 A's then it would find two matches.

Last edited by Scrutinizer; 12-18-2016 at 05:20 AM..
This User Gave Thanks to Scrutinizer For This Post:
# 4  
Old 12-18-2016
In the instance of:

input
Code:
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta

there is a run of 7A in bold, so if the if I am looking for 6A, there are 6 in that sequence. However,

Code:
perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/a{6}/i){say qq(>$h); say $1 while /(a{6,})/gi}' input

output
Code:
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAAA

desired output
Code:
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA

Is there a way to print the 6A from this run instead of the entire run of 7A if only 6A is being searched for? Thank you Smilie.
# 5  
Old 12-18-2016
Quick adaptation, try:
Code:
perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/(^|[^a])a{6}($|[^a])/i){say qq(>$h); say $2 while /(^|[^a])(a{6})($|[^a])/gi}' input

or:
Code:
perl -076 -nE '
  s/(.+)//;
  $h=$1;
  s/\s//g;
  if(/(^|[^a])a{6}($|[^a])/i) {
    say qq(>$h);
    say $2 while /(^|[^a])(a{6})($|[^a])/gi
  }
' input

This User Gave Thanks to Scrutinizer For This Post:
# 6  
Old 12-18-2016
Code:
perl -076 -naF'\n' -e '@c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"' input

This User Gave Thanks to Aia For This Post:
# 7  
Old 12-18-2016
Hi Aia, that would also match 6 [Aa]'s or more. But OP in #4 is looking for matches of exactly six [Aa]'s. This approach would also match in the header and not just in the payload unlike the earlier suggestions and it does not find consecutive [Aa]'s that are on either side of a line wrap in the payload, nor would it find mixed case matches..
This User Gave Thanks to Scrutinizer For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to print lines from a files with specific start and end patterns and pick only the last lines?

Hi, I need to print lines which are matching with start pattern "SELECT" and END PATTERN ";" and only select the last "select" statement including the ";" . I have attached sample input file and the desired input should be as: INPUT FORMAT: SELECT ABCD, DEFGH, DFGHJ, JKLMN, AXCVB,... (5 Replies)
Discussion started by: nani2019
5 Replies

2. Shell Programming and Scripting

awk to combine all matching fields in input but only print line with largest value in specific field

In the below I am trying to use awk to match all the $13 values in input, which is tab-delimited, that are in $1 of gene which is just a single column of text. However only the line with the greatest $9 value in input needs to be printed. So in the example below all the MECP2 and LTBP1... (0 Replies)
Discussion started by: cmccabe
0 Replies

3. Shell Programming and Scripting

How to print the specific lines?

I need to print specific lines 5,100,67,123 in a file. file name: today.csv (3 Replies)
Discussion started by: ramkumar15
3 Replies

4. Shell Programming and Scripting

Help to just print out specific line from an input file

Hi, I have a file which contains 2,500,500,432 lines. Can I know what command I should type in order just print out particular line from the input file? eg. I just wanna to see what is the contents at line 522,484,612. Thanks for advice. (3 Replies)
Discussion started by: perl_beginner
3 Replies

5. Shell Programming and Scripting

how to print specific lines or words

Hi, Please have a look on below records. STG_HCM_STATE_DIS_TAX_TBL.1207.Xfm: The value of the row is: EMPLID = 220677 COMPANY = 919 BALANCE_ID = 0 BALANCE_YEAR = 2012 STG_HCM_STATE_DIS_TAX_TBL.1207.Xfm: ORA-00001: unique constraint (SYSADM.PS_TAX_BALANCE) violated ... (4 Replies)
Discussion started by: Sachin Lakka
4 Replies

6. Shell Programming and Scripting

Print Specific lines when found specific character

Hello all, I have thousand file input like this: file1: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$ | | | |$$ $$ UERT | TTYH | TAFE | FRFG |$$ $$______|______|________|______|$$ $$ | | | |$$ $$ 1 | DISK | TR1311 | 1 |$$ $$ 1 |... (4 Replies)
Discussion started by: attila
4 Replies

7. Shell Programming and Scripting

print first few lines, then apply regex on a specific column to print results.

abc.dat tty cpu tin tout us sy wt id 0 0 7 3 19 71 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 133.2 0.0 682.9 0.0 1.0 0.0 7.2 0 79 c1t0d0 0.2 180.4 0.1 5471.2 3.0 2.8 16.4 15.6 15 52 aaaaaa1-xx I want to skip first 5 line... (4 Replies)
Discussion started by: kchinnam
4 Replies

8. Shell Programming and Scripting

Sed one-liner to print specific lines?

I need to print specific lines from a file, say 2-5, 8, 12-15, 17, 19, 21-27. How do I achieve this? (2 Replies)
Discussion started by: Ilja
2 Replies

9. Shell Programming and Scripting

print specific lines

I have a text file made of different blocks separated by blank lines. I need to print the blocks with odd indexes. How can I get it with awk? For example i need to print the first and the third block of a file like this: asgdg sadsd ssgsdgd ass uff fedd sddddso ieeduydd dddee deeo ssancnc... (4 Replies)
Discussion started by: littleboyblu
4 Replies

10. Shell Programming and Scripting

How to print specific lines with awk

Hi! How can I print out a specific range of rows, like "cat file | awk NR==5,NR==9", but in the END-statement? I have a small awk-script that finds specific rows in a file and saves the line number in an array, like this: awk ' BEGIN { count=0} /ZZZZ/ { list=NR ... (10 Replies)
Discussion started by: Bugenhagen
10 Replies
Login or Register to Ask a Question