Identifiyng pattern within a block

 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Identifiyng pattern within a block
# 1  
Old 04-11-2017
Identifiyng pattern within a block

I have the following file with @M at the beginning of the line as a RS:
Code:
 @M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
 @M04961:22:000000000-B5VGJ:1:1101:14258:7136 1:N:0:86
 GGCATGAAAACATACAACAGCGGCTTTAACCGGACGCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATGAGACAGGCCGTTTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGEGEGEGGG
 @M04961:22:000000000-B5VGJ:1:1101:15671:7305 1:N:0:86
 GGCATGAAAACATACAAAGTAAGGGGCCGAAGCCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTAC
 +
 CCCC@CCFFGFGEGGGGGFGGGGGGGGFGGGGGGEFGGGGGGGGGCGGGGGGGGCFFG@GFFGGGGGCCGCGFGGGGGGGGGGGFFBEGG:CFF9>CGEG
 @M04961:22:000000000-B5VGJ:1:1101:10817:7690 1:N:0:86
 ACGAGCATCATCTTGATTAAGCTCATTAGGGTTAGCCTCGGTACGGTCAGGCATCCACGGCGCTTTAAAATAGTTGTTATAGATATTCAAATAACCCTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEEFFFGGGGGG
 @M04961:22:000000000-B5VGJ:1:1101:10091:7763 1:N:0:86
 GAGCACATTGTAGCATTGTGCCAATTCATCCATTAACTTCTCAGTAACAGATACAAACTCATCACGAACGTCAGAAGCAGCCTTATGGCCGTCAACATAC
 +
 :=@FGEFFFGGGGGGGFBB@BEFGG?F,EFCCF@FGGGGGGECFGFG9,><3>FC@DFFGG9:383@FC9,>;,>78FC=FCDECFFDGFFCFFGGC?FF
 @M04961:22:000000000-B5VGJ:1:1101:14783:7784 1:N:0:86
 TCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGCGGGGFGG
 @M04961:22:000000000-B5VGJ:1:1101:26069:7790 1:N:0:86
 CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACAT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

And I am using the following script to extract sequences with a specific string(GGCATGAAAACATACA):
Code:
 awk -vRS="@M" '/GGCATGAAAACATACA/ { print "@M"$0 }' infile

The problem I have is that the string should be at the beginning of the second line. Thus, the desire output file should include only three records:
Code:
 @M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
 @M04961:22:000000000-B5VGJ:1:1101:14258:7136 1:N:0:86
 GGCATGAAAACATACAACAGCGGCTTTAACCGGACGCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATGAGACAGGCCGTTTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGEGEGEGGG
 @M04961:22:000000000-B5VGJ:1:1101:15671:7305 1:N:0:86
 GGCATGAAAACATACAAAGTAAGGGGCCGAAGCCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTAC
 +
 CCCC@CCFFGFGEGGGGGFGGGGGGGGFGGGGGGEFGGGGGGGGGCGGGGGGGGCFFG@GFFGGGGGCCGCGFGGGGGGGGGGGFFBEGG:CFF9>CGEG

My script, however, is outputting an extra record containing the string somewhere in the middle of the second line and a blank line between each record:
Code:
 @M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
  
 @M04961:22:000000000-B5VGJ:1:1101:14258:7136 1:N:0:86
 GGCATGAAAACATACAACAGCGGCTTTAACCGGACGCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATGAGACAGGCCGTTTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGEGEGEGGG
  
 @M04961:22:000000000-B5VGJ:1:1101:15671:7305 1:N:0:86
 GGCATGAAAACATACAAAGTAAGGGGCCGAAGCCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTAC
 +
 CCCC@CCFFGFGEGGGGGFGGGGGGGGFGGGGGGEFGGGGGGGGGCGGGGGGGGCFFG@GFFGGGGGCCGCGFGGGGGGGGGGGFFBEGG:CFF9>CGEG
  
 @M04961:22:000000000-B5VGJ:1:1101:26069:7790 1:N:0:86
 CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACAT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

I tried adding ^ to the string(/^GGCATGAAAACATACA/), but that obviously does not work.
Any help will be greatly appreciated!
PS. Ideally I would like to use | sed '/^$/d' to eliminate the blank lines if at all possible

Last edited by Xterra; 04-11-2017 at 04:06 PM..
# 2  
Old 04-11-2017
Hello Xterra,

It is not about the setting RS you are not getting the expected output, it is about the search is finding the string GGCATGAAAACATACA 4 times. So could you please post more clearly about sample Input_file and expected output and let us know, as to me it is not clear.

Thanks,
R. Singh
# 3  
Old 04-11-2017
Instead of the special RS you can perhaps store a line in a variable like this
Code:
awk -v search="GGCATGAAAACATACA" '/^@M/ {s=$0; next} $0~search {print s; c=2} (c && c--)' infile

And you can search at the beginning of a line with search="^GGCATGAAAACATACA"
This User Gave Thanks to MadeInGermany For This Post:
# 4  
Old 04-11-2017
MadeInGermany

Thanks a TON! Exactly what I was looking for. I just added ^ to make sure only the lines with the string at the beginning of the line are selected and change C=3 instead of 2:
Code:
 awk -v search="^GGCATGAAAACATACA" '/^@M/ {s=$0; next} $0~search {print s; c=3} (c && c--)'

Could you please explain me the last part of your code (c=2} (c && c--))?
Thanks!
# 5  
Old 04-11-2017
When the search hits it prints the stored line and sets c=2.
(c && c--) is an implicit if that by default prints $0 if non-zero.
Because it follows and c is non-zero it evaluates c-- that is also non-zero (but then decrements c) so it prints $0.
In the next line c is decremented but still non-zero, so it decrements c and prints $0.
In the overnext line c is zero so the c-- is skipped and nothing is printed.
In effect after a c=X the (c && c--) prints X lines.
BTW if you want to always print the current block regardless how many lines follow you can simplify this
Code:
awk -v search="^GGCATGAAAACATACA" '/^@M/ {s=$0; p=0; next} $0~search {print s; p=1} (p)'


Last edited by MadeInGermany; 04-11-2017 at 05:14 PM..
This User Gave Thanks to MadeInGermany For This Post:
# 6  
Old 04-13-2017
Why not
Code:
awk -vRS="@M" '/\n GGCATGAAAACATACA/ { print "@M"$0 }' file
@M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
 
@M04961:22:000000000-B5VGJ:1:1101:14258:7136 1:N:0:86
 GGCATGAAAACATACAACAGCGGCTTTAACCGGACGCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATGAGACAGGCCGTTTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGEGEGEGGG
 
@M04961:22:000000000-B5VGJ:1:1101:15671:7305 1:N:0:86
 GGCATGAAAACATACAAAGTAAGGGGCCGAAGCCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTAC
 +
 CCCC@CCFFGFGEGGGGGFGGGGGGGGFGGGGGGEFGGGGGGGGGCGGGGGGGGCFFG@GFFGGGGGCCGCGFGGGGGGGGGGGFFBEGG:CFF9>CGEG

This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find specific pattern and change some of block values using awk

Hi, Could you please help me finding a way to replace a specific value in a text block when matching a key pattern ? I got the keys and the values from a command similar to: echo -e "key01 Nvalue01-1 Nvalue01-2 Nvalue01-3\nkey02 Nvalue02-1 Nvalue02-2 Nvalue02-3 \nkey03 Nvalue03-1... (2 Replies)
Discussion started by: alex2005
2 Replies

2. UNIX for Beginners Questions & Answers

Search a string inside a pattern matched block of a file

How to grep for searching a string within a begin and end pattern of a file. Sent from my Redmi 3S using Tapatalk (8 Replies)
Discussion started by: Baishali
8 Replies

3. Shell Programming and Scripting

sed -- Find pattern -- print remainder -- plus lines up to pattern -- Minus pattern

The intended result should be : PDF converters 'empty line' gpdftext and pdftotext?xml version="1.0"?> xml:space="preserve"><note-content version="0.1" xmlns:/tomboy/link" xmlns:size="http://beatniksoftware.com/tomboy/size">PDF converters gpdftext and pdftotext</note-content>... (9 Replies)
Discussion started by: Klasform
9 Replies

4. Shell Programming and Scripting

How to grab a block of data in a file with repeating pattern?

I need to send email to receipient in each block of data in a file which has the sender address under TO and just send that block of data where it ends as COMPANY. I tried to work this out by getting line numbers of the string HELLO but unable to grab the next block of data to send the next... (5 Replies)
Discussion started by: loggedout
5 Replies

5. Shell Programming and Scripting

Using sed to pattern match within a particular multiline block and take action

Hi all, This is my first post, so please go easy if I broke some rules. Not accustomed to posting in forums... :) I'm looking for help on pattern matching within a multiline block and looking to highlight blocks/block-ids that do NOT contain a particular pattern. For example an input file... (5 Replies)
Discussion started by: tirodad
5 Replies

6. Shell Programming and Scripting

[Awk] Extract block of with a particular pattern

Hi, I have some CVS log files, which are divided into blocks. Each block has many fields of information and I want to extract those blocks with a pattern. Here is the sample input. RCS file: /cvsroot/eclipse/org.eclipse.debug.core/core/org/eclipse/debug/core/DebugPlugin.java,v head: 1.174... (7 Replies)
Discussion started by: sandeepk1611
7 Replies

7. Shell Programming and Scripting

delete block of lines when pattern does not match

I have this input file that I need to remove lines which represents more than 30 days of processing. Input file: On 11/17/2009 at 12:30:00, Program started processing...argc=7 Total number of bytes in file being processed is 390 Message buffer of length=390 was allocated successfully... (1 Reply)
Discussion started by: udelalv
1 Replies

8. Shell Programming and Scripting

Need help in sed command (adding a blank line btw each block generated by pattern)

Hello friends, I have a C source code containing sql statements. I use the following sed command to print all the sql blocks in the source code.... sed -n "/exec sql/,/;/p" Sample.cpp The above sed command will print the sql blocks based on the pattern "exec sql" & ";"... (2 Replies)
Discussion started by: frozensmilz
2 Replies

9. Shell Programming and Scripting

Print block of lines matching a pattern

Hi :), I am using the script to search "MYPATTERN" in MYFILE and print that block of lines containing the pattern starting with HEADER upto FOOTER. But my problem is that at some occurrence my footer is different e.g. ";". How to modify the script so that MYPATTERN between HEADER and different... (1 Reply)
Discussion started by: vanand420
1 Replies

10. Shell Programming and Scripting

Delete a block of text delimited by blank lines when pattern is found

I have a file which contains blocks of text - each block is a multi-lines text delimited by blank lines eg. <blank line> several lines of text ... pattern found on this line several more lines of text ... <blank line> How do you delete the block of text (including the blank lines) when... (17 Replies)
Discussion started by: gleu
17 Replies
Login or Register to Ask a Question