Finding specific series of strings or characters Post: 302562643

Sponsored Content

Top Forums UNIX for Dummies Questions & Answers Finding specific series of strings or characters Post 302562643 by Corona688 on Friday 7th of October 2011 02:15:53 PM

10-07-2011

Registered User

If awk's default handling of ORS doesn't do what you want, you'll have to print the >'s yourself:

Code:

$ cat data
>Sequence1
AGACAGATGACAGTAGACAGAT-GACGATAGCAGT
>Sequence2
AGACAGATGACAGTAGACAGATAGACGATAGCAGT
>Sequence3
AGACAGATGACAGTAGACAGATCGACGATAGCAGT
>Sequence4
AGACAGATGA-AGTAGACAGATTGACGATAGCAGT
>Sequence5
AGAC*GATGA
$ awk 'BEGIN { RS=">"; FS="\n"; ORS="" } /-/ { print ">" $0; }' data
>Sequence1
AGACAGATGACAGTAGACAGAT-GACGATAGCAGT
>Sequence4
AGACAGATGA-AGTAGACAGATTGACGATAGCAGT
$

[edit] adding a reply that explains in more detail.

---------- Post updated at 12:15 PM ---------- Previous update was at 12:05 PM ----------

You know how the FS and OFS variables control what awk considers fields for input, and what awk prints as fields for output?

RS and ORS are the exact same thing, but for lines. So when we do RS=">"; FS="\n" we're telling awk "each time you see >, that is a new line", and "each time you see \n, that's a new field".

When you have a statement like

Code:

EXPRESSION { code }

, the { code } part is only executed when EXPRESSION is true. If you drop an unadorned /regex/ into there, it assumes you want $0 ~ /regex/. BEGIN and END are just special expressions that are true before any processing, and after all records have been processed.

My first try puts extra >'s on the end because the record separator gets printed at the end of the record, not the beginning -- the same place you'd expect a newline. So it ends up kind of off by one.

My improved version here just prepends a > to the input string and prints it, so it gets them in the correct place.

So:

Code:

BEGIN {
        # Our 'newline' will be >
        RS=">";
        # Input fields separated on real newlines
        FS="\n";
} 

# This code block gets executed only when $0 ~ /-/
# i.e. there's a - somewhere in the entire mess of input for this 'line'.
# If you wanted to just check the second field, you could do
# $2 ~ /-/ { ... }
/-/ {
        # Print a >, followed by all our fields.  Since we haven't
        # modified $1/$2/..., $0 will still contain UNMODIFIED data,
        # complete with newlines -- otherwise we might need OFS="\n"
        # to print newlines instead of spaces between lines.
        print ">" $0;
}

Last edited by Corona688; 10-07-2011 at 03:26 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

print 10 characters in series

suppose fileA kanika123ABC 1222222222222222 raciat5678ty 1221123333331121 jessica78ulllo 2233243223333333 so output shud be print only first 10 characters in series and rest remain same kanika123A 1222222222222222 raciat5678 1221123333331121 jessica78u ...

2. Shell Programming and Scripting

Finding strings

Hi I made a post earlier but now my problem has become a lot more complicated. So I have a file that looks like this: Name 1 13 94 1 AGGTT Name 1 31 44 1 TTCCG Name 1 13 94 2 AAAAATTTT Name 1 41 47 2 GGGGGGGGGGG So the file is tab delimited and what I want to do is find...

3. Shell Programming and Scripting

Finding repitition of series

Dear friends, hello to everyone. I am new to this forum. I have a set of data where I need to find the repitition of series as below data format: 0001230000456000001230000456 each digit can be separated by any delimeter I need to find out the starting point (index) of '123' and '456' I...

4. Shell Programming and Scripting

Finding Minimum in a Series

I have two LARGE files of data more than 20,000 line each, file-1 and file-2, and I wish to do the following if possible: file-1 1 2 5 7 9 2 4 6 3 8 9 4 6 8 9 3 2 1 3 1 2 . . . file-2 1 2 3 2 5 7 5 7 3 7 9 4 .

5. Shell Programming and Scripting

sed replacing specific characters and control characters by escaping

sed -e "s// /g" old.txt > new.txt While I do know some control characters need to be escaped, can normal characters also be escaped and still work the same way? Basically I do not know all control characters that have a special meaning, for example, ?, ., % have a meaning and have to be escaped...

6. Shell Programming and Scripting

finding the strings beween 2 characters "/" & "/" in .txt file

Hi all. I have a .txt file that I need to sort it My file is like: 1- 88 chain0 MASTER (FF-TE) FFFF 1962510 /TCK T FD2TQHVTT1 /jtagc/jtag_instreg/updateinstr_reg_1 dff1 (TI,SO) 2- ...

7. Shell Programming and Scripting

Can't figure out how to find specific characters in specific columns

I am trying to find a specific set of characters in a long file. I only want to find the characters in column 265 for 4 bytes. Is there a search for that? I tried cut but couldn't get it to work. Ex. I want to find '9999' in column 265 for 4 bytes. If it is in there, I want it to print...

8. Shell Programming and Scripting

Count specific characters at specific column positions

Hi all, I need help. I have an input text file (input.txt) like this: 21 GTGCAACACCGTCTTGAGAGG 50 21 GACCGAGACAGAATGAAAATC 73 21 CGGGTCTGTAGTAGCAAACGC 108 21 CGAAAAATGAACCCCTTTATC 220 21 CGTGATCCTGTTGAAGGGTCG 259 Now I need to count A/T/G/C numbers at each character location in column...

9. Shell Programming and Scripting

Finding Strings between 2 characters in a file

Hi All, Assuming i have got a file test.dat which has contains as follows: Unix = abc def fgt jug 111 2222 3333 Linux = gggg pppp qqq C# = ccc ffff llll I would like to traverse through the file, get the 1st occurance of "=" and then need to get the sting...

10. UNIX for Dummies Questions & Answers

Printing lines with specific strings at specific columns

Hi I have a file which is tab-delimited. Now, I'd like to print the lines which have "chr6" string in both first and second columns. Could anybody help?

LEARN ABOUT REDHAT

encoding

encoding(n)						       Tcl Built-In Commands						       encoding(n)

__________________________________________________________________________________________________________________________________________________

NAME

       encoding - Manipulate encodings

SYNOPSIS

       encoding option ?arg arg ...?
_________________________________________________________________

INTRODUCTION

       Strings	in Tcl are encoded using 16-bit Unicode characters.  Different operating system interfaces or applications may generate strings in
       other encodings such as Shift-JIS.  The encoding command helps to bridge the gap between Unicode and these other formats.

DESCRIPTION

       Performs one of several encoding related operations, depending on option.  The legal options are:

       encoding convertfrom ?encoding? data
	      Convert data to Unicode from the specified encoding.  The characters in data are treated as binary data where the  lower	8-bits	of
	      each  character  is  taken  as a single byte.  The resulting sequence of bytes is treated as a string in the specified encoding.	If
	      encoding is not specified, the current system encoding is used.

       encoding convertto ?encoding? string
	      Convert string from Unicode to the specified encoding.  The result is a sequence of bytes  that  represents  the	converted  string.
	      Each byte is stored in the lower 8-bits of a Unicode character.  If encoding is not specified, the current system encoding is used.

       encoding names
	      Returns a list containing the names of all of the encodings that are currently available.

       encoding system ?encoding?
	      Set the system encoding to encoding. If encoding is omitted then the command returns the current system encoding.  The system encod-
	      ing is used whenever Tcl passes strings to system calls.

EXAMPLE

       It is common practice to write script files using a text editor that produces output in the euc-jp encoding,  which  represents	the  ASCII
       characters  as  singe bytes and Japanese characters as two bytes.  This makes it easy to embed literal strings that correspond to non-ASCII
       characters by simply typing the strings in place in the script.	However, because the source command always reads files using the ISO8859-1
       encoding, Tcl will treat each byte in the file as a separate character that maps to the 00 page in Unicode.  The resulting Tcl strings will
       not contain the expected Japanese characters.  Instead, they will contain a sequence of Latin-1 characters that correspond to the bytes	of
       the original string.  The encoding command can be used to convert this string to the expected Japanese Unicode characters.  For example,
		set s [encoding convertfrom euc-jp "xA4xCF"]
       would return the Unicode string "u306F", which is the Hiragana letter HA.

SEE ALSO

       Tcl_GetEncoding(3)

KEYWORDS

       encoding

Tcl									8.1							       encoding(n)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

print 10 characters in series

Discussion started by: cdfd123

2. Shell Programming and Scripting

Finding strings

Discussion started by: kylle345

3. Shell Programming and Scripting

Finding repitition of series

Discussion started by: gjarms

4. Shell Programming and Scripting

Finding Minimum in a Series

Discussion started by: ali2011

5. Shell Programming and Scripting

sed replacing specific characters and control characters by escaping

Discussion started by: ijustneeda

6. Shell Programming and Scripting

finding the strings beween 2 characters "/" & "/" in .txt file

Discussion started by: Behrouzx77

7. Shell Programming and Scripting

Can't figure out how to find specific characters in specific columns

Discussion started by: Drenhead

8. Shell Programming and Scripting

Count specific characters at specific column positions

Discussion started by: thienxho

9. Shell Programming and Scripting

Finding Strings between 2 characters in a file

Discussion started by: rtagarra

10. UNIX for Dummies Questions & Answers

Printing lines with specific strings at specific columns

Discussion started by: a_bahreini

LEARN ABOUT REDHAT

encoding