String pattern matching and position


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers String pattern matching and position
# 1  
Old 08-30-2014
String pattern matching and position

I am not an expert with linux, but following various posts on this forum, I have been trying to write a script to match pattern of charters occurring together in a file.
My file has approximately 200 million characters (upper and lower case), with about 50 characters per line. I have merged all the lines together to make it one line using

Code:
tr -d '\n' < input.txt > oneLineInput.txt

I now have all charcters in my file in the same line without spaces.

I am trying to count the number of times the specific characters occur together. For example, in the file below

Code:
IamTryingtobuildascriptfortrestingthetyposinmysentence

I am trying to look for the pattern 'tr' that occurs in the sentence. The script I have now is

Code:
grep -o -i oneLineInput.txt -e tr | sort | uniq -c

The above script works perfectly fine for a small file, but when I try to run it on my actual file with more than 200 million characters, it takes ages to finish the task (I lost patience and did not check the total time taken).

Is there a way I can optimize the code?

Next, I have been trying to get the position of the match. For example, in the above example file, 'tr' is starts on 4th and 27th position. I just want the number as output.

Is it possible?

Thank you Smilie
# 2  
Old 08-30-2014
By definition, grep, sort, and uniq work on text files; and the input your feeding to grep is not a line. (A line ends with a newline character and, including the newline character, contains no more than LINE_MAX bytes. On most systems, LINE_MAX is the minimum allowed by the standards, 2048.)

The standards also require operands to follow options on the command line. So, what you are doing is not portable and will not work at all on many systems.

On Linux systems, where the command you showed might work, it will take a lot longer than processing a normal text file because you require the entire (200Mb) file to be read into the address space of grep at once.

If the command line you showed works on your system, you may be able to get offsets in the file offsets (0-based rather than 1-based) of each match (rather than the number of occurrences of TR, Tr, tR, and tr) by using the command line:
Code:
grep -bio tr oneLineInput.txt

This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 08-30-2014
In the original files with about 50 characters per line, could patterns be spread over two consecutive lines?
This User Gave Thanks to Scrutinizer For This Post:
# 4  
Old 08-30-2014
@ Scrutinizer: The patterns in the original file are indeed spread over two consecutive lines. That is the reason I merged the two.
I did manage to get an answer for the problem from Jotne and Tom Fenech at stackoverflow.

To count the number of occurrences:

Code:
awk -F"[Tt][Rr]" '{print NF-1}' oneLineInput.txt

To get the position:

Code:
awk -F"[Tt][Rr]" 'BEGIN {print "hit\tposition"} {for (i=1;i<NF;i++) {p+=length($i);print ++a"\t"p+1+(a-1)*2}}' oneLineInput.txt

Another approach:

Code:
{ 
    while (match($0, /[Tt][Rr]/)) {
        ++n
        m += RSTART
        $0 = substr($0, RSTART + RLENGTH)
        printf "match %d: position %d\n", n, m + n - 1
    }
}

Code:
awk -f matches.awk file

Thank you trying to help me.

@ Don Cragun: perfect explanation for why the script I tried did not work.

Amazed by the capabilities of what scripting can do.
# 5  
Old 08-30-2014
The awk utility is also only defined to work when the input files it reads are text files. So, although some versions of awk can handle long, and/or incomplete lines or both, many cannot. If you would like something that should work on any UNIX or Linux system, you could try something like this:
Code:
awk '
function p(spot) {
	printf("%10d %10d\n", ++cnt, spot)
}
te && /^[Rr]/ {
	p(te)
}
{	while(match($0, /[Tt][Rr]/)) {
		p(off + RSTART)
		$0 = substr($0, 1, RSTART - 1) " " substr($0, RSTART + 1)
	}
}
{	off += length($0)
	if($0 ~ /[Tt]$/) {
		te = off
	} else	te = 0
}' input.txt

Note that this works on your input file before stripping out the <newline> characters, so instead of having to allocate 200Mb of memory to read in your one-line file, it just needs to read one ~50 character line at a time.

With the following randomly generated list of upper- and lower-case letters (except for the 1st 8 and last 8 characters in the file):
Code:
TrTRtrtRmzGArXRqWdKOmxzDWLKZVnPRRrAVNcpAflTxvLkLbs
NbZdBuopHQnEqVJiLWYHVZUfHLqUTmRPesoqVbVdgXXglCCEQC
ZRfvLdXyfgpufseFnIIboRbtDXtlttNQudyeOGyLvLGzSOPyMo
VpxGVwNJKXpYUlhZuNgIcgYuscJRzmExrJZWeeRgnHXwxkxbKh
mndPLikztEWtlovWaOddGCSEijRtrkgWWzvQADIQhsfVEAwmXQ
eIImjmJnvLTQLubbchEwLclnjVmUKuIRxmUOSmarnWYyEBKQpX
gEpdrIXIXiUsiMjQQWWIYWYCfSBwMsPQwvLHyGRwKldfvOxzar
xgwKodWiJxgAhVhlCfalWRpijwiHRlYntBOxweZrvwPPLTYpmN
REPdLIcZnBLWORUkpLCBtlTzjOmQBDVuFEAYfzLTIbyZaNVUMt
rfDzbKDxzXoCqnpWntyTrkyIrSrZTopjapZFouHDGxmlZmxswW
AcvPaJKxLSXZLCLfRZVuxusjYcKzlpZajBMvweadarCAIGjPdM
yiFAqrMDySoxpPREnFPHDQaFJDVUDsYXmbZGkhbvImOkCKfAsg
kauwlSzzrbqrBrXCLJsHXlHAdoRBjXjQbUoFJslyENNKnjIADT
RMEZvOSLWqHeeEoIUddxBxdHuuEMqTpYVTIoGUNVPxKPcSadji
ecsIoISmpwIPIqCXYdwqsvbtTKuoQflREDkZPLxtlyfOVeuKxj
LkwARhocaWFEMjZlPHtuCiYmxfqtYSGwlRSLZHzYGDZoHzvJbm
GsXLsRcuvLEQcXPRakbdeHGLrrnZgwyMFHmXNMmNNbEnfkXumM
pUSpOhpTakWOpQNohhjcuObfSfteNBMyJivKQKhPJQtrtRTrTR

it produces the output:
Code:
         1          1
         2          3
         3          5
         4          7
         5        228
         6        450
         7        470
         8        650
         9        893
        10        895
        11        897
        12        899

giving you the number of matches found and their positions in the file (not counting <newline> characters).
This User Gave Thanks to Don Cragun For This Post:
# 6  
Old 09-01-2014
Indeed it is best to keep the file original. Awk can be easily adjusted to work with the original file. For example an adjustment of Jotne's suggestion:

Code:
awk -F"[Tt][Rr]" '{gsub(/\n/,x); for (i=1;i<NF;i++) {p+=length($i); print ++a, p+1+(a-1)*2}}' RS=± file

Will maybe work with gawk and maybe mawk, since they have very good line limitations.

Also a perl solution like:
Code:
perl -0077 -ne 's/\n//g; print (++$c," ",(pos() +1 -2)."\n") while /tr/gi' file

But while it perhaps may be even less likely than awk to run into line length limitations, just like the awk approach it will read the entire file in memory, which with 200M records is at least a 200 MB footprint...

I came up with a similar approach to Don's, but it uses index() rather than match() and it works for variable length patterns:

Code:
awk -v pattern="tr" '
BEGIN {
  pat_width=length(pattern)
}

{
  curline=tolower($0)
  chunk=rest curline
  while (pos=index(chunk,pattern)) {
    relpos+=pos
    print ++count, basepos + relpos
    chunk=substr(chunk, pos+pat_width)
    relpos+=pat_width - 1
  } 
  relpos=1-pat_width
  rest=substr(curline, length(curline) - pat_width + 2)
  basepos+=length(curline)
}

' file

Also, with all the approaches so far, they will look for the next match AFTER last match.

This next approach will also find additional pattern that were already part of a previous match:

Code:
awk -v pattern="trt" '
BEGIN {
  pat_width=length(pattern)
}

{
  curline=tolower($0)
  chunk=rest curline
  while (pos=index(chunk,pattern)) {
    relpos+=pos
    print ++count, basepos + relpos
    chunk=substr(chunk, pos+1)
  } 
  rest=substr(curline, length(curline) - pat_width + 2)
  basepos+=length(curline)
  relpos=-pat_width+1
}

' file

If we take the last part of Don's example: trtRTrTR, when trying to match "try" it will find 3 matches, while the others find only two.

Output:
Code:
1 1
2 3
3 5
4 893
5 895
6 897

Whereas the previous (using the pattern "trt" ) will find:
Code:
1 1
2 5
3 893
4 897

 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replace String matching wildcard pattern

Hi, I know how to replace a string with another in a file. But, i wish to replace the below string pattern EncryptedPassword="{gafgfa}]\asffafsf312a" i.e EncryptedPassword="<any random string>" To EncryptedPassword="" i.e remove the random password to a empty string. Can you... (3 Replies)
Discussion started by: mohtashims
3 Replies

2. Shell Programming and Scripting

Taking out part of a string by matching a pattern

Hi All, My Problem is like below. I have a file which contains just one row and contains data like PO_CREATE12457888888888889SK1234567878744551111111111SK89456321145789955455555SK8888888815788852222 i want to extract SK12345678 SK89456321 SK88888888 So basically SK and next 8... (4 Replies)
Discussion started by: Asfakul Islam
4 Replies

3. Shell Programming and Scripting

PHP - Regex for matching string containing pattern but without pattern itself

The sample file: dept1: user1,user2,user3 dept2: user4,user5,user6 dept3: user7,user8,user9 I want to match by '/^dept2.*/' but don't want to have substring 'dept2:' in output. How to compose such regex? (8 Replies)
Discussion started by: urello
8 Replies

4. Shell Programming and Scripting

sed or awk command to replace a string pattern with another string based on position of this string

here is what i want to achieve... consider a file contains below contents. the file size is large about 60mb cat dump.sql INSERT INTO `table1` (`id`, `action`, `date`, `descrip`, `lastModified`) VALUES (1,'Change','2011-05-05 00:00:00','Account Updated','2012-02-10... (10 Replies)
Discussion started by: vivek d r
10 Replies

5. Shell Programming and Scripting

Problems with Multiple Pattern String Matching

I am facing a problem and I would be grateful if you can help me :wall: I have a list of words like And I have a datafile like the box of the box of tissues out of of tissues out of the book, the the book, the pen and the the pen and the I want to find Patterns of “x.*x” where... (2 Replies)
Discussion started by: A-V
2 Replies

6. UNIX for Dummies Questions & Answers

Extracting sub-string matching the pattern.

Hi, I have a string looks like the following: USERS 32767.9844 UNDOTBS1 32767.9844 SYSAUX 32767.9844 SYSTEM 32767.9844 EMS 8192 EMS 8192 EMS_INDEXES 4096 EMS_INDEXES 4096 8 rows selected. How do I extract a sub-string to get the expected output as following: EMS 8192 EMS_INDEXES 4096 ... (3 Replies)
Discussion started by: NetBear
3 Replies

7. Shell Programming and Scripting

Fetching string after matching pattern from last

I have a file a file having entries are like @ram@sham@sita @krishan@kumar @deep@kumar@hello@sham in this file all line are having different no of pattern-@. need to fetch the substring after the last pattern. like sita kumar sham thanks in advance (3 Replies)
Discussion started by: saluja.deepak
3 Replies

8. Shell Programming and Scripting

Get matching string pattern from a file

Hi, file -> temp.txt cat temp.txt /home/pradeep/123/a_asp.html /home/pradeep/123/a_asp1.html /home/pradeep/435/a_asp2.html /home/pradeep/arun/abc/a_dfr.html /home/pradeep/arun/123/a_kir.html /home/pradeep/123/arun/a_dir.html .... .... .. i need to get a_*.html(bolded strings... (4 Replies)
Discussion started by: pradebban
4 Replies

9. Shell Programming and Scripting

Find the position of lines matching string

I have a file with the below format, GS*8***** ST*1******** A* B* E* RMR*123455(This is the unique number to locate this row) F* SE*1*** GE** GS*9***** ST*2 H* J* RMR*567889(This is the unique number to locate this row) L* SE* GE***** (16 Replies)
Discussion started by: Muthuraj K
16 Replies

10. Shell Programming and Scripting

Extracting a string matching a pattern from a line

Hi All, I am pretty new to pattern matching and extraction using shell scripting. Could anyone please help me in extracting the word matching a pattern from a line in bash. Input Sample (can vary between any of the 3 samples below): 1) Adaptec SCSI RAID 5445 2) Adaptec SCSI 5445S RAID 3)... (8 Replies)
Discussion started by: jharish
8 Replies
Login or Register to Ask a Question