search a regular expression and match in two (or more files) using bash


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting search a regular expression and match in two (or more files) using bash
# 1  
Old 07-20-2011
search a regular expression and match in two (or more files) using bash

Dear all,

I have a specific problem that I don't quite understand how to solve. I have two files, both of the same format:

XXXXXX_FIND1 bla bla bla
bla
bla
bla
bla
bla
bla
bla
bla
bla
========
(return)
XXXXXX_FIND2 bla bla bla
bla
bla
bla
bla
bla
bla
bla
bla
bla
========
(return)
etc...

The problem is that each entry is randomly swapped, for example in file 1 there is XXXXXX_FIND1, XXXXXX_FIND3, XXXXXX_FINDX mixed, as well as in file 2, but scrambled.
What I want to do is create a new file and match entries like:

XXXXXX_FIND1 bla bla bla
bla
bla
bla
bla
bla
bla
bla
bla
bla
========
(return)
XXXXXX_FIND1 bla bla bla
bla
bla
bla
bla
bla
bla
bla
bla
bla
========
(return)
XXXXXX_FIND2 bla bla bla
bla
bla
bla
bla
bla
bla
bla
bla
bla
========
(return)
XXXXXX_FIND2 bla bla bla
bla
bla
bla
bla
bla
bla
bla
bla
bla
========

Note that:
1) I don't know the letters/numbers in FIND1, FIND2 etc. But these match between the files, and they are always five.
2) There are entries that do not match; those should not be considered
3) bla is for information that does not match, and sometimes some entries have more lines of "bla"!

Is this possible to do with bash or awk?

Thank you in advance!
# 2  
Old 07-20-2011
Try:
Code:
cat file1 file2 | perl -n0e 'while(/.{6}_(FIND\d+).*?========\n/sg){$h{$1}.=$&};print $h{$_} for (sort keys %h)'

This User Gave Thanks to bartus11 For This Post:
# 3  
Old 07-20-2011
Thanks for the quick reply, Bartus! However, I should emphasize that FIND1 , FIND2 etc are not like that. For example they can be ABC1D, RTGQ1 etc. So, a random combination of numberals and letters...

Therefore, the only common characteristic between the files is the seperation of each entry and within the entry what is after the XXXXXX_ , which is composed of 5 characters and this should match between the entries from each file...

Thanks again for the help!
# 4  
Old 07-20-2011
Try this:
Code:
cat file1 file2 | perl -n0e '$h{$1}.=$& while(/X{6}_(\w{4}\d).*?========\n/sg);print $h{$_} for (sort keys %h)'

# 5  
Old 07-20-2011
still not working...
I would like to show two example output files, so that you can have a better idea of the output (see attached archive). I am new to scripting and I think that I didn't describe the problem precisely.

Thanks a lot for seeing through my problem.
# 6  
Old 07-20-2011
OK, so I can see that those "XXXXXX" weren't literal X characters. So how should those records be sorted? Only based on the part after "_"? If that is the case, then try this:
Code:
cat file1 file2 | perl -n0e '$h{$1}.=$& while(/.{6}_(\w+).*?=+\n/sg);print $h{$_} for (sort keys %h)'

# 7  
Old 07-20-2011
the thing is that, if the files are cat, then i lose information about where the entries previously were (file 1 or file 2).
So, I would like to have matches only between entries from A and B files.
Furthermore, I can see that the problem has another dimension, the entry after the _ is not unique. Therefore, an additional way is to match the string between tabs 7 and 8 of the line where the XXXXX_XXXXX is.
I think this should be matched first and then, when this is matched, refine the matches according to the _XXXXX. If entries are not matched, these should not be included in the output...
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Grep command to search a regular expression in a line an only print the string after the match

Hello, one step in a shell script i am writing, involves Grep command to search a regular expression in a line an only print the string after the match an example line is below /logs/GRAS/LGT/applogs/lgt-2016-08-24/2016-08-24.8.log.zip:2016-08-24 19:12:48,602 ERROR... (9 Replies)
Discussion started by: Ramneekgupta91
9 Replies

2. Shell Programming and Scripting

Regular expression match

echo 20110101 | awk '{ print match($0,/^((17||18||19||20)|)-*(|0|1)-*(|0||3)$/)) I am getting a match for the above, where as it shouldn't, as there is no hyphen in the echoed date. Another question is what is the difference between || and | in the above statement (4 Replies)
Discussion started by: tostay2003
4 Replies

3. Shell Programming and Scripting

Sed: Splitting A large File into smaller files based on recursive Regular Expression match

I will simplify the explaination a bit, I need to parse through a 87m file - I have a single text file in the form of : <NAME>house........ SOMETEXT SOMETEXT SOMETEXT . . . . </script> MORETEXT MORETEXT . . . (6 Replies)
Discussion started by: sumguy
6 Replies

4. Homework & Coursework Questions

Regular Expression to match files in Perl

Hi Everybody! I need some help with a regular expression in Perl that will match files named messages, but also files named message.1, message.2 and so on. So really I need one that will find messages and messages that might be followed by a period and a digit without matching other files like... (2 Replies)
Discussion started by: Hax0rc1ph3r
2 Replies

5. Shell Programming and Scripting

regular expression exact match

hi everyone suppose we have two scenario echo ABCD | grep \{4\} DATE echo SYSDATE | grep \{4\} SYSDATE i want to match the string of four length only please help (5 Replies)
Discussion started by: aishsimplesweet
5 Replies

6. Shell Programming and Scripting

regular expression match

I am trying to match a similar line using grep with regular expression the line is /remote/mac/pbbbb/abc/def/hij/hop/include/abc/tif/element/test/testfiles/Office.cpp:57: const OfficeType& getType().get() const; I just need to extract the bold characters using grep with regular expression.... (5 Replies)
Discussion started by: prasbala
5 Replies

7. Shell Programming and Scripting

Regular Expression to match repeated characters

Hello All I have file which contain sample data like below - test.txt ---------------------------------------------- jambesh aaa india trxxx sdasd mentor asss light train bbblah --------------------------------------------- I want to write a regX which would print only those... (4 Replies)
Discussion started by: jambesh
4 Replies

8. Shell Programming and Scripting

Regular expression match

Hi all, any idea how to match the following: char*<no or any string or space> buf and char *<no or any string or space> buf i need to capture the buf characters too. currently i need two checks to cover this: #search char* <any string> buf or char *<any string> buf @noarray =... (2 Replies)
Discussion started by: ChaMeN
2 Replies

9. UNIX for Dummies Questions & Answers

Regular Expression - match 'b' that follows 'a' and is at the end of a string

Hi, I'm struggling with a regex that would match a 'b' that follows an 'a' and is at the end of a string of non-white characters. For example: Line 1: aba abab b abb aab bab baa I can find the right strings but I'm lacking knowledge of how to "discard" the bits that precede bs.... (2 Replies)
Discussion started by: machinogodzilla
2 Replies

10. UNIX for Dummies Questions & Answers

Exact match with regular expression

Hi I have a file with data arranged into columns. The first column is the chromosome name. When I use grep to subset only rows with chr1, I get chr1 but also chr10, chr11,.. How do I get only rows with chr1? grep chr1 filein > fileout head fileout chr1 59757841 chr11 108258691 ... (2 Replies)
Discussion started by: jdhahbi
2 Replies
Login or Register to Ask a Question