AWK: Substring search


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting AWK: Substring search
# 8  
Old 08-17-2011
@polsun, let me make a suggestion, if I may. Since you are learning something new, I suggest that you learn perl instead. Perl can pretty much do what awk can (IMHO) and lots more. It is not difficult to learn, either.

Try O'Reilly's Learning Perl book, to get you started. There are also a lot of tutorials on the web. I am sure this is a better skill to have, just because you can do so much more with Perl.

Good luck!!

- GP
This User Gave Thanks to g.pi For This Post:
# 9  
Old 08-19-2011
Quote:
Originally Posted by g.pi
@polsun, let me make a suggestion, if I may. Since you are learning something new, I suggest that you learn perl instead. Perl can pretty much do what awk can (IMHO) and lots more. It is not difficult to learn, either.

Try O'Reilly's Learning Perl book, to get you started. There are also a lot of tutorials on the web. I am sure this is a better skill to have, just because you can do so much more with Perl.

Good luck!!

- GP
thanks for your advice. I think you are right. If i want to learn, why not learn the best.Smilie
# 10  
Old 08-19-2011
Quote:
Originally Posted by g.pi
@polsun, let me make a suggestion, if I may. Since you are learning something new, I suggest that you learn perl instead. Perl can pretty much do what awk can (IMHO) and lots more. It is not difficult to learn, either.
Certainly you can do anything in Perl that you could in Awk. This is a feature of any turing-complete language.

But they're not the same at all. Perl is an extremely complex language with a bizarre combination of complex types and weak typing which leaves you always blindly groping trying to figure out whether you've been given an array, list, string, scalar, hash, a reference to any of the above, an array containing references, etc. awk has only two types, strings and maps, so much less confusion.

Perl also has the problem that perl doesn't play too well as a part of something else. It can do anything, sure, but the usual means to do this is to install a perl module to connect with libraries and such. Find a neat and tidy perl script to do something and you may be surprised just how much extra software, perl and other, you need to install to make this "small" script actually work. Python has the same problem. awk solves one class of problems and solves them well without trying to be everything and giving you library-shock.

Having learned Perl instead of awk, I wish I'd learned awk first -- much simpler syntax to solve some of the same problems and a better introduction to a lot of concepts that puzzled me in Perl.

In the end, which is best really depends on what you want to use it for. So learn both. Smilie

Last edited by Corona688; 08-19-2011 at 02:12 PM..
# 11  
Old 08-19-2011
Quote:
Originally Posted by Corona688
Certainly you can do anything in Perl that you could in Awk. This is a feature of any turing-complete language.

But they're not the same at all. Perl is an extremely complex language with a bizarre combination of complex types and weak typing which leaves you always blindly groping trying to figure out whether you've been given an array, list, string, scalar, hash, a reference to any of the above, an array containing references, etc. awk has only two types, strings and maps, so much less confusion.

Perl also has the problem that perl doesn't play too well as a part of something else. It can do anything, sure, but the usual means to do this is to install a perl module to connect with libraries and such. Find a neat and tidy perl script to do something and you may be surprised just how much extra software, perl and other, you need to install to make this "small" script actually work. Python has the same problem. awk solves one class of problems and solves them well without trying to be everything and giving you library-shock.

Having learned Perl instead of awk, I wish I'd learned awk first -- much simpler syntax to solve some of the same problems and a better introduction to a lot of concepts that puzzled me in Perl.

In the end, which is best really depends on what you want to use it for. So learn both. Smilie
damn! now you confused me. I agree that AWK is simple. I once tried perl too and I agree with you that its bit more complex. But for some reason perl is more popular. People think AWK is obsolete, but I love it. May be you are right, I should learn both.Smilie
# 12  
Old 08-19-2011
Quote:
Originally Posted by polsum
damn! now you confused me. I agree that AWK is simple. I once tried perl too and I agree with you that its bit more complex. But for some reason perl is more popular.
awk is specialized. It reads, processes, and writes delimited text but isn't good for much else. Its data structures are flexible but limited. Imagine trying to write a huge database application in it! There's nothing like structure.member. But as a connector between other things, or a small language for processing data, it's nice. I certainly don't think it's obsolete, just not the sort of thing you notice being used since it's not big applications.

Perl's data structures, though eye-twisting and hard to use, are complex enough for the needs of a large database application. And you can do just about anything with it by heaping on enough modules and using enough memory. It's not exactly what I'd call elegant, but it works.

Last edited by Corona688; 08-19-2011 at 02:40 PM..
# 13  
Old 08-19-2011
Quote:
Originally Posted by polsum
I want to know how many times the string in 2nd column appears in the first column as substring.
Neither of the perl suggestions nor the ksh/egrep script handle this properly. They are all taking the second column and treating it as a regular expression. Should the text in that column contain any regexp metacharacters, they yield erroneous results. And if the text in the second column forms an invalid regular expression, they will error out.

Place a lone asterisk in the second column and you may see something similar to:
Code:
Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE / at substring.pl line 26, <> line 6.

Worse, a valid regular expression with metacharacters (.*) will silently return incorrect results. Perhaps your real data consists of nothing but alphanumerics, in which case the code provided should be adequate, but that wasn't made clear and your problem statement asks for fixed string matching.

Quote:
Originally Posted by polsum
Hi

I have a table like this
Code:
aaacgt cgt
cggaat acg
acgt
cgtgha
jhaja

I want to know how many times the string in 2nd column appears in the first column as substring.

For example the first string of 2nd column "cgt" occurs 3 times in the 1st column and "acg" one time.

So my desired output is
Code:
cgt 3
acg 1

THank you very much in advanceSmilie

Quote:
Originally Posted by polsum
Thanks a lot for your replies. My file is not that big...it has 30000 rows. and It always has 2 columns.
The only sample data you've provided contradicts that statement; some of the lines have only one line.

Also, should the string appearing in the second column be searched for in column 1 of all rows or only rows that follow it? If only those that follow, does that include the current row? You should clarify, because you state above that "acg" only occurs once in column 1, and your desired output shows a 1 for "acg", but "acg" actually occurs twice in the first column.

All suggestions so far scan the entire column an in fact will return 2 where you say it should be 1.

Also, should the string occurring more than once in column1 of a single row be counted as one instance or multiple instances? One of the perl solutions and the ksh/egrep will count "acgacg" as one instance of "acg", while the remaining perl solution would count it as two instances.

Regards,
Alister

Last edited by alister; 08-19-2011 at 03:56 PM..
# 14  
Old 08-19-2011
@alister - thanks for the detailed reply. Actually it was a typo from my side in the first post - "acg" should count twice. My original file has only alphabets in both columns and no other characters or metacharacters. So, the codes seem to be work fine.

Yes the string of 2nd column should be searched in all the rows of 1st column.

By 30000 lines I meant the maximum number of lines in any column. Yes the 2nd column has far fewer rows/lines than the first.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

sed question for substring search

i have this data where i am looking for a two digit number 01,03,05 or 07. if not found i should detect that . this sed command gives me the matching rows . I want the opposite , i want the rows if the match is NOT found . also the sed command is only looking for 01, can i add 03, 05, 07 to... (7 Replies)
Discussion started by: boncuk
7 Replies

2. Shell Programming and Scripting

Search substring in a column of file

Hi all, I have 2 files, the first one containing a list of ids and the second one is a master file. I want to search each id from the first file from the 5th col in the second file. The 5th column in master file has values separated by ';', if not a single value is present. Each id must occur... (2 Replies)
Discussion started by: ritakadm
2 Replies

3. Shell Programming and Scripting

To Search for a pattern and substring text in a file

I have the following data in a text file. "A",1,"MyTextfile.CSV","200","This is ,line one" "B","EFG",23,"MyTextfile1.csv","5621",562,"This is ,line two" I want to extract the fileNames MyTextfile.CSV and MyTextfile1.csv. The problem is not all the lines are delimited with "," There are... (3 Replies)
Discussion started by: AshTrak
3 Replies

4. Shell Programming and Scripting

Extract a substring using SED/AWK

Hi All, I have a log file in which name and version of applications are coming in the following format name It may look like following, based on the name of the application and version: XYZ OR xyz OR XyZ OR xyz I want to separate out the name and version and store them into variables.... (4 Replies)
Discussion started by: bhaskar_m
4 Replies

5. UNIX for Advanced & Expert Users

awk if/substring/append help

Hi All, I need some help with an awk command: What I'm trying to do is append "MYGROUP: " to text with the substring "AT_" the input file follows this format: AT_xxxxxx Name1 Name2 AT_xxxxxx NameA NameB I want the output to be: MYGROUP: AT_xxxxx Name1 Name2 MYGROUP:... (2 Replies)
Discussion started by: bikecraft
2 Replies

6. Shell Programming and Scripting

Getting substring with awk

Hi Team, How to get the last 3 characters of a String irrespective of their length using awk? Thanks Kinny (5 Replies)
Discussion started by: kinny
5 Replies

7. UNIX for Dummies Questions & Answers

search for string and return substring

Hi, I have a file with the following contents: I need to create a script or search command that will search for this string 'ENDC' in the file. This string is unique and only occurs in one record. Once it finds the string, I would like it to return positions 101-109 ( this is the date of... (0 Replies)
Discussion started by: Lenora2009
0 Replies

8. Shell Programming and Scripting

Substring using sed or awk

I am trying to get a substring from a string stored in a variable. I tried sed with a bit help from this forum, but not successful. Here is my problem. My string is: "REPLYFILE=myfile.txt" And I need: myfile.txt (everything after the = symbol). My string is: "myfile.txt.gz.20091120.enc... (5 Replies)
Discussion started by: jamjam10k
5 Replies

9. UNIX for Dummies Questions & Answers

grep exact string/ avoid substring search

Hi All, I have 2 programs running by the following names: a_testloop.sh testloop.sh I read these programs names from a file and store each of them into a variable called $program. On the completion of the above programs i should send an email. When i use grep with ps to see if any of... (3 Replies)
Discussion started by: albertashish
3 Replies

10. UNIX for Dummies Questions & Answers

substring using AWK

can we do substring fuctionality using AWK say I have string "sandeep" can i pick up only portion "nde" from it. Thanks and Regards Sandeep Ranade (3 Replies)
Discussion started by: mahabunta
3 Replies
Login or Register to Ask a Question