Top Forums Shell Programming and Scripting Finding most common substrings Post 302889669 by Don Cragun on Sunday 23rd of February 2014 12:05:14 AM
With the simple awk program:
Code:
#!/bin/ksh
awk '
NR > 1 {for(i = length($2) - 5; i >= 1; i--)
                c[substr($2, i, 6)]++
}
END {   for(i in c)
                print c[i], i | "sort -rn"
}' file

and with file containing the data you provided in message #1, the output produced is:
Code:
13 byebye
8 yebyeb
8 ebyeby
5 ypehel
5 typehe
5 pehell
5 llobye
5 hellob
5 elloby
5 ehello
5 catcat
4 tcatdo
4 ogcatc
4 gcatca
4 dogcat
4 catdog
4 atcatd
3 yedogc
3 ogbyeb
3 obyedo
3 lobyed
3 gbyeby
3 edogca
3 dogbye
3 byedog
2 tdogby
2 obyeby
2 lobyeb
2 atdogb
1 ttypeh
1 tdogdo
1 tcatty
1 ogdogb
1 gdogby
1 dogdog
1 cattyp
1 attype
1 atdogd
1 atcatt

so "byebye" is indeed the most common six character substring in the 2nd column of your input. But "catdog" and "typehe" don't even come close. To just print the most commonly occurring substring or (if more than one substring appears the same number of times as the first most common substring) substrings, try:
Code:
#!/bin/ksh
awk '
NR > 1 {for(i = length($2) - 5; i >= 1; i--)
                c[substr($2, i, 6)]++
}
END {   for(i in c)
                print c[i], i | "sort -rn"
}' file | awk '
NR == 1 {c = $1}
$1 == c {print $2}'

which just prints:
Code:
byebye

If you want to try this on a Solaris/SunOS system, change awk from the default /usr/bin/awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk.

I use the Korn shell, but any shell that supports basic Bourne shell or POSIX standard shell syntax can be used instead.
This User Gave Thanks to Don Cragun For This Post:
 
Test Your Knowledge in Computers #979
Difficulty: Medium
macOS is based on the Unix operating system and on technologies developed between 1985 and 1997 at NeXT.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Finding the most common entry in a column

Hi, I have a file with 3 columns in it that are comma separated and it has about 5000 lines. What I want to do is find the most common value in column 3 using awk or a shell script or whatever works! I'm totally stuck on how to do this. e.g. value1,value2,bob value1,value2,bob... (12 Replies)
Discussion started by: Donkey25
12 Replies

2. Shell Programming and Scripting

extracting substrings

Hi guys, I am stuck in this problem. Please help. I have two files. FILE1 (with records starting from '>' ) >TC1723_3 similar to Scific_A7Q9Q3 EMSPSQDYCDDYFKLTYPCTAGAQYYGRGALPVYWNYNYGAIGEALKLDLLNHPEYIEQN ATMAFQAAIWRWMNPMKKGQPSAHDAFVGNWKP >TC214_2 similar to Quiet_Ref100_Q8W2B2 Cluster;... (1 Reply)
Discussion started by: smriti_shridhar
1 Replies

3. Shell Programming and Scripting

Finding longest common substring among filenames

I will be performing a task on several directories, each containing a large number of files (2500+) that follow a regular naming convention: YYYY_MM_DD_XX.foo_bar.A.B.some_different_stuff.EXT What I would like to do is automatically discover the part of the filenames that are common to all... (1 Reply)
Discussion started by: cmcnorgan
1 Replies

4. Shell Programming and Scripting

Finding Authors in Common Across Dozens of Lists

I currently have publication lists for ~3 dozen faculty members. I need to find out how many publications are in common across all faculty members - person 1 with person 2, person 1 with person 3, person 2 with person 3, person 1 with both person 2 and person 3, etc. One person may have Last1,... (5 Replies)
Discussion started by: Peggy White
5 Replies

5. Shell Programming and Scripting

finding common numbers (contents) across 2 or 3 files

I have 3 files which are tab delimited and have numbers in it. file 1 1 2 3 4 5 6 7 File 2 3 5 7 8 File 3 1 (4 Replies)
Discussion started by: Lucky Ali
4 Replies

6. Shell Programming and Scripting

Extract three substrings from a logfile

I have a log file like below. 66.249.73.11 - - "UCiZ7QocVqYAABgwfP8AAHAA" "US" "Mediapartners-Google" "-" www.mahashwetha.com.sg "GET... (2 Replies)
Discussion started by: Tuxidow
2 Replies

7. UNIX for Dummies Questions & Answers

Replace substrings in awk

Hi ! my input looks like that: --AAA-AAAAAAA---------AA- AAA------AAAAAAAAAAAAAA ------A----AAAA-----A------- Using awk, I would need to replace only the "-" located between the last letter and the end of the string by "~" in order to get: --AAA-AAAAAAA---------AA~... (7 Replies)
Discussion started by: beca123456
7 Replies

8. Shell Programming and Scripting

Finding out the common lines in two files using 4 fields with the help of awk and UNIX

Dear All, I have 2 files. If field 1, 2, 4 and 5 matches in both file1 and file2, I want to print the whole line of file1 and file2 one after another in my output file. File1: sc2/80 20 . A T 86 F=5;U=4 sc2/60 55 . G T ... (1 Reply)
Discussion started by: NamS
1 Replies

9. Shell Programming and Scripting

Look for substrings with special characters

Hello gurus, I have a lookup table cat tmp1 \\\erw``~ 1 ^774574574565665f\] 2 ()42543^ and I`m trying to compare a bunch of strings such that, either the lookup table column 1, or the string to be looked up are substrings of each other (and return the second lookup column if yes). ... (2 Replies)
Discussion started by: sheetalk
2 Replies

10. UNIX for Beginners Questions & Answers

Finding common entries between 10 columns

Hello, I need to find the intersection across 10 columns. Kindly help. my file (INPUT.csv) looks like this 4_R 4_S 8_R 8_S 12_R 12_S 24_R 24_S LOC_Os01g01010 LOC_Os01g01010 LOC_Os01g01010 LOC_Os04g48290 LOC_Os01g01010 LOC_Os01g01010... (1 Reply)
Discussion started by: Sanchari
1 Replies

Featured Tech Videos

All times are GMT -4. The time now is 10:53 PM.
Unix & Linux Forums Content Copyright 1993-2019. All Rights Reserved.
Privacy Policy