Finding most common substrings Post: 302889672

Sponsored Content

Top Forums Shell Programming and Scripting Finding most common substrings Post 302889672 by Don Cragun on Sunday 23rd of February 2014 12:43:35 AM

02-23-2014

Registered User

Does this help?

Code:

#!/bin/ksh
awk '
NR > 1 {# For all lines except line 1 (which contains headers; not data), loop
        # through the 2nd field and increment the number of times the six
        # character substring starting at offset i in tn the 2nd field appears:
        for(i = length($2) - 5; i >= 1; i--)
                c[substr($2, i, 6)]++
}
END {   # After all input lines have been processed, print and sort (using a
        # descreasing numeric sort) the number of times that substring occurred
        # and the substring.
        for(i in c)
                print c[i], i | "sort -rn"
}' file |       # The data to be processed comes from a file named file.
# The sorted output is then piped tnto another awk script to just print the
# most commonly appearing substrings.
awk '
# Save the count from the 1st input line (the most frequent substring).
NR == 1 {c = $1}
# Print the substring from every input line that has the same count as the
# most commonly appearing substring.
$1 == c {print $2}'

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Breaking strings into Substrings

I'm only new to shell programming and have been given a task to do a program in .sh, however I've come to a point where I'm not sure what to do. This is my code so far: # process all arguments (i.e. loop while $1 is present) while ; do # echo "Arg is $1" case $1 in -h*|-H*) echo "help...

2. Shell Programming and Scripting

Finding the most common entry in a column

Hi, I have a file with 3 columns in it that are comma separated and it has about 5000 lines. What I want to do is find the most common value in column 3 using awk or a shell script or whatever works! I'm totally stuck on how to do this. e.g. value1,value2,bob value1,value2,bob...

3. Shell Programming and Scripting

extracting substrings

Hi guys, I am stuck in this problem. Please help. I have two files. FILE1 (with records starting from '>' ) >TC1723_3 similar to Scific_A7Q9Q3 EMSPSQDYCDDYFKLTYPCTAGAQYYGRGALPVYWNYNYGAIGEALKLDLLNHPEYIEQN ATMAFQAAIWRWMNPMKKGQPSAHDAFVGNWKP >TC214_2 similar to Quiet_Ref100_Q8W2B2 Cluster;...

4. Shell Programming and Scripting

Finding longest common substring among filenames

I will be performing a task on several directories, each containing a large number of files (2500+) that follow a regular naming convention: YYYY_MM_DD_XX.foo_bar.A.B.some_different_stuff.EXT What I would like to do is automatically discover the part of the filenames that are common to all...

5. Shell Programming and Scripting

Finding Authors in Common Across Dozens of Lists

I currently have publication lists for ~3 dozen faculty members. I need to find out how many publications are in common across all faculty members - person 1 with person 2, person 1 with person 3, person 2 with person 3, person 1 with both person 2 and person 3, etc. One person may have Last1,...

6. Shell Programming and Scripting

extracting substrings from variables

Hello Everyone, I am looking for a way to extract substrings to local variables. Here is the format of the string variable i am using : /var/x/www && /usr/x/share/doc && /etc/x/logs where the substrings i must extract are the "/var/x/www" and such. I was originally thinking of using...

7. Shell Programming and Scripting

finding common numbers (contents) across 2 or 3 files

I have 3 files which are tab delimited and have numbers in it. file 1 1 2 3 4 5 6 7 File 2 3 5 7 8 File 3 1

8. Shell Programming and Scripting

Finding out the common lines in two files using 4 fields with the help of awk and UNIX

Dear All, I have 2 files. If field 1, 2, 4 and 5 matches in both file1 and file2, I want to print the whole line of file1 and file2 one after another in my output file. File1: sc2/80 20 . A T 86 F=5;U=4 sc2/60 55 . G T ...

9. Shell Programming and Scripting

Look for substrings with special characters

Hello gurus, I have a lookup table cat tmp1 \\\erw``~ 1 ^774574574565665f\] 2 ()42543^ and I`m trying to compare a bunch of strings such that, either the lookup table column 1, or the string to be looked up are substrings of each other (and return the second lookup column if yes). ...

10. UNIX for Beginners Questions & Answers

Finding common entries between 10 columns

Hello, I need to find the intersection across 10 columns. Kindly help. my file (INPUT.csv) looks like this 4_R 4_S 8_R 8_S 12_R 12_S 24_R 24_S LOC_Os01g01010 LOC_Os01g01010 LOC_Os01g01010 LOC_Os04g48290 LOC_Os01g01010 LOC_Os01g01010...

LEARN ABOUT V7

join

JOIN(1) 						      General Commands Manual							   JOIN(1)

NAME

       join - relational database operator

SYNOPSIS

       join [ options ] file1 file2

DESCRIPTION

       Join  forms,  on the standard output, a join of the two relations specified by the lines of file1 and file2.  If file1 is `-', the standard
       input is used.

       File1 and file2 must be sorted in increasing ASCII collating sequence on the fields on which they are to be joined, normally the  first	in
       each line.

       There  is  one line in the output for each pair of lines in file1 and file2 that have identical join fields.  The output line normally con-
       sists of the common field, then the rest of the line from file1, then the rest of the line from file2.

       Fields are normally separated by blank, tab or newline.	In this case, multiple separators count as one, and leading  separators  are  dis-
       carded.

       These options are recognized:

       -an    In addition to the normal output, produce a line for each unpairable line in file n, where n is 1 or 2.

       -e s   Replace empty output fields by string s.

       -jn m  Join on the mth field of file n.	If n is missing, use the mth field in each file.

       -o list
	      Each  output line comprises the fields specifed in list, each element of which has the form n.m, where n is a file number and m is a
	      field number.

       -tc    Use character c as a separator (tab character).  Every appearance of c in a line is significant.

SEE ALSO

       sort(1), comm(1), awk(1)

BUGS

       With default field separation, the collating sequence is that of sort -b; with -t, the sequence is that of a plain sort.

       The conventions of join, sort, comm, uniq, look and awk(1) are wildly incongruous.

																	   JOIN(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Breaking strings into Substrings

Discussion started by: switch

2. Shell Programming and Scripting

Finding the most common entry in a column

Discussion started by: Donkey25

3. Shell Programming and Scripting

extracting substrings

Discussion started by: smriti_shridhar

4. Shell Programming and Scripting

Finding longest common substring among filenames

Discussion started by: cmcnorgan