Parse the longest matching string


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Parse the longest matching string
# 1  
Old 08-01-2018
Parse the longest matching string

Hello experts,

I am trying to unscramble a mixed signal into component signals.

Let the list of known signals be

Code:
$ cat tmplist


DU
DU4016
GFF
GFF2010
GFF201019
G2115
G211
DU40


Let the scrambled signals (separated by "/") be

Code:
$ cat tmpsignal
([GFF201019B-//C21-DU4016/*DU/GFF2010
DU40/GFF201019-b-1-2-3-/DU4016/GFF2010/THFFF

My desired output is


Code:
([GFF201019B-//C21-DU4016/*DU/GFF2010	GFF201019	DU4016	DU	GFF2010
DU40/GFF201019-b-1-2-3-/DU4016/GFF2010/THFFF	DU40	GFF201019	GFF2010	THFFF

When I iterate over an array of known signals it gives me the shortest matching signal (which can be sub-string of a bigger signal)


Code:
$ awk -F"/" 'NR==FNR{a[$1];next}{t=$0; for(i=1;i<=NF;i++) { for (as in a) { if ($i~as) {$i=as}}} print t,$0}' tmplist tmpsignal
([GFF201019B-//C21-DU4016/*DU/GFF2010 GFF  DU DU GFF
DU40/GFF201019-b-1-2-3-/DU4016/GFF2010/THFFF DU GFF DU GFF THFFF

Please assist, how can I catch the longest possible match? The original data has ~20k known signals and ~20 million scrambled ones.
# 2  
Old 08-01-2018
One approach might be to first check for the longest signals and break if matched. My mawk doesn't offer a sort function, so falling back to the *nix sort command piping into an adaption of your code:
Code:
awk '{print length, $0}' file1 | sort -rn | awk  '
NR == FNR       {T[NR] = $2
                 MAX = NR
                 next
                }

                {t = $0
                 for (i=1; i<=NF; i++)
                   for (j=1; j<=MAX; j++)
                     if (index ($i, T[j]))      {$i = T[j]
                                                 break
                                                }
                 print t, $0
                }
' - FS="/" OFS="\t" file2
([GFF201019B-//C21-DU4016/*DU/GFF2010    GFF201019        DU4016    DU    GFF2010
 DU40/GFF201019-b-1-2-3-/DU4016/GFF2010/THFFF    DU40    GFF201019    DU4016    GFF2010    THFFF

Sure the THFFF should occur in the output? It's not a known signal as defined in file1.
This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Replace substring by longest string in common field (awk)

Hi, Let's say I have a pipe-separated input like so: name_10|A|BCCC|cat_1 name_11|B|DE|cat_2 name_10|A|BC|cat_3 name_11|B|DEEEEEE|cat_4 Using awk, for records with common field 2, I am trying to replace all the shortest substrings by the longest string in field 3. In order to get the... (5 Replies)
Discussion started by: beca123456
5 Replies

2. Shell Programming and Scripting

awk uniq and longest string of a column as index

I met a challenge to filter ~70 millions of sequence rows and I want using awk with conditions: 1) longest string of each pattern in column 2, ignore any sub-string, as the index; 2) all the unique patterns after 1); 3) print the whole row; input: 1 ABCDEFGHI longest_sequence1 2 ABCDEFGH... (12 Replies)
Discussion started by: yifangt
12 Replies

3. Shell Programming and Scripting

parse a mixed alphanumeric string from within a string

Hi, I would like to be able to parse out a substring matching a basic pattern, which is a character followed by 3 or 4 digits (for example S1234 out of a larger string). The main string would just be a filename, like Thisis__the FileName_S1234_ToParse.txt. The filename isn't fixed, but the... (2 Replies)
Discussion started by: keaneMB
2 Replies

4. Shell Programming and Scripting

Longest length of string in array

I would be grateful if someone could help me. I am trying to write a .sh script in UNIX. I have the following code; User=john User=james User=ian User=martin for x in ${User} do print ${#x} done This produces the following output; 4 5 3 6 (12 Replies)
Discussion started by: mmab
12 Replies

5. Shell Programming and Scripting

Matching string from input to string of file

Hi, i want to know how to compare string of file with input string im trying following code: file_no=`paste -s -d "||||\n" a.txt | cut -c 1` #it will return collection number from file echo "enter number" read " curr_no" if ; then echo " current number already present" fi ... (4 Replies)
Discussion started by: a_smith
4 Replies

6. Emergency UNIX and Linux Support

[Solved] AWK to parse adjacent matching lines

Hi, I have an input file like F : 0.1 : 0.002 P : 0.3 : 0.004 P : 0.5 : 0.008 P : 0.1 : 0.005 L : 0.05 : 0.02 P: 0.1 : 0.006 P : 0.01 : 0.08 F : 0.02 : 0.08 Expected output: (2 Replies)
Discussion started by: vasanth.vadalur
2 Replies

7. Shell Programming and Scripting

Find longest string and print it

Hello all, I need to find the longest string in a select field and print that field. I have tried a few different methods and I always end up one step from where I need to be. Methods thus far: nawk '{if (length($1) > long) long=length($1); if(length($1)==long) print $1}' The above... (6 Replies)
Discussion started by: SEinT
6 Replies

8. Shell Programming and Scripting

Longest prefix matching -answer found

Hi Everyone, #!/usr/bin/perl use strict; use warnings; my %prefix_to_rate = ( '93' => "1.50", '6iii' => "0.22" ); my ( $shortest, $longest ) = ( sort { $a <=> $b } map { length } keys %prefix_to_rate ); for my $len ( reverse $shortest .. $longest ) { print ... (0 Replies)
Discussion started by: jimmy_y
0 Replies

9. Shell Programming and Scripting

Parse string

Hi, I need to parse a string, check if there are periods and strip the string. For example i have the following domains and subdomains: mydomain.com, dev.mydomain.com I need to strip all periods so i have a string without periods or domain extensions: mydomain, devmydomain. I use this for... (12 Replies)
Discussion started by: ktm
12 Replies

10. Shell Programming and Scripting

sed problem - replacement string should be same length as matching string.

Hi guys, I hope you can help me with my problem. I have a text file that contains lines like this: 78 ANGELO -809.05 79 ANGELO2 -5,000.06 I need to find all occurences of amounts that are negative and replace them with x's 78 ANGELO xxxxxxx 79... (4 Replies)
Discussion started by: amangeles
4 Replies
Login or Register to Ask a Question