Replace substring by longest string in common field (awk)


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Replace substring by longest string in common field (awk)
# 1  
Replace substring by longest string in common field (awk)

Hi,

Let's say I have a pipe-separated input like so:
Code:
name_10|A|BCCC|cat_1
name_11|B|DE|cat_2
name_10|A|BC|cat_3
name_11|B|DEEEEEE|cat_4

Using awk, for records with common field 2, I am trying to replace all the shortest substrings by the longest string in field 3.
In order to get the following (changes in bold):
Code:
name_10|A|BCCC|cat_1
name_11|B|DEEEEEE|cat_2
name_10|A|BCCC|cat_3
name_11|B|DEEEEEE|cat_4

A beginning of a code so far, but I am getting stuck:
Code:
echo -e "name_10|A|BCCC|cat_1\nname_11|B|DE|cat_2\nname_10|A|BC|cat_3\nname_11|B|DEEEEEE|cat_4" |
awk '
BEGIN{FS="|"}
{
    if(a[$2] < length($3)){
        a[$2]=$3
    }
}
END{
    for(i in a){
        print i FS a[i]
    }
}'

# 2  
Hello beca123456,

Could you please try following.

Code:
awk 'BEGIN{FS=OFS="|"} FNR==NR{b[$1]=length($3)>a[$1]?$3:b[$1];a[$1]=length($3)>a[$1]?length($3):a[$1];next} length($3)<a[$1]{$3=b[$1]} 1'  Input_file  Input_file

A non-one liner form of solution is:
Code:
awk '
BEGIN{
  FS=OFS="|"
}
FNR==NR{
  b[$1]=length($3)>a[$1]?$3:b[$1]
  a[$1]=length($3)>a[$1]?length($3):a[$1]
  next
}
length($3)<a[$1]{
  $3=b[$1]
}
1
'   Input_file  Input_file

Output will be as follows.

Code:
name_10|A|BCCC|cat_1
name_11|B|DEEEEEE|cat_2
name_10|A|BCCC|cat_3
name_11|B|DEEEEEE|cat_4

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Brilliant !

However, although I think I understand the following lines, I cannot place them in the context:
Code:
# When reading the input file for the first time, create an array based on $1 for which the values are $3 if length($3)>a[$1] or b[$1] if not.
# How come this line does not trigger an error since a[$1] is not yet defined?
b[$1]=length($3)>a[$1]?$3:b[$1]

# Defining a[$1]
# if length($3) > a[$1], then a[$1] equals length($3), otherwise equals a[$1]
a[$1]=length($3)>a[$1]?length($3):a[$1]

# 4  
Try also
Code:
awk -F\| '
        {LN[NR] = $0
         L      = length($3)
         if (L>MX[$2])  {MX[$2] = L
                         D3[$2] = $3
                        }
        }
END     {for (n=1; n<=NR; n++)  {$0 = LN[n]
                                 $3 = D3[$2]
                                 print
                                }
        }
' OFS=\| file
name_10|A|BCCC|cat_1
name_11|B|DEEEEEE|cat_2
name_10|A|BCCC|cat_3
name_11|B|DEEEEEE|cat_4

On versions that don't keep NR's value into the END section you'll need to use a temp var to convey its value.
This User Gave Thanks to RudiC For This Post:
# 5  
Quote:
Originally Posted by beca123456
Brilliant !
Thank you Smilie

For your questions, why a[$1] didn't throw errors because if any variable is NOT initialized in awk and we are using it in any condition or etc then its value will be considered as NULL, hence NO ERRORS in it.

I am adding a detailed level of explanation here for my solution above:

Code:
awk '                                                ##Starting awk program from here.
BEGIN{                                               ##Starting BEGIN section of this awk code here.
  FS=OFS="|"                                         ##Setting FS and OFS as pipe here.
}
FNR==NR{                                             ##Checking condition if FNR==NR which will be TRUE when first time Input_file is being read.
  b[$1]=length($3)>a[$1]?$3:b[$1]                    ##Creating array b with index $1 and checking if value of length of $3 is grater than value of a[$3] then keep value of length of $3 else keep OLD Value in it.
  a[$1]=length($3)>a[$1]?length($3):a[$1]            ##Creating array a with index $1 and checking condition if length of $3 is grater than a[$1] then save value as length($3) or keep the OLD value to it. This array a is basically has length in integer form value with index $1 to be used later in condition.
  next                                               ##next will skip all further statements from here,
}
length($3)<a[$1]{                                    ##Checking condition if length of 3rd field is lesser than value of array a with index $1 then
  $3=b[$1]                                           ##Setting current $3 to value of array b with index of $1 here.
}
1                                                    ##1 will print edited/non-edited values of current line.
'  Input_file Input_file                             ##Mentioning Input_file 2 times here.

These 2 Users Gave Thanks to RavinderSingh13 For This Post:
# 6  
An not inititalized variable (or array element) becomes 0 in number context, and "" in string context.
In this case, since 0 is the minimal possible string length, the 0 is perfect.

With two passes through the input file one only needs one array that holds the longest string:

Code:
awk '
BEGIN { FS=OFS="|" }
# NR == FNR when reading the 1st file
NR == FNR {
# 1st file
# a[$2] holds the longest $3
  if (length(a[$2]) < length($3)) a[$2]=$3
# jump to next input cycle, do not run the following code
  next
}
# 2nd file, here: pass 2
{
# always update $3
  $3=a[$2]
  print
}' Input_file Input_file

And, similar to post#4, with one pass through the input file, where everything is read into a line[] array, and in the END section this is printed in a loop.
Code:
awk '
BEGIN { FS=OFS="|" }
{
# store $0 in line[1..]
  line[NR]=$0
# a[$2] holds the longest $3
  if (length(a[$2]) < length($3)) a[$2]=$3
}
END {
  for (n=1; n<=NR; n++) {
# restore $0
    $0=line[n]
# always update $3
    $3=a[$2]
    print
  }
}
' Input_file


Last edited by MadeInGermany; 5 Days Ago at 10:23 AM..
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #623
Difficulty: Medium
The Python for loop is radically different from the C/C++ for loop.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replace substring from a string variable

Hi, Wish to remove "DR-" from the string variable (var). var="DR-SERVER1" var=`echo $var | sed -e 's/DR-//g'` echo "$var" Expected Output: However, I get the below error: Can you please suggest. (4 Replies)
Discussion started by: mohtashims
4 Replies

2. UNIX for Beginners Questions & Answers

Awk: output lines with common field to separate files

Hi, A beginner one. my input.tab (tab-separated): h1 h2 h3 h4 h5 item1 grpA 2 3 customer1 item2 grpB 4 6 customer1 item3 grpA 5 9 customer1 item4 grpA 0 0 customer2 item5 grpA 9 1 customer2 objective: output a file for each customer ($5) with the item number ($1) only if $2 matches... (2 Replies)
Discussion started by: beca123456
2 Replies

3. Shell Programming and Scripting

awk to update field using matching value in file1 and substring in field in file2

In the awk below I am trying to set/update the value of $14 in file2 in bold, using the matching NM_ in $12 or $9 in file2 with the NM_ in $2 of file1. The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always ;... (2 Replies)
Discussion started by: cmccabe
2 Replies

4. Shell Programming and Scripting

Parsing the longest match substring

Hello gurus, I have a database of possible primary signal strings pp22 pt22dx pp22dx jty2234 Also I have a list of scrambled signals which has a shorter string and a longer string separated by // (double slash ). Always the shorter string of a scrambled signal will have the primary... (6 Replies)
Discussion started by: senhia83
6 Replies

5. UNIX for Dummies Questions & Answers

Values with common field in same line with awk

Hi all ! I almost did it but got a small problem. input: cars red cars blue cars green truck black Wanted: cars red-blue-green truck black Attempt: gawk 'BEGIN{FS="\t"}{a = a (a?"-":"")$2; $2=a; print $1 FS $2}' input But I also got the intermediate records... (2 Replies)
Discussion started by: beca123456
2 Replies

6. Shell Programming and Scripting

awk uniq and longest string of a column as index

I met a challenge to filter ~70 millions of sequence rows and I want using awk with conditions: 1) longest string of each pattern in column 2, ignore any sub-string, as the index; 2) all the unique patterns after 1); 3) print the whole row; input: 1 ABCDEFGHI longest_sequence1 2 ABCDEFGH... (12 Replies)
Discussion started by: yifangt
12 Replies

7. Shell Programming and Scripting

sed or awk command to replace a string pattern with another string based on position of this string

here is what i want to achieve... consider a file contains below contents. the file size is large about 60mb cat dump.sql INSERT INTO `table1` (`id`, `action`, `date`, `descrip`, `lastModified`) VALUES (1,'Change','2011-05-05 00:00:00','Account Updated','2012-02-10... (10 Replies)
Discussion started by: vivek d r
10 Replies

8. Shell Programming and Scripting

Awk Search text string in field, not all in field.

Hello, I am using awk to match text in a tab separated field and am able to do so when matching the exact word. My problem is that I would like to match any sequence of text in the tab-separated field without having to match it all. Any help will be appreciated. Please see the code below. awk... (3 Replies)
Discussion started by: rocket_dog
3 Replies

9. Shell Programming and Scripting

Advanced AWK Regexp substring to int & Replace

Hi! I have a difficult problem, to step up a unknown version number in a text file, and save the file. It would be great to run script.sh and the version gets increased. Example the content of the textfile.txt hello version = x bye This include three steps 1. First find the char after... (2 Replies)
Discussion started by: Beachboy72
2 Replies

10. Shell Programming and Scripting

Finding longest common substring among filenames

I will be performing a task on several directories, each containing a large number of files (2500+) that follow a regular naming convention: YYYY_MM_DD_XX.foo_bar.A.B.some_different_stuff.EXT What I would like to do is automatically discover the part of the filenames that are common to all... (1 Reply)
Discussion started by: cmcnorgan
1 Replies

Featured Tech Videos