Problem getting sed to work with variables

02-28-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Give this a try with a few different input files

Code:

awk '
function PRT()  {TMP = $0
                 for (i=1; i<=MXLN; i++)        {$0 = LINE[i]
                                                 if (MAX[$2] > 1) $2 = "dup_" 0+CNT[$2]++ "_" $2
                                                 print  >  FN
                                                }
                 $0 = TMP
                }


FNR == 1        {if (NR>1) PRT()
                 FN = FILENAME
                 sub (/^(.*\/)*/, "revised_", FN)
                 delete LINE
                 delete MAX
                 delete CNT
                }
                {LINE[FNR] = $0
                 MAX[$2]++
                 MXLN = FNR
                }

END             {PRT()
                }

' OFS="\t" /tmp/test_base.txt

If your awk version doesn't provide the delete array command, replace it by split ("", array).

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-28-2018

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Your #1 problem is that your output file is overwritten by each loop cycle, so will only contain the last sed output.
Your #2 problem is that [ ] are RE-special characters.
(While the ( ) are only special in ERE not RE).
--
I think my solution does what you intended.

Code:

#!/bin/bash
# in each given base file check for duplicate names
for base_file
do
 awk '                                  
  BEGIN { FS=OFS="\t" }
# process the base_file and count the dups, +1 if a dup was met
  NR==FNR { if (dup[$2]++==1) dup[$2]++; next }
# process the base_file again, if a dup then add a dup_#_ prefix
  dup[$2]>1 { $2=("dup_" --dup[$2]-1 "_" $2) }
  { print }
 ' "$base_file" "$base_file" > revised_"$base_file"
done

Last edited by MadeInGermany; 02-28-2018 at 05:55 PM.. Reason: added a loop over the arguments

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

03-01-2018

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Quote:

Originally Posted by MadeInGermany

The posted solutions seem to work, but don't solve the issue of needing to make changes in other files. I have my script working by changing the input and output names so I'm not overwriting the changes made earlier in the loop.

This version makes the changes in the first file base_file and then in a second file sdf_file . The second file is more complex because it is a multi-line record file where the string that needs to be changed occurs in two places. I have added an awk call that does this part.

Code:

#!/bin/bash

# base file to check for duplicate names
base_file=$1
# additional file where names also need to be changed
sdf_file=$2

# copy base_file to make changes in
cp -fp $base_file  temp_file_1

# copy sdf_file to make changes in
cp -fp $sdf_file  temp_file_sdf_1

# set input field separator to newline so each line is stored in an array element
IFS=$'\n'
# use sort and uniq to output duplicate lines in to array
dup_list=( $(cat "$base_file" | sort -k2 | uniq  -f1 -D) )

# index for modified duplicate names
name_count=0
# name of current duplicate to check for new name series
current_dup=''

# loop on list of duplicate names
for dup_name in "${dup_list[@]}"
do

   # use second field for name
   unset FIELD; IFS=$'\t' read -a FIELD <<< "$dup_name"
   # set current name value
   current_name="${FIELD[1]}"

   # if no current dup has been set
   if [ "$current_dup" == "" ]; then
      # set value to check against  for new duplicate
      current_dup=$current_name
      #create indexed dup name
      new_name='dup_'$name_count'_'$current_name
   # if the current name matches the current dup, increment counter
   elif [ "$current_dup" == "$current_name" ]; then
      # increment counter
      name_count=$((name_count+1))
      # create name based on incremented counter
      new_name='dup_'$name_count'_'$current_name
   # if there is a new dup series
   elif [ "$current_dup" != "$current_name" ]; then
      # set value to new duplicate name
      current_dup=$current_name
      # reset name index prefix value
      name_count=0
      #create new dup name
      new_name='dup_'$name_count'_'$current_name
   fi

   # replace first instance of duplicate name in base_file copy with first indexed name
   sed "0,/\t$current_name$/s//\t$new_name/" 'temp_file_1' > 'temp_file_2'

   # rename temp output so that it is the input for the next loop
   # prevents changes from being overwritten
   mv 'temp_file_2'  'temp_file_1'

   # revise current_name to find in sdf
   current_name='ID_'$current_name
   # revise new_name to write to sdf copy
   new_name='ID_'$new_name
   # set check value
   check=1
   # set found value
   found=0

   # make corresponding change to sdf file
   # this should find $find_name on both lines where it exists
   # and then replace it with $new_name when output is written
   # once the first instance is found, checking stops and the rest of the file is output unchanged
   cat temp_file_sdf_1 | \
   awk -v find=$current_name \
       -v replace=$new_name \
       -v found=$found \
       -v check=$check ' check == 1 { OUT[++CNT] = $0;
                                      if ( $0 == find ){
                                         OUT[CNT] = replace;
                                         found = 1;
                                      }
                                      else if ( $0 == "$$$$" && found == "1") {
                                         for(i=1; i<=CNT; i++) print OUT[i];
                                         delete OUT;
                                         CNT = 0;
                                         check = 0;
                                      }
                                      else if ( $0 == "$$$$" && found == "0") {
                                         for(i=1; i<=CNT; i++) print OUT[i];
                                         delete OUT;
                                         CNT = 0;
                                       }
                                    }
                         check == 0 { print $0 }' > temp_file_sdf_2

   # rename temp output so that it is the input for the next loop
   # prevents changes from being overwritten
   mv 'temp_file_sdf_2'  'temp_file_sdf_1'

done

# change name from temp name to output file name
mv 'temp_file_1'  'revised_'$base_file

# change name from temp name to output file name
mv 'temp_file_sdf_1'  'revised_'$sdf_file

The awk code stores each record in an array until the end of record is reached ($$$$). Along the way, if a line is found that matches the name that needs to be changed, the array element for that line is overwritten with the revised name. When the end of record is reached, the record is written to the new file. This is also set up so that the records are only read and checked until the replacement is found and implemented. After that, the indicator "check" is set to 0 and all remaining rows are printed unchanged and unchecked.

The test files can be run with,
./rename_duplicates.sh test_base.txt test_sdf.txt

This works on test files I have tried so far and is reasonably fast. I am still concerned about the comment that that [ ] are RE-special characters. The test files attached with the script do contain this character and the character is involved with the substitution, so I'm not sure why it is working.

I am certainly not married to this code, but I do need a solution that will work on multiple files. Some of the files are larger (50MB-100MB) so the above solution may be slow in some cases.

LMHmedchem

rename_duplicates.sh (3.8 KB)

test_base.txt (4.3 KB)

test_sdf.txt (129.9 KB)

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

03-01-2018

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

This solution as 1 awk program appears to work OK with your test data.

Note: I used index() and substr() instead of gsub() to avoid possible RE character issues that MadeInGermany identified.

Code:

# usage ./rename_duplicates.sh  test_base.txt  test_sdf.sdf

# base file to check for duplicate names
base_file=$1
# sdf with duplicate structures
sdf_file=$2

awk '
  BEGIN { FS=OFS="\t" }
  FNR==1{ if(file++ > 1) {printf "" > "revised_"FILENAME } }
# process the base_file and count the dups, +1 if a dup was met
  file==1 { if (dup[$2]++==1) dup[$2]++; next }

# process the base_file again, if a dup then add a dup_#_ prefix
  file==2 && dup[$2]>1 { repcnt[$2]++; $2=("dup_" repcnt[$2]-1 "_" $2)}
  file==2 { print >> "revised_"FILENAME }

# replace all duplicate keys in repcnt[] with the dup_ddd string
  file==3 {
      for(check in repcnt) {
          pos=index($0, check)
          if (pos) {
             fdup[check]++
             old=$0
             $0=""
             while(pos) {
                 $0=$0 substr(old,1,pos-1) "dup_" fdup[check] - 1 "_" check
                 old=substr(old, pos + length(check))
                 pos=index(old, check)
             }
             # Some efficiency - when all dups replaced dont check for it again
             if (fdup[check] == repcnt[check]) delete repcnt
             $0=$0 old
          }
      }
      print $0 "\n$$$$" >> "revised_"FILENAME
  }
 ' "$base_file" "$base_file" FS="" RS="\n[$]{4}\n" $sdf_file

Last edited by Chubler_XL; 03-01-2018 at 09:00 PM..

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

03-02-2018

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Quote:

Originally Posted by Chubler_XL

This solution as 1 awk program appears to work OK with your test data.

Note: I used index() and substr() instead of gsub() to avoid possible RE character issues that MadeInGermany identified.

Your script works in part. In the test file there are two sets of duplicate strings with three instances each,

Code:

(1R)-1,2,3,3-tetraamino-2-propen-1-ol
(1R)-1,2,3,3-tetraamino-2-propen-1-ol
(1R)-1,2,3,3-tetraamino-2-propen-1-ol
2-[2-hydroxyethyl(methyl)amino]ethanol
2-[2-hydroxyethyl(methyl)amino]ethanol
2-[2-hydroxyethyl(methyl)amino]ethanol

For the copy of the base file, these are replaced with the intended indexed unique names,

Code:

dup_0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # on line 10 of revised_test_base.txt
dup_1_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # on line 18 of revised_test_base.txt
dup_2_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # on line 36 of revised_test_base.txt
dup_0_2-[2-hydroxyethyl(methyl)amino]ethanol # on line 20 of revised_test_base.txt
dup_1_2-[2-hydroxyethyl(methyl)amino]ethanol # on line 46 of revised_test_base.txt
dup_2_2-[2-hydroxyethyl(methyl)amino]ethanol # on line 79 of revised_test_base.txt

In the sdf file, the indexed replacement is only partial.

Code:

ID_dup_0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # substituted on lines 366 and 390 of revised_test_sdf.txt
ID_dup_1_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # substituted on lines 728 and 752
ID_dup_2_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # substituted on lines 1540 and 1564
ID_dup_0_2-[2-hydroxyethyl(methyl)amino]ethanol # substituted on lines 818 and 842

however,

Code:

ID_dup_1_2-[2-hydroxyethyl(methyl)amino]ethanol  
ID_dup_2_2-[2-hydroxyethyl(methyl)amino]ethanol

do not appear anywhere in the file and duplicate values for ID_2-[2-hydroxyethyl(methyl)amino]ethanol still appear in the two remaining duplicate records at lines 1993,2017 and 3491,3515.

I don't see where this is failing, but I also don't understand what you did very well. It is about 100 times faster than my script which will make a difference with the bigger files.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

03-02-2018

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

small mistake replace line:

Code:

 if (fdup[check] == repcnt[check]) delete repcnt

with

Code:

 if (fdup[check] == repcnt[check]) delete repcnt[check]

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

Shell Programming and Scripting

Problem getting sed to work with variables

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

[Bash] passing variables to executable doesn't work

Discussion started by: sushi2k7

2. Shell Programming and Scripting

Problem with variables in sed

Discussion started by: Castelior

3. UNIX for Dummies Questions & Answers

Why does this SED example work?

Discussion started by: glev2005

4. Shell Programming and Scripting

Perl variables inside Net::Telnet::Cisco Module doesn't work

Discussion started by: ahmed_zaher

5. Shell Programming and Scripting

reading external variables does not work

Discussion started by: ABE2202

6. Shell Programming and Scripting

Sed with variables problem

Discussion started by: mcdef

7. Shell Programming and Scripting

SED 4.1.4 - INI File Change Problem in Variables= in Specific [Sections] (Guru Help)

Discussion started by: JakFrost

8. Shell Programming and Scripting

cd command doesn't work through variables

Discussion started by: vipinchauhan222

9. UNIX for Dummies Questions & Answers

sed command not work with variables?

Discussion started by: MaestroRage

10. UNIX for Dummies Questions & Answers

Working with Script variables; seems like this should work...

Discussion started by: Chong Lee