Problem getting sed to work with variables


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Problem getting sed to work with variables
# 8  
Old 02-28-2018
Give this a try with a few different input files
Code:
awk '
function PRT()  {TMP = $0
                 for (i=1; i<=MXLN; i++)        {$0 = LINE[i]
                                                 if (MAX[$2] > 1) $2 = "dup_" 0+CNT[$2]++ "_" $2
                                                 print  >  FN
                                                }
                 $0 = TMP
                }


FNR == 1        {if (NR>1) PRT()
                 FN = FILENAME
                 sub (/^(.*\/)*/, "revised_", FN)
                 delete LINE
                 delete MAX
                 delete CNT
                }
                {LINE[FNR] = $0
                 MAX[$2]++
                 MXLN = FNR
                }

END             {PRT()
                }

' OFS="\t" /tmp/test_base.txt

If your awk version doesn't provide the delete array command, replace it by split ("", array).
This User Gave Thanks to RudiC For This Post:
# 9  
Old 02-28-2018
Your #1 problem is that your output file is overwritten by each loop cycle, so will only contain the last sed output.
Your #2 problem is that [ ] are RE-special characters.
(While the ( ) are only special in ERE not RE).
--
I think my solution does what you intended.
Code:
#!/bin/bash
# in each given base file check for duplicate names
for base_file
do
 awk '                                  
  BEGIN { FS=OFS="\t" }
# process the base_file and count the dups, +1 if a dup was met
  NR==FNR { if (dup[$2]++==1) dup[$2]++; next }
# process the base_file again, if a dup then add a dup_#_ prefix
  dup[$2]>1 { $2=("dup_" --dup[$2]-1 "_" $2) }
  { print }
 ' "$base_file" "$base_file" > revised_"$base_file"
done


Last edited by MadeInGermany; 02-28-2018 at 05:55 PM.. Reason: added a loop over the arguments
This User Gave Thanks to MadeInGermany For This Post:
# 10  
Old 03-01-2018
Quote:
Originally Posted by MadeInGermany
Your #1 problem is that your output file is overwritten by each loop cycle, so will only contain the last sed output.
Your #2 problem is that [ ] are RE-special characters.
(While the ( ) are only special in ERE not RE).
The posted solutions seem to work, but don't solve the issue of needing to make changes in other files. I have my script working by changing the input and output names so I'm not overwriting the changes made earlier in the loop.

This version makes the changes in the first file base_file and then in a second file sdf_file . The second file is more complex because it is a multi-line record file where the string that needs to be changed occurs in two places. I have added an awk call that does this part.
Code:
#!/bin/bash

# base file to check for duplicate names
base_file=$1
# additional file where names also need to be changed
sdf_file=$2

# copy base_file to make changes in
cp -fp $base_file  temp_file_1

# copy sdf_file to make changes in
cp -fp $sdf_file  temp_file_sdf_1

# set input field separator to newline so each line is stored in an array element
IFS=$'\n'
# use sort and uniq to output duplicate lines in to array
dup_list=( $(cat "$base_file" | sort -k2 | uniq  -f1 -D) )

# index for modified duplicate names
name_count=0
# name of current duplicate to check for new name series
current_dup=''

# loop on list of duplicate names
for dup_name in "${dup_list[@]}"
do

   # use second field for name
   unset FIELD; IFS=$'\t' read -a FIELD <<< "$dup_name"
   # set current name value
   current_name="${FIELD[1]}"

   # if no current dup has been set
   if [ "$current_dup" == "" ]; then
      # set value to check against  for new duplicate
      current_dup=$current_name
      #create indexed dup name
      new_name='dup_'$name_count'_'$current_name
   # if the current name matches the current dup, increment counter
   elif [ "$current_dup" == "$current_name" ]; then
      # increment counter
      name_count=$((name_count+1))
      # create name based on incremented counter
      new_name='dup_'$name_count'_'$current_name
   # if there is a new dup series
   elif [ "$current_dup" != "$current_name" ]; then
      # set value to new duplicate name
      current_dup=$current_name
      # reset name index prefix value
      name_count=0
      #create new dup name
      new_name='dup_'$name_count'_'$current_name
   fi

   # replace first instance of duplicate name in base_file copy with first indexed name
   sed "0,/\t$current_name$/s//\t$new_name/" 'temp_file_1' > 'temp_file_2'

   # rename temp output so that it is the input for the next loop
   # prevents changes from being overwritten
   mv 'temp_file_2'  'temp_file_1'

   # revise current_name to find in sdf
   current_name='ID_'$current_name
   # revise new_name to write to sdf copy
   new_name='ID_'$new_name
   # set check value
   check=1
   # set found value
   found=0

   # make corresponding change to sdf file
   # this should find $find_name on both lines where it exists
   # and then replace it with $new_name when output is written
   # once the first instance is found, checking stops and the rest of the file is output unchanged
   cat temp_file_sdf_1 | \
   awk -v find=$current_name \
       -v replace=$new_name \
       -v found=$found \
       -v check=$check ' check == 1 { OUT[++CNT] = $0;
                                      if ( $0 == find ){
                                         OUT[CNT] = replace;
                                         found = 1;
                                      }
                                      else if ( $0 == "$$$$" && found == "1") {
                                         for(i=1; i<=CNT; i++) print OUT[i];
                                         delete OUT;
                                         CNT = 0;
                                         check = 0;
                                      }
                                      else if ( $0 == "$$$$" && found == "0") {
                                         for(i=1; i<=CNT; i++) print OUT[i];
                                         delete OUT;
                                         CNT = 0;
                                       }
                                    }
                         check == 0 { print $0 }' > temp_file_sdf_2

   # rename temp output so that it is the input for the next loop
   # prevents changes from being overwritten
   mv 'temp_file_sdf_2'  'temp_file_sdf_1'

done

# change name from temp name to output file name
mv 'temp_file_1'  'revised_'$base_file

# change name from temp name to output file name
mv 'temp_file_sdf_1'  'revised_'$sdf_file

The awk code stores each record in an array until the end of record is reached ($$$$). Along the way, if a line is found that matches the name that needs to be changed, the array element for that line is overwritten with the revised name. When the end of record is reached, the record is written to the new file. This is also set up so that the records are only read and checked until the replacement is found and implemented. After that, the indicator "check" is set to 0 and all remaining rows are printed unchanged and unchecked.

The test files can be run with,
./rename_duplicates.sh test_base.txt test_sdf.txt

This works on test files I have tried so far and is reasonably fast. I am still concerned about the comment that that [ ] are RE-special characters. The test files attached with the script do contain this character and the character is involved with the substitution, so I'm not sure why it is working.

I am certainly not married to this code, but I do need a solution that will work on multiple files. Some of the files are larger (50MB-100MB) so the above solution may be slow in some cases.

LMHmedchem
# 11  
Old 03-01-2018
This solution as 1 awk program appears to work OK with your test data.

Note: I used index() and substr() instead of gsub() to avoid possible RE character issues that MadeInGermany identified.

Code:
# usage ./rename_duplicates.sh  test_base.txt  test_sdf.sdf

# base file to check for duplicate names
base_file=$1
# sdf with duplicate structures
sdf_file=$2

awk '
  BEGIN { FS=OFS="\t" }
  FNR==1{ if(file++ > 1) {printf "" > "revised_"FILENAME } }
# process the base_file and count the dups, +1 if a dup was met
  file==1 { if (dup[$2]++==1) dup[$2]++; next }

# process the base_file again, if a dup then add a dup_#_ prefix
  file==2 && dup[$2]>1 { repcnt[$2]++; $2=("dup_" repcnt[$2]-1 "_" $2)}
  file==2 { print >> "revised_"FILENAME }

# replace all duplicate keys in repcnt[] with the dup_ddd string
  file==3 {
      for(check in repcnt) {
          pos=index($0, check)
          if (pos) {
             fdup[check]++
             old=$0
             $0=""
             while(pos) {
                 $0=$0 substr(old,1,pos-1) "dup_" fdup[check] - 1 "_" check
                 old=substr(old, pos + length(check))
                 pos=index(old, check)
             }
             # Some efficiency - when all dups replaced dont check for it again
             if (fdup[check] == repcnt[check]) delete repcnt
             $0=$0 old
          }
      }
      print $0 "\n$$$$" >> "revised_"FILENAME
  }
 ' "$base_file" "$base_file" FS="" RS="\n[$]{4}\n" $sdf_file


Last edited by Chubler_XL; 03-01-2018 at 09:00 PM..
This User Gave Thanks to Chubler_XL For This Post:
# 12  
Old 03-02-2018
Quote:
Originally Posted by Chubler_XL
This solution as 1 awk program appears to work OK with your test data.

Note: I used index() and substr() instead of gsub() to avoid possible RE character issues that MadeInGermany identified.
Your script works in part. In the test file there are two sets of duplicate strings with three instances each,
Code:
(1R)-1,2,3,3-tetraamino-2-propen-1-ol
(1R)-1,2,3,3-tetraamino-2-propen-1-ol
(1R)-1,2,3,3-tetraamino-2-propen-1-ol
2-[2-hydroxyethyl(methyl)amino]ethanol
2-[2-hydroxyethyl(methyl)amino]ethanol
2-[2-hydroxyethyl(methyl)amino]ethanol

For the copy of the base file, these are replaced with the intended indexed unique names,
Code:
dup_0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # on line 10 of revised_test_base.txt
dup_1_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # on line 18 of revised_test_base.txt
dup_2_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # on line 36 of revised_test_base.txt
dup_0_2-[2-hydroxyethyl(methyl)amino]ethanol # on line 20 of revised_test_base.txt
dup_1_2-[2-hydroxyethyl(methyl)amino]ethanol # on line 46 of revised_test_base.txt
dup_2_2-[2-hydroxyethyl(methyl)amino]ethanol # on line 79 of revised_test_base.txt

In the sdf file, the indexed replacement is only partial.
Code:
ID_dup_0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # substituted on lines 366 and 390 of revised_test_sdf.txt
ID_dup_1_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # substituted on lines 728 and 752
ID_dup_2_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # substituted on lines 1540 and 1564
ID_dup_0_2-[2-hydroxyethyl(methyl)amino]ethanol # substituted on lines 818 and 842

however,
Code:
ID_dup_1_2-[2-hydroxyethyl(methyl)amino]ethanol  
ID_dup_2_2-[2-hydroxyethyl(methyl)amino]ethanol

do not appear anywhere in the file and duplicate values for ID_2-[2-hydroxyethyl(methyl)amino]ethanol still appear in the two remaining duplicate records at lines 1993,2017 and 3491,3515.

I don't see where this is failing, but I also don't understand what you did very well. It is about 100 times faster than my script which will make a difference with the bigger files.

LMHmedchem
# 13  
Old 03-02-2018
small mistake replace line:

Code:
 if (fdup[check] == repcnt[check]) delete repcnt

with

Code:
 if (fdup[check] == repcnt[check]) delete repcnt[check]

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

[Bash] passing variables to executable doesn't work

Bash version 4.4.20 / Ubuntu 16.0.4 Hello, I tried to write a script that gathers some data and passes them to an executable. The executed application answers with an error. The echo output in the script returns correct values. If I copy/paste the last echo command, it get's executed... (2 Replies)
Discussion started by: sushi2k7
2 Replies

2. Shell Programming and Scripting

Problem with variables in sed

Hello! I have a problem to insert variables with sed... And I can't find the solution. :confused: I would like to display one/few line(s) between 2 values. This line works well sed -n '/Dec 12 10:42/,/Dec 12 10:47/p' Thoses lines with variables doesn't work and I don't find the... (2 Replies)
Discussion started by: Castelior
2 Replies

3. UNIX for Dummies Questions & Answers

Why does this SED example work?

$ x="/home/guru/temp/f1.txt" $ echo $x | sed 's^.*/^^' This will give the absolute path f1.txt. I don't understand WHY it works. How is it determining the last "/" character exactly? (7 Replies)
Discussion started by: glev2005
7 Replies

4. Shell Programming and Scripting

Perl variables inside Net::Telnet::Cisco Module doesn't work

I am writing perl script to configure Cisco device but Variables inside Net::Telnet::Cisco Module doesn't work and passed to device without resolving. Please advise. here is a sample of script: use Net::Telnet::Cisco; $device = "10.14.199.1"; ($o1, $o2, $o3, $o4) = split(/\./,$device);... (5 Replies)
Discussion started by: ahmed_zaher
5 Replies

5. Shell Programming and Scripting

reading external variables does not work

... declare vINIFILE vINIFILE=$1 ... echo "The name of the File is $vINIFILE" >>mail_tmp echo "" >> mail_tmp.$$ ... grep RUNJOB=0 $vINIFILE >>tmp_filter ... So the strange is in echo-statement I get the correct output for $vINIFILE wrtitten into the file mail_tmp. But the... (2 Replies)
Discussion started by: ABE2202
2 Replies

6. Shell Programming and Scripting

Sed with variables problem

I am writing a script with a sed call that needs to use a variable and still have quotations be present in the substitution. Example: sed -i "s/Replacable.\+$/Replaced="root@$VAR"/g" this outputs: where $VAR = place Replaced=root@place and i need Replaced="root@place" ... (2 Replies)
Discussion started by: mcdef
2 Replies

7. Shell Programming and Scripting

SED 4.1.4 - INI File Change Problem in Variables= in Specific [Sections] (Guru Help)

GNU sed version 4.1.4 on Windows XP SP3 from GnuWin32 I think that I've come across a seemingly simple text file change problem on a INI formatted file that I can't do with SED without side effects edge cases biting me. I've tried to think of various ways of doing this elegantly and quickly... (5 Replies)
Discussion started by: JakFrost
5 Replies

8. Shell Programming and Scripting

cd command doesn't work through variables

Hi.... cd command is not working when dual string drive/volume name is passed to cd through variables....... For Ex.... y=/Volumes/Backup\ vipin/ cd $y the above command gives error....... anyone with a genuine solution ? (16 Replies)
Discussion started by: vipinchauhan222
16 Replies

9. UNIX for Dummies Questions & Answers

sed command not work with variables?

I am trying to write a simple script which will take a variable with sed to take a line out of a text and display it #!/bin/sh exec 3<list while read list<&3 do echo $list sed -n '$list p'<list2 done this does not work, yet when I replace the $list variable from the sed command and... (1 Reply)
Discussion started by: MaestroRage
1 Replies

10. UNIX for Dummies Questions & Answers

Working with Script variables; seems like this should work...

The following seems quite basic but does not seem to work. Anybody know why? $ g=1 $ echo $g 1 $ echo abc$g abc1 $ abc$g=hello ksh: abc1=hello: not found $ echo $abc1 ksh: abc1: parameter not set It works when I specify the full variable name $ abc1=hello $ echo $abc1 hello ... (2 Replies)
Discussion started by: Chong Lee
2 Replies
Login or Register to Ask a Question