Problem getting sed to work with variables

02-28-2018

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Problem getting sed to work with variables

Hello,

I am processing text files looking for a string and replacing the first occurrence of the string with something else.

For the text,

Code:

id	Name
1	methyl-(2-methylpropoxy)-oxoammonium
2	N-amino-N-(methylamino)-2-nitrosoethanamine
3	3-methoxy-3-methyloxazolidin-3-ium
4	1,3-dihydroxypropan-2-yl-methyl-methyleneammonium
5	(1R)-1,2,3,3-tetraamino-2-propen-1-ol
6	2-(ethoxyamino)guanidine
7	O-[(2S)-2-aminoazopropyl]hydroxylamine
8	N-$l^{1}-oxidanyl-N-[(2-methylpropan-2-yl)oxy]methanamine
9	(1R)-1,2,3,3-tetraamino-2-propen-1-ol
10	1-amino-1-ethoxyguanidine

I am replacing the first instance of (1R)-1,2,3,3-tetraamino-2-propen-1-ol with 0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol

If I do the following in sed,

sed '0,/(1R)-1,2,3,3-tetraamino-2-propen-1-ol/s//0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol/' input > output.txt

I get the necessary results.

If I add variables to the command line,

Code:

current_name="(1R)-1,2,3,3-tetraamino-2-propen-1-ol";
new_name="0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol";
sed -e "0,/$current_name/s//$new_name/" input > output.txt

I still get the necessary results. When, however, I assign current_name and new_name from a bash array and other bash variables,

current_name="${FIELD[1]}"
new_name='dup_'$name_count'_'$current_name

I do not get the modified output and the file is unchanged. Apparently sed is not able to match the pattern in the file. There are any number of non-standard characters in the data so I don't know if that is an issue or not. The difference that I can see is that when I assign new_name="0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol", I am able to quote the string but when I assign current_name="${FIELD[1]}" I am not able to quote/escape special characters like ( in the string.

It seems like I just am missing some combination of single and double quotes to do the job but I haven't been able to progress past this.

Suggestions would be appreciated.

LMHmedchem

Last edited by LMHmedchem; 02-28-2018 at 01:56 AM..

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

02-28-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

What's the contents of ${FIELD[1]}? How did you define it?

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-28-2018

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Quote:

Originally Posted by RudiC

What's the contents of ${FIELD[1]}? How did you define it?

The script begins by reading a file and looking for duplicate values in a specific column. These are retrieved by,

Code:

# set input field separator to newline so each line is stored in an array element
IFS=$'\n'
# use sort and uniq to output duplicate lines in to array
dup_list=( $(cat "$base_file" | sort -k2 | uniq  -f1 -D) )

so the file is sorted on column 2 and uniq ignores the first column.

For the above data the output would be,

Code:

5	(1R)-1,2,3,3-tetraamino-2-propen-1-ol
9	(1R)-1,2,3,3-tetraamino-2-propen-1-ol

Then I iterate over the array to parse the lines and capture individual names,

Code:

for dup_name in "${dup_list[@]}"
do
   # parse on tab
   unset FIELD; IFS=$'\t' read -a FIELD <<< "$dup_name"
   # assign second column to current name
   current_name="${FIELD[1]}"
done

When I echo $current_name I get the correct value but it doesn't work with the sed command I posted.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

02-28-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Your code(s) are working for me:

Code:

current_name="${FIELD[1]}"
sed -e "0,/$current_name/s//$new_name/" $base_file 
id    Name
1    methyl-(2-methylpropoxy)-oxoammonium
2    N-amino-N-(methylamino)-2-nitrosoethanamine
3    3-methoxy-3-methyloxazolidin-3-ium
4    1,3-dihydroxypropan-2-yl-methyl-methyleneammonium
5    dup_0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol
6    2-(ethoxyamino)guanidine
7    O-[(2S)-2-aminoazopropyl]hydroxylamine
8    N-$l^{1}-oxidanyl-N-[(2-methylpropan-2-yl)oxy]methanamine
9    (1R)-1,2,3,3-tetraamino-2-propen-1-ol
10    1-amino-1-ethoxyguanidine

One problem might be that your input data do have DOS line terminators (<CR> = 0x0D = ^M = \r); did you try without?

BTW, your approach seems somewhat complicated. Does it do anything else or is its sole purpose to add a counter to the first instance of duplicates?

Last edited by RudiC; 02-28-2018 at 02:07 PM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-28-2018

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Attention: sed uses RE, so any RE-special character or the / separator will cause a malfunction.
Is the goal to index all the duplicates?
Then consider this robust awk solution

Code:

awk '
  BEGIN { FS=OFS="\t" }
  NR==FNR { if (dup[$2]++==1) dup[$2]++; next }
  dup[$2]>1 { $2=("dup_" --dup[$2]-1 "_" $2) }
  { print }
' input input

With a trick the dup array discovers the duplicates AND and counts the index (backwards though).

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

02-28-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Similar approach:

Code:

 awk '{LINE[NR] = $0; CNT[$2]++} END {for (i=1; i<=NR; i++) {$0 = LINE[i]; if (CNT[$2]-- > 1) $2 = "0_" $2; print}}' OFS="\t" file

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-28-2018

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Quote:

Originally Posted by RudiC

One problem might be that your input data do have DOS line terminators (<CR> = 0x0D = ^M = \r); did you try without?

These are unix files, so there shouldn't be an issue with EOL. I am a bit mystified as to why it works from the command line but not from my script.

Quote:

Originally Posted by RudiC

BTW, your approach seems somewhat complicated. Does it do anything else or is its sole purpose to add a counter to the first instance of duplicates?

There are a number of things that need to be done. I need to identify and re-name duplicates in several files. Every name needs to be unique, so I am finding the dups and adding an indexed prefix to each instance. There could be more than one duplicate string.

Quote:

Originally Posted by MadeInGermany

Attention: sed uses RE, so any RE-special character or the / separator will cause a malfunction.

I suspect that something like this may be the issue but I am not sure why RudiC is able to run it.

Quote:

Originally Posted by MadeInGermany

Is the goal to index all the duplicates?

I think your code would work well if I only had one file to change. I need to change the name and then look up the name in several other files and propagate the change so that all files have the revised name.

This is the current script

Code:

#!/bin/bash

# base file to check for duplicate names
base_file=$1

# set input field separator to newline so each line is stored in an array element
IFS=$'\n'
# use sort and uniq to output duplicate lines in to array
dup_list=( $(cat "$base_file" | sort -k2 | uniq  -f1 -D) )

# count for indexed name
name_count=0
# to identify when we have a new duplicate
current_dup=''

# loop on duplicate names
for dup_name in "${dup_list[@]}"
do

   # use second field for name
   unset FIELD; IFS=$'\t' read -a FIELD <<< "$dup_name"

   # set name value
   current_name="${FIELD[1]}"

   # if no current dup has been set
   if [ "$current_dup" == "" ]; then
      # set base to check for new duplicate
      current_dup=$current_name
      #create new dup name
      # name count is already 0 so no need to increment
      new_name='dup_'$name_count'_'$current_name
   # if the current name matches the current dup, increment counter
   elif [ "$current_dup" == "$current_name" ]; then
      # increment counter
      name_count=$((name_count+1))
      # create name based on incremented counter
      new_name='dup_'$name_count'_'$current_name
   # if there is a new dup series
   elif [ "$current_dup" != "$current_name" ]; then
      # set base to new duplicate
      current_dup=$current_name
      # reset name counter
      name_count=0
      #create new dup name
      new_name='dup_'$name_count'_'$current_name
   fi

   # test print
   echo $new_name
 
   # find first instance of dup name in base file and replace
   sed "0,/$current_name/s//$new_name/" $base_file > 'revised_'$base_file

   # make changes in other files

done

When I run this on the attached file test_base.txt, I get the printed output I expect,

Code:

dup_0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol
dup_1_(1R)-1,2,3,3-tetraamino-2-propen-1-ol
dup_2_(1R)-1,2,3,3-tetraamino-2-propen-1-ol
dup_0_2-[2-hydroxyethyl(methyl)amino]ethanol
dup_1_2-[2-hydroxyethyl(methyl)amino]ethanol
dup_2_2-[2-hydroxyethyl(methyl)amino]ethanol

All of the duplicates are identified and renamed with an indexed prefix. This works very fast and I have each duplicate name in scope in the do loop where I can work on other files.

At this point I am not able to make changes in other files, which is annoying. I am sure that the logic above is overly complex.

Could the problem also be how I am reading the data into the array?

LMHmedchem

test_base.txt (4.3 KB)

change_names.sh (1.6 KB)

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

Shell Programming and Scripting

Problem getting sed to work with variables

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

[Bash] passing variables to executable doesn't work

Discussion started by: sushi2k7

2. Shell Programming and Scripting

Problem with variables in sed

Discussion started by: Castelior

3. UNIX for Dummies Questions & Answers

Why does this SED example work?

Discussion started by: glev2005

4. Shell Programming and Scripting

Perl variables inside Net::Telnet::Cisco Module doesn't work

Discussion started by: ahmed_zaher

5. Shell Programming and Scripting

reading external variables does not work

Discussion started by: ABE2202

6. Shell Programming and Scripting

Sed with variables problem

Discussion started by: mcdef

7. Shell Programming and Scripting

SED 4.1.4 - INI File Change Problem in Variables= in Specific [Sections] (Guru Help)

Discussion started by: JakFrost

8. Shell Programming and Scripting

cd command doesn't work through variables

Discussion started by: vipinchauhan222

9. UNIX for Dummies Questions & Answers

sed command not work with variables?

Discussion started by: MaestroRage

10. UNIX for Dummies Questions & Answers

Working with Script variables; seems like this should work...

Discussion started by: Chong Lee