Your #1 problem is that your output file is overwritten by each loop cycle, so will only contain the last sed output.
Your #2 problem is that [ ] are RE-special characters.
(While the ( ) are only special in ERE not RE).
--
I think my solution does what you intended.
Last edited by MadeInGermany; 02-28-2018 at 05:55 PM..
Reason: added a loop over the arguments
This User Gave Thanks to MadeInGermany For This Post:
Your #1 problem is that your output file is overwritten by each loop cycle, so will only contain the last sed output.
Your #2 problem is that [ ] are RE-special characters.
(While the ( ) are only special in ERE not RE).
The posted solutions seem to work, but don't solve the issue of needing to make changes in other files. I have my script working by changing the input and output names so I'm not overwriting the changes made earlier in the loop.
This version makes the changes in the first file base_file and then in a second file sdf_file . The second file is more complex because it is a multi-line record file where the string that needs to be changed occurs in two places. I have added an awk call that does this part.
The awk code stores each record in an array until the end of record is reached ($$$$). Along the way, if a line is found that matches the name that needs to be changed, the array element for that line is overwritten with the revised name. When the end of record is reached, the record is written to the new file. This is also set up so that the records are only read and checked until the replacement is found and implemented. After that, the indicator "check" is set to 0 and all remaining rows are printed unchanged and unchecked.
The test files can be run with, ./rename_duplicates.sh test_base.txt test_sdf.txt
This works on test files I have tried so far and is reasonably fast. I am still concerned about the comment that that [ ] are RE-special characters. The test files attached with the script do contain this character and the character is involved with the substitution, so I'm not sure why it is working.
I am certainly not married to this code, but I do need a solution that will work on multiple files. Some of the files are larger (50MB-100MB) so the above solution may be slow in some cases.
This solution as 1 awk program appears to work OK with your test data.
Note: I used index() and substr() instead of gsub() to avoid possible RE character issues that MadeInGermany identified.
Code:
# usage ./rename_duplicates.sh test_base.txt test_sdf.sdf
# base file to check for duplicate names
base_file=$1
# sdf with duplicate structures
sdf_file=$2
awk '
BEGIN { FS=OFS="\t" }
FNR==1{ if(file++ > 1) {printf "" > "revised_"FILENAME } }
# process the base_file and count the dups, +1 if a dup was met
file==1 { if (dup[$2]++==1) dup[$2]++; next }
# process the base_file again, if a dup then add a dup_#_ prefix
file==2 && dup[$2]>1 { repcnt[$2]++; $2=("dup_" repcnt[$2]-1 "_" $2)}
file==2 { print >> "revised_"FILENAME }
# replace all duplicate keys in repcnt[] with the dup_ddd string
file==3 {
for(check in repcnt) {
pos=index($0, check)
if (pos) {
fdup[check]++
old=$0
$0=""
while(pos) {
$0=$0 substr(old,1,pos-1) "dup_" fdup[check] - 1 "_" check
old=substr(old, pos + length(check))
pos=index(old, check)
}
# Some efficiency - when all dups replaced dont check for it again
if (fdup[check] == repcnt[check]) delete repcnt
$0=$0 old
}
}
print $0 "\n$$$$" >> "revised_"FILENAME
}
' "$base_file" "$base_file" FS="" RS="\n[$]{4}\n" $sdf_file
Last edited by Chubler_XL; 03-01-2018 at 09:00 PM..
This User Gave Thanks to Chubler_XL For This Post:
For the copy of the base file, these are replaced with the intended indexed unique names,
Code:
dup_0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol # on line 10 of revised_test_base.txt
dup_1_(1R)-1,2,3,3-tetraamino-2-propen-1-ol # on line 18 of revised_test_base.txt
dup_2_(1R)-1,2,3,3-tetraamino-2-propen-1-ol # on line 36 of revised_test_base.txt
dup_0_2-[2-hydroxyethyl(methyl)amino]ethanol # on line 20 of revised_test_base.txt
dup_1_2-[2-hydroxyethyl(methyl)amino]ethanol # on line 46 of revised_test_base.txt
dup_2_2-[2-hydroxyethyl(methyl)amino]ethanol # on line 79 of revised_test_base.txt
In the sdf file, the indexed replacement is only partial.
Code:
ID_dup_0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol # substituted on lines 366 and 390 of revised_test_sdf.txt
ID_dup_1_(1R)-1,2,3,3-tetraamino-2-propen-1-ol # substituted on lines 728 and 752
ID_dup_2_(1R)-1,2,3,3-tetraamino-2-propen-1-ol # substituted on lines 1540 and 1564
ID_dup_0_2-[2-hydroxyethyl(methyl)amino]ethanol # substituted on lines 818 and 842
do not appear anywhere in the file and duplicate values for ID_2-[2-hydroxyethyl(methyl)amino]ethanol still appear in the two remaining duplicate records at lines 1993,2017 and 3491,3515.
I don't see where this is failing, but I also don't understand what you did very well. It is about 100 times faster than my script which will make a difference with the bigger files.
Bash version 4.4.20 / Ubuntu 16.0.4
Hello,
I tried to write a script that gathers some data and passes them to an executable.
The executed application answers with an error. The echo output in the script returns correct values.
If I copy/paste the last echo command, it get's executed... (2 Replies)
Hello!
I have a problem to insert variables with sed... And I can't find the solution. :confused:
I would like to display one/few line(s) between 2 values.
This line works well
sed -n '/Dec 12 10:42/,/Dec 12 10:47/p'
Thoses lines with variables doesn't work and I don't find the... (2 Replies)
$ x="/home/guru/temp/f1.txt"
$ echo $x | sed 's^.*/^^'
This will give the absolute path f1.txt. I don't understand WHY it works. How is it determining the last "/" character exactly? (7 Replies)
I am writing perl script to configure Cisco device but Variables inside Net::Telnet::Cisco Module doesn't work and passed to device without resolving.
Please advise.
here is a sample of script:
use Net::Telnet::Cisco;
$device = "10.14.199.1";
($o1, $o2, $o3, $o4) = split(/\./,$device);... (5 Replies)
...
declare vINIFILE
vINIFILE=$1
...
echo "The name of the File is $vINIFILE" >>mail_tmp
echo "" >> mail_tmp.$$
...
grep RUNJOB=0 $vINIFILE >>tmp_filter
...
So the strange is in echo-statement I get the correct output for $vINIFILE wrtitten into the file mail_tmp. But the... (2 Replies)
I am writing a script with a sed call that needs to use a variable and still have quotations be present in the substitution.
Example:
sed -i "s/Replacable.\+$/Replaced="root@$VAR"/g"
this outputs:
where $VAR = place
Replaced=root@place
and i need
Replaced="root@place"
... (2 Replies)
GNU sed version 4.1.4 on Windows XP SP3 from GnuWin32
I think that I've come across a seemingly simple text file change problem on a INI formatted file that I can't do with SED without side effects edge cases biting me. I've tried to think of various ways of doing this elegantly and quickly... (5 Replies)
Hi....
cd command is not working when dual string drive/volume name is passed to cd through variables.......
For Ex....
y=/Volumes/Backup\ vipin/
cd $y
the above command gives error.......
anyone with a genuine solution ? (16 Replies)
I am trying to write a simple script which will take a variable with sed to take a line out of a text and display it
#!/bin/sh
exec 3<list
while read list<&3
do
echo $list
sed -n '$list p'<list2
done
this does not work, yet when I replace the $list variable from the sed command and... (1 Reply)
The following seems quite basic but does not seem to work. Anybody know why?
$ g=1
$ echo $g
1
$ echo abc$g
abc1
$ abc$g=hello
ksh: abc1=hello: not found
$ echo $abc1
ksh: abc1: parameter not set
It works when I specify the full variable name
$ abc1=hello
$ echo $abc1
hello
... (2 Replies)