Copy of array by index value fails

02-08-2018

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Copy of array by index value fails

Hello,

I have a complicated situational find and replace that I wrote in bash because I didn't know how to do everything in awk. The code works but is very slow, as expected.

To create my modified file, I am looping through an array that was populated earlier and making some replacements at stored positions.

Code:

# loop through stored record
for ((j=0; j <= $i ; j++)) ; do
   # for the first line, add the new firstline value
   if [[ $j == "0" ]]; then
      echo $new_firstline >> $output_file
   # when the replace line is found, use the substitute value
   elif [[ $j == "$replace_line" ]]; then
      echo $new_name >> $output_file
   # output all other lines as normal
   else
      echo ${line_array[$j]} >> $output_file
   fi
done

This is obviously going to be very slow because of all the file operations. My intent was to write the new file to an array and then print the array at the end.

Code:

# store the record
while read line 
do
   # store line in array
   line_array[$i]="$line"
   # increment counter
   i=$((i+1))
done < $input_file

...
other code to test some things and make new variables
...

# declare an array to store the output
declare -a modified_file
# loop through stored file
for ((j=0; j <= $i ; j++)) ; do
   # for the first line, add the new firstline value to the output array
   if [[ $j == "0" ]]; then
      modified_file=("${modified_file[@]}" "$new_firstline")
   # when the replace line is found, add the substitute value to the output array
   elif [[ $j == "$replace_line" ]]; then
      modified_file=("${modified_file[@]}" "$new_name")
   # output all other lines as normal
   else
      modified_file=("${modified_file[@]}" "${line_array[$j]}")
   fi
done

# print the modified file array to the output file
echo ${modified_file[@]} >> $output_file

This seems like it should be right but all I get when I print modified_file[@] is a series of integers, like it is printing the array index.

What am I doing wrong here? Let me know if I didn't provide enough information.

Thanks,

LMHmedchem

Last edited by LMHmedchem; 02-08-2018 at 01:58 PM..

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

02-08-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The command at the end of your script:

Code:

echo ${modified_file[@]} >> $output_file

will produce a single line of output in your output file with all sequences of 1 or more adjacent <space>s, <tab>s, and <newline>s replaced by single <space>s.

But, if you input file is not a list of numbers, I missed anything in what you have shown us that would convert text to numbers. However, there is obviously a lot of code that you haven't shown us and we can't guess at what transformations might be taking place there.

If you would give us more details (like a couple of sample input files and the output files you are trying to produce from them), this does look like something that would be easy to do in ed, sed, or awk.

What operating system are you using?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

02-08-2018

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

I am using cygwin under windows but also run under opensuse 13.2.

This is the entire script and it is run with something like,

./_reformat.sh input_file output_file CompoundName Identifier InChI= 015_

It looks for certain conditions and when found, makes some modifications to the record. Most of the files I am processing contain thousands to tens of thousands of records. This is the version that writes each line to the output file as processed, the slow version.

Code:

#!/bin/sh

# file to be processed
input_file=$1
# prefix to add to firstline
output_file=$2
# sdf tag with name field
name_tag=$3
# sdf tag with substitution field
sub_tag=$4
# string to check for on line following name tag line
check_for=$5
# prefix to add to firstline
prefix=$6

# create output file
touch $output_file

# location of line to replace with modified name
replace_line=0
# value collected to build replacement name
sub_value=''
# flag to check next line
check_next=0
# flag to do replacement
replace=0
# flag to indicating saving of next line for sub name
save_next=0

# initalize line counter
i=0

# to preserve spaces
IFS=""

# read file by lines
while read line 
do

   # store line in array
   line_array[$i]="$line"
   # increment counter
   i=$((i+1))

   # if check next was set to 1 above, the next line is the one that needs to be evaluated
   if [[ $check_next == "1" ]]; then
      # reset check next, do this here so we reset even if the next line is not a match
      check_next=0
      # check for check_for as part of line
      if [[ $line =~ .*$check_for.* ]]; then
         # save line number
         replace_line=$i
         # set flag to do replacement of name
         replace=1
      fi
   fi

   # find name tag line and check if value on next line includes check_for string
   # check for name_tag as part of line
   if [[ $line =~ .*$name_tag.* ]]; then
      # set flag to check next line
      check_next=1
   fi

   # save the value in the line after sub tag has been found
   # this must come before save_next is set
   if [[ $save_next == "1" ]]; then
      # save the value from this line to use for substitute name
      sub_value=$line
      # reset flag
      save_next=0
   fi

   # look for the line with the sub tag
   if [[ $line =~ .*$sub_tag.* ]]; then
         # set flag to save next line
         save_next=1
   fi

   # when we get to the end of the record
   if [[ $line == '$$$$' ]]; then

      # if replace has been set, make replacements
      if [[ $replace== "1" ]]; then

         # create new first line value from stored substitute value
         new_firstline=$prefix'PubChem_CID_'$sub_value
         # create new name value from stored substitute value
         new_name='PubChem_CID_'$sub_value

         # decrement replace line value by one
         replace_line=$(($replace_line-1))

         # decrement line counter value by one
         i=$(($i-1))

         # loop through stored file
         for ((j=0; j <= $i ; j++)) ; do
            # for the first line, add the new firstline value
            if [[ $j == "0" ]]; then
               echo $new_firstline >> $output_file
            # when the replace line is found, use the substitute value
            elif [[ $j == "$replace_line" ]]; then
               echo $new_name >> $output_file
            # output all other lines as normal
            else
               echo ${line_array[$j]} >> $output_file
            fi
         done

      # if replace is not set, output unmodified record
      else
         for ((j=0; j < $i ; j++)) ; do
            echo ${line_array[$j]} >> $output_file
         done
      fi

      # reset for next record
      # line array
      unset line_array
      # line counter
      i=0
      # location of line to replace with modified name
      replace_line=0
      # value collected to build replacement name
      sub_value=''
      # flag to check next line
      check_next=0
      # flag to do replacement
      replace=0
      # flag to indicating saving of next line for sub name
      save_next=0

   fi

done < $input_file

This is an example of input with one record that meets the conditions to be changed,

Code:

015_InChI=1S/C16H9N3O5/c20-16-11-5-14-13(23-7-24-14)4-10(11)15-17-12-2-1-9(19(21)22)3-8(12)6-18(15)16/h1-5H,6-7H2
 OpenBabel05051721102D

 24 28  0  0  0  0  0  0  0  0999 V2000
   -1.3288    3.5365    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -1.3006    2.0368    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.4974    1.1324    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.9702    1.4167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.9528    0.2833    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.4626   -1.1343    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.9897   -1.4185    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.0071   -0.2852    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.5074   -0.2570    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.5170   -1.3527    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    1.9781   -1.0133    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0026   -2.1090    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.4637   -1.7696    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.9004   -0.3346    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.3615    0.0048    0.0000 N   0  3  0  0  0  0  0  0  0  0  0  0
    6.7981    1.4398    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    7.3859   -1.0909    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
    3.8759    0.7611    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4148    0.4217    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.3904    1.5174    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0708    1.1780    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -5.6593   -2.0386    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -6.8892   -1.1799    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -6.4525    0.2552    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0  0  0  0
  2  3  1  0  0  0  0
  3  4  2  0  0  0  0
  4  5  1  0  0  0  0
  5  6  2  0  0  0  0
  6  7  1  0  0  0  0
  6 22  1  0  0  0  0
  7  8  2  0  0  0  0
  8  9  1  0  0  0  0
  8  3  1  0  0  0  0
  9 10  2  0  0  0  0
 10 11  1  0  0  0  0
 11 12  2  0  0  0  0
 12 13  1  0  0  0  0
 13 14  2  0  0  0  0
 14 15  1  0  0  0  0
 14 18  1  0  0  0  0
 15 16  2  0  0  0  0
 15 17  1  0  0  0  0
 18 19  2  0  0  0  0
 19 20  1  0  0  0  0
 19 11  1  0  0  0  0
 20 21  1  0  0  0  0
 21  2  1  0  0  0  0
 21  9  1  0  0  0  0
 22 23  1  0  0  0  0
 23 24  1  0  0  0  0
 24  5  1  0  0  0  0
M  CHG  2  15   1  17  -1
M  END
> <order>
281

>  <CompoundName>
InChI=1S/C16H9N3O5/c20-16-11-5-14-13(23-7-24-14)4-10(11)15-17-12-2-1-9(19(21)22)3-8(12)6-18(15)16/h1-5H,6-7H2

>  <Identifier>
101651482

>  <InChI>
InChI=1S/C16H9N3O5/c20-16-11-5-14-13(23-7-24-14)4-10(11)15-17-12-2-1-9(19(21)22)3-8(12)6-18(15)16/h1-5H,6-7H2

>  <InChIKey>
ZIOMULGFTPCQIY-UHFFFAOYSA-N

>  <MolecularFormula>
C16H9N3O5

>  <MonoisotopicMass>
323.0542

>  <SMILES>
C1C2=C(C=CC(=C2)[N+](=O)[O-])N=C3N1C(=O)C4=CC5=C(C=C43)OCO5

$$$$

I was trying to dump the lines of the file to a new array with the code I first posted, but that didn't work.

In short, when the value on the line after <CompoundName> contains InChI=, the name value is too long for some of the tools in the chain. I address this by making a new name from the value read from the line following <Identifier> and re-write the record using the substitution name in the required places. If the line following <CompoundName> does not contain InChI=, then the record is written unmodified.

This is what the properly modified version of the record would look like,

Code:

015_PubChem_CID_101651482
 OpenBabel05051721102D

 24 28  0  0  0  0  0  0  0  0999 V2000
   -1.3288    3.5365    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -1.3006    2.0368    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.4974    1.1324    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.9702    1.4167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.9528    0.2833    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.4626   -1.1343    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.9897   -1.4185    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.0071   -0.2852    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.5074   -0.2570    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.5170   -1.3527    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    1.9781   -1.0133    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0026   -2.1090    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.4637   -1.7696    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.9004   -0.3346    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.3615    0.0048    0.0000 N   0  3  0  0  0  0  0  0  0  0  0  0
    6.7981    1.4398    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    7.3859   -1.0909    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
    3.8759    0.7611    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4148    0.4217    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.3904    1.5174    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0708    1.1780    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -5.6593   -2.0386    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -6.8892   -1.1799    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -6.4525    0.2552    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0  0  0  0
  2  3  1  0  0  0  0
  3  4  2  0  0  0  0
  4  5  1  0  0  0  0
  5  6  2  0  0  0  0
  6  7  1  0  0  0  0
  6 22  1  0  0  0  0
  7  8  2  0  0  0  0
  8  9  1  0  0  0  0
  8  3  1  0  0  0  0
  9 10  2  0  0  0  0
 10 11  1  0  0  0  0
 11 12  2  0  0  0  0
 12 13  1  0  0  0  0
 13 14  2  0  0  0  0
 14 15  1  0  0  0  0
 14 18  1  0  0  0  0
 15 16  2  0  0  0  0
 15 17  1  0  0  0  0
 18 19  2  0  0  0  0
 19 20  1  0  0  0  0
 19 11  1  0  0  0  0
 20 21  1  0  0  0  0
 21  2  1  0  0  0  0
 21  9  1  0  0  0  0
 22 23  1  0  0  0  0
 23 24  1  0  0  0  0
 24  5  1  0  0  0  0
M  CHG  2  15   1  17  -1
M  END
> <order>
281

>  <CompoundName>
PubChem_CID_101651482

>  <Identifier>
101651482

>  <InChI>
InChI=1S/C16H9N3O5/c20-16-11-5-14-13(23-7-24-14)4-10(11)15-17-12-2-1-9(19(21)22)3-8(12)6-18(15)16/h1-5H,6-7H2

>  <InChIKey>
ZIOMULGFTPCQIY-UHFFFAOYSA-N

>  <MolecularFormula>
C16H9N3O5

>  <MonoisotopicMass>
323.0542

>  <SMILES>
C1C2=C(C=CC(=C2)[N+](=O)[O-])N=C3N1C(=O)C4=CC5=C(C=C43)OCO5

$$$$

Sorry for the overly long post. I was trying to solve this myself and thought I just made some syntax error in making a copy of the array.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

02-08-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Now this is a problem spec difficult to understand, and I'm not sure I did. Nevertheless, this "proof of concept" worked on a two record file created from your sample. It requires a recent bash (don't run with your above shebang!), and it assumes replacements in every record. Should that not be the case, additional logics is required. The cats are NOT "uuoc"s but are needed for line numbering for later sorting. Be careful with the distinction between <TAB>s and spaces! Give it a try and come back with results...

Code:

prefix=015_
name_tag=CompoundName
sub_tag=Identifier
mapfile < <(cat -n input_file | grep -A1 "^ *1	\|$name_tag\|$sub_tag\|\$\$\$\$" | grep -v "$name_tag\|$sub_tag\|\$\$\$\$\|--\|^ *2	") REP
for ((i=0; i<${#REP[@]}; i+=3))
  do	REP[i+1]="${REP[i+1]%%	*}	PubChem_CID_${REP[i+2]#*	}"
	REP[i]="${REP[i]%%	*}	${prefix}${REP[i+1]#*	}"
	unset REP[i+2]
  done
PAT=${REP[@]%%	*}
PAT="^ *\(${PAT// /\\|}\)	"
sort -k1,1n  <(echo "${REP[@]}") <(cat -n input_file | grep -v "$PAT") | cut -f2-

Last edited by RudiC; 02-08-2018 at 05:42 PM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-08-2018

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

The files that I am processing are called sdf and they contain information about chemical structures with each structure contained in a record. The first section of the record holds the chemical structure and the second section holds (can hold) other associated information such as the compound name, identification numbers, measured data, etc, in the form of attribute tags where the tag is on one line and the value on the next. Unfortunately, the standard is rather loose and the same information can be located in more than one place and there is no requirement that every record have the same attribute tags or have them in the same order.

Software that reads this type of file is also all over the place as far as where any individual application will be looking for specific information or what limitations there will be. Since these applications cannot be modified (by me), it is often necessary to modify the input and, as expected, I tend to come to linux when that happens.

In this case, there is an issue with the chemical name. IUPAC, which creates the nomenclature for chemical names has not yet come around to understanding that chemical names should formatted such that they can have a linear notation in standard ACSII or similar. There are many chemical names that cannot be copied and pasted into a computer file name or flat text file. When you get such a name, you need to do something else for that compound. With the data I am working with, someone substituted a different value called the InChi (International Chemical Identifier). This is a computer compatible string but is unfortunately still not compatible with some applications (it's too long).

Years of working with such data has taught me to avoid names that begin with a number, have special characters, or are longer than 300 characters but not everyone has come to those same conclusions. I am working with files that are 50+ MB and have thousands of records. There are generally between 50 and 75 records that need to be changed. That's to many to do by hand.

The specific case I am looking for is when the InChi value was used for the chemical name. This is identified by the line following > <CompoundName> containing the string InChi=. Where this is not the case, nothing needs to be done to the record. Where that is the case, I need to create a substitute name from something reasonable. I am using the Identifier value, which is to be found on the line following > <Identifier>.

In short, if the line following > <CompoundName> contains InChi=, I save the value on the line following > <Identifier> and use it to create a new name. That name is written to both the first line of the record (one place where apps look for the name) and to the line following > <CompoundName>. The version on the first line is a bit different but that isn't very important.

My script works, but can take an hour to do a long file. I thought that I could speed things up by storing the output in an array and then dumping it at the end as I think this is more or less what apps like awk do. I couldn't get that working.

The number of records that need to be modified is relatively small but the files are big enough to be difficult to manage. The solution should write records that do not comply with the criteria in an unaltered fashion. I have tried to write a version that knows exactly which records need to be modified and so does not process the rest (just writes to output) but that version isn't working yet. It won't be much of an improvement if I can't store the output and have to write it to a file line by line.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

02-08-2018

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Here is a solution using awk. I use RS (record separator) to load a whole record into $0 this makes stripping out required values and replacing fields simple and avoids making multiple passes over the input.

Code:

# file to be processed
input_file=$1
# prefix to add to firstline
output_file=$2
# sdf tag with name field
name_tag=$3
# sdf tag with substitution field
sub_tag=$4
# string to check for on line following name tag line
check_for=$5
# prefix to add to firstline
prefix=$6

awk '
{
  ID=$0
  sub(".*<" sub_tag ">\n", "", ID)
  sub(/\n.*/, "", ID);

  CN=$0
  sub(".*<" name_tag ">\n", "", CN)
  sub(/\n.*/, "", CN)

  if (CN ~ "^" check_for) {
      new_CN ="PubChem_CID_" ID
      sub(/^[^\n]*\n/, prefix new_CN  "\n")
      sub("<" name_tag ">\n[^\n]*\n", "<" name_tag ">\n" new_CN  "\n")
  }
  print $0 "\n$$$$"
}
' RS="\n[$]{4}\n" name_tag="$name_tag" sub_tag="$sub_tag" check_for="$check_for" prefix="$prefix" $input_file > $output_file

Last edited by Chubler_XL; 02-08-2018 at 10:38 PM.. Reason: Longer variable names for more readiability

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

02-08-2018

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

In so far as I have tested this, it works and gives the same output as my script.

Just to illustrate the difference between a working solution and a well formed solution, my script took 16+ minutes to reformat a file,

Code:

time ./_reformat.sh  044_MF_L-anserine.sdf  044_MF_L-anserine_mod.sdf  CompoundName  Identifier InChI=  044_

real    16m49.969s
user    1m34.920s
sys     2m9.546s

The code posted by Chubler_XL processed the same file in just over 1 second.

Code:

time ./_reformat4.sh  044_MF_L-anserine.sdf  044_MF_L-anserine_mod.sdf  CompoundName  Identifier InChI=  044_

real    0m1.219s
user    0m0.920s
sys     0m0.045s

Thanks, this will save many hours of waiting for my code to finish.

LMHmedchem

This User Gave Thanks to LMHmedchem For This Post:

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

Shell Programming and Scripting

Copy of array by index value fails

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Associative array index question

Discussion started by: Riker1204

2. Shell Programming and Scripting

Index problem in associate array in awk

Discussion started by: yifangt

3. Shell Programming and Scripting

build array name based on loop index

Discussion started by: janavan

4. Shell Programming and Scripting

how to search array and print index in ksh

Discussion started by: tmalik79

5. Shell Programming and Scripting

dynamic index for array in while loop

Discussion started by: weak_code-fu

6. Shell Programming and Scripting

awk array index help

Discussion started by: uwork72

7. UNIX for Advanced & Expert Users

sql variable as array index

Discussion started by: sudheer157

8. Shell Programming and Scripting

Problem when assign the array with the string index

Discussion started by: youareapkman

9. UNIX for Dummies Questions & Answers

wh inode index starts from 1 unlike array index (0)

Discussion started by: sairamdevotee

10. Filesystems, Disks and Memory

why the inode index of file system starts from 1 unlike array index(0)

Discussion started by: sairamdevotee