I have a complicated situational find and replace that I wrote in bash because I didn't know how to do everything in awk. The code works but is very slow, as expected.
To create my modified file, I am looping through an array that was populated earlier and making some replacements at stored positions.
This is obviously going to be very slow because of all the file operations. My intent was to write the new file to an array and then print the array at the end.
This seems like it should be right but all I get when I print modified_file[@] is a series of integers, like it is printing the array index.
What am I doing wrong here? Let me know if I didn't provide enough information.
Thanks,
LMHmedchem
Last edited by LMHmedchem; 02-08-2018 at 01:58 PM..
The command at the end of your script:
will produce a single line of output in your output file with all sequences of 1 or more adjacent <space>s, <tab>s, and <newline>s replaced by single <space>s.
But, if you input file is not a list of numbers, I missed anything in what you have shown us that would convert text to numbers. However, there is obviously a lot of code that you haven't shown us and we can't guess at what transformations might be taking place there.
If you would give us more details (like a couple of sample input files and the output files you are trying to produce from them), this does look like something that would be easy to do in ed, sed, or awk.
It looks for certain conditions and when found, makes some modifications to the record. Most of the files I am processing contain thousands to tens of thousands of records. This is the version that writes each line to the output file as processed, the slow version.
This is an example of input with one record that meets the conditions to be changed,
I was trying to dump the lines of the file to a new array with the code I first posted, but that didn't work.
In short, when the value on the line after <CompoundName> contains InChI=, the name value is too long for some of the tools in the chain. I address this by making a new name from the value read from the line following <Identifier> and re-write the record using the substitution name in the required places. If the line following <CompoundName> does not contain InChI=, then the record is written unmodified.
This is what the properly modified version of the record would look like,
Sorry for the overly long post. I was trying to solve this myself and thought I just made some syntax error in making a copy of the array.
Now this is a problem spec difficult to understand, and I'm not sure I did. Nevertheless, this "proof of concept" worked on a two record file created from your sample. It requires a recent bash (don't run with your above shebang!), and it assumes replacements in every record. Should that not be the case, additional logics is required. The cats are NOT "uuoc"s but are needed for line numbering for later sorting. Be careful with the distinction between <TAB>s and spaces! Give it a try and come back with results...
The files that I am processing are called sdf and they contain information about chemical structures with each structure contained in a record. The first section of the record holds the chemical structure and the second section holds (can hold) other associated information such as the compound name, identification numbers, measured data, etc, in the form of attribute tags where the tag is on one line and the value on the next. Unfortunately, the standard is rather loose and the same information can be located in more than one place and there is no requirement that every record have the same attribute tags or have them in the same order.
Software that reads this type of file is also all over the place as far as where any individual application will be looking for specific information or what limitations there will be. Since these applications cannot be modified (by me), it is often necessary to modify the input and, as expected, I tend to come to linux when that happens.
In this case, there is an issue with the chemical name. IUPAC, which creates the nomenclature for chemical names has not yet come around to understanding that chemical names should formatted such that they can have a linear notation in standard ACSII or similar. There are many chemical names that cannot be copied and pasted into a computer file name or flat text file. When you get such a name, you need to do something else for that compound. With the data I am working with, someone substituted a different value called the InChi (International Chemical Identifier). This is a computer compatible string but is unfortunately still not compatible with some applications (it's too long).
Years of working with such data has taught me to avoid names that begin with a number, have special characters, or are longer than 300 characters but not everyone has come to those same conclusions. I am working with files that are 50+ MB and have thousands of records. There are generally between 50 and 75 records that need to be changed. That's to many to do by hand.
The specific case I am looking for is when the InChi value was used for the chemical name. This is identified by the line following > <CompoundName> containing the string InChi=. Where this is not the case, nothing needs to be done to the record. Where that is the case, I need to create a substitute name from something reasonable. I am using the Identifier value, which is to be found on the line following > <Identifier>.
In short, if the line following > <CompoundName> contains InChi=, I save the value on the line following > <Identifier> and use it to create a new name. That name is written to both the first line of the record (one place where apps look for the name) and to the line following > <CompoundName>. The version on the first line is a bit different but that isn't very important.
My script works, but can take an hour to do a long file. I thought that I could speed things up by storing the output in an array and then dumping it at the end as I think this is more or less what apps like awk do. I couldn't get that working.
The number of records that need to be modified is relatively small but the files are big enough to be difficult to manage. The solution should write records that do not comply with the criteria in an unaltered fashion. I have tried to write a version that knows exactly which records need to be modified and so does not process the rest (just writes to output) but that version isn't working yet. It won't be much of an improvement if I can't store the output and have to write it to a file line by line.
Here is a solution using awk. I use RS (record separator) to load a whole record into $0 this makes stripping out required values and replacing fields simple and avoids making multiple passes over the input.
Last edited by Chubler_XL; 02-08-2018 at 10:38 PM..
Reason: Longer variable names for more readiability
This User Gave Thanks to Chubler_XL For This Post:
In so far as I have tested this, it works and gives the same output as my script.
Just to illustrate the difference between a working solution and a well formed solution, my script took 16+ minutes to reformat a file,
The code posted by Chubler_XL processed the same file in just over 1 second.
Thanks, this will save many hours of waiting for my code to finish.
LMHmedchem
This User Gave Thanks to LMHmedchem For This Post:
I am trying to assign indexes to an associative array in a for loop but I have to use an eval command to make it work, this doesn't seem correct I don't have to do this with regular arrays
For example, the following assignment fails without the eval command:
#! /bin/bash
read -d "\0" -a... (19 Replies)
I am trying to reformat the table by filling any missing rows. The final table will have consecutive IDs in the first column. My problem is the index of the associate array in the awk script.
infile:
S01 36407 53706 88540
S02 69343 87098 87316
S03 50133 59721 107923... (4 Replies)
Hi,
I am new to perl and I have the following query please help here.
I have following array variables declaration
@pld1 = qw(00 01 02 03 04 05);
@pld2 = qw(10 11 12 13 14 15);
for(my $k=1;$k<=2;$k++)
{
//I want here to use @pld1 if $k is 1
// and @pld2 if $k is 2. How to do... (3 Replies)
Hi,
I am using KSH shell to do some programming.
I want to search array and print index value of the array.
Example..
nodeval4workflow="DESCRIPTION ="" ISENABLED ="YES" ISVALID ="YES" NAME="TESTVALIDATION"
set -A strwfVar $nodeval4workflow
strwfVar=DESCRIPTION=""... (1 Reply)
Hi,
I'm just trying to use a dynamic index for some array elements that I'm accessing within a loop. Specifically, I want to access an array at variable position $counter and then also at location $counter + 1 and $counter + 2 (the second and third array positions after it) but I keep getting... (0 Replies)
hi folks
i am facing problom while trying to access sql variable as array index ina unix shell script....script goes as below..
#!/bin/ksh
MAX=3
for elem in alpha beeta gaama
do
arr=$elem
((x=x+1))
Done
SQL_SERVER='servername'
/apps/sun5/utils/sqsh -S $SQL_SERVER -U user -P pwd -b -h... (1 Reply)
I come across the problems when assigning the array in the script below . How to use the array with the 'string index' correctly ? When I assign a new string index , the array elements that are previously assigned are all changed .:eek::eek::eek:
$ array=211
$ echo ${array}
211
$... (4 Replies)
brothers why inode index starts from 1 unlike array inex which starts from 0
its a question from the design of unix operating system of maurice j.bach
i need to know the answer urgently...someone help please (1 Reply)
why do inode indices starts from 1 unlike array indexes which starts from 0
its a question from "the design of unix operating system" of maurice j bach
id be glad if i get to know the answer quickly
:) (0 Replies)