02-08-2018
The files that I am processing are called sdf and they contain information about chemical structures with each structure contained in a record. The first section of the record holds the chemical structure and the second section holds (can hold) other associated information such as the compound name, identification numbers, measured data, etc, in the form of attribute tags where the tag is on one line and the value on the next. Unfortunately, the standard is rather loose and the same information can be located in more than one place and there is no requirement that every record have the same attribute tags or have them in the same order.
Software that reads this type of file is also all over the place as far as where any individual application will be looking for specific information or what limitations there will be. Since these applications cannot be modified (by me), it is often necessary to modify the input and, as expected, I tend to come to linux when that happens.
In this case, there is an issue with the chemical name. IUPAC, which creates the nomenclature for chemical names has not yet come around to understanding that chemical names should formatted such that they can have a linear notation in standard ACSII or similar. There are many chemical names that cannot be copied and pasted into a computer file name or flat text file. When you get such a name, you need to do something else for that compound. With the data I am working with, someone substituted a different value called the InChi (International Chemical Identifier). This is a computer compatible string but is unfortunately still not compatible with some applications (it's too long).
Years of working with such data has taught me to avoid names that begin with a number, have special characters, or are longer than 300 characters but not everyone has come to those same conclusions. I am working with files that are 50+ MB and have thousands of records. There are generally between 50 and 75 records that need to be changed. That's to many to do by hand.
The specific case I am looking for is when the InChi value was used for the chemical name. This is identified by the line following > <CompoundName> containing the string InChi=. Where this is not the case, nothing needs to be done to the record. Where that is the case, I need to create a substitute name from something reasonable. I am using the Identifier value, which is to be found on the line following > <Identifier>.
In short, if the line following > <CompoundName> contains InChi=, I save the value on the line following > <Identifier> and use it to create a new name. That name is written to both the first line of the record (one place where apps look for the name) and to the line following > <CompoundName>. The version on the first line is a bit different but that isn't very important.
My script works, but can take an hour to do a long file. I thought that I could speed things up by storing the output in an array and then dumping it at the end as I think this is more or less what apps like awk do. I couldn't get that working.
The number of records that need to be modified is relatively small but the files are big enough to be difficult to manage. The solution should write records that do not comply with the criteria in an unaltered fashion. I have tried to write a version that knows exactly which records need to be modified and so does not process the rest (just writes to output) but that version isn't working yet. It won't be much of an improvement if I can't store the output and have to write it to a file line by line.
LMHmedchem
10 More Discussions You Might Find Interesting
1. Filesystems, Disks and Memory
why do inode indices starts from 1 unlike array indexes which starts from 0
its a question from "the design of unix operating system" of maurice j bach
id be glad if i get to know the answer quickly
:) (0 Replies)
Discussion started by: sairamdevotee
0 Replies
2. UNIX for Dummies Questions & Answers
brothers why inode index starts from 1 unlike array inex which starts from 0
its a question from the design of unix operating system of maurice j.bach
i need to know the answer urgently...someone help please (1 Reply)
Discussion started by: sairamdevotee
1 Replies
3. Shell Programming and Scripting
I come across the problems when assigning the array in the script below . How to use the array with the 'string index' correctly ? When I assign a new string index , the array elements that are previously assigned are all changed .:eek::eek::eek:
$ array=211
$ echo ${array}
211
$... (4 Replies)
Discussion started by: youareapkman
4 Replies
4. UNIX for Advanced & Expert Users
hi folks
i am facing problom while trying to access sql variable as array index ina unix shell script....script goes as below..
#!/bin/ksh
MAX=3
for elem in alpha beeta gaama
do
arr=$elem
((x=x+1))
Done
SQL_SERVER='servername'
/apps/sun5/utils/sqsh -S $SQL_SERVER -U user -P pwd -b -h... (1 Reply)
Discussion started by: sudheer157
1 Replies
5. Shell Programming and Scripting
$ cat file.txt
A|X|20
A|Y|20
A|X|30
A|Z|20
B|X|10
A|Y|40
Summing up $NF based on first 2 fields,
$ awk -F "|" 'BEGIN {OFS="|"}
{ sum += $NF }
END { for (f in sum) print f,sum }
' file.txt
o/p:
A|X|50
A|Y|60
A|Z|20 (4 Replies)
Discussion started by: uwork72
4 Replies
6. Shell Programming and Scripting
Hi,
I'm just trying to use a dynamic index for some array elements that I'm accessing within a loop. Specifically, I want to access an array at variable position $counter and then also at location $counter + 1 and $counter + 2 (the second and third array positions after it) but I keep getting... (0 Replies)
Discussion started by: weak_code-fu
0 Replies
7. Shell Programming and Scripting
Hi,
I am using KSH shell to do some programming.
I want to search array and print index value of the array.
Example..
nodeval4workflow="DESCRIPTION ="" ISENABLED ="YES" ISVALID ="YES" NAME="TESTVALIDATION"
set -A strwfVar $nodeval4workflow
strwfVar=DESCRIPTION=""... (1 Reply)
Discussion started by: tmalik79
1 Replies
8. Shell Programming and Scripting
Hi,
I am new to perl and I have the following query please help here.
I have following array variables declaration
@pld1 = qw(00 01 02 03 04 05);
@pld2 = qw(10 11 12 13 14 15);
for(my $k=1;$k<=2;$k++)
{
//I want here to use @pld1 if $k is 1
// and @pld2 if $k is 2. How to do... (3 Replies)
Discussion started by: janavan
3 Replies
9. Shell Programming and Scripting
I am trying to reformat the table by filling any missing rows. The final table will have consecutive IDs in the first column. My problem is the index of the associate array in the awk script.
infile:
S01 36407 53706 88540
S02 69343 87098 87316
S03 50133 59721 107923... (4 Replies)
Discussion started by: yifangt
4 Replies
10. Shell Programming and Scripting
I am trying to assign indexes to an associative array in a for loop but I have to use an eval command to make it work, this doesn't seem correct I don't have to do this with regular arrays
For example, the following assignment fails without the eval command:
#! /bin/bash
read -d "\0" -a... (19 Replies)
Discussion started by: Riker1204
19 Replies