Adding lines to a large file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Adding lines to a large file
# 1  
Old 09-20-2014
Adding lines to a large file

Hello,

I have a relatively large text file (25,000K) consisting of records of data. For each record, I need to create a new line based on what is already there.

Every record has a block that looks like,
Code:
M  END
>  <ID>
1

>  <SOURCE>
KEGG

>  <SOURCE_ID>
C00002

>  <NAME>
ATP; Adenosine 5'-triphosphate

>  <SMILES>
Nc1ncnc2n(cnc12)[C@@H]1O[C@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)[C@@H](O)[C@H]1O

>  <MIMW>
506.995745159

>  <FORMULA>
C10H16N5O13P3

$$$$

The data tag lines > <ID>, etc, are the same for each record (or should be). The data on the line below the tag varies. I need to make a new field called

> <SOURCE_SOURCE_ID>

That is the data from > <SOURCE> concatenated with > <SOURCE_ID> separated with an underscore.

The record above would look like,
Code:
M  END
>  <ID>
1

>  <SOURCE>
KEGG

>  <SOURCE_ID>
C00002

>  <SOURCE_SOURCE_ID>
KEGG_C00002

>  <NAME>
ATP; Adenosine 5'-triphosphate

>  <SMILES>
Nc1ncnc2n(cnc12)[C@@H]1O[C@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)[C@@H](O)[C@H]1O

>  <MIMW>
506.995745159

>  <FORMULA>
C10H16N5O13P3

$$$$

This is quite a bit beyond the things I normally do with shell scripts and I'm not sure where to start. I presume this would be some kind of while read line that looks for > <SOURCE> and captures the next line, looks for > <SOURCE_ID> and captures the next line, makes up the new variable, and makes an insert. All other lines would just be printed. This seems like manipulating an output stream, which I know how to do in cpp, but not in bash.

Suggestions would be greatly appreciated.

LMHmedchem
# 2  
Old 09-20-2014
Code:
$  awk '!f{f=/>[ \t]+<SOURCE>/}!s{s=/>[ \t]+<SOURCE_ID>/} f && s && !NF {print insert; f=s=""}1' insert="\n>  <SOURCE_SOURCE_ID>\nKEGG_C00002"  file

This User Gave Thanks to Akshay Hegde For This Post:
# 3  
Old 09-20-2014
Try this to:

Code:
awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n>  <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS= file

This User Gave Thanks to pilnet101 For This Post:
# 4  
Old 09-20-2014
Quote:
Originally Posted by pilnet101
Code:
awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n>  <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS= file

I decided to try this first. I ran this from the command line adding my file name at the end.

Code:
awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n>  <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS=KHHscaffolds_7108.sdf

I let it run for a while and it doesn't seem to do anything. There is no change to the file KHHscaffolds_7108.sdf and no output to the terminal. Should I be redirecting to a new file or something like that? Is it just taking a long time to run since it is processing the entire file in one pass?

I also tried,
Code:
awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n>  <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS=file KHHscaffolds_7108.sdf > KHHscaffolds_7108_r3.sdf

and this finishes quickly, but the output file is the same as the input.

LMHmedchem

---------- Post updated at 03:21 PM ---------- Previous update was at 03:12 PM ----------

Alright, after reading a bit about awk RS, I see the meaning of RS=

This is the correct usage,

Code:
awk '/<SOURCE>/{a=$NF};/<SOURCE_ID>/{$0=$0"\n\n>  <SOURCE_SOURCE_ID>\n"a"_"$NF}1' ORS="\n\n" RS=  KHHscaffolds_7108.sdf > KHHscaffolds_7108_r3.sdf

The empty space after RS= fooled me a bit there. This worked very well. I am always amazed at how fast these things can work, even on a large file. I am sure that would have taken me a few hundred lines in cpp and I doubt it would have run nearly as fast.

LMHmedchem
# 5  
Old 09-20-2014
Glad it helped Smilie

Yep that is why I love awk!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash : Checking Large file for specific lines

Morning .. I have a file with approximately 1000 lines. I want to check that the file contains, for example, 100 lines. Something like whats given below is ugly. And even if I create a function I have to call it 100 times. I may need to look through multiple files at times. Is there a... (4 Replies)
Discussion started by: sumguy
4 Replies

2. UNIX for Advanced & Expert Users

How to split a large file with the first 100 lines of each condition?

I have a huge file with the following input: Case1 Specific_Info Specific_Info Case1 Specific_Info Specific_Info Case3 Specific_Info Specific_Info Case4 Specific_Info Specific_Info Case1 Specific_Info Specific_Info Case2 Specific_Info Specific_Info Case2 Specific_Info Specific_Info... (2 Replies)
Discussion started by: laurigo
2 Replies

3. Shell Programming and Scripting

Adding lines at a particular location in a file.

Hi Experts, Let us take a text file,say items.txt having the following data jar bottle gum tube cereal bag I want to add the content of items.txt to another file say #many lines not necessary ingredients #many line not necesary ingredients I want to append the data in... (3 Replies)
Discussion started by: Pradeep_1990
3 Replies

4. UNIX for Advanced & Expert Users

Count number of lines between a pattern in a large file

1000CUS E Y4NYRETAIL 10010004HELIOPOLIS 110000500022360591000056XX EG 1101DEBY XXAD ZSSKY TSSROS 1102HANYNNYY@HOTMAIL.COM 210030/05/201301/06/2013AED 3100 OPE 3100 CLO 3100 The 1000CUS E Y NYCORPORATE 10010004HELIOPOLIS 110000500025270504550203XX EG 1101XXXQ FOR CTING AND... (1 Reply)
Discussion started by: john2022
1 Replies

5. UNIX for Dummies Questions & Answers

Adding missing lines in file

Dear all, I have a file with two columns - the first column is increasing every 50, the second column is just count (e.g. 5). However, when count is zero, no line is present. Sample: 1950 7 2000 14 2050 7 2100 13 2150 10 2200 9 2250 7 2300 8 2350 7... (1 Reply)
Discussion started by: TheTransporter
1 Replies

6. Shell Programming and Scripting

Parse large file on line count (random lines)

I have a file that needs to be parsed into multiple files every time there line contains a number 1. the problem i face is the lines are random and the file size is random. an example is that on line 4, 65, 187, 202 & 209 are number 1's so there has to be file breaks between all those to create 4... (6 Replies)
Discussion started by: darbs121
6 Replies

7. Shell Programming and Scripting

Adding new lines to a file + adding suffix to a pattern

I need some help with adding lines to file and substitute a pattern. Ok I have a file: #cat names.txt name: John Doe stationed: 1 name: Michael Sweets stationed: 41 . . . And would like to change it to: name: John Doe employed permanently stationed: 1-office (7 Replies)
Discussion started by: hemo21
7 Replies

8. Shell Programming and Scripting

Search for multiple lines in large file

Hi, I have a requirement to search for a string in a large log file along with few lines before and after the the string. The following script was sufficient to search such an entry. STRING_TO_GREP="$1" FILE_TO_GREP="$2" NUMBER_OF_LINES_BEFORE=$3 NUMBER_OF_LINES_AFTER=$4 for i in `grep... (3 Replies)
Discussion started by: praveen123
3 Replies

9. Shell Programming and Scripting

Adding strings to lines in a file

Hi all, I have a positional text file that comes from some source application. Before it is processed by destination application I have to add some header (suffix) to every record(line) in the file. e.g. Actual File ............... AccountDetails AcNO Name Amount 1234 John 26578 5678... (3 Replies)
Discussion started by: sharath160
3 Replies

10. UNIX for Dummies Questions & Answers

Help with selecting specific lines in a large file

Hello, I need to select the 3 lines above as well as below a search string, including the search string. I have been trying various combinations using sed command without any success. Can anuone help please. Thanking (2 Replies)
Discussion started by: tansha
2 Replies
Login or Register to Ask a Question