Find and modify a huge file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Find and modify a huge file
# 1  
Old 04-13-2017
Find and modify a huge file

Dear Forum,

I have a rather large file with a few million lines looking like this:

Code:
head -n 5 seq.txt
>KF1.8.1
010011001011100010101110000000
>DF1.6.1
0101000010111010101011111100
>XC1.3.7
010110101011101010110000011
>GG5.1.1
0100011010111010101110001101
>HK1.2.2
010000111011101101001110001010
0101011

In this file the lines can be split into different records with a name (starting with >) and the encoded information/sequence (001010...) associated with the header. Now, I need to add some code to the header according to the following file:

Code:
head -n 5 code.txt
>KF1.8.1;code=D0:B;D1:P;D2:E;D3:C;D4:H;D5:S_(1);
>DF1.6.1;code=D0:B;D1:D;D2:F;D3:C;D4:F;D5:S_(1);
>XC1.3.7;code=D0:A;D1:D;D2:E;D3:C;D4:H;D5:H;
>GG5.1.1;code=D0:A;D1:D;D2:E;D3:C;D4:F;D5:H;
>HK1.2.2;code=D0:A;D1:F;D2:F;D3:C;D4:H;D5:K_[23];

The results should look like this:

Code:
head -n 11 res.txt
>KF1.8.1;code=D0:B;D1:P;D2:E;D3:C;D4:H;D5:S_(1);
010011001011100010101110000000
>DF1.6.1;code=D0:B;D1:D;D2:F;D3:C;D4:F;D5:S_(1);
0101000010111010101011111100
>XC1.3.7;code=D0:A;D1:D;D2:E;D3:C;D4:H;D5:H;
0100011010111010101110001101
>GG5.1.1;code=D0:A;D1:D;D2:E;D3:C;D4:F;D5:H;
0100011010111010101110001101
>HK1.2.2;code=D0:A;D1:F;D2:F;D3:C;D4:H;D5:K_[23];
010000111011101101001110001010
0101011

The two files (seq.txt, code.txt) are not sorted but the number of records are identical.

I could use sed to change one record header at a time
Code:
sed 's/>KF1.8.1/>KF1.8.1;code=D0:B;D1:P;D2:E;D3:C;D4:H;D5:S_(1);/g' seq.txt

or maybe write it into a file an execute it

Code:
while read code
do
  record=`echo $code | cut -d';' -f 1`
  echo "sed 's/$record/$code/g' seq.txt" >> all.txt
done < code.txt

chmod a+x all.txt
./all.txt

but this might take some time. Does anybody have a faster and maybe more elegant way for me to modify the record headers?

Thanks for all your help!
# 2  
Old 04-13-2017
Yes, editing a huge file once as opposed to editing a huge file n times for n lines would certainly be preferable!

This should work efficiently for anywhere up to millions of sequences listed in code.txt :

Code:
$ awk -F';' 'NR==FNR { A[$1]=$0 ; next } ; /^>/ && ($1 in A) { $1=A[$1] } 1' code.txt seq.txt

>KF1.8.1;code=D0:B;D1:P;D2:E;D3:C;D4:H;D5:S_(1);
010011001011100010101110000000
>DF1.6.1;code=D0:B;D1:D;D2:F;D3:C;D4:F;D5:S_(1);
0101000010111010101011111100
>XC1.3.7;code=D0:A;D1:D;D2:E;D3:C;D4:H;D5:H;
010110101011101010110000011
>GG5.1.1;code=D0:A;D1:D;D2:E;D3:C;D4:F;D5:H;
0100011010111010101110001101
>HK1.2.2;code=D0:A;D1:F;D2:F;D3:C;D4:H;D5:K_[23];
010000111011101101001110001010
0101011

$

It works because awk has associative arrays, you can do ARRAY["something"]="ABCD". And NR==FNR means 'do this only for the first file listed'. So it reads the entire list into an associative array, then reads through the huge file hunting for relevant lines, substituting where appropriate, then printing everything.
This User Gave Thanks to Corona688 For This Post:
# 3  
Old 04-14-2017
Quote:
Originally Posted by Corona688
It works because awk has associative arrays
Just a side note: Many shells (for instance bash and zsh) have associative arrays too. Problem is that the OP did not specify whether he wants to restrict his solution to a particular shell, as the code snippet he wrote would be compliant to several shells.
# 4  
Old 04-14-2017
Quote:
Originally Posted by Corona688
Yes, editing a huge file once as opposed to editing a huge file n times for n lines would certainly be preferable!

This should work efficiently for anywhere up to millions of sequences listed in code.txt :

Code:
$ awk -F';' 'NR==FNR { A[$1]=$0 ; next } ; /^>/ && ($1 in A) { $1=A[$1] } 1' code.txt seq.txt

>KF1.8.1;code=D0:B;D1:P;D2:E;D3:C;D4:H;D5:S_(1);
010011001011100010101110000000
>DF1.6.1;code=D0:B;D1:D;D2:F;D3:C;D4:F;D5:S_(1);
0101000010111010101011111100
>XC1.3.7;code=D0:A;D1:D;D2:E;D3:C;D4:H;D5:H;
010110101011101010110000011
>GG5.1.1;code=D0:A;D1:D;D2:E;D3:C;D4:F;D5:H;
0100011010111010101110001101
>HK1.2.2;code=D0:A;D1:F;D2:F;D3:C;D4:H;D5:K_[23];
010000111011101101001110001010
0101011

$

It works because awk has associative arrays, you can do ARRAY["something"]="ABCD". And NR==FNR means 'do this only for the first file listed'. So it reads the entire list into an associative array, then reads through the huge file hunting for relevant lines, substituting where appropriate, then printing everything.
---------- Post updated at 08:54 AM ---------- Previous update was at 08:35 AM ----------

Dear Corona,

Thanks for the help and the explanation. I'm am not sure I understand the solution completely.

A[$1]=$0 means I read everything from the first file provided - because I use -F ";" the line in the first file is split up
/^>/ && ($1 in A) is this the if statement - if the line starts with a ">" sign and $1 is somewhere in the arry - why $1 ? Is it not file two or is it everything after ";" meant for file two?

Would be great if you would find the time to explain me the awk array a bit more. I really appreciate your help.



Moderator's Comments:
Mod Comment Please use CODE / ICODE tags as required by forum rules!

Last edited by RudiC; 04-14-2017 at 03:58 AM.. Reason: Added ICODE tags.
# 5  
Old 04-14-2017
Hello GDC,

Could you please go through following and let me know if this helps you.
Code:
awk -F';'             ##### Making field separator as ";"
'NR==FNR              ##### Checking NR==FNR condition here, this condition will be TRUE when first file code.txt is getting read.
{ A[$1]=$0 ;          ##### Making an aray named A with index $1 and keeping it's value as current line.
next } ;              ##### putting next keyword from built-in awk's keyword it will skip all next statements then.
/^>/ && ($1 in A)     ##### Checking here 2 conditions, 1st condition if any line starts from ">" and first field or that line is present in array A. If both conditions are TRUE then perform the following statements.
{ $1=A[$1] }          ##### Making first field as array A's value whose index is $1.
1                     ##### Mentioning 1 here, so awk works on condition then action part, when condition is TRUE then action will happen. So here by mentioning 1 we are making condition as TRUE and no action mentioned so default action will happen which is printing of the current line.
' code.txt seq.txt    ##### Mentioning the Input_files here too.

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 6  
Old 04-14-2017
Dear R. Singh

yes it does. Thanks for the help!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Search for word in huge logfile and need to continue to print few lines from that line til find date

Guys i need an idea for one logic..in shell scripting am struggling with a logic...So the thing is... i need to search for a word in a huge log file and i need to continue to print few more lines from that line and the consecutive line has to end when it finds the line with date..because i know... (1 Reply)
Discussion started by: Prathi
1 Replies

2. Shell Programming and Scripting

Find modify and delete files

hi every one. one of my friends has writen this script and send it to me. this script can find files that add-delete-modify and also send an alert by email i'm not catch all part of it. can anyone explain me how this work #!/bin/bash START="a.txt" END="b.txt" DIFF="c.txt" mv ${START}... (4 Replies)
Discussion started by: nimafire
4 Replies

3. UNIX for Dummies Questions & Answers

Need to modify a delimited file using UNIX commands. Please find description

i have a '|' delimited file having 4 fields. now i want to sort the data by combination of first three fields without changing order of 4th field. input file looks like this: 3245|G|kop|45 1329|A|uty|76 9878|K|wer|12 3245|G|kop|15 1329|A|uty|56 9878|K|wer|2 3245|G|kop|105... (4 Replies)
Discussion started by: ankurgoyal2408
4 Replies

4. UNIX for Dummies Questions & Answers

My file system is 100%, can't find the huge file

Please help. My file system is 100%, I can't seem to find what is taking so much space. The total hard drive space is 150Gig free but I got nothing now. I did to this to find the big file but it's taking so much time. Is there any other way? du -ah / | more find ./ -size +200M... (3 Replies)
Discussion started by: samnyc
3 Replies

5. Shell Programming and Scripting

Optimised way for search & replace a value on one line in a very huge file (File Size is 24 GB).

Hi Experts, I had to edit (a particular value) in header line of a very huge file so for that i wanted to search & replace a particular value on a file which was of 24 GB in Size. I managed to do it but it took long time to complete. Can anyone please tell me how can we do it in a optimised... (7 Replies)
Discussion started by: manishkomar007
7 Replies

6. AIX

find command modify the output

Hello All, I am new to this shell scripting , I wanted to modify the output of my find command such that it does not display the path but only file names , for example I am searching for the files which are modified in the last 24 hours which is find /usr/monitor/text/ -type f -mtime... (3 Replies)
Discussion started by: raokl
3 Replies

7. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Hi, all: I've got two folders, say, "folder1" and "folder2". Under each, there are thousands of files. It's quite obvious that there are some files missing in each. I just would like to find them. I believe this can be done by "diff" command. However, if I change the above question a... (1 Reply)
Discussion started by: jiapei100
1 Replies

8. Shell Programming and Scripting

Help on splitting this huge file

Hi , i have files coming in my system which are very huge in MB and GBs, all these files are in a single line, there is no newline character. I need to get only last 700 bytes of these files, of this i am splitting the files by "split -b 700 filename" but this gives all the splitted... (2 Replies)
Discussion started by: Prateek007
2 Replies

9. Shell Programming and Scripting

Modify a perl script to find and count

Hello all !I have two sets of folders that have IP address from two sources.The below perl script I was working with needs some corrections.I am looking for the perl script to identify and count what IP address are found to be duplicated between both files.The format from both files are the same... (4 Replies)
Discussion started by: richsark
4 Replies

10. Shell Programming and Scripting

insert a header in a huge data file without using an intermediate file

I have a file with data extracted, and need to insert a header with a constant string, say: H|PayerDataExtract if i use sed, i have to redirect the output to a seperate file like sed ' sed commands' ExtractDataFile.dat > ExtractDataFileWithHeader.dat the same is true for awk and... (10 Replies)
Discussion started by: deepaktanna
10 Replies
Login or Register to Ask a Question