Optimised way for search & replace a value on one line in a very huge file (File Size is 24 GB).


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Optimised way for search & replace a value on one line in a very huge file (File Size is 24 GB).
# 1  
Old 09-02-2011
Optimised way for search & replace a value on one line in a very huge file (File Size is 24 GB).

Hi Experts,

I had to edit (a particular value) in header line of a very huge file so for that i wanted to search & replace a particular value on a file which was of 24 GB in Size. I managed to do it but it took long time to complete. Can anyone please tell me how can we do it in a optimised way.

Thanks in advance.
Manish

Steps which i followed:

1. head -1 orignal_file > temp
2. sed -n '2,$p' original_file >> temp
3. mv temp original_file
# 2  
Old 09-02-2011
Afaik, in the general case no. But if your new first line has the same number of bytes or shorter and you can pad it with spaces then I believe you can do it quick with low level programming - open for read/write, read some bytes (512 for example) in a buffer, change them, rewind, and write the buffer back.
# 3  
Old 09-02-2011
Thanks for your suggestion Yazu.

Just now i have got this below command from my friend which works good and takes 11mins to process on such a huge file.

Please someone can tell me if it can further be optimized.

Code:
perl -i -e '(s/OLD/New/) if $.==1' original_file


Last edited by radoulov; 09-03-2011 at 04:05 AM.. Reason: Code tags.
# 4  
Old 09-02-2011
There is no fundamental operation for inserting or deleting data in the middle of a file. You have to rewrite the entire file after the edit.

A 24 gigabyte file in 11 minutes is 37 megabytes per second, which is actually a pretty impressive transfer rate! It's probably maxed out your disk or bus speed now, changing the program won't help significantly. It might help to write the output to a different disk than you're reading from.

If you could use yazu's suggestion of always keeping the string the same length, so the data afterwards doesn't need to be rewritten, that would let the edit happen in a fraction of a second...
# 5  
Old 09-02-2011
Hammer & Screwdriver

Thanks Corona688...!!

While doing this we are excatly searching & replacing 8 character like 20110901 to 20110902. And we were monitoring the performance of the server which was very good. It didn't swaped out on memory. Still it took so much time .. rite.. i think on Linux if its takes 11 mins which is still more. Please correct me if I am wrong.
# 6  
Old 09-02-2011
How quickly did awk/nawk do it with sub/gsub? Just curious.
# 7  
Old 09-02-2011
Quote:
Originally Posted by manishkomar007
Thanks Corona688...!!

While doing this we are excatly searching & replacing 8 character like 20110901 to 20110902.
Could you show us the first few lines of the file, and the data you wish replaced? If the data is always the same length and always in the same place, you can use dd to write it in...

---------- Post updated at 11:42 AM ---------- Previous update was at 11:37 AM ----------

An example:
Code:
$ cat textdata
This is line 1
This is line 2
This is the data I want replaced >>11111111<<
This is another line
etc etc until end of file.
$ printf "%s" 22222222 | dd conv=notrunc of=textdata seek=65 bs=1
$ cat textdata
This is line 1
This is line 2
This is the data I want replaced >>22222222<<
This is another line
etc etc until end of file.

The 'bs=1' tells it to work on a sector size of 1 byte, which lets us seek seek exactly 65 characters into the file with seek=65. The conv=notrunc is important, it tells dd not to replace the file but to just overwrite data that's already there.

---------- Post updated at 12:06 PM ---------- Previous update was at 11:42 AM ----------

Another method needing BASH 3.0 or newer:

Code:
#!/bin/bash

exec 5<hugedata
exec 6<>hugedata

# Read lines one at a time from both file descriptors.
# When we find the line we want in FD 5, FD 6 will still be at the
# previous line, allowing us to overwrite the line with it.
while read -u 5 LINE
do
        # Match strings like >>12345678<< anywhere in the line
        # save it in BASH_REMATCH in three segments:  ...>>, 11111111, <<...
        if [[ $LINE =~ ^(.*\>\>)([0-9]+)(\<\<.*)$ ]]
        then
                NEWLINE="${BASH_REMATCH[1]}22222222${BASH_REMATCH[3]}"

                if [ "${#NEWLINE}" -ne "${#LINE}" ]
                then
                        echo "Error, new line would be different length"
                        exit 1
                fi

                # Overwrite the line with a line of same length
                echo "${NEWLINE}" >&6
                exec 6>&-
                exec 5>&-

                echo "Found and replaced ${BASH_REMATCH[2]} with 22222222" >&2
                exit 0
        else
                read -u 6 LINE  # Keep FD 5 and FD 6 in sync
        fi
done <&5

echo "Warning, didn't find any data to replace" >&2
exit 1

Code:
$ cat hugedata
This is line 1
This is line 2
This is the data I want replaced >>11111111<<
This is another line
etc etc until end of file.
$ ./datarep2.sh
$ cat hugedata
This is line 1
This is line 2
This is the data I want replaced >>22222222<<
This is another line
etc etc until end of file.
$

Both methods are able to edit early lines in the file as long as their length doesn't change, without having to read or write data afterwards at all.


The DD version would be more reliable and portable if you always know where the data to replace is.

---------- Post updated at 12:27 PM ---------- Previous update was at 12:06 PM ----------

Another thing you could do is just keep the header always separate from the huge file. When you need to feed it into something, use sed or awk or whatever to get the modified header, and cat out the rest of the file. (one of the rare useful uses of cat.)
Code:
( sed 's/orig/replacement/' < header ; cat restoffile ) | programusinghugefile


Last edited by Corona688; 09-02-2011 at 03:22 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Need to replace new line characters in a huge file

Hi , I would like to replace new line characters(\n) in a huge file of about 2 million records . I tried this one (:%s/\n//g) but it's hanging there and no result. Does this command do not work if the file is big. Please let me know if you have any other options Regards Raj (1 Reply)
Discussion started by: rajeevm
1 Replies

2. Shell Programming and Scripting

Search & Replace in Multiple Files by reading a input file

I have a environment property file which contains: Input file: value1 = url1 value2 = url2 value3 = url3 and so on. I need to search all *.xml files under directory for value1 and replace it with url1. Same thing I have to do for all values mentioned in input file. I need script in unix bash... (7 Replies)
Discussion started by: Shamkamde
7 Replies

3. Shell Programming and Scripting

awk search/replace specific field, using variables for regexp & subsitution then overwrite file

Hello, I'm trying the solve the following problem. I have a file which I intend to use as a csv called master.csv The columns are separated by commas. I want to change the text on a specific row in either column 3,4,5 or 6 from xxx to yyy depending upon if column 1 matches a specified pattern.... (3 Replies)
Discussion started by: cyphex
3 Replies

4. Shell Programming and Scripting

Mutli line pattern search & replace in a xml file

Hello guys, I need your help for a specific sed command that would search for a multi line pattern and if found, would replace it by another multi line pattern. For instance, here is the input: <RefNickName>abcd</RefNickName> <NickName>efgh</NickName> <Customize> ... (0 Replies)
Discussion started by: xciteddd
0 Replies

5. Shell Programming and Scripting

Global search and replace multi line file

Hello I need to search for a mult-line strngs(with spaces in between and qoted) in a file1 and replace that text with Fixed string globally in file1. The strng to search for is in file2. The file is big with some 20K records. so speed and effciency is required file1: (where srch & rplc... (0 Replies)
Discussion started by: Hiano
0 Replies

6. Shell Programming and Scripting

Implement in one line sed or awk having no delimiter and file size is huge

I have file which contains around 5000 lines. The lines are fixed legth but having no delimiter.Each line line contains nearly 3000 characters. I want to delete the lines a> if it starts with 1 and if 576th postion is a digit i,e 0-9 or b> if it starts with 0 or 9(i,e header and footer) ... (4 Replies)
Discussion started by: millan
4 Replies

7. Shell Programming and Scripting

Search & Replace in Multiple Files by reading a input file

Hi, I have a folder which contains multiple config.xml files and one input file, Please see the below format. Config Files format looks like :- Code: <application name="SAMPLE-ARCHIVE"> <NVPairs name="Global Variables"> <NameValuePair> ... (0 Replies)
Discussion started by: haiksuresh
0 Replies

8. UNIX for Dummies Questions & Answers

How to search and replace a particular line in file with sed command

Hello, I have a file and in that, I want to search for a aprticular word and then replace another word in the same line with something else. Example: In file abc.txt, there is a line <host oa_var="s_hostname">test</host> I want to search with s_hostname text and then replace test with... (2 Replies)
Discussion started by: sshah1001
2 Replies

9. UNIX for Dummies Questions & Answers

how can search a String in one text file and replace the whole line in another file

i am very new to UNIX plz help me in this scenario i have two text files as below file1.txt name=Rajakumar. Discipline=Electronics and communication. Designation=software Engineer. file2.txt name=Kannan. Discipline=Mechanical. Designation=CADD Design Engineer. ... (6 Replies)
Discussion started by: kkraja
6 Replies

10. Shell Programming and Scripting

In Line File Modifications: Search and Replace

grep -il "TEST" ${ENVIRON}/*.pde| while read pde &nbsp;&nbsp;do &nbsp;&nbsp;&nbsp;&nbsp;cat $pde | sed s/"TEST 3,1"/"TEST 3,0"/g | sed s/"TEST&nbsp;&nbsp;3,1"/"TEST&nbsp;&nbsp;3,0"/g > ${pde}.tmp &nbsp;&nbsp;&nbsp;&nbsp;if ; then &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;mv ${pde}.tmp $pde ... (2 Replies)
Discussion started by: Shakey21
2 Replies
Login or Register to Ask a Question