Optimised way for search & replace a value on one line in a very huge file (File Size is 24 GB).

09-02-2011

Registered User

9, 0

Join Date: May 2011

Last Activity: 22 December 2011, 10:31 AM EST

Posts: 9

Thanks Given: 1

Thanked 0 Times in 0 Posts

Optimised way for search & replace a value on one line in a very huge file (File Size is 24 GB).

Hi Experts,

I had to edit (a particular value) in header line of a very huge file so for that i wanted to search & replace a particular value on a file which was of 24 GB in Size. I managed to do it but it took long time to complete. Can anyone please tell me how can we do it in a optimised way.

Thanks in advance.
Manish

Steps which i followed:

1. head -1 orignal_file > temp
2. sed -n '2,$p' original_file >> temp
3. mv temp original_file

manishkomar007

View Public Profile for manishkomar007

Find all posts by manishkomar007

09-02-2011

Registered User

1,000, 237

Join Date: Jun 2011

Last Activity: 2 August 2017, 9:27 AM EDT

Location: From far

Posts: 1,000

Thanks Given: 21

Thanked 237 Times in 231 Posts

Afaik, in the general case no. But if your new first line has the same number of bytes or shorter and you can pad it with spaces then I believe you can do it quick with low level programming - open for read/write, read some bytes (512 for example) in a buffer, change them, rewind, and write the buffer back.

yazu

View Public Profile for yazu

Find all posts by yazu

09-02-2011

Registered User

9, 0

Join Date: May 2011

Last Activity: 22 December 2011, 10:31 AM EST

Posts: 9

Thanks Given: 1

Thanked 0 Times in 0 Posts

Thanks for your suggestion Yazu.

Just now i have got this below command from my friend which works good and takes 11mins to process on such a huge file.

Please someone can tell me if it can further be optimized.

Code:

perl -i -e '(s/OLD/New/) if $.==1' original_file

Last edited by radoulov; 09-03-2011 at 04:05 AM.. Reason: Code tags.

manishkomar007

View Public Profile for manishkomar007

Find all posts by manishkomar007

09-02-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

There is no fundamental operation for inserting or deleting data in the middle of a file. You have to rewrite the entire file after the edit.

A 24 gigabyte file in 11 minutes is 37 megabytes per second, which is actually a pretty impressive transfer rate! It's probably maxed out your disk or bus speed now, changing the program won't help significantly. It might help to write the output to a different disk than you're reading from.

If you could use yazu's suggestion of always keeping the string the same length, so the data afterwards doesn't need to be rewritten, that would let the edit happen in a fraction of a second...

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

09-02-2011

Registered User

9, 0

Join Date: May 2011

Last Activity: 22 December 2011, 10:31 AM EST

Posts: 9

Thanks Given: 1

Thanked 0 Times in 0 Posts

Thanks Corona688...!!

While doing this we are excatly searching & replacing 8 character like 20110901 to 20110902. And we were monitoring the performance of the server which was very good. It didn't swaped out on memory. Still it took so much time .. rite.. i think on Linux if its takes 11 mins which is still more. Please correct me if I am wrong.

manishkomar007

View Public Profile for manishkomar007

Find all posts by manishkomar007

09-02-2011

Registered User

190, 0

Join Date: Sep 2001

Last Activity: 21 August 2015, 10:59 AM EDT

Location: Chicago

Posts: 190

Thanks Given: 7

Thanked 0 Times in 0 Posts

How quickly did awk/nawk do it with sub/gsub? Just curious.

giannicello

View Public Profile for giannicello

Find all posts by giannicello

09-02-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by manishkomar007

Thanks Corona688...!!

While doing this we are excatly searching & replacing 8 character like 20110901 to 20110902.

Could you show us the first few lines of the file, and the data you wish replaced? If the data is always the same length and always in the same place, you can use dd to write it in...

---------- Post updated at 11:42 AM ---------- Previous update was at 11:37 AM ----------

An example:

Code:

$ cat textdata
This is line 1
This is line 2
This is the data I want replaced >>11111111<<
This is another line
etc etc until end of file.
$ printf "%s" 22222222 | dd conv=notrunc of=textdata seek=65 bs=1
$ cat textdata
This is line 1
This is line 2
This is the data I want replaced >>22222222<<
This is another line
etc etc until end of file.

The 'bs=1' tells it to work on a sector size of 1 byte, which lets us seek seek exactly 65 characters into the file with seek=65. The conv=notrunc is important, it tells dd not to replace the file but to just overwrite data that's already there.

---------- Post updated at 12:06 PM ---------- Previous update was at 11:42 AM ----------

Another method needing BASH 3.0 or newer:

Code:

#!/bin/bash

exec 5<hugedata
exec 6<>hugedata

# Read lines one at a time from both file descriptors.
# When we find the line we want in FD 5, FD 6 will still be at the
# previous line, allowing us to overwrite the line with it.
while read -u 5 LINE
do
        # Match strings like >>12345678<< anywhere in the line
        # save it in BASH_REMATCH in three segments:  ...>>, 11111111, <<...
        if [[ $LINE =~ ^(.*\>\>)([0-9]+)(\<\<.*)$ ]]
        then
                NEWLINE="${BASH_REMATCH[1]}22222222${BASH_REMATCH[3]}"

                if [ "${#NEWLINE}" -ne "${#LINE}" ]
                then
                        echo "Error, new line would be different length"
                        exit 1
                fi

                # Overwrite the line with a line of same length
                echo "${NEWLINE}" >&6
                exec 6>&-
                exec 5>&-

                echo "Found and replaced ${BASH_REMATCH[2]} with 22222222" >&2
                exit 0
        else
                read -u 6 LINE  # Keep FD 5 and FD 6 in sync
        fi
done <&5

echo "Warning, didn't find any data to replace" >&2
exit 1

Code:

$ cat hugedata
This is line 1
This is line 2
This is the data I want replaced >>11111111<<
This is another line
etc etc until end of file.
$ ./datarep2.sh
$ cat hugedata
This is line 1
This is line 2
This is the data I want replaced >>22222222<<
This is another line
etc etc until end of file.
$

Both methods are able to edit early lines in the file as long as their length doesn't change, without having to read or write data afterwards at all.

The DD version would be more reliable and portable if you always know where the data to replace is.

---------- Post updated at 12:27 PM ---------- Previous update was at 12:06 PM ----------

Another thing you could do is just keep the header always separate from the huge file. When you need to feed it into something, use sed or awk or whatever to get the modified header, and cat out the rest of the file. (one of the rare useful uses of cat.)

Code:

( sed 's/orig/replacement/' < header ; cat restoffile ) | programusinghugefile

Last edited by Corona688; 09-02-2011 at 03:22 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

Optimised way for search & replace a value on one line in a very huge file (File Size is 24 GB).

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Need to replace new line characters in a huge file

Discussion started by: rajeevm

2. Shell Programming and Scripting

Search & Replace in Multiple Files by reading a input file

Discussion started by: Shamkamde

3. Shell Programming and Scripting

awk search/replace specific field, using variables for regexp & subsitution then overwrite file

Discussion started by: cyphex

4. Shell Programming and Scripting

Mutli line pattern search & replace in a xml file

Discussion started by: xciteddd

5. Shell Programming and Scripting

Global search and replace multi line file

Discussion started by: Hiano

6. Shell Programming and Scripting

Implement in one line sed or awk having no delimiter and file size is huge

Discussion started by: millan

7. Shell Programming and Scripting

Search & Replace in Multiple Files by reading a input file

Discussion started by: haiksuresh

8. UNIX for Dummies Questions & Answers

How to search and replace a particular line in file with sed command

Discussion started by: sshah1001

9. UNIX for Dummies Questions & Answers

how can search a String in one text file and replace the whole line in another file

Discussion started by: kkraja

10. Shell Programming and Scripting

In Line File Modifications: Search and Replace

Discussion started by: Shakey21