How to quickly substitute pattern within certain range of a huge file?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to quickly substitute pattern within certain range of a huge file?
# 1  
Old 05-15-2018
How to quickly substitute pattern within certain range of a huge file?

I have big files (some are >300GB!) that need substitution for some patterns, for example, change Multiple Spaces into Tab. I used this oneliner:
Code:
sed '1,18s/ \{1,\}/\t/g' infile_big.sam > outfile_big.sam

but it seems very slow as the job is still running after 24 hours! In this example, only the first 18 rows need be changed, and the rest is untouched.
Is there any better way to do the job quickly? I'm using GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu) on Linux 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) x86_64 GNU/Linux.
Thanks a lot!

Last edited by yifangt; 05-15-2018 at 05:42 PM..
# 2  
Old 05-15-2018
I'm afraid ANY approach will have to copy >300GB, even if only 18 lines are to be modified, which will take its time. Possible solutions might be using editors like ed (search these fora), or applying some dirty tricks like
Code:
sed '1,18s/,\{1,\}/\t/g; 19q' infile_big.sam > outfile_big.sam
dd if=infile_big.sam of=outfile_big.sam skip=$(stat -c%s outfile_big.sam) iflag=skip_bytes oflag=append conv=notrunc

Still, don't expect too much ...

Last edited by RudiC; 05-15-2018 at 06:13 PM.. Reason: Eliminated TMP file.
# 3  
Old 05-15-2018
Thanks!
Quote:
ANY approach will have to copy >300GB, even if only 18 lines are to be modified, which will take its time.
--- This is one of what I wanted to confirm. I tried vim, but had pain to open and close >300GB file.
# 4  
Old 05-15-2018
I forgot to mention: increasing dd's block size to several MB will speed up the copy process dramatically, but don't go too high. And, I think the TMP file is not necessary, you can use the output file immediately. So it would read like
Code:
sed '1,18s/,\{1,\}/\t/g; 19q' infile_big.sam > outfile_big.sam
dd if=infile_big.sam of=outfile_big.sam skip=$(stat -c%s outfile_big.sam) bs=2M iflag=skip_bytes oflag=append conv=notrunc

This User Gave Thanks to RudiC For This Post:
# 5  
Old 05-16-2018
Another thing to consider is the resources you have, so memory, disk devices and contention by other applications. If you run out of memory you may end up swapping/paging real memory to disk which is time consuming to write and to (later on) read back in.

For the disk, is it local disk or an attached SAN? I fear it might be an NFS or Samba mounted share which will be slow because another server is doing the real IO and shovelling it across the network.

If it is not NFS or Samba, is it local disk or SAN is still a question. Local simple disks (no RAID controller) will require writes to be committed before returning control to the program. You might find a high %SYS time on something like vmstat 3, ignoring the first line which is statistics since boot.

Local disk also may have IO contention for the physical devices.

Local hardware RAID disk or SAN provided disk LUNs (hopefully fibre attached), on the other hand should give better performance because they usually come with a large cache, to IO reads are anticipated and writes and written to disk-cache memory (and committed to real disk later) so the control goes back to the CPU again.


Can you tell us more about the resources you have?




Kind regards,
Robin
This User Gave Thanks to rbatte1 For This Post:
# 6  
Old 05-16-2018
At the end you know perl has it Smilie
https://perldoc.perl.org/Tie/File.html

If you give it a shot, be sure to get back with results on 300 GB file.

Regards
Peasant.
# 7  
Old 05-16-2018
Thanks Robin & RudiC!
1) The storage disk is NFS mounted EMC Isilon, but I am not quite sure the hardware configuration. My Admin told me the network width is only 1Gb speed. Probably this is one of the reasons.
2) Tiny bug when I tried sed '1,18s/,\{1,\}/\t/g; 19q' I got an extra line (Line 19). which should be sed '1,18s/,\{1,\}/\t/g; 18q' .
Thanks a lot again!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract range from config file matching pattern

I have config file like this: server_name xx opt1 opt2 opt3 suboptions1 #suboptions - disabled suboptions2 pattern suboptions3 server_name yy opt1 opt2 opt3 suboptions1 pattern #suboptions - disabled suboptions2 So basically I want to extract the server... (1 Reply)
Discussion started by: nemesis911
1 Replies

2. Shell Programming and Scripting

sed Range Pattern and 2 lines before Start Pattern

Hi all, I have been searching all over Google but I am unable to find a solution for a particular result that I am trying to achieve. Consider the following input: 1 2 3 4 5 B4Srt1--Variable-0000 B4Srt2--Variable-1111 Srt 6 7 8 9 10 End (3 Replies)
Discussion started by: y2jacky
3 Replies

3. UNIX for Dummies Questions & Answers

Split a huge 7 GB File Based on Pattern into 4 files

Hi, I have a Huge 7 GB file which has around 1 million records, i want to split this file into 4 files to contain around 250k messages each. Please help me as Split command cannot work here as it might miss tags.. Format of the file is as below <!--###### ###### START-->... (6 Replies)
Discussion started by: KishM
6 Replies

4. Shell Programming and Scripting

Print pattern range to a new file

Hi Everyone! I really appreciate all of your help, I'm learning so much, can't wait until I get good enough to start answering questions! I have a problem ... from one large file, I'd like to create multiple new files for each pattern block beginning with /^ISA/ ending with /^IEA/ ... (2 Replies)
Discussion started by: verge
2 Replies

5. Shell Programming and Scripting

Pattern Matchin Huge File

Hi Experts, I've issue with the huge file. My requirement is I need to search a pattern between the 155-156 position and if its match's to 31 or 36 then need to route that to a new separate files. The main file has around 1459328 line and 2 GB in size. I tired with the below code which take... (9 Replies)
Discussion started by: senthil.ak
9 Replies

6. Shell Programming and Scripting

Removing tmp file too quickly?

Still trying to get the basics down and I would like a different solution to what I'm currently doing and a better understanding of why it's happening. I've written a simple backup script that tars individual directories and then dumps them to a NFS drive. STDERR is being dumped into a process... (2 Replies)
Discussion started by: mandelbrot333
2 Replies

7. Shell Programming and Scripting

How to combine lines within range of pattern

I've a file say having line 1 line 2 (NP line 3 line 4 line 5) line 6 I want to combine lines starting from (NP and ending with ) then it will look like line 1 line 2 (NP line3 line4 line5) line 6 I tried using sed '/(NP/,/)$/ s/\n/ /' but it's not working. Any help please? ... (8 Replies)
Discussion started by: neg
8 Replies

8. Shell Programming and Scripting

sed pattern range

Hi guys, trying to replace a '#' with a ' ' (space) but only between the brackets '(' and ')' N="text1#text2#text3(var1#var2#var3)" N=`echo $N |sed '/(/,/) s/#. //'` echo $N Looking for an output of "text1#text2#text3(var1 var2 var3)" Any ideas? (15 Replies)
Discussion started by: mikepegg
15 Replies

9. Shell Programming and Scripting

print range between two patterns if it contains a pattern within the range

I want to print between the range two patterns if a particular pattern is present in between the two patterns. I am new to Unix. Any help would be greatly appreciated. e.g. Pattern1 Bombay Calcutta Delhi Pattern2 Pattern1 Patna Madras Gwalior Delhi Pattern2 Pattern1... (2 Replies)
Discussion started by: joyan321
2 Replies

10. UNIX for Dummies Questions & Answers

need solution for this quickly. please quickly.

Write a nawk script that will produce the following report: ***FIRST QUARTERLY REPORT*** ***CAMPAIGN 2004 CONTRIBUTIONS*** ------------------------------------------------------------------------- NAME PHONE Jan | ... (5 Replies)
Discussion started by: p.palakj.shah
5 Replies
Login or Register to Ask a Question