Replacing in huge text file

05-28-2011

Registered User

6, 0

Join Date: May 2011

Last Activity: 29 May 2011, 10:59 PM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

Replacing in huge text file

I have huge text files (~120 MB)x100 which equivalents to ~11GB of data. The files contain pure numbers, actually the value of "phi" to 10 billion digits!!

I know its huge!! Here are the last few lines of a file

Code:

0952899155 3233967444 3344925499 0276061529 7261968933 9683989044 3317145063 2771963944 5807139825 5785263278 : 999996
7076665287 1341193004 9994291160 2752806087 3098057018 7993954003 8272886989 6031743863 1213075239 5486559526 : 999997
4770078828 1376659981 9345095495 5822463216 7224348351 6200913437 5085852987 6060405404 9200077203 8324752051 : 999998
4334324783 5519682615 3340745027 7486245638 0533805208 0097461685 3057557984 4986386591 3281896020 9655014075 : 999999
6983266465 0958762067 5922249107 5144125222 8226019880 4186130718 6909500836 2519505480 1837059131 8941970031 : 1000000

each line consists of 10x10 digits and at the end the line number. What I want to do is to remove the spaces and the trailing line number and line break. I tried doing that using sed but I keep messing up. I want the output as:

Code:

095289915532339674443344925499027606152972619689339683989044331714506327719639445807139825578526327870766652871341193004999 and so on.......

I'm relatively new to shell so if you could add a little explanation so that I could learn too.

Thanks a lot.

---------- Post updated at 08:28 AM ---------- Previous update was at 07:49 AM ----------

Ok, after lot of searching I finally got it:

Code:

for(( i = 1 ; i <= 100 ; i++))
do
        cat phi-(printf "%.3d" "$i").txt | sed 's/ : [0-9]*\| //g' | tr -d "\r \n" > $i.txt
done

Where filenames are phi-001.txt, phi-002.txt ..... phi-100.txt

Is there any simpler way to do it?

---------- Post updated at 08:29 AM ---------- Previous update was at 08:28 AM ----------

Simpler as in more CPU and resource efficient ??

shantanuthatte

View Public Profile for shantanuthatte

Find all posts by shantanuthatte

05-29-2011

Moderator

1,484, 567

Join Date: Mar 2011

Last Activity: 28 November 2020, 9:34 AM EST

Posts: 1,484

Thanks Given: 68

Thanked 567 Times in 444 Posts

Can you try using this inside your loop if i got the request right.

Code:

awk -F":" ' {gsub(" ","",$1); printf $1 } ' phi-${i} > $i.txt

Last edited by Peasant; 05-29-2011 at 03:03 AM..

Peasant

View Public Profile for Peasant

Find all posts by Peasant

05-29-2011

Registered User

6, 0

Join Date: May 2011

Last Activity: 29 May 2011, 10:59 PM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

i tried that, its not removing the newlines and the data " : xxxxx"...

any ways I replaced the spaces with null and imported the files in a mysql database....
now the problem is that querying the database is taking huge time....

So whats the best way to search for a substring in approx 18 GB of data and is 18 GB of text file creation possible ??

shantanuthatte

View Public Profile for shantanuthatte

Find all posts by shantanuthatte

05-29-2011

Moderator

1,484, 567

Join Date: Mar 2011

Last Activity: 28 November 2020, 9:34 AM EST

Posts: 1,484

Thanks Given: 68

Thanked 567 Times in 444 Posts

Hmmm what's wrong with it ?
(on your input)

Code:

$ cat phi
0952899155 3233967444 3344925499 0276061529 7261968933 9683989044 3317145063 2771963944 5807139825 5785263278 : 999996
7076665287 1341193004 9994291160 2752806087 3098057018 7993954003 8272886989 6031743863 1213075239 5486559526 : 999997
4770078828 1376659981 9345095495 5822463216 7224348351 6200913437 5085852987 6060405404 9200077203 8324752051 : 999998
4334324783 5519682615 3340745027 7486245638 0533805208 0097461685 3057557984 4986386591 3281896020 9655014075 : 999999
6983266465 0958762067 5922249107 5144125222 8226019880 4186130718 6909500836 2519505480 1837059131 8941970031 : 1000000
$ awk -F":" ' {gsub(" ","",$1); printf $1 } ' phi
09528991553233967444334492549902760615297261968933968398904433171450632771963944580713982557852632787076665287134119300499942911602752806087309805701879939540038272886989603174386312130752395486559526477007882813766599819345095495582246321672243483516200913437508585298760604054049200077203832475205143343247835519682615334074502774862456380533805208009746168530575579844986386591328189602096550140756983266465095876206759222491075144125222822601988041861307186909500836251950548018370591318941970031$

Peasant

View Public Profile for Peasant

Find all posts by Peasant

05-29-2011

Registered User

6, 0

Join Date: May 2011

Last Activity: 29 May 2011, 10:59 PM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

yeah I caught my error... Since i wanted 3 digit numbers with leading zeros I had messed it up....

Its working fine now.... Now my question is: "Is creation of a 18 GB file possible?" I'm using x86_64 GNU/Linux Ubuntu 10.10... and what will be the best way to search for a substring in this file ???

shantanuthatte

View Public Profile for shantanuthatte

Find all posts by shantanuthatte

05-29-2011

Moderator

1,484, 567

Join Date: Mar 2011

Last Activity: 28 November 2020, 9:34 AM EST

Posts: 1,484

Thanks Given: 68

Thanked 567 Times in 444 Posts

Yes it's possible.
check block size of your disks and compare with table.
Suppose it's 4k block size, you will be able to create a file upto ~2TB.

Regarding substrings, you can use awk substr to print substrings.
If you can tell what are you trying to accomplish folks here will probably suggest the best way.

Peasant

View Public Profile for Peasant

Find all posts by Peasant

05-29-2011

Registered User

6, 0

Join Date: May 2011

Last Activity: 29 May 2011, 10:59 PM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

How can I check the block size?

What I'm trying to do is to search the 10billion digits of phi a.k.a the golden ratio for number patterns as it is said that phi will have any number series 0provided you look long enough.

Then once a efficient search function is done which will check even for repeated occurrences, it will be used to derive mathematical statistics about numbers present and more over it, which I havent thought yet. Maybe linking it will the stats available like probability and its relation etc.

One method that I think will be to use multi-threaded application so as to quicken the process and use less RAM.

shantanuthatte

View Public Profile for shantanuthatte

Find all posts by shantanuthatte

Shell Programming and Scripting

Replacing in huge text file

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Filter records in a huge text file from a filter text file

Discussion started by: tech_frk

2. Shell Programming and Scripting

Output only first 400 bytes of a huge text file

Discussion started by: garethsays

3. Shell Programming and Scripting

How to open a huge text file?

Discussion started by: stalaei

4. UNIX for Dummies Questions & Answers

Replacing a column in a text file

Discussion started by: Sotau

5. Shell Programming and Scripting

Replacing second line from huge files

Discussion started by: satish.pyboyina

6. UNIX for Dummies Questions & Answers

Help parsing and replacing text with file name

Discussion started by: mycoguy

7. Shell Programming and Scripting

replacing text with contents from another file

Discussion started by: amoeba

8. Shell Programming and Scripting

replacing text in a file, but...

Discussion started by: Angelseph

9. Shell Programming and Scripting

Replacing Text in Text file

Discussion started by: cgilchrist

10. UNIX for Dummies Questions & Answers

How to remove FIRST Line of huge text file on Solaris

Discussion started by: madoatz