a problem with large files

07-10-2010

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Or since 2,3 million lines will have to be deleted, this may be necessary:

Code:

awk 'NR==1{getline x<f}NR==x{print;getline x<f}' f=file file1 > file2

assuming that "file" is sorted numerically. otherwise sort -n first.

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

07-11-2010

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

@vidyahar85

Quote:

because cat file will be really slow if file has 3 million records please avoid that.

This statement is ridiculous and has no basis in fact whatsoever.

Back on topic.

My impression is that the O/P is reading "file" and searching through "file1" once for every line in "file" to produce the output in "file2".
Is would appear that "file" contains 700,000 line numbers and that "file1" contains 3,000,000 records.
Therefore the number of reads is:
700,000 times 3,000,000 = 2,100,000,000,000
We are clearly on a powerful computer or it would have got nowhere in two days.

To my mind the issue is how to do ONE PASS through "file1" and select the record numbers contained in "file".
We need the following facts from the O/P.
1) Is "file" in numerical order. Is each record unique? Are there leading zeros in the record numbers. Is there a delimiter?
2) Does the record layout of "file1" include the record number? If so, where exactly in the record? Is there a delimiter?
3) Is there a Database and database language available which would make this task easier?

methyl

View Public Profile for methyl

Find all posts by methyl

07-11-2010

Registered User

12, 0

Join Date: Jul 2010

Last Activity: 3 October 2010, 9:47 AM EDT

Posts: 12

Thanks Given: 3

Thanked 0 Times in 0 Posts

thx a lot for your replies....
and here is the answer for your questions:
1) Is "file" in numerical order. Is each record unique? NO Are there leading zeros in the record numbers. Is there a delimiter? NO

2) Does the record layout of "file1" include the record number? YES If so, where exactly in the record? Is there a delimiter? they are in one column u can consider enter the delimiter
3) Is there a Database and database language available which would make this task easier? no i'm just trying to reformat it to a specific application.

---------- Post updated at 12:59 AM ---------- Previous update was at 12:58 AM ----------

i will try it and feed u back
thanks a alot

---------- Post updated at 01:01 AM ---------- Previous update was at 12:59 AM ----------

it is just lines
and the sed is used to print a line no.s saved in a file

m_wassal

View Public Profile for m_wassal

Find all posts by m_wassal

07-12-2010

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

As suggested in post #6, can we see a sample portion of "file" and "file" making it clear which field is the record number.
Please confirm that "file" can contain duplicate record numbers. If so, this is one that needs cleaning up first.

methyl

View Public Profile for methyl

Find all posts by methyl

07-12-2010

Registered User

12, 0

Join Date: Jul 2010

Last Activity: 3 October 2010, 9:47 AM EDT

Posts: 12

Thanks Given: 3

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Scrutinizer

Or since 2,3 million lines will have to be deleted, this may be necessary:

Code:

awk 'NR==1{getline x<f}NR==x{print;getline x<f}' f=file file1 > file2

assuming that "file" is sorted numerically. otherwise sort -n first.

Last edited by m_wassal; 07-12-2010 at 09:15 AM.. Reason: forgot writing the contents

m_wassal

View Public Profile for m_wassal

Find all posts by m_wassal

07-13-2010

Registered User

12, 0

Join Date: Jul 2010

Last Activity: 3 October 2010, 9:47 AM EDT

Posts: 12

Thanks Given: 3

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by binlib

Code:

awk 'FNR==NR{n[$0];next}FNR in n'  file file1 > file2

it didn't work.....syntax error...could you please advise?

---------- Post updated at 08:33 PM ---------- Previous update was at 08:32 PM ----------

Quote:

Originally Posted by Scrutinizer

Or since 2,3 million lines will have to be deleted, this may be necessary:

Code:

awk 'NR==1{getline x<f}NR==x{print;getline x<f}' f=file file1 > file2

assuming that "file" is sorted numerically. otherwise sort -n first.

it didn't work also...syntax error....could you please explain and advise...i really need your help...

m_wassal

View Public Profile for m_wassal

Find all posts by m_wassal

07-13-2010

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Are you on Solaris? If so use nawk or /usr/xpg4/bin/awk instead of the silly awk that is the default.

Last edited by Scrutinizer; 07-13-2010 at 03:15 PM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

Shell Programming and Scripting

a problem with large files

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

A Large Percent Problem

Discussion started by: miriammiriam

2. Solaris

How to safely copy full filesystems with large files (10Gb files)

Discussion started by: dragonov7

3. Shell Programming and Scripting

Divide large data files into smaller files

Discussion started by: ad23

4. UNIX for Dummies Questions & Answers

Large Problem with nautilus

Discussion started by: hakermania

5. UNIX for Dummies Questions & Answers

Large file problem

Discussion started by: iancrozier

6. UNIX for Dummies Questions & Answers

Problem using find with prune on large number of files

Discussion started by: ashikin_8119

7. UNIX for Advanced & Expert Users

Large file FTP problem

Discussion started by: rprajendran

8. Shell Programming and Scripting

problem with 0 byte and large files

Discussion started by: dsravan

9. Shell Programming and Scripting

Problem in processing a very large file.

Discussion started by: Rohini Vijay