Unique values from a Terabyte File

10-14-2008

Registered User

173, 0

Join Date: Sep 2007

Last Activity: 13 May 2010, 3:07 PM EDT

Posts: 173

Thanks Given: 0

Thanked 0 Times in 0 Posts

Unique values from a Terabyte File

Hi,

I have been dealing with a files only a few gigs until now and was able to get out by using the sort utility. But now, I have a terabyte file which I want to filter out unique values from.

I have a server having 8 processor and 16GB RAM with a 5 TB hdd. Is it worthwhile trying to use sort again for this type of a problem or is there a better solution for this? Any help is much appreciated.

Legend986

View Public Profile for Legend986

Find all posts by Legend986

10-14-2008

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

Not really.

Running again a plain sort on a tera-byte problem wont scale up properly and that is not needed as well.

These type of problems for which computational complexity increases with more number of records to be processed can be handled by the map-reduce problem. This should probably be done by splitting the files into 'n' chunks and collaborating each of the processed chunks.

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

10-14-2008

Registered User

173, 0

Join Date: Sep 2007

Last Activity: 13 May 2010, 3:07 PM EDT

Posts: 173

Thanks Given: 0

Thanked 0 Times in 0 Posts

So, if I have just a single server with 8 processors, would I be able to execute such an algorithm? I am a little new to these things so I apologize if the question is silly. I was just wondering if there is an algorithm to just split up the original file and then process it bit by bit...

And also, what is the main problem encountered if I create a hashmap? I mean, if there are only a few unique values, where would the problem come from in the first place?

Legend986

View Public Profile for Legend986

Find all posts by Legend986

10-14-2008

Registered User

176, 5

Join Date: Oct 2008

Last Activity: 11 November 2015, 6:40 PM EST

Location: Orem, Utah

Posts: 176

Thanks Given: 16

Thanked 5 Times in 5 Posts

If I may ask, what type of file is this? On a single-instance, rather urgent job, I was able to take a plain-text file and use the split command. It bothered me a bit, since HDD was hit pretty hard, but the job got done. Would your file work with something that primitive?

treesloth

View Public Profile for treesloth

Find all posts by treesloth

10-14-2008

Registered User

173, 0

Join Date: Sep 2007

Last Activity: 13 May 2010, 3:07 PM EDT

Posts: 173

Thanks Given: 0

Thanked 0 Times in 0 Posts

Oh.. this is a text file too with a bunch of numbers from a network simulation experiment... I was thinking of actually splitting the file and getting the job done, but was just curious if there are better ways of doing things like matrixmadhan expressed....

Last edited by Legend986; 10-14-2008 at 09:47 PM..

Legend986

View Public Profile for Legend986

Find all posts by Legend986

10-25-2008

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

( I keep forgetting about this, sorry bad memory

)

Probably you could try what I had posted in the below post for your other question.

https://www.unix.com/unix-advanced-ex...un-faster.html

It kind of handles these kind of huge dataset problems. Running sort over such a big file would be really tiring and best is to split and achieve the same.

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

10-25-2008

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

A hashmap or associative arrays (another word for them) is probably best.

You might even try awk if your version handles largefiles. Assume your map key is characters 1-10 of the record.

Code:

awk '!arr[substr($0,1,10)++' myTBfile

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

Shell Programming and Scripting

Unique values from a Terabyte File

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to identify varying unique fields values from a text file in UNIX?

Discussion started by: manikandan23

2. Shell Programming and Scripting

Using grep and a parameter file to return unique values

Discussion started by: clippertm

3. Shell Programming and Scripting

Extracting unique values of a column from a feed file

Discussion started by: punpun66

4. Linux

To get all the columns in a CSV file based on unique values of particular column

Discussion started by: sanvel

5. Shell Programming and Scripting

Find and count unique date values in a file based on position

Discussion started by: ronan1219

6. Shell Programming and Scripting

List unique values and count instances in .csv file

Discussion started by: batcho

7. Shell Programming and Scripting

How to count Unique Values from a file.

Discussion started by: Prega

8. UNIX for Dummies Questions & Answers

Extract Unique Values from file

Discussion started by: simonsimon

9. UNIX Desktop Questions & Answers

Fetching unique values from file

Discussion started by: shivi707

10. Shell Programming and Scripting

Getting Unique values in a file

Discussion started by: Legend986