Unique values from a Terabyte File


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Unique values from a Terabyte File
# 1  
Old 10-14-2008
Unique values from a Terabyte File

Hi,

I have been dealing with a files only a few gigs until now and was able to get out by using the sort utility. But now, I have a terabyte file which I want to filter out unique values from.

I have a server having 8 processor and 16GB RAM with a 5 TB hdd. Is it worthwhile trying to use sort again for this type of a problem or is there a better solution for this? Any help is much appreciated.
# 2  
Old 10-14-2008
Not really.

Running again a plain sort on a tera-byte problem wont scale up properly and that is not needed as well.

These type of problems for which computational complexity increases with more number of records to be processed can be handled by the map-reduce problem. This should probably be done by splitting the files into 'n' chunks and collaborating each of the processed chunks.
# 3  
Old 10-14-2008
So, if I have just a single server with 8 processors, would I be able to execute such an algorithm? I am a little new to these things so I apologize if the question is silly. I was just wondering if there is an algorithm to just split up the original file and then process it bit by bit...

And also, what is the main problem encountered if I create a hashmap? I mean, if there are only a few unique values, where would the problem come from in the first place?
# 4  
Old 10-14-2008
If I may ask, what type of file is this? On a single-instance, rather urgent job, I was able to take a plain-text file and use the split command. It bothered me a bit, since HDD was hit pretty hard, but the job got done. Would your file work with something that primitive?
# 5  
Old 10-14-2008
Oh.. this is a text file too with a bunch of numbers from a network simulation experiment... I was thinking of actually splitting the file and getting the job done, but was just curious if there are better ways of doing things like matrixmadhan expressed....

Last edited by Legend986; 10-14-2008 at 09:47 PM..
# 6  
Old 10-25-2008
( I keep forgetting about this, sorry bad memory Smilie )


Probably you could try what I had posted in the below post for your other question.

https://www.unix.com/unix-advanced-ex...un-faster.html

It kind of handles these kind of huge dataset problems. Running sort over such a big file would be really tiring and best is to split and achieve the same.
# 7  
Old 10-25-2008
A hashmap or associative arrays (another word for them) is probably best.

You might even try awk if your version handles largefiles. Assume your map key is characters 1-10 of the record.
Code:
awk '!arr[substr($0,1,10)++' myTBfile

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to identify varying unique fields values from a text file in UNIX?

Hi, I have a huge unsorted text file. We wanted to identify the unique field values in a line and consider those fields as a primary key for a table in upstream system. Basically, the process or script should fetch the values from each line that are unique compared to the rest of the lines in... (13 Replies)
Discussion started by: manikandan23
13 Replies

2. Shell Programming and Scripting

Using grep and a parameter file to return unique values

Hello Everyone! I have updated the first post so that my intentions are easier to understand, and also attached sample files (post #18). I have over 500 text files in a directory. Over 1 GB of data. The data in those files is organised in lines: My intention is to return one line per... (23 Replies)
Discussion started by: clippertm
23 Replies

3. Shell Programming and Scripting

Extracting unique values of a column from a feed file

Hi Folks, I have the below feed file named abc1.txt in which you can see there is a title and below is the respective values in the rows and it is completely pipe delimited file ,. ... (4 Replies)
Discussion started by: punpun66
4 Replies

4. Linux

To get all the columns in a CSV file based on unique values of particular column

cat sample.csv ID,Name,no 1,AAA,1 2,BBB,1 3,AAA,1 4,BBB,1 cut -d',' -f2 sample.csv | sort | uniq this gives only the 2nd column values Name AAA BBB How to I get all the columns of CSV along with this? (1 Reply)
Discussion started by: sanvel
1 Replies

5. Shell Programming and Scripting

Find and count unique date values in a file based on position

Hello, I need some sort of way to extract every date contained in a file, and count how many of those dates there are. Here are the specifics: The date format I'm looking for is mm/dd/yyyy I only need to look after line 45 in the file (that's where the data begins) The columns of... (2 Replies)
Discussion started by: ronan1219
2 Replies

6. Shell Programming and Scripting

List unique values and count instances in .csv file

I need to take the second column of a .csv file and count the number of instances of each unique value in that same second column. I'd like the output to be value,count sorted by most instances. Thanks for any guidance! Data example: 317476,317756,0 816063,318861,0 313123,319091,0... (4 Replies)
Discussion started by: batcho
4 Replies

7. Shell Programming and Scripting

How to count Unique Values from a file.

Hi I have the following info in a file - <Cell id="25D"/> <Cell id="26A"/> <Cell id="26B"/> <Cell id="26C"/> <Cell id="27A"/> <Cell id="27B"/> <Cell id="27C"/> <Cell id="28A"/> I would like to know how would you go about counting all... (4 Replies)
Discussion started by: Prega
4 Replies

8. UNIX for Dummies Questions & Answers

Extract Unique Values from file

Hello all, I have a file with following sample data 2009-08-26 05:32:01.65 spid5 Process ID 86:214 owns resources that are blocking processes on Scheduler 0. 2009-08-26 05:32:01.65 spid5 Process ID 86:214 owns resources that are blocking processes on Scheduler 0. 2009-08-26... (5 Replies)
Discussion started by: simonsimon
5 Replies

9. UNIX Desktop Questions & Answers

Fetching unique values from file

After giving grep -A4 "feature 1," <file name> I have extracted the following text feature 1, subfeat 2, type 1, subtype 5, dump '30352f30312f323030392031313a33303a3337'H -- "05/01/2009 11:30:37" -- -- ... (1 Reply)
Discussion started by: shivi707
1 Replies

10. Shell Programming and Scripting

Getting Unique values in a file

Hi, I have a file like this: Some_String_Here 123 123 123 321 321 321 3432 3221 557 886 321 321 I would like to find only the unique values in the files and get the following output: Some_String_Here 123 321 3432 3221 557 886 I am trying to get this done using awk. Can someone please... (5 Replies)
Discussion started by: Legend986
5 Replies
Login or Register to Ask a Question