The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Getting Unique values in a file Legend986 Shell Programming and Scripting 5 10-15-2008 02:36 AM
Calculate Gigabyte and Terabyte terryporter51 Shell Programming and Scripting 3 07-01-2008 12:59 PM
Need to find only unique values for a given tag across the files sudheshnaiyer UNIX for Dummies Questions & Answers 8 09-03-2007 12:53 AM
to retrieve unique values mahalakshmi Shell Programming and Scripting 3 02-05-2007 09:30 AM
Unique cell vaules in a file rahulrathod Shell Programming and Scripting 3 01-17-2006 07:42 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 10-14-2008
Legend986 Legend986 is offline
Registered User
  
 

Join Date: Sep 2007
Posts: 171
Unique values from a Terabyte File

Hi,

I have been dealing with a files only a few gigs until now and was able to get out by using the sort utility. But now, I have a terabyte file which I want to filter out unique values from.

I have a server having 8 processor and 16GB RAM with a 5 TB hdd. Is it worthwhile trying to use sort again for this type of a problem or is there a better solution for this? Any help is much appreciated.
  #2 (permalink)  
Old 10-14-2008
matrixmadhan matrixmadhan is offline Forum Advisor  
Technorati Master
  
 

Join Date: Mar 2005
Location: leaf node in B+ tree
Posts: 2,944
Not really.

Running again a plain sort on a tera-byte problem wont scale up properly and that is not needed as well.

These type of problems for which computational complexity increases with more number of records to be processed can be handled by the map-reduce problem. This should probably be done by splitting the files into 'n' chunks and collaborating each of the processed chunks.
  #3 (permalink)  
Old 10-14-2008
Legend986 Legend986 is offline
Registered User
  
 

Join Date: Sep 2007
Posts: 171
So, if I have just a single server with 8 processors, would I be able to execute such an algorithm? I am a little new to these things so I apologize if the question is silly. I was just wondering if there is an algorithm to just split up the original file and then process it bit by bit...

And also, what is the main problem encountered if I create a hashmap? I mean, if there are only a few unique values, where would the problem come from in the first place?
  #4 (permalink)  
Old 10-14-2008
treesloth treesloth is offline
Registered User
  
 

Join Date: Oct 2008
Location: Orem, Utah
Posts: 72
If I may ask, what type of file is this? On a single-instance, rather urgent job, I was able to take a plain-text file and use the split command. It bothered me a bit, since HDD was hit pretty hard, but the job got done. Would your file work with something that primitive?
  #5 (permalink)  
Old 10-14-2008
Legend986 Legend986 is offline
Registered User
  
 

Join Date: Sep 2007
Posts: 171
Oh.. this is a text file too with a bunch of numbers from a network simulation experiment... I was thinking of actually splitting the file and getting the job done, but was just curious if there are better ways of doing things like matrixmadhan expressed....

Last edited by Legend986; 10-14-2008 at 08:47 PM..
  #6 (permalink)  
Old 10-25-2008
matrixmadhan matrixmadhan is offline Forum Advisor  
Technorati Master
  
 

Join Date: Mar 2005
Location: leaf node in B+ tree
Posts: 2,944
( I keep forgetting about this, sorry bad memory )


Probably you could try what I had posted in the below post for your other question.

Making things run faster

It kind of handles these kind of huge dataset problems. Running sort over such a big file would be really tiring and best is to split and achieve the same.
  #7 (permalink)  
Old 10-25-2008
jim mcnamara jim mcnamara is offline Forum Staff  
...@...
  
 

Join Date: Feb 2004
Location: NM
Posts: 5,717
A hashmap or associative arrays (another word for them) is probably best.

You might even try awk if your version handles largefiles. Assume your map key is characters 1-10 of the record.
Code:
awk '!arr[substr($0,1,10)++' myTBfile
Closed Thread

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 06:50 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0