How to get a very big file sorted by contents of another variable list in one pass?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to get a very big file sorted by contents of another variable list in one pass?
# 1  
Old 10-03-2010
How to get a very big file sorted by contents of another variable list in one pass?

I have two list....

One has 17m (yes that's million) records in it which relate to the path and file name of individual text files.

At the start of the pathname is a two character identifier identifying which category it belongs in as identified by a second file with the two character lookup table.

So for example

Code:
Big File

/BB/26607/B7763/BTHS/name1.txt
/DF/78873/YHH97/H764/name76.txt
/BB/8766/OP764/Y7644/name39.txt

In the second file (usually about 42 different two letter identifiers).

Code:
BB
JH
DF
ND

What I want to do is not just sort the master file but split and prioritise it by the order in the second list and try to do it in one pass as processing 17m records is taking about 7 minutes per pass on the system I'm using.

I can't think of a simple way to do this and am suffering from too much typing today now, so could anyone give me some good ideas please?

Thanks
# 2  
Old 10-03-2010
So you want to output pathnames relating to different categories to separate files? Like those containing "BB" to one file, those containing "JH" to another, etc?
# 3  
Old 10-03-2010
Yes please....

Or drop them into an array with the two character reference being used as the array lookup value i.e.

$NAMEVARIABLE[$TWOCHARVAR]

and then I'll echo that out to individual files

Let me just add....

What happens is that then this list is used to process work on the text files as prioritised by the two character order, the work is broken up and jobs spread to other machines that do the actual processing.
# 4  
Old 10-03-2010
Try:
Code:
perl -ne '/^\/(..)/;open O,">>$1";print O;' bigfile

# 5  
Old 10-03-2010
Assuming you can have 40+ open files (you do not mention your OS)

Code:
#pathname for your two letter filenames
p=/path/to/new/files

awk -F'/' -v p=$p '  FILENAME=="file1" {arr[$0]=sprintf("%s/%s", p, $0) }
                    FILENAME=="file2" {if($1 in arr) { print $0 > arr[$1]}
                                              else {print $0}  } ' file1 file2

This will move data into the separate files. you may need to call ulimit to provide more open file descriptors. Also, any row from file2 without a match in file1 will print to the terminal.
# 6  
Old 10-03-2010
Bartus..

That's very kind and helpful, I should have said I only want to use bash, mainly because I don't have the time to learn perl at the moment and make it a rule to never put code in a script I might need a reference book for later.

If you have any ideas for doing it in bash I'd be grateful (sed, awk, grep all allowed).

Regards

---------- Post updated at 07:20 PM ---------- Previous update was at 07:15 PM ----------

Thanks Jim

I'm using Fedora for the testing and RH live.

Let me see if I can get what you've suggested working, I'm open to as many suggestions as possible though.

Regards
# 7  
Old 10-03-2010
Is this a sort or a sift?
Do you need to sort the content of the 42 files? If so, what is the sort key?

Assuming no sort, sifting the main file into individual files with names based on the data characters 2-3 is relatively easy in a modern shell but it will not be quick with 17 million records.

Code:
For example:
cat filename | while read record
do
     category=`echo "${record}"|cut -c2-3`
     echo "${record}" >> "filename_${category}"
done

You can then do whatever processing you need to do in priority order.

For example:
cat reference_file | while read category
do
         ls -ald "filename_${category}"
done

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Storing file contents to a variable

Hi All, I was trying a shell script. I was unable to store file contents to a variable in the script. I have tried the below but unable to do it. Input = `cat /path/op.diary` Input = $(<op.diary) I am using ksh shell. I want to store the 'op.diary' file contents to the variable 'Input'... (12 Replies)
Discussion started by: am24
12 Replies

2. Shell Programming and Scripting

How to pass a shellscript variable to a sql file?

Hi, i wan't to pass a shellscript variable to a sql file. a.sql select $field from dual; the way i am calling this is through sqlplus field_name="sysdate" sqlplus -s username/password@hostname:port/servicename <<EOF @a.sql $field_name EOF (4 Replies)
Discussion started by: reignangel2003
4 Replies

3. Shell Programming and Scripting

Inserting lines from one file to a sorted list

Hello friends! I am working a Psychology/Neuro* project where I am sorting inline citations by category. The final step of the process has me a little stuck. I need to take citations from a text list and sort them in another text file. Here is a file X example... (1 Reply)
Discussion started by: danbroz
1 Replies

4. Shell Programming and Scripting

Folder contents getting appended as strings while redirecting file contents to a variable

Hi one of the output of the command is as below # sed -n "/CCM-ResourceHealthCheck:/,/---------/{/CCM-ResourceHealthCheck:/d;/---------/d;p;}" Automation.OutputZ$zoneCounter | sed 's/$/<br>/' Resource List : <br> *************************** 1. row ***************************<br> ... (2 Replies)
Discussion started by: vivek d r
2 Replies

5. Red Hat

How to pass value of pwd as variable in SED to replace variable in a script file

Hi all, Hereby wish to have your advise for below: Main concept is I intend to get current directory of my script file. This script file will be copied to /etc/init.d. A string in this copy will be replaced with current directory value. Below is original script file: ... (6 Replies)
Discussion started by: cielle
6 Replies

6. Homework & Coursework Questions

How to read contents of a file into variable :(

Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted! 1. The problem statement, all variables and given/known data: I have to read the contents of each field of a file creating user accounts. The file will be of format : ... (6 Replies)
Discussion started by: dude_me5
6 Replies

7. Shell Programming and Scripting

How to read contents of a file into variable :(

My file is in this format : username : student information : default shell : student ID Eg : joeb:Joe Bennett:/bin/csh:1234 jerryd:Jerry Daniels:/bin/csh:2345 deaverm: Deaver Michelle:/bin/bash:4356 joseyg:Josey Guerra:/bin/bash:8767 michaelh:Michael Hall:/bin/ksh:1547 I have to... (1 Reply)
Discussion started by: dude_me5
1 Replies

8. Shell Programming and Scripting

Storing the contents of a file in a variable

There is a file named file.txt whose contents are: +-----------------------------------+-----------+ | Variable_name | Value | +-----------------------------------+-----------+ | Aborted_clients | 0 | | Aborted_connects | 25683... (6 Replies)
Discussion started by: proactiveaditya
6 Replies

9. Shell Programming and Scripting

how to read a value from a file and pass it to a variable

Hi, Some one please help me with this script. I have a file "sequence.txt" and contents of it look like below. 1 5 3 7 4 7 5 74 from my script i should be able to read every single and save the first columnbs to variable x1, x2, x3 and second column as y1, y2, y3 and so on. Then i can... (3 Replies)
Discussion started by: pragash_ms
3 Replies

10. Shell Programming and Scripting

getting the file name and pass as variable

Can any one suggest me how to check the file extension and pass the name based out of the filename within the folder. There would be always one latest file in the folder, but extension may vary... ie .csv, .CSV,.rpt,.xls etc what is best way to get the latest file name and pass as variable.... (1 Reply)
Discussion started by: u263066
1 Replies
Login or Register to Ask a Question