How to get a very big file sorted by contents of another variable list in one pass?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to get a very big file sorted by contents of another variable list in one pass?
# 8  
Old 10-03-2010
This should works (works on your little sample)
bash code:
  1. #!/bin/bash
  2. P=UniqueName
  3. LIST=$(cat index | sed "s/^/$P/" | tr '\n' ' ')
  4. rm $LIST
  5. while read L
  6. do
  7.    L=${L:1}
  8.    echo $L >>$P${L%%/*}
  9. done <bigfile
  10. cat $LIST >newbigfile
  11. cat newbigfile
I'd like to test with a bigger one Smilie
Nota: it shows errors for non-existing files (name in index but without corresponding line in bigfile)

PS: maybe you can try to cd /dev/shm (the ramdisk).
I don't know if it goes really faster. Anyway, you should precise the fullpath for bigfile and newbigfile.
# 9  
Old 10-03-2010
@frans
Sorry, but your script does not work. There are too many errors to discuss.
# 10  
Old 10-03-2010
Hi.

You can use a very small awk code that simply switches the output file for lines if you can arrange the file in order of your custom collating sequence.

The msort utility can handle such collating sequences http://freshmeat.net/projects/msort
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate custom collating sequence.
# msort-home http://freshmeat.net/projects/msort

# Uncomment to run script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
# Infrastructure details, environment, commands for forum posts. 
set +o nounset
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe ; pe "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
pe "(Versions displayed with local utility \"version\")"
c=$( ps | grep $$ | awk '{print $NF}' )
version >/dev/null 2>&1 && s=$(_eat $0 $1) || s=""
[ "$c" = "$s" ] && p="$s" || p="$c"
version >/dev/null 2>&1 && version "=o" $p specimen msort awk
set -o nounset

FILE1=${1-data1}
shift
FILE2=${1-data2}

# Sample data files with head / tail if specimen fails.
pe
specimen $FILE1 $FILE2 \
|| { pe "(head/tail)"; head -n 5 $FILE1 $FILE2; pe " ||" ;\
     tail -n 5 $FILE1 $FILE2; }

pl " Results, intermediate file:"
msort -q -n 1,1 -u n -l -c lexicographic -d "/" -s $FILE2 -1 $FILE1 |
tee t1 |
awk -F "/" '
BEGIN	{ old = "" }
old == ""	{ old = $2 ; print $0 > old;  next }
$2 != old	{ close(old) ;  old = $2 ; print $0 > old; next }
$2 == old	{ print $0 > old }
'
cat t1

pl " Created files:"
ls -lgG [BDJN]*

pl " Contents of BB:"
cat BB

exit 0

Using your data, this produces:
Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
specimen (local) 1.17
msort - ( /usr/bin/msort Apr 24 2008 )
GNU Awk 3.1.5

Whole: 5:0:5 of 4 lines in file "data1"
/ND/26607/B7763/BTHS/name1.txt
/BB/26607/B7763/BTHS/name1.txt
/DF/78873/YHH97/H764/name76.txt
/BB/8766/OP764/Y7644/name39.txt

Whole: 5:0:5 of 4 lines in file "data2"
BB
JH
DF
ND

-----
 Results, intermediate file:
/BB/26607/B7763/BTHS/name1.txt
/BB/8766/OP764/Y7644/name39.txt
/DF/78873/YHH97/H764/name76.txt
/ND/26607/B7763/BTHS/name1.txt

-----
 Created files:
-rw-r--r-- 1 63 Oct  3 18:47 BB
-rw-r--r-- 1 32 Oct  3 18:47 DF
-rw-r--r-- 1 31 Oct  3 18:47 ND

-----
 Contents of BB:
/BB/26607/B7763/BTHS/name1.txt
/BB/8766/OP764/Y7644/name39.txt

The first chunk of code in the script is to show the environment I used. The file BB is displayed as a sample result after all the files created from your (sorted) data are listed by name.

The drawback is that one needs msort. It was in my Debian (lenny) repository, but it's available in freshmeat as noted.

Good luck ... cheers, drl
# 11  
Old 10-03-2010
Thanks everyone.

I went with the idea suggested by methyl in the end as it was easier to get my head around it.

drl thanks for a tremendous example and frankly a shedload of work, but the dependency made it a potential problem with portability later, but once again thanks.

BA
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Storing file contents to a variable

Hi All, I was trying a shell script. I was unable to store file contents to a variable in the script. I have tried the below but unable to do it. Input = `cat /path/op.diary` Input = $(<op.diary) I am using ksh shell. I want to store the 'op.diary' file contents to the variable 'Input'... (12 Replies)
Discussion started by: am24
12 Replies

2. Shell Programming and Scripting

How to pass a shellscript variable to a sql file?

Hi, i wan't to pass a shellscript variable to a sql file. a.sql select $field from dual; the way i am calling this is through sqlplus field_name="sysdate" sqlplus -s username/password@hostname:port/servicename <<EOF @a.sql $field_name EOF (4 Replies)
Discussion started by: reignangel2003
4 Replies

3. Shell Programming and Scripting

Inserting lines from one file to a sorted list

Hello friends! I am working a Psychology/Neuro* project where I am sorting inline citations by category. The final step of the process has me a little stuck. I need to take citations from a text list and sort them in another text file. Here is a file X example... (1 Reply)
Discussion started by: danbroz
1 Replies

4. Shell Programming and Scripting

Folder contents getting appended as strings while redirecting file contents to a variable

Hi one of the output of the command is as below # sed -n "/CCM-ResourceHealthCheck:/,/---------/{/CCM-ResourceHealthCheck:/d;/---------/d;p;}" Automation.OutputZ$zoneCounter | sed 's/$/<br>/' Resource List : <br> *************************** 1. row ***************************<br> ... (2 Replies)
Discussion started by: vivek d r
2 Replies

5. Red Hat

How to pass value of pwd as variable in SED to replace variable in a script file

Hi all, Hereby wish to have your advise for below: Main concept is I intend to get current directory of my script file. This script file will be copied to /etc/init.d. A string in this copy will be replaced with current directory value. Below is original script file: ... (6 Replies)
Discussion started by: cielle
6 Replies

6. Homework & Coursework Questions

How to read contents of a file into variable :(

Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted! 1. The problem statement, all variables and given/known data: I have to read the contents of each field of a file creating user accounts. The file will be of format : ... (6 Replies)
Discussion started by: dude_me5
6 Replies

7. Shell Programming and Scripting

How to read contents of a file into variable :(

My file is in this format : username : student information : default shell : student ID Eg : joeb:Joe Bennett:/bin/csh:1234 jerryd:Jerry Daniels:/bin/csh:2345 deaverm: Deaver Michelle:/bin/bash:4356 joseyg:Josey Guerra:/bin/bash:8767 michaelh:Michael Hall:/bin/ksh:1547 I have to... (1 Reply)
Discussion started by: dude_me5
1 Replies

8. Shell Programming and Scripting

Storing the contents of a file in a variable

There is a file named file.txt whose contents are: +-----------------------------------+-----------+ | Variable_name | Value | +-----------------------------------+-----------+ | Aborted_clients | 0 | | Aborted_connects | 25683... (6 Replies)
Discussion started by: proactiveaditya
6 Replies

9. Shell Programming and Scripting

how to read a value from a file and pass it to a variable

Hi, Some one please help me with this script. I have a file "sequence.txt" and contents of it look like below. 1 5 3 7 4 7 5 74 from my script i should be able to read every single and save the first columnbs to variable x1, x2, x3 and second column as y1, y2, y3 and so on. Then i can... (3 Replies)
Discussion started by: pragash_ms
3 Replies

10. Shell Programming and Scripting

getting the file name and pass as variable

Can any one suggest me how to check the file extension and pass the name based out of the filename within the folder. There would be always one latest file in the folder, but extension may vary... ie .csv, .CSV,.rpt,.xls etc what is best way to get the latest file name and pass as variable.... (1 Reply)
Discussion started by: u263066
1 Replies
Login or Register to Ask a Question