Sponsored Content
Top Forums UNIX for Beginners Questions & Answers sed awk: split a large file to unique file names Post 302980511 by drl on Monday 29th of August 2016 01:05:06 PM
Old 08-29-2016
Hi.

Apologies for the length of this and for the late posting. I am always skeptical of shell solutions when we get to sizable files, 1M lines of more because of the time involved. I focused only on the time for reading by creating a test file of 1M lines, only with line content scaffold1 and scaffold2. Here is the script:
Code:
#!/usr/bin/env bash

# @(#) s1       Demonstrate schemes to split a file based on content.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk perl gate mmsplit
inxi -c0 -C

FILE=data1
FILE_tmp=/tmp/data1$$
trap 'rm -f $FILE_tmp ; exit 1'  0  1  2  15
rm -f file* scaffold*

# Create data file if it does not yet exist.
if [ ! -f $FILE ]
then
  ./create2
fi

pl " Input data file $FILE:"
specimen 2:2:2 -n $FILE

# Sample line:
# scaffold1       928     929     C/T     +

pl " Results, shell, unsorted:"
time while read col1 rest; do echo "$col1 $rest" >> ${col1}.txt; done < $FILE
pe
wc scaffold*
rm scaffold*

pl " Results, awk, unsorted:"
time awk '!($1 in a){a[$1]="file"++c".txt"}{print $0 >>a[$1]; close(a[$1])}' $FILE
pe
wc file*
rm file*

pl " Results, sort the file:"
time sort -o $FILE_tmp $FILE
pe
specimen 2:2:2 -n $FILE_tmp

pl " Results, awk sorted:"
time awk '$1 != prev{if(f)close(f);f="file"++c".txt"; prev=$1}{print > f}END{if(f)close(f)}' $FILE_tmp
pe
wc file*
rm file*

pl " Results, gate, sorted:"
time gate -f=1 -s=" " $FILE_tmp
pe
wc scaffold*
rm scaffold*

pl " Results, mmsplit, sorted:"
time mmsplit --fix=every --body=body --grep='/^scaffold(\d+)/' -i=$FILE_tmp
pe
wc body*
rm body*

exit 0

producing:
Code:
$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.4 (jessie) 
bash GNU bash 4.3.30
awk GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2-p3, GNU MP 6.0.0)
perl 5.20.2
gate (local) 1.10
mmsplit (local) 2.0
CPU:       Triple core AMD FX-6350 Six-Core (-MCP-) cache: 6144 KB 
           clock speeds: max: 3915 MHz 1: 3915 MHz 2: 3915 MHz 3: 3915 MHz

-----
 Input data file data1:
Edges: 2:2:2 of 1000000 lines in file "data1"
     1  scaffold1       928     929     C/T     +
     2  scaffold2       928     929     C/T     +
   ---
500001  scaffold1       928     929     C/T     +
500002  scaffold2       928     929     C/T     +
   ---
999999  scaffold1       928     929     C/T     +
1000000 scaffold2       928     929     C/T     +

-----
 Results, shell, unsorted:

real    0m26.607s
user    0m17.868s
sys     0m8.624s

  500000  2500000 18000000 scaffold1.txt
  500000  2500000 18000000 scaffold2.txt
 1000000  5000000 36000000 total

-----
 Results, awk, unsorted:

real    0m19.304s
user    0m5.892s
sys     0m13.308s

  500000  2500000 21000000 file1.txt
  500000  2500000 21000000 file2.txt
 1000000  5000000 42000000 total

-----
 Results, sort the file:

real    0m0.424s
user    0m0.416s
sys     0m0.176s

Edges: 2:2:2 of 1000000 lines in file "/tmp/data110702"
     1  scaffold1       928     929     C/T     +
     2  scaffold1       928     929     C/T     +
   ---
500001  scaffold2       928     929     C/T     +
500002  scaffold2       928     929     C/T     +
   ---
999999  scaffold2       928     929     C/T     +
1000000 scaffold2       928     929     C/T     +

-----
 Results, awk sorted:

real    0m0.515s
user    0m0.420s
sys     0m0.092s

  500000  2500000 21000000 file1.txt
  500000  2500000 21000000 file2.txt
 1000000  5000000 42000000 total

-----
 Results, gate, sorted:

real    0m6.238s
user    0m6.144s
sys     0m0.092s

  500000  2500000 21000000 scaffold1
  500000  2500000 21000000 scaffold2
 1000000  5000000 42000000 total

-----
 Results, mmsplit, sorted:

real    0m2.918s
user    0m2.796s
sys     0m0.120s

  500000  2500000 21000000 body.1
  500000  2500000 21000000 body.2
 1000000  5000000 42000000 total

Comments:

This isn't just a simple split, it's a split and group problem. Codes like csplit at first glance might be considered, but it keys off a unique header-like value, then transfers lines until the next occurrence of a header.. We need to create multiple output files gathering lines that have similar key values.

I like the shell code because it is simple to understand, but it takes a long time.

The awk unsorted version also takes a long time, and I think it's because of the large number of closes.

The awk sorted version is very speedy and, when compared with the time for a sort seems like the best solution.

Our local perl codes gate and mmsplit are run for comparison. The gate is slower, but is very simple to call.

The mmsplit is faster than gate, but has a more complicated calling sequence.

So I would choose the awk sorted code from Akshay Hegde but precede it with a sort. The total real time coming in at 0.424+0.515 -> 0.939, is better than the other solutions.

The awk unsorted could be improved by holding strings until one had, say 1000 of them, then writing the file and closing it. That would cut down the time, but increase the complexity.

The issue of the maximum number of open files might be a problem, although less so for the shell than the other scripting solutions. Solutions using the sorted file would probably be best for a large number of possible group values.

Best wishes ... cheers, drl

Last edited by drl; 08-29-2016 at 10:37 PM.. Reason: Correct minor typos.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split large file and add header and footer to each file

I have one large file, after every 200 line i have to split the file and the add header and footer to each small file? It is possible to add different header and footer to each file? (1 Reply)
Discussion started by: ashish4422
1 Replies

2. UNIX for Dummies Questions & Answers

split a file with unique sets

This may sound like a trivial problem, but I still need some help: I have a file with ids and I want to split it 'n' ways (could be any number) into files: 1 1 1 2 2 3 3 4 5 5 Let's assume 'n' is 3, and we cannot have the same id in two different partitions. So the partitions may... (8 Replies)
Discussion started by: ChicagoBlues
8 Replies

3. Shell Programming and Scripting

extract unique pattern from large text file

Hi All, I am trying to extract data from a large text file , I want to extract lines which contains a five digit number followed by a hyphen , like 12345- , i tried with egrep ,eg : egrep "+" text.txt but which returns all the lines which contains any number of digits followed by hyhen ,... (19 Replies)
Discussion started by: shijujoe
19 Replies

4. Shell Programming and Scripting

Updating a line in a large csv file, with sed/awk?

I have an extremely large csv file that I need to search the second field, and upon matches update the last field... I can pull the line with awk.. but apparently you cant use awk to directly update the file? So im curious if I can use sed to do this... The good news is the field I want to... (5 Replies)
Discussion started by: trey85stang
5 Replies

5. UNIX for Dummies Questions & Answers

Get List of Unique File Names

I have a large directory of web pages. I am doing a search through the web pages using grep and would like to get a list of unique file names of search results. The following command works fine to give me a list of file names where term appears: grep -l term *.html However, since these are... (3 Replies)
Discussion started by: rjulich
3 Replies

6. Shell Programming and Scripting

How to split a data file into separate files with the file names depending upon a column's value?

Hi, I have a data file xyz.dat similar to the one given below, 2345|98|809||x|969|0 2345|98|809||y|0|537 2345|97|809||x|544|0 2345|97|809||y|0|651 9685|98|809||x|321|0 9685|98|809||y|0|357 9685|98|709||x|687|0 9685|98|709||y|0|234 2315|98|809||x|564|0 2315|98|809||y|0|537... (2 Replies)
Discussion started by: nithins007
2 Replies

7. Shell Programming and Scripting

Split File by Pattern with File Names in Source File... Awk?

Hi all, I'm pretty new to Shell scripting and I need some help to split a source text file into multiple files. The source has a row with pattern where the file needs to be split, and the pattern row also contains the file name of the destination for that specific piece. Here is an example: ... (2 Replies)
Discussion started by: cul8er
2 Replies

8. Shell Programming and Scripting

Change unique file names into new unique filenames

I have 84 files with the following names splitseqs.1, spliseqs.2 etc. and I want to change the .number to a unique filename. E.g. change splitseqs.1 into splitseqs.7114_1#24 and change spliseqs.2 into splitseqs.7067_2#4 So all the current file names are unique, so are the new file names.... (1 Reply)
Discussion started by: avonm
1 Replies

9. Shell Programming and Scripting

sed and awk not working on a large record file

Hi All, I have a very large single record file. abc;date||bcd;efg|......... pqr;stu||record_count;date when i do wc -l on this file it gives me "0" records, coz of missing line feed. my problem is there is an extra pipe that is coming at the end of this record like... (6 Replies)
Discussion started by: Gurkamal83
6 Replies

10. Linux

Split a large textfile (one file) into multiple file to base on ^L

Hi, Anyone can help, I have a large textfile (one file), and I need to split into multiple file to break each file into ^L. My textfile ========== abc company abc address abc contact ^L my company my address my contact my skills ^L your company your address ========== (3 Replies)
Discussion started by: fspalero
3 Replies
All times are GMT -4. The time now is 04:58 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy