sed awk: split a large file to unique file names

08-24-2016

Registered User

1, 0

Join Date: Sep 2015

Last Activity: 24 August 2016, 11:07 AM EDT

Posts: 1

Thanks Given: 0

Thanked 0 Times in 0 Posts

sed awk: split a large file to unique file names

Dear Users,

Appreciate your help if you could help me with splitting a large file > 1 million lines with sed or awk. below is the text in the file
input file.txt

Code:

scaffold1       928     929     C/T     +
scaffold1       942     943     G/C     +
scaffold1       959     960     C/T     +
scaffold1       994     995     G/A     +
scaffold2       1024    1025    G/A     +
scaffold2       1065    1066    G/A     +
scaffold2       1356    1357    C/T     +
scaffold2       1363    1364    G/A     +
scaffold3       1367    1368    G/A     +
scaffold3       1403    1404    G/A     +
scaffold3       1404    1405    C/T     +
scaffold3       1433    1434    G/A     +
scaffold3       1467    1468    G/A     +
scaffold4       1521    1522    G/A     +
scaffold4       63885   63886   T/G     +
scaffold4       63907   63908   G/A     +
scaffold4       63942   63943   T/C     +
scaffold4       63964   63965   G/A     +
scaffold5       63996   63997   G/A     +
scaffold5       63997   63998   T/C     +
scaffold5       64074   64075   G/T     +
scaffold100       64076   64077   C/T     +
scaffold100       64127   64128   C/T     +
scaffold120       64221   64222   A/G     +
scaffold1100       64222   64223   T/C     +
scaffold1890       64263   64264   C/T     +
scaffold2000       64281   64282   G/C     +
scaffold2001       64292   64293   C/T     +
scaffold2002      64343   64344   G/A     +
scaffold2003       64347   64348   G/T     +

my output file should be unique to the first column name
output files
file1.txt

Code:

scaffold1       928     929     C/T     +
scaffold1       942     943     G/C     +
scaffold1       959     960     C/T     +
scaffold1       994     995     G/A     +

file2.txt

Code:

scaffold2       1024    1025    G/A     +
scaffold2       1065    1066    G/A     +
scaffold2       1356    1357    C/T     +
scaffold2       1363    1364    G/A     +

file2.txt

Code:

scaffold3       1367    1368    G/A     +
scaffold3       1403    1404    G/A     +
scaffold3       1404    1405    C/T     +
scaffold3       1433    1434    G/A     +
scaffold3       1467    1468    G/A     +

and so on.

Thank you,
kapr0001

Moderator's Comments:

Please use CODE tags as required by forum rules!

Last edited by RudiC; 08-24-2016 at 04:13 PM.. Reason: Added CODE tags.

kapr0001

View Public Profile for kapr0001

Find all posts by kapr0001

08-24-2016

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

man split

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

08-24-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello Kapr0001,

Welcome to forums, hope you will enjoy learning/sharing knowledge here. Please use code tags for your commands/codes/Inputs which you are using into your post as per forum rules. Following may help you in same.
Let's say we have following Input_file.

Code:

cat Input_file
scaffold1 928 929 C/T +
scaffold1 942 943 G/C +
scaffold1 959 960 C/T +
scaffold1 994 995 G/A +
scaffold2 1024 1025 G/A +
scaffold2 1065 1066 G/A +
scaffold2 1356 1357 C/T +
scaffold2 1363 1364 G/A +
scaffold3 1367 1368 G/A +
scaffold3 1403 1404 G/A +
scaffold3 1404 1405 C/T +
scaffold3 1433 1434 G/A +
scaffold3 1467 1468 G/A +
scaffold4 1521 1522 G/A +
scaffold4 63885 63886 T/G +
scaffold4 63907 63908 G/A +
scaffold4 63942 63943 T/C +
scaffold4 63964 63965 G/A +
scaffold5 63996 63997 G/A +
scaffold5 63997 63998 T/C +
scaffold5 64074 64075 G/T +

Then following is the code.

Code:

awk '{Q=$0;sub(/[[:alpha:]]+/,X,$1);A[$1]=A[$1]?A[$1] ORS Q:Q;num=num>$1?num:$1} END{for(i=1;i<=num;i++){print A[i] > "file"i".txt";close("file"i".txt")}}'  Input_file

Output will be 5 files named file1.txt,file2.txt,file3.txt,file4.txt and file5.txt as follows.

Code:

cat file1.txt
scaffold1 928 929 C/T +
scaffold1 942 943 G/C +
scaffold1 959 960 C/T +
scaffold1 994 995 G/A +
  
cat file2.txt
scaffold2 1024 1025 G/A +
scaffold2 1065 1066 G/A +
scaffold2 1356 1357 C/T +
scaffold2 1363 1364 G/A +

cat file3.txt
scaffold3 1367 1368 G/A +
scaffold3 1403 1404 G/A +
scaffold3 1404 1405 C/T +
scaffold3 1433 1434 G/A +
scaffold3 1467 1468 G/A +

cat file4.txt
scaffold4 1521 1522 G/A +
scaffold4 63885 63886 T/G +
scaffold4 63907 63908 G/A +
scaffold4 63942 63943 T/C +
scaffold4 63964 63965 G/A +
  
cat file5.txt
scaffold5 63996 63997 G/A +
scaffold5 63997 63998 T/C +
scaffold5 64074 64075 G/T +

Please do let us know if this helps you. Enjoy learning

NOTE: Also wanted to mention here, above code considered that you 1st field have a digit in it, so by which I am only taking maximum number and then going further in it.

Thanks,
R. Singh

Last edited by RavinderSingh13; 08-24-2016 at 12:23 PM.. Reason: Added a NOTE to solution now.

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

08-24-2016

Registered User

2,019, 606

Join Date: Apr 2009

Last Activity: 27 February 2021, 12:15 PM EST

Location: India

Posts: 2,019

Thanks Given: 50

Thanked 606 Times in 567 Posts

Code:

while read col1 rest; do echo "$col1 $rest" >> ${col1}.txt; done < input.txt

balajesuri

View Public Profile for balajesuri

Find all posts by balajesuri

08-24-2016

Moderator

1,837, 668

Join Date: Nov 2012

Last Activity: 30 June 2020, 12:07 PM EDT

Posts: 1,837

Thanks Given: 180

Thanked 668 Times in 590 Posts

If input file is not sorted then try this

Code:

[akshay@localhost tmp]$ awk '!($1 in a){a[$1]="file"++c".txt"}{print $0 >>a[$1]; close(a[$1])}' file

If input file is sorted then try this

Code:

[akshay@localhost tmp]$ awk '$1 != prev{if(f)close(f);f="file"++c".txt"; prev=$1}{print > f}END{if(f)close(f)}' file

Quote:

Originally Posted by kapr0001

Dear Users,

Appreciate your help if you could help me with splitting a large file > 1 million lines with sed or awk. below is the text in the file
input file.txt
scaffold1 928 929 C/T +
scaffold1 942 943 G/C +
scaffold1 959 960 C/T +
scaffold1 994 995 G/A +
scaffold2 1024 1025 G/A +
scaffold2 1065 1066 G/A +
scaffold2 1356 1357 C/T +
scaffold2 1363 1364 G/A +
scaffold3 1367 1368 G/A +
scaffold3 1403 1404 G/A +
scaffold3 1404 1405 C/T +
scaffold3 1433 1434 G/A +
scaffold3 1467 1468 G/A +
scaffold4 1521 1522 G/A +
scaffold4 63885 63886 T/G +
scaffold4 63907 63908 G/A +
scaffold4 63942 63943 T/C +
scaffold4 63964 63965 G/A +
scaffold5 63996 63997 G/A +
scaffold5 63997 63998 T/C +
scaffold5 64074 64075 G/T +
scaffold100 64076 64077 C/T +
scaffold100 64127 64128 C/T +
scaffold120 64221 64222 A/G +
scaffold1100 64222 64223 T/C +
scaffold1890 64263 64264 C/T +
scaffold2000 64281 64282 G/C +
scaffold2001 64292 64293 C/T +
scaffold2002 64343 64344 G/A +
scaffold2003 64347 64348 G/T +

my output file should be unique to the first column name
output files
file1.txt
scaffold1 928 929 C/T +
scaffold1 942 943 G/C +
scaffold1 959 960 C/T +
scaffold1 994 995 G/A +
file2.txt
scaffold2 1024 1025 G/A +
scaffold2 1065 1066 G/A +
scaffold2 1356 1357 C/T +
scaffold2 1363 1364 G/A +
file2.txt
scaffold3 1367 1368 G/A +
scaffold3 1403 1404 G/A +
scaffold3 1404 1405 C/T +
scaffold3 1433 1434 G/A +
scaffold3 1467 1468 G/A +

and so on.

Thank you,
kapr0001

Akshay Hegde

View Public Profile for Akshay Hegde

Find all posts by Akshay Hegde

08-24-2016

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

I like the shell solution. It can become a little more I/O efficient (matters when the output files are written to a network file system):

Code:

p_col1=""
while read col1 rest
do
  if [ "$col1" != "$p_col1" ]
  then
    p_col1=$col1
    exec 3>"$col1".txt
  fi
  echo "$col1 $rest" >&3
done < Input_file
exec 3>&-

The input file must be sorted on col1 (otherwise: remove previous output files and append with exec 3>>"$col1".txt)

Last edited by MadeInGermany; 08-24-2016 at 04:42 PM.. Reason: Removed comment about awk - close() releases the file descriptors!

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

08-29-2016

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Apologies for the length of this and for the late posting. I am always skeptical of shell solutions when we get to sizable files, 1M lines of more because of the time involved. I focused only on the time for reading by creating a test file of 1M lines, only with line content scaffold1 and scaffold2. Here is the script:

Code:

#!/usr/bin/env bash

# @(#) s1       Demonstrate schemes to split a file based on content.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk perl gate mmsplit
inxi -c0 -C

FILE=data1
FILE_tmp=/tmp/data1$$
trap 'rm -f $FILE_tmp ; exit 1'  0  1  2  15
rm -f file* scaffold*

# Create data file if it does not yet exist.
if [ ! -f $FILE ]
then
  ./create2
fi

pl " Input data file $FILE:"
specimen 2:2:2 -n $FILE

# Sample line:
# scaffold1       928     929     C/T     +

pl " Results, shell, unsorted:"
time while read col1 rest; do echo "$col1 $rest" >> ${col1}.txt; done < $FILE
pe
wc scaffold*
rm scaffold*

pl " Results, awk, unsorted:"
time awk '!($1 in a){a[$1]="file"++c".txt"}{print $0 >>a[$1]; close(a[$1])}' $FILE
pe
wc file*
rm file*

pl " Results, sort the file:"
time sort -o $FILE_tmp $FILE
pe
specimen 2:2:2 -n $FILE_tmp

pl " Results, awk sorted:"
time awk '$1 != prev{if(f)close(f);f="file"++c".txt"; prev=$1}{print > f}END{if(f)close(f)}' $FILE_tmp
pe
wc file*
rm file*

pl " Results, gate, sorted:"
time gate -f=1 -s=" " $FILE_tmp
pe
wc scaffold*
rm scaffold*

pl " Results, mmsplit, sorted:"
time mmsplit --fix=every --body=body --grep='/^scaffold(\d+)/' -i=$FILE_tmp
pe
wc body*
rm body*

exit 0

producing:

Code:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.4 (jessie) 
bash GNU bash 4.3.30
awk GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2-p3, GNU MP 6.0.0)
perl 5.20.2
gate (local) 1.10
mmsplit (local) 2.0
CPU:       Triple core AMD FX-6350 Six-Core (-MCP-) cache: 6144 KB 
           clock speeds: max: 3915 MHz 1: 3915 MHz 2: 3915 MHz 3: 3915 MHz

-----
 Input data file data1:
Edges: 2:2:2 of 1000000 lines in file "data1"
     1  scaffold1       928     929     C/T     +
     2  scaffold2       928     929     C/T     +
   ---
500001  scaffold1       928     929     C/T     +
500002  scaffold2       928     929     C/T     +
   ---
999999  scaffold1       928     929     C/T     +
1000000 scaffold2       928     929     C/T     +

-----
 Results, shell, unsorted:

real    0m26.607s
user    0m17.868s
sys     0m8.624s

  500000  2500000 18000000 scaffold1.txt
  500000  2500000 18000000 scaffold2.txt
 1000000  5000000 36000000 total

-----
 Results, awk, unsorted:

real    0m19.304s
user    0m5.892s
sys     0m13.308s

  500000  2500000 21000000 file1.txt
  500000  2500000 21000000 file2.txt
 1000000  5000000 42000000 total

-----
 Results, sort the file:

real    0m0.424s
user    0m0.416s
sys     0m0.176s

Edges: 2:2:2 of 1000000 lines in file "/tmp/data110702"
     1  scaffold1       928     929     C/T     +
     2  scaffold1       928     929     C/T     +
   ---
500001  scaffold2       928     929     C/T     +
500002  scaffold2       928     929     C/T     +
   ---
999999  scaffold2       928     929     C/T     +
1000000 scaffold2       928     929     C/T     +

-----
 Results, awk sorted:

real    0m0.515s
user    0m0.420s
sys     0m0.092s

  500000  2500000 21000000 file1.txt
  500000  2500000 21000000 file2.txt
 1000000  5000000 42000000 total

-----
 Results, gate, sorted:

real    0m6.238s
user    0m6.144s
sys     0m0.092s

  500000  2500000 21000000 scaffold1
  500000  2500000 21000000 scaffold2
 1000000  5000000 42000000 total

-----
 Results, mmsplit, sorted:

real    0m2.918s
user    0m2.796s
sys     0m0.120s

  500000  2500000 21000000 body.1
  500000  2500000 21000000 body.2
 1000000  5000000 42000000 total

Comments:

This isn't just a simple split, it's a split and group problem. Codes like csplit at first glance might be considered, but it keys off a unique header-like value, then transfers lines until the next occurrence of a header.. We need to create multiple output files gathering lines that have similar key values.

I like the shell code because it is simple to understand, but it takes a long time.

The awk unsorted version also takes a long time, and I think it's because of the large number of closes.

The awk sorted version is very speedy and, when compared with the time for a sort seems like the best solution.

Our local perl codes gate and mmsplit are run for comparison. The gate is slower, but is very simple to call.

The mmsplit is faster than gate, but has a more complicated calling sequence.

So I would choose the awk sorted code from Akshay Hegde but precede it with a sort. The total real time coming in at 0.424+0.515 -> 0.939, is better than the other solutions.

The awk unsorted could be improved by holding strings until one had, say 1000 of them, then writing the file and closing it. That would cut down the time, but increase the complexity.

The issue of the maximum number of open files might be a problem, although less so for the shell than the other scripting solutions. Solutions using the sorted file would probably be best for a large number of possible group values.

Best wishes ... cheers, drl

Last edited by drl; 08-29-2016 at 10:37 PM.. Reason: Correct minor typos.

drl

View Public Profile for drl

Find all posts by drl

UNIX for Beginners Questions & Answers

sed awk: split a large file to unique file names

10 More Discussions You Might Find Interesting

1. Linux

Split a large textfile (one file) into multiple file to base on ^L

Discussion started by: fspalero

2. Shell Programming and Scripting

sed and awk not working on a large record file

Discussion started by: Gurkamal83

3. Shell Programming and Scripting

Change unique file names into new unique filenames

Discussion started by: avonm

4. Shell Programming and Scripting

Split File by Pattern with File Names in Source File... Awk?

Discussion started by: cul8er

5. Shell Programming and Scripting

How to split a data file into separate files with the file names depending upon a column's value?

Discussion started by: nithins007

6. UNIX for Dummies Questions & Answers

Get List of Unique File Names

Discussion started by: rjulich

7. Shell Programming and Scripting

Updating a line in a large csv file, with sed/awk?

Discussion started by: trey85stang

8. Shell Programming and Scripting

extract unique pattern from large text file

Discussion started by: shijujoe

9. UNIX for Dummies Questions & Answers

split a file with unique sets

Discussion started by: ChicagoBlues

10. Shell Programming and Scripting

Split large file and add header and footer to each file

Discussion started by: ashish4422