sed awk: split a large file to unique file names Post: 302980139

Sponsored Content

Top Forums UNIX for Beginners Questions & Answers sed awk: split a large file to unique file names Post 302980139 by Akshay Hegde on Wednesday 24th of August 2016 12:09:05 PM

08-24-2016

Moderator

If input file is not sorted then try this

Code:

[akshay@localhost tmp]$ awk '!($1 in a){a[$1]="file"++c".txt"}{print $0 >>a[$1]; close(a[$1])}' file

If input file is sorted then try this

Code:

[akshay@localhost tmp]$ awk '$1 != prev{if(f)close(f);f="file"++c".txt"; prev=$1}{print > f}END{if(f)close(f)}' file

Quote:

Originally Posted by kapr0001

Dear Users,

Appreciate your help if you could help me with splitting a large file > 1 million lines with sed or awk. below is the text in the file
input file.txt
scaffold1 928 929 C/T +
scaffold1 942 943 G/C +
scaffold1 959 960 C/T +
scaffold1 994 995 G/A +
scaffold2 1024 1025 G/A +
scaffold2 1065 1066 G/A +
scaffold2 1356 1357 C/T +
scaffold2 1363 1364 G/A +
scaffold3 1367 1368 G/A +
scaffold3 1403 1404 G/A +
scaffold3 1404 1405 C/T +
scaffold3 1433 1434 G/A +
scaffold3 1467 1468 G/A +
scaffold4 1521 1522 G/A +
scaffold4 63885 63886 T/G +
scaffold4 63907 63908 G/A +
scaffold4 63942 63943 T/C +
scaffold4 63964 63965 G/A +
scaffold5 63996 63997 G/A +
scaffold5 63997 63998 T/C +
scaffold5 64074 64075 G/T +
scaffold100 64076 64077 C/T +
scaffold100 64127 64128 C/T +
scaffold120 64221 64222 A/G +
scaffold1100 64222 64223 T/C +
scaffold1890 64263 64264 C/T +
scaffold2000 64281 64282 G/C +
scaffold2001 64292 64293 C/T +
scaffold2002 64343 64344 G/A +
scaffold2003 64347 64348 G/T +

my output file should be unique to the first column name
output files
file1.txt
scaffold1 928 929 C/T +
scaffold1 942 943 G/C +
scaffold1 959 960 C/T +
scaffold1 994 995 G/A +
file2.txt
scaffold2 1024 1025 G/A +
scaffold2 1065 1066 G/A +
scaffold2 1356 1357 C/T +
scaffold2 1363 1364 G/A +
file2.txt
scaffold3 1367 1368 G/A +
scaffold3 1403 1404 G/A +
scaffold3 1404 1405 C/T +
scaffold3 1433 1434 G/A +
scaffold3 1467 1468 G/A +

and so on.

Thank you,
kapr0001

Akshay Hegde

View Public Profile for Akshay Hegde

Find all posts by Akshay Hegde

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split large file and add header and footer to each file

I have one large file, after every 200 line i have to split the file and the add header and footer to each small file? It is possible to add different header and footer to each file?

2. UNIX for Dummies Questions & Answers

split a file with unique sets

This may sound like a trivial problem, but I still need some help: I have a file with ids and I want to split it 'n' ways (could be any number) into files: 1 1 1 2 2 3 3 4 5 5 Let's assume 'n' is 3, and we cannot have the same id in two different partitions. So the partitions may...

3. Shell Programming and Scripting

extract unique pattern from large text file

Hi All, I am trying to extract data from a large text file , I want to extract lines which contains a five digit number followed by a hyphen , like 12345- , i tried with egrep ,eg : egrep "+" text.txt but which returns all the lines which contains any number of digits followed by hyhen ,...

4. Shell Programming and Scripting

Updating a line in a large csv file, with sed/awk?

I have an extremely large csv file that I need to search the second field, and upon matches update the last field... I can pull the line with awk.. but apparently you cant use awk to directly update the file? So im curious if I can use sed to do this... The good news is the field I want to...

5. UNIX for Dummies Questions & Answers

Get List of Unique File Names

I have a large directory of web pages. I am doing a search through the web pages using grep and would like to get a list of unique file names of search results. The following command works fine to give me a list of file names where term appears: grep -l term *.html However, since these are...

6. Shell Programming and Scripting

How to split a data file into separate files with the file names depending upon a column's value?

Hi, I have a data file xyz.dat similar to the one given below, 2345|98|809||x|969|0 2345|98|809||y|0|537 2345|97|809||x|544|0 2345|97|809||y|0|651 9685|98|809||x|321|0 9685|98|809||y|0|357 9685|98|709||x|687|0 9685|98|709||y|0|234 2315|98|809||x|564|0 2315|98|809||y|0|537...

7. Shell Programming and Scripting

Split File by Pattern with File Names in Source File... Awk?

Hi all, I'm pretty new to Shell scripting and I need some help to split a source text file into multiple files. The source has a row with pattern where the file needs to be split, and the pattern row also contains the file name of the destination for that specific piece. Here is an example: ...

8. Shell Programming and Scripting

Change unique file names into new unique filenames

I have 84 files with the following names splitseqs.1, spliseqs.2 etc. and I want to change the .number to a unique filename. E.g. change splitseqs.1 into splitseqs.7114_1#24 and change spliseqs.2 into splitseqs.7067_2#4 So all the current file names are unique, so are the new file names....

9. Shell Programming and Scripting

sed and awk not working on a large record file

Hi All, I have a very large single record file. abc;date||bcd;efg|......... pqr;stu||record_count;date when i do wc -l on this file it gives me "0" records, coz of missing line feed. my problem is there is an extra pipe that is coming at the end of this record like...

10. Linux

Split a large textfile (one file) into multiple file to base on ^L

Hi, Anyone can help, I have a large textfile (one file), and I need to split into multiple file to break each file into ^L. My textfile ========== abc company abc address abc contact ^L my company my address my contact my skills ^L your company your address ==========

LEARN ABOUT V7

join

JOIN(1) 						      General Commands Manual							   JOIN(1)

NAME

       join - relational database operator

SYNOPSIS

       join [ options ] file1 file2

DESCRIPTION

       Join  forms,  on the standard output, a join of the two relations specified by the lines of file1 and file2.  If file1 is `-', the standard
       input is used.

       File1 and file2 must be sorted in increasing ASCII collating sequence on the fields on which they are to be joined, normally the  first	in
       each line.

       There  is  one line in the output for each pair of lines in file1 and file2 that have identical join fields.  The output line normally con-
       sists of the common field, then the rest of the line from file1, then the rest of the line from file2.

       Fields are normally separated by blank, tab or newline.	In this case, multiple separators count as one, and leading  separators  are  dis-
       carded.

       These options are recognized:

       -an    In addition to the normal output, produce a line for each unpairable line in file n, where n is 1 or 2.

       -e s   Replace empty output fields by string s.

       -jn m  Join on the mth field of file n.	If n is missing, use the mth field in each file.

       -o list
	      Each  output line comprises the fields specifed in list, each element of which has the form n.m, where n is a file number and m is a
	      field number.

       -tc    Use character c as a separator (tab character).  Every appearance of c in a line is significant.

SEE ALSO

       sort(1), comm(1), awk(1)

BUGS

       With default field separation, the collating sequence is that of sort -b; with -t, the sequence is that of a plain sort.

       The conventions of join, sort, comm, uniq, look and awk(1) are wildly incongruous.

																	   JOIN(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split large file and add header and footer to each file

Discussion started by: ashish4422

2. UNIX for Dummies Questions & Answers

split a file with unique sets

Discussion started by: ChicagoBlues

3. Shell Programming and Scripting

extract unique pattern from large text file

Discussion started by: shijujoe

4. Shell Programming and Scripting

Updating a line in a large csv file, with sed/awk?

Discussion started by: trey85stang