sed awk: split a large file to unique file names Post: 302980139

Sponsored Content

Top Forums UNIX for Beginners Questions & Answers sed awk: split a large file to unique file names Post 302980139 by Akshay Hegde on Wednesday 24th of August 2016 12:09:05 PM

08-24-2016

Moderator

If input file is not sorted then try this

Code:

[akshay@localhost tmp]$ awk '!($1 in a){a[$1]="file"++c".txt"}{print $0 >>a[$1]; close(a[$1])}' file

If input file is sorted then try this

Code:

[akshay@localhost tmp]$ awk '$1 != prev{if(f)close(f);f="file"++c".txt"; prev=$1}{print > f}END{if(f)close(f)}' file

Quote:

Originally Posted by kapr0001

Dear Users,

Appreciate your help if you could help me with splitting a large file > 1 million lines with sed or awk. below is the text in the file
input file.txt
scaffold1 928 929 C/T +
scaffold1 942 943 G/C +
scaffold1 959 960 C/T +
scaffold1 994 995 G/A +
scaffold2 1024 1025 G/A +
scaffold2 1065 1066 G/A +
scaffold2 1356 1357 C/T +
scaffold2 1363 1364 G/A +
scaffold3 1367 1368 G/A +
scaffold3 1403 1404 G/A +
scaffold3 1404 1405 C/T +
scaffold3 1433 1434 G/A +
scaffold3 1467 1468 G/A +
scaffold4 1521 1522 G/A +
scaffold4 63885 63886 T/G +
scaffold4 63907 63908 G/A +
scaffold4 63942 63943 T/C +
scaffold4 63964 63965 G/A +
scaffold5 63996 63997 G/A +
scaffold5 63997 63998 T/C +
scaffold5 64074 64075 G/T +
scaffold100 64076 64077 C/T +
scaffold100 64127 64128 C/T +
scaffold120 64221 64222 A/G +
scaffold1100 64222 64223 T/C +
scaffold1890 64263 64264 C/T +
scaffold2000 64281 64282 G/C +
scaffold2001 64292 64293 C/T +
scaffold2002 64343 64344 G/A +
scaffold2003 64347 64348 G/T +

my output file should be unique to the first column name
output files
file1.txt
scaffold1 928 929 C/T +
scaffold1 942 943 G/C +
scaffold1 959 960 C/T +
scaffold1 994 995 G/A +
file2.txt
scaffold2 1024 1025 G/A +
scaffold2 1065 1066 G/A +
scaffold2 1356 1357 C/T +
scaffold2 1363 1364 G/A +
file2.txt
scaffold3 1367 1368 G/A +
scaffold3 1403 1404 G/A +
scaffold3 1404 1405 C/T +
scaffold3 1433 1434 G/A +
scaffold3 1467 1468 G/A +

and so on.

Thank you,
kapr0001

Akshay Hegde

View Public Profile for Akshay Hegde

Find all posts by Akshay Hegde

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split large file and add header and footer to each file

I have one large file, after every 200 line i have to split the file and the add header and footer to each small file? It is possible to add different header and footer to each file?

2. UNIX for Dummies Questions & Answers

split a file with unique sets

This may sound like a trivial problem, but I still need some help: I have a file with ids and I want to split it 'n' ways (could be any number) into files: 1 1 1 2 2 3 3 4 5 5 Let's assume 'n' is 3, and we cannot have the same id in two different partitions. So the partitions may...

3. Shell Programming and Scripting

extract unique pattern from large text file

Hi All, I am trying to extract data from a large text file , I want to extract lines which contains a five digit number followed by a hyphen , like 12345- , i tried with egrep ,eg : egrep "+" text.txt but which returns all the lines which contains any number of digits followed by hyhen ,...

4. Shell Programming and Scripting

Updating a line in a large csv file, with sed/awk?

I have an extremely large csv file that I need to search the second field, and upon matches update the last field... I can pull the line with awk.. but apparently you cant use awk to directly update the file? So im curious if I can use sed to do this... The good news is the field I want to...

5. UNIX for Dummies Questions & Answers

Get List of Unique File Names

I have a large directory of web pages. I am doing a search through the web pages using grep and would like to get a list of unique file names of search results. The following command works fine to give me a list of file names where term appears: grep -l term *.html However, since these are...

6. Shell Programming and Scripting

How to split a data file into separate files with the file names depending upon a column's value?

Hi, I have a data file xyz.dat similar to the one given below, 2345|98|809||x|969|0 2345|98|809||y|0|537 2345|97|809||x|544|0 2345|97|809||y|0|651 9685|98|809||x|321|0 9685|98|809||y|0|357 9685|98|709||x|687|0 9685|98|709||y|0|234 2315|98|809||x|564|0 2315|98|809||y|0|537...

7. Shell Programming and Scripting

Split File by Pattern with File Names in Source File... Awk?

Hi all, I'm pretty new to Shell scripting and I need some help to split a source text file into multiple files. The source has a row with pattern where the file needs to be split, and the pattern row also contains the file name of the destination for that specific piece. Here is an example: ...

8. Shell Programming and Scripting

Change unique file names into new unique filenames

I have 84 files with the following names splitseqs.1, spliseqs.2 etc. and I want to change the .number to a unique filename. E.g. change splitseqs.1 into splitseqs.7114_1#24 and change spliseqs.2 into splitseqs.7067_2#4 So all the current file names are unique, so are the new file names....

9. Shell Programming and Scripting

sed and awk not working on a large record file

Hi All, I have a very large single record file. abc;date||bcd;efg|......... pqr;stu||record_count;date when i do wc -l on this file it gives me "0" records, coz of missing line feed. my problem is there is an extra pipe that is coming at the end of this record like...

10. Linux

Split a large textfile (one file) into multiple file to base on ^L

Hi, Anyone can help, I have a large textfile (one file), and I need to split into multiple file to break each file into ^L. My textfile ========== abc company abc address abc contact ^L my company my address my contact my skills ^L your company your address ==========

LEARN ABOUT DEBIAN

pyp

PYP(1)							      General Commands Manual							    PYP(1)

NAME

       pyp - The Pyed Piper: A Modern Python Alternative to awk, sed and Other Unix Text Manipulation Utilities

SYNOPSIS

       pyp [options] files ...

DESCRIPTION

       pyp,  the  Pyed Piper, is a command line tool for text manipulation. It is similar to awk and sed in functionality, but its subcommands are
       Python based, and thus more familiar to many programmers.

       It can operate both on a per-line base and on the complete input stream.  Different features can be pipelined in a single command by  using
       the pipe character familiar from shell commands.

       pyp  backs  up  its  input  for reruns with modified commands, and can save commands as macros. On the downside, the rerun feature makes it
       unsuitable for continuous pipe operation.

OPTIONS

       These programs follow the usual GNU command line syntax, with long options starting with  two  dashes  (`-').   A  summary  of  options	is
       included below.	For a complete description, use --manual.

       -h, --help
	      Show this help message and exit.

       -m, --manual
	      Prints out extended help.

       -l, --macro_list
	      Lists all available macros.

       -s MACRO_SAVE_NAME, --macro_save=MACRO_SAVE_NAME
	      Saves current command as macro. use "#" for adding
	      comments	EXAMPLE:
	      pyp -s "great_macro # prints first letter" "p[1]".

       -f MACRO_FIND_NAME, --macro_find=MACRO_FIND_NAME
	      Searches for macros with keyword or user name.

       -d MACRO_DELETE_NAME, --macro_delete=MACRO_DELETE_NAME
	      Deletes specified public macro.

       -g, --macro_group
	      Specify group macros for save and delete; default is user.

       -t TEXT_FILE, --text_file=TEXT_FILE
	      Specify text file to load. For advanced users,
	      you should typically cat a file into pyp.

       -x, --execute
	      Execute all commands.

       -c, --turn_off_color
	      Prints raw, uncolored output.

       -u, --unmodified_config
	      Prints out generic PypCustom.py config file.

       -b BLANK_INPUTS, --blank_inputs=BLANK_INPUTS
	      Generate this number of blank input lines; useful for
	      generating numbered lists with variable 'n'.

       -n, --no_input
	      Use with command that generates output with no input;
	      same as --dummy_input 1.

       -k, --keep_false
	      Print blank lines for lines that test as False.
	      default is to filter out False lines from the output.

       -r, --rerun
	      Rerun based on automatically cached data from the last run.
	      Use this after executing "pyp", pasting input into the shell,
	      and hitting CTRL-D.

SEE ALSO

       awk(1), grep(1), sed(1).

AUTHOR

       pyp was written by Toby Rosen <tobyrosen@gmail.com>.

       This manual page was written by Khalid El Fathi <khalid@elfathi.fr>, for the Debian project (and may be used by others).

								  March 19, 2012							    PYP(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split large file and add header and footer to each file

Discussion started by: ashish4422

2. UNIX for Dummies Questions & Answers

split a file with unique sets

Discussion started by: ChicagoBlues

3. Shell Programming and Scripting

extract unique pattern from large text file

Discussion started by: shijujoe

4. Shell Programming and Scripting

Updating a line in a large csv file, with sed/awk?

Discussion started by: trey85stang

5. UNIX for Dummies Questions & Answers

Get List of Unique File Names

Discussion started by: rjulich

6. Shell Programming and Scripting

How to split a data file into separate files with the file names depending upon a column's value?

Discussion started by: nithins007

7. Shell Programming and Scripting

Split File by Pattern with File Names in Source File... Awk?

Discussion started by: cul8er