Appreciate your help if you could help me with splitting a large file > 1 million lines with sed or awk. below is the text in the file
input file.txt
my output file should be unique to the first column name
output files
file1.txt
file2.txt
file2.txt
and so on.
Thank you,
kapr0001
Moderator's Comments:
Please use CODE tags as required by forum rules!
Last edited by RudiC; 08-24-2016 at 04:13 PM..
Reason: Added CODE tags.
Welcome to forums, hope you will enjoy learning/sharing knowledge here. Please use code tags for your commands/codes/Inputs which you are using into your post as per forum rules. Following may help you in same.
Let's say we have following Input_file.
Then following is the code.
Output will be 5 files named file1.txt,file2.txt,file3.txt,file4.txt and file5.txt as follows.
Please do let us know if this helps you. Enjoy learning
NOTE: Also wanted to mention here, above code considered that you 1st field have a digit in it, so by which I am only taking maximum number and then going further in it.
Thanks,
R. Singh
Last edited by RavinderSingh13; 08-24-2016 at 12:23 PM..
Reason: Added a NOTE to solution now.
I like the shell solution. It can become a little more I/O efficient (matters when the output files are written to a network file system):
The input file must be sorted on col1 (otherwise: remove previous output files and append with exec 3>>"$col1".txt)
Last edited by MadeInGermany; 08-24-2016 at 04:42 PM..
Reason: Removed comment about awk - close() releases the file descriptors!
This User Gave Thanks to MadeInGermany For This Post:
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,288
Thanks Given: 430
Thanked 480 Times in 395 Posts
Hi.
Apologies for the length of this and for the late posting. I am always skeptical of shell solutions when we get to sizable files, 1M lines of more because of the time involved. I focused only on the time for reading by creating a test file of 1M lines, only with line content scaffold1 and scaffold2. Here is the script:
producing:
Comments:
This isn't just a simple split, it's a split and group problem. Codes like csplit at first glance might be considered, but it keys off a unique header-like value, then transfers lines until the next occurrence of a header.. We need to create multiple output files gathering lines that have similar key values.
I like the shell code because it is simple to understand, but it takes a long time.
The awk unsorted version also takes a long time, and I think it's because of the large number of closes.
The awk sorted version is very speedy and, when compared with the time for a sort seems like the best solution.
Our local perl codes gate and mmsplit are run for comparison. The gate is slower, but is very simple to call.
The mmsplit is faster than gate, but has a more complicated calling sequence.
So I would choose the awk sorted code from Akshay Hegde but precede it with a sort. The total real time coming in at 0.424+0.515 -> 0.939, is better than the other solutions.
The awk unsorted could be improved by holding strings until one had, say 1000 of them, then writing the file and closing it. That would cut down the time, but increase the complexity.
The issue of the maximum number of open files might be a problem, although less so for the shell than the other scripting solutions. Solutions using the sorted file would probably be best for a large number of possible group values.
Best wishes ... cheers, drl
Last edited by drl; 08-29-2016 at 10:37 PM..
Reason: Correct minor typos.
Hi,
Anyone can help, I have a large textfile (one file), and I need to split into multiple file to break each file into ^L.
My textfile
==========
abc company
abc address
abc contact
^L
my company
my address
my contact
my skills
^L
your company
your address
========== (3 Replies)
Hi All,
I have a very large single record file.
abc;date||bcd;efg|......... pqr;stu||record_count;date
when i do wc -l on this file it gives me "0" records, coz of missing line feed.
my problem is there is an extra pipe that is coming at the end of this record
like... (6 Replies)
I have 84 files with the following names splitseqs.1, spliseqs.2 etc.
and I want to change the .number to a unique filename.
E.g.
change splitseqs.1 into splitseqs.7114_1#24
and
change spliseqs.2 into splitseqs.7067_2#4
So all the current file names are unique, so are the new file names.... (1 Reply)
Hi all,
I'm pretty new to Shell scripting and I need some help to split a source text file into multiple files. The source has a row with pattern where the file needs to be split, and the pattern row also contains the file name of the destination for that specific piece. Here is an example:
... (2 Replies)
Hi,
I have a data file xyz.dat similar to the one given below,
2345|98|809||x|969|0
2345|98|809||y|0|537
2345|97|809||x|544|0
2345|97|809||y|0|651
9685|98|809||x|321|0
9685|98|809||y|0|357
9685|98|709||x|687|0
9685|98|709||y|0|234
2315|98|809||x|564|0
2315|98|809||y|0|537... (2 Replies)
I have a large directory of web pages. I am doing a search through the web pages using grep and would like to get a list of unique file names of search results. The following command works fine to give me a list of file names where term appears:
grep -l term *.html
However, since these are... (3 Replies)
I have an extremely large csv file that I need to search the second field, and upon matches update the last field...
I can pull the line with awk.. but apparently you cant use awk to directly update the file? So im curious if I can use sed to do this... The good news is the field I want to... (5 Replies)
Hi All,
I am trying to extract data from a large text file , I want to extract lines which contains a five digit number followed by a hyphen , like
12345- , i tried with egrep ,eg : egrep "+" text.txt
but which returns all the lines which contains any number of digits followed by hyhen ,... (19 Replies)
This may sound like a trivial problem, but I still need some help:
I have a file with ids and I want to split it 'n' ways (could be any number) into files:
1
1
1
2
2
3
3
4
5
5
Let's assume 'n' is 3, and we cannot have the same id in two different partitions. So the partitions may... (8 Replies)
I have one large file, after every 200 line i have to split the file and the add header and footer to each small file?
It is possible to add different header and footer to each file? (1 Reply)