Sponsored Content
Top Forums Shell Programming and Scripting The builtin split function in AWK is too slow Post 302423066 by kevintse on Thursday 20th of May 2010 05:53:52 AM
Old 05-20-2010
The builtin split function in AWK is too slow

I have a text file that contains 4 million lines, each line contains 2 fields(colon as field separator). as shown:
Code:
123:444,555,666,777,888,345
233:5444,555,666,777,888,345
623:454,585,664,773,888,345
......

Here I have to split the second field(can be up to 40,000 fields) by comma into an array for analysis, but I find the "split" function is too slow.

I tried to find an alternative to replacing the split function. I think I found one but there's still something that can not be achieved without your help. now I have this code:

Code:
awk -F: -vcmd="awk ' BEGIN { RS=\",\"} { print $1 }'" '{ print $2 | cmd; close(cmd)} ' data.txt

This code has "split"ed the second field fast enough, but I don't know how to store the splitted data in an array in the *first* AWK program as shown above.

How to solve this problem? or you have other alternatives to replacing the split function?

Thanks.

Last edited by kevintse; 05-20-2010 at 06:59 AM..
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

split function

Hi all! I am relatively new to UNIX staff, and I have come across a problem: I have a big directory, which contains 100 smaller ones. Each of the 100 contains a file ending in .txt , so there are 100 files ending in .txt I want to split each of the 100 files in smaller ones, which will contain... (4 Replies)
Discussion started by: ktsirig
4 Replies

2. Shell Programming and Scripting

perl split function

$mystring = "name:blk:house::"; print "$mystring\n"; @s_format = split(/:/, $mystring); for ($i=0; $i <= $#s_format; $i++) { print "index is $i,field is $s_format"; print "\n"; } $size = $#s_format + 1; print "total size of array is $size\n"; i am expecting my size to be 5, why is it... (5 Replies)
Discussion started by: new2ss
5 Replies

3. UNIX for Dummies Questions & Answers

Split a file with no pattern -- Split, Csplit, Awk

I have gone through all the threads in the forum and tested out different things. I am trying to split a 3GB file into multiple files. Some files are even larger than this. For example: split -l 3000000 filename.txt This is very slow and it splits the file with 3 million records in each... (10 Replies)
Discussion started by: madhunk
10 Replies

4. Shell Programming and Scripting

awk - split function

Hi, I have some output in the form of: #output: abc123 def567 hij890 ghi324 the above is in one column, stored in the variable x ( and if you wana know about x... x=sprintf(tolower(substr(someArray,1,1)substr(userArray,3,1)substr(userArray,2,1))) when i simply print x (print x) I get... (7 Replies)
Discussion started by: fusionX
7 Replies

5. Shell Programming and Scripting

Use split function in perl

Hello, if i have file like this: 010000890306932455804 05306977653873 0520080417010520ISMS SMT ZZZZZZZZZZZZZOC30693599000 30971360000 ZZZZZZZZZZZZZZZZZZZZ202011302942311 010000890306946317387 05306977313623 0520080417010520ISMS SMT ZZZZZZZZZZZZZOC306942190000 30971360000... (5 Replies)
Discussion started by: chriss_58
5 Replies

6. Homework & Coursework Questions

PERL split function

Hi... I have a question regarding the split function in PERL. I have a very huge csv file (more than 80 million records). I need to extract a particular position(eg : 50th position) of each line from the csv file. I tried using split function. But I realized split takes a very long time. Also... (1 Reply)
Discussion started by: castle
1 Replies

7. Homework & Coursework Questions

PERL split function

Hi... I have a question regarding the split function in PERL. I have a very huge csv file (more than 80 million records). I need to extract a particular position(eg : 50th position) of each line from the csv file. I tried using split function. But I realized split takes a very long time. Also... (0 Replies)
Discussion started by: castle
0 Replies

8. Shell Programming and Scripting

PERL split function

Hi... I have a question regarding the split function in PERL. I have a very huge csv file (more than 80 million records). I need to extract a particular position(eg : 50th position) of each line from the csv file. I tried using split function. But I realized split takes a very long time. Also... (1 Reply)
Discussion started by: castle
1 Replies

9. Shell Programming and Scripting

Perl split function

my @d =split('\|', $_); west|ACH|3|Y|LuV|N||N|| Qt|UWST|57|Y|LSV|Y|Bng|N|KT| It Returns d as 8 for First Line, and 9 as for Second Line . I want to Process Both the Files, How to Handle It. (3 Replies)
Discussion started by: vishwakar
3 Replies

10. Shell Programming and Scripting

awk to split one field and print the last two fields within the split part.

Hello; I have a file consists of 4 columns separated by tab. The problem is the third fields. Some of the them are very long but can be split by the vertical bar "|". Also some of them do not contain the string "UniProt", but I could ignore it at this moment, and sort the file afterwards. Here is... (5 Replies)
Discussion started by: yifangt
5 Replies
TRACE-CMD-SPLIT(1)														TRACE-CMD-SPLIT(1)

NAME
trace-cmd-split - split a trace.dat file into smaller files SYNOPSIS
trace-cmd split [OPTIONS] [start-time [end-time]] DESCRIPTION
The trace-cmd(1) split is used to break up a trace.dat into small files. The start-time specifies where the new file will start at. Using trace-cmd-report(1) and copying the time stamp given at a particular event, can be used as input for either start-time or end-time. The split will stop creating files when it reaches an event after end-time. If only the end-time is needed, use 0.0 as the start-time. If start-time is left out, then the split will start at the beginning of the file. If end-time is left out, then split will continue to the end unless it meets one of the requirements specified by the options. OPTIONS
-i file If this option is not specified, then the split command will look for the file named trace.dat. This options will allow the reading of another file other than trace.dat. -o file By default, the split command will use the input file name as a basis of where to write the split files. The output file will be the input file with an attached '.#' to the end: trace.dat.1, trace.dat.2, etc. This option will change the name of the base file used. -o file will create file.1, file.2, etc. -s seconds This specifies how many seconds should be recorded before the new file should stop. -m milliseconds This specifies how many milliseconds should be recorded before the new file should stop. -u microseconds This specifies how many microseconds should be recorded before the new file should stop. -e events This specifies how many events should be recorded before the new file should stop. -p pages This specifies the number of pages that should be recorded before the new file should stop. Note: only one of *-p*, *-e*, *-u*, *-m*, *-s* may be specified at a time. If *-p* is specified, then *-c* is automatically set. -r This option causes the break up to repeat until end-time is reached (or end of the input if end-time is not specified). trace-cmd split -r -e 10000 This will break up trace.dat into several smaller files, each with at most 10,000 events in it. -c This option causes the above break up to be per CPU. trace-cmd split -c -p 10 This will create a file that has 10 pages per each CPU from the input. SEE ALSO
trace-cmd(1), trace-cmd-record(1), trace-cmd-report(1), trace-cmd-start(1), trace-cmd-stop(1), trace-cmd-extract(1), trace-cmd-reset(1), trace-cmd-list(1), trace-cmd-listen(1) AUTHOR
Written by Steven Rostedt, <rostedt@goodmis.org[1]> RESOURCES
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git COPYING
Copyright (C) 2010 Red Hat, Inc. Free use of this software is granted under the terms of the GNU Public License (GPL). NOTES
1. rostedt@goodmis.org mailto:rostedt@goodmis.org 06/11/2014 TRACE-CMD-SPLIT(1)
All times are GMT -4. The time now is 05:04 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy