split file by delimiter with csplit Post: 302680029

Sponsored Content

Top Forums Shell Programming and Scripting split file by delimiter with csplit Post 302680029 by drl on Wednesday 1st of August 2012 06:00:35 AM

08-01-2012

Registered User

Hi.

Here is a demonstration of a gathering technique that might be useful here:

Code:

#!/usr/bin/env bash

# @(#) s1	Demonstrate gather of lines, split into groups, expand.

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk tail split

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

# Remove debris.
rm -f xa*

pl " Script \"gather\":"
cat gather

# Bunch lines of each sequence together on a single line.
./gather $FILE |
tail --lines=+2 |
split --lines=2

pl " Files xa created by split:"
ls xa*

# Expand file to temporary, replace original.
for file in xa*
do
awk '
	{ gsub(/=/,"\n") ; printf("%s",$0) }
' $file > t1
mv t1 $file
done

pl " Sample: file xab expanded:"
cat xab

exit 0

producing

Code:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
awk GNU Awk 3.1.5
tail (GNU coreutils) 6.10
split (GNU coreutils) 6.10

-----
 Input data file data1:
>seq1
agtcagtc
agtcagtc
ag
>seq2
agtcagtcagtc
agtcagtcagtc
agtcagtc
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc
>seq5
agtcagtc
>seq6
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtc

-----
 Script "gather":
#!/usr/bin/env bash

# @(#) gather	Substitution of newline.

FILE=${1-data1}

awk '
BEGIN	{ RS = ">" }
	{ gsub(/\n/,"=") ; printf("%s%s\n", RS,$0) }
' $FILE

exit 0

-----
 Files xa created by split:
xaa  xab  xac

-----
 Sample: file xab expanded:
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc

The short awk script in file gather collects lines belonging to a sequence into a super line. The newlines are replaced by some character not in the data, here I used "=".

Then the super lines are split into groups of 2.

Another awk script expands the super lines by replacing "=" with a real newline and re-writes the files.

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

split string with multibyte delimiter

Hi, I need to split a string, either using awk or cut or basic unix commands (no programming) , with a multibyte charectar as a delimeter. Ex: abcd-efgh-ijkl split by -efgh- to get two segments abcd & ijkl Is it possible? Thanks A.H.S

2. UNIX for Dummies Questions & Answers

Split files using Csplit

I have an excel file with more than 65K records... Since excel does not take more than 65K records i wan to split the file and send it as two excel files... Could some help me how to use the csplit by specifiying the no of records

3. UNIX for Dummies Questions & Answers

Split a file with no pattern -- Split, Csplit, Awk

I have gone through all the threads in the forum and tested out different things. I am trying to split a 3GB file into multiple files. Some files are even larger than this. For example: split -l 3000000 filename.txt This is very slow and it splits the file with 3 million records in each...

4. Shell Programming and Scripting

How to split a string with no delimiter

Hi; I want to write a shell script that will split a string with no delimiter. Basically the script will read a line from a file. For example the line it read from the file contains: 99234523 These values are never the same but the length will always be 8. How do i split this...

5. Shell Programming and Scripting

Help- counting delimiter in a huge file and split data into 2 files

I’m new to Linux script and not sure how to filter out bad records from huge flat files (over 1.3GB each). The delimiter is a semi colon “;” Here is the sample of 5 lines in the file: Name1;phone1;address1;city1;state1;zipcode1 Name2;phone2;address2;city2;state2;zipcode2;comment...

6. Shell Programming and Scripting

Split file into multiple files using delimiter

Hi, I have a file which has many URLs delimited by space. Now i want them to move to separate files each one holding 10 URLs per file. http://3276.e-printphoto.co.uk/guardian http://abdera.apache.org/ http://abdera.apache.org/docs/api/index.html I have used the below code to arrange...

7. Shell Programming and Scripting

How to target certain delimiter to split text file?

Hi, all. I have an input file. I would like to generate 3 types of output files. Input: LG10_PM_map_19_LEnd_1000560 LG10_PM_map_6-1_27101856 LG10_PM_map_71_REnd_20597718 LG12_PM_map_5_chr_118419232 LG13_PM_map_121_24341052 LG14_PM_1a_456799 LG1_MM_scf_5a_opt_abc_9029993 ...

8. UNIX for Advanced & Expert Users

How to split large file with different record delimiter?

Hi, I have received a file which is 20 GB. We would like to split the file into 4 equal parts and process it to avoid memory issues. If the record delimiter is unix new line, I could use split command either with option l or b. The problem is that the line terminator is |##| How to use...

9. UNIX for Beginners Questions & Answers

Shell script to Split matrix file with delimiter into multiple files

I have a large semicolon delimited file with thousands of columns and many thousands of line. It looks like: ID1;ID2;ID3;ID4;A_1;B_1;C_1;A_2;B_2;C_2;A_3;B_3;C_3 AA;ax;ay;az;01;02;03;04;05;06;07;08;09 BB;bx;by;bz;03;05;33;44;15;26;27;08;09 I want to split this table in to multiple files: ...

LEARN ABOUT FREEBSD

csplit

CSPLIT(1)						    BSD General Commands Manual 						 CSPLIT(1)

NAME

     csplit -- split files based on context

SYNOPSIS

     csplit [-ks] [-f prefix] [-n number] file args ...

DESCRIPTION

     The csplit utility splits file into pieces using the patterns args.  If file is a dash ('-'), csplit reads from standard input.

     Files are created with a prefix of ``xx'' and two decimal digits.	The size of each file is written to standard output as it is created.  If
     an error occurs whilst files are being created, or a HUP, INT, or TERM signal is received, all files previously written are removed.

     The options are as follows:

     -f prefix
	     Create file names beginning with prefix, instead of ``xx''.

     -k      Do not remove previously created files if an error occurs or a HUP, INT, or TERM signal is received.

     -n number
	     Create file names beginning with number of decimal digits after the prefix, instead of 2.

     -s      Do not write the size of each output file to standard output as it is created.

     The args operands may be a combination of the following patterns:

     /regexp/[[+|-]offset]
	     Create a file containing the input from the current line to (but not including) the next line matching the given basic regular
	     expression.  An optional offset from the line that matched may be specified.

     %regexp%[[+|-]offset]
	     Same as above but a file is not created for the output.

     line_no
	     Create containing the input from the current line to (but not including) the specified line number.

     {num}   Repeat the previous pattern the specified number of times.  If it follows a line number pattern, a new file will be created for each
	     line_no lines, num times.	The first line of the file is line number 1 for historic reasons.

     After all the patterns have been processed, the remaining input data (if there is any) will be written to a new file.

     Requesting to split at a line before the current line number or past the end of the file will result in an error.

ENVIRONMENT

     The LANG, LC_ALL, LC_COLLATE and LC_CTYPE environment variables affect the execution of csplit as described in environ(7).

EXIT STATUS

     The csplit utility exits 0 on success, and >0 if an error occurs.

EXAMPLES

     Split the mdoc(7) file foo.1 into one file for each section (up to 21 plus one for the rest, if any):

	   csplit -k foo.1 '%^.Sh%' '/^.Sh/' '{20}'

     Split standard input after the first 99 lines and every 100 lines thereafter:

	   csplit -k - 100 '{19}'

SEE ALSO

     sed(1), split(1), re_format(7)

STANDARDS

     The csplit utility conforms to IEEE Std 1003.1-2001 (``POSIX.1'').

HISTORY

     A csplit command appeared in PWB UNIX.

BUGS

     Input lines are limited to LINE_MAX (2048) bytes in length.

BSD
								 February 6, 2014							       BSD