Splitting a text file into smaller files with awk, how to create a different name for each new file

Tags
awk, file, shell scripts, split, text

Login to Reply

 
Thread Tools Search this Thread
# 1  
Old 1 Week Ago
Splitting a text file into smaller files with awk, how to create a different name for each new file

Hello,

I have some large text files that look like,
Code:
putrescine
  Mrv1583 01041713302D          

  6  5  0  0  0  0            999 V2000
    2.0928   -0.2063    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.6650    0.2063    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    3.5217   -0.2063    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2361    0.2063    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8072    0.2063    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.9504   -0.2063    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  5  1  0  0  0  0
  2  6  1  0  0  0  0
  3  4  1  0  0  0  0
  3  5  1  0  0  0  0
  4  6  1  0  0  0  0
M  END
> <num>
1

> <name>
putrescine

$$$$
bis(hexamethylene)triamine.mol
  Mrv1583 01041713302D          

 15 14  0  0  0  0            999 V2000
    6.4898    1.0450    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    7.2042    1.4575    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.9187    1.0450    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.6332    1.4575    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    9.3477    1.0450    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   10.0621    1.4575    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   10.7766    1.0450    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   11.4911    1.4575    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   12.2055    1.0450    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   12.9200    1.4575    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   13.6345    1.0450    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   14.3490    1.4575    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   15.0634    1.0450    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   15.7779    1.4575    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   16.4924    1.0450    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  1  0  0  0  0
  3  4  1  0  0  0  0
  4  5  1  0  0  0  0
  5  6  1  0  0  0  0
  6  7  1  0  0  0  0
  7  8  1  0  0  0  0
  8  9  1  0  0  0  0
  9 10  1  0  0  0  0
 10 11  1  0  0  0  0
 11 12  1  0  0  0  0
 12 13  1  0  0  0  0
 13 14  1  0  0  0  0
 14 15  1  0  0  0  0
M  END
> <num>
2

> <name>
bis(hexamethylene)triamine

$$$$

There can be thousands of records and there is no specific length for each record as far as the number of lines or tag fields between MEND and $$$$. Each record ends with the $$$$ terminator. I am trying to divide large files into a number of smaller files, each with the same number of records.

This code attempts to do this,
Code:
#! /bin/sh

# input file name
input_file=${1:-input.txt}
# output file name
output_file=${2:-output.txt}
# number of compounds per sdf file
split_number=${3:-6}

cat $input_file | \
awk -v split=$split_number ' { OUT[++CNT] = $0;  }
                $0 == "$$$$" { ++MOLS }
             $MOLS == $split { for(i in OUT) print OUT[i]; delete OUT; MOLS = 0 }
                         END { for(i in OUT) print OUT[i] }
                           ' > $output_file

by storing rows in OUT[] until a counter is reached (the desired number of records in each subfile) and then printing the rows, clearing the array, and resetting the counter. This also attempts to trap if EOF is reached before the counter reaches the set number.

The obvious problem is that there is no way to change the output file name for each subsequent write, so I will only end up with the last file. I think I can change the value of $output_file with the awk code but I think the awk here runs in a different subshell than bash, so I don't think that will work.

If I could run the awk only on specific lines of the file, I think I could call awk from a bash loop and make that work but I am guessing there is an easier way. I am running this in 32-bit cygwin so have everything available from that kit.

Suggestions would be appreciated.

LMHmedchem
This User Gave Thanks to LMHmedchem For This Post:
ashwini ng (6 Days Ago)
# 2  
Old 1 Week Ago
Note: the csplit syntax is incorrect Ignore the example. See Don Cragun's post below
Have you looked at the csplit command? It works by context (context split), and the split is based on a string or a pattern, not length of records or block sizes. You make it it use fix number of records per output small file as well. Your requirement is for a pattern I think.
Code:
csplit /pattern/ filename

e.g.,
Code:
   csplit /$$$$/ inputfilename

You get to specify the output filenames, so a quick read of the man page is in order, but they are generally something like xx01, xx02 by default.
Change the prefix and if there are literally thousands of possible output files, then declare 4 or 5 digits for the numeration operator.

FWIW sounds like you need a sqlite db or something similar, maintaining thousands of files are a nightmare waiting to happen.

Last edited by jim mcnamara; 6 Days Ago at 02:49 PM.. Reason: Error.
These 2 Users Gave Thanks to jim mcnamara For This Post:
LMHmedchem (4 Days Ago), rbatte1 (5 Days Ago)
# 3  
Old 1 Week Ago
Quote:
Originally Posted by jim mcnamara
Have you looked at the csplit command? It works by context (context split), and the split is based on a string or a pattern, not length of records or block sizes. You make it it use fix number of records per output small file as well. Your requirement is for a pattern I think.
Code:
csplit /pattern/ filename

e.g.,
Code:
   csplit /$$$$/ inputfilename

You get to specify the output filenames, so a quick read of the man page is in order, but they are generally something like xx01, xx02 by default.
Change the prefix and if there are literally thousands of possible output files, then declare 4 or 5 digits for the numeration operator.

FWIW sounds like you need a sqlite db or something similar, maintaining thousands of files are a nightmare waiting to happen.
Hi Jim,
The standard csplit synopsis is more like:
Code:
csplit [−ks] [−f prefix] [−n number] file /BRE/[offset]...

Note that the patterns with optional offsets (i.e., /BRE/[offset]) come after the file operand; not before it. And, since the pattern is a basic regular expression, the dollar-sign is a special character and needs to be escaped to be taken literally (instead of as a match for the end of the line). And, the offset is needed in this case because without one, the operand /\$\$\$\$/ will start the next record with the line that matches that BRE; instead of ending the current record with that line.

Then note that each time the pattern is matched, a new output file is created. So getting each output file to contain six records is going to require an iterative process where each pass produces seven output files (the first six with one record each and the seventh with any remaining records). Then the first six will need to be combined into a real output file and the loop will then need to be repeated if there was a seventh output file.

Without specifying options for output filenames and the number of digits in the output filenames, the command:
Code:
csplit file '/\$\$\$\$/+1' '/\$\$\$\$/+1' '/\$\$\$\$/+1' '/\$\$\$\$/+1' '/\$\$\$\$/+1' '/\$\$\$\$/+1'

would produce the files xx00, xx01, xx02, xx03, xx04, and xx05, containing the first six records, respectively, from the file named file and produce a file name xx06 containing any remaining records. But, this will only work if there are at least 7 records in your input file. When this finally produces an error, the last input file will contain no more than six input records, but you will need further processing to find out exactly how many, if that is important to you.

Hi LMHmedchem,
I would tend to just use awk for this. It is perfectly capable of creating output filenames for each record (or for each set of records in this case), counting records and grouping them together in the output, and it is also capable of reading an input file without creating a pipeline and wasting time reading and writing the file unnecessarily with cat:
Code:
#!/bin/sh

# input file name
input_file=${1:-input.txt}
# output file name
output_file=${2:-output.txt}
# number of compounds per sdf file
compounds_per_file=${3:-6}
# number of digits to add to mdf filenames
digits=${4:-4}

awk -v output_file="$output_file" \
    -v cpf="$compounds_per_file" \
    -v digits="$digits" '
function fname() {
	ofn = sprintf("%s%0" digits "d%s",
		substr(output_file, 1, length(output_file) - 4),
		nr / cpf,
		substr(output_file, length(output_file)-3))
}
BEGIN {	nr = 0
	fname()
}
{	print > ofn
}
/\$\$\$\$/ {
	if((++nr) % cpf)
		next
	close(ofn)
	fname()
}' "$input_file"

Note that I couldn't use split as a variable name in awk because split is the name of a standard awk function. (Some versions of awk might allow you to have a variable and a function with the same name, but that is not required by the standards.

The function used here that creates the output filenames makes the assumption that the filename you want should contain digits digits before the last four characters of output_file (i.e., before the .txt that is assumed to be at the end of the output filename). If you want to use a filename extension that is a different length, you'll have to adjust the fname() function. Since you said there can be thousands of records in an input file, I set the default for digits at 4 (which will work for up to 9,999 input records even if cpf is set to 1.
These 2 Users Gave Thanks to Don Cragun For This Post:
LMHmedchem (4 Days Ago), rbatte1 (5 Days Ago)
# 4  
Old 6 Days Ago
Don - Thanks for the syntax correction. Edited original to consult your post.
Login to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Similar Threads More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Sed: Splitting A large File into smaller files based on recursive Regular Expression match sumguy Shell Programming and Scripting 6 04-02-2013 09:39 PM
Splitting up a text file into multiple files by columns evelibertine UNIX for Dummies Questions & Answers 1 10-21-2012 11:42 PM
While or For with looping contructs? to create files from contents of a text file wolf@=NK Shell Programming and Scripting 3 05-23-2012 07:15 AM
Splitting a file into several smaller files using perl ramky79 Shell Programming and Scripting 9 03-30-2012 09:01 PM
Adding header to sub files after splitting the main file using AWK tanmay.gemini Shell Programming and Scripting 3 03-27-2012 05:25 AM
Script to create a text file whose content is the text of another files tenteyu Shell Programming and Scripting 1 03-22-2012 10:05 PM
How to split a file into smaller files wintersnow2011 Shell Programming and Scripting 2 12-08-2011 03:58 PM
Splitting text file into 2 separate files ?? shekharjchandra Shell Programming and Scripting 10 11-17-2010 06:12 AM
Need help in writing a script to create a new text file with specific data from existing two files shashi143ibm Shell Programming and Scripting 1 08-06-2010 04:48 AM
Help with splitting a large text file into smaller ones lord_butler Shell Programming and Scripting 2 07-15-2009 12:39 PM
splitting text file into smaller ones prvnrk Shell Programming and Scripting 3 04-03-2009 11:16 PM
splitting the large file into smaller files vsnreddy UNIX for Dummies Questions & Answers 1 11-16-2008 09:09 PM
Splitting a Larger File Into Mutiple Smaller ones. madhubt_1982 Shell Programming and Scripting 1 03-22-2008 11:10 AM
Splitting text file to several other files using sed. JeffV Shell Programming and Scripting 3 03-14-2008 04:34 PM
splitting files based on text in the file matrix1067 Shell Programming and Scripting 1 01-30-2006 08:45 PM
All times are GMT -4. The time now is 05:48 AM.

Unix & Linux Forums Content Copyright 1993-2018. All Rights Reserved.