Split file by column value, each with header


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split file by column value, each with header
# 15  
Old 11-15-2016
Code:
column=5

rm -f *_testcase.csv
(( acol = column - 1 ))
while read line
do
   [[ -z "$line1" ]] && { line1=$line ; continue ; }
   IFS=, cols=($line)
   fl=${cols[$acol]}${yr}__testcase.csv
   [[ -f "$fl" ]] || echo "$line1" > "$fl"
   echo "$line" >> "$fl"
done < infile.csv

This User Gave Thanks to rdrtx1 For This Post:
# 16  
Old 11-15-2016
Quote:
Originally Posted by RavinderSingh13
Hello Rudi,

Thank you for nice explanation, I would like to mention here wouldn't be it good to close the opened files or if there are too many files opened it may be an issue with it. Could you please put some light on this too, will be grateful to you sir.

Thanks,
R. Singh
Yes, if the number of files threatens to exceed the system configuration parameter OPEN_MAX (my linux: 1024) or awk's internal limit (here:1022), a close (FN[$5]) after each print >> FN[$5] might help. Please note the >> redir3ction operator to append lines to the file.
# 17  
Old 11-15-2016
Hi.

At our shop, we often try to generalize problem solutions, rather than having a number of separate solutions for similar problems.

In this kind of case we created a code gate to handle these kind of situations. The center of this would be the single command:
Code:
gate -f=5 $FILE

Here is how we would start to solve this:
Code:
#!/usr/bin/env bash

# @(#) s1       Demonstrate collection of matching lines to separate files, gate.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C dixf gate
pe
dixf gate

FILE=${1-data1}
E=expected-output.txt

pl " Input data file $FILE, lines = $( wc -l < $FILE):"
head $FILE

pl " Expected output:"
grep csv $E

pl " Results, dry run:"
gate --dryrun --field=5 $FILE

pl " Results, real run:"
gate -f=5 $FILE
wc -l Testcase TATA*

pl " Documentation for gate:"
gate -h

exit 0

producing:
Code:
$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.6 (jessie) 
bash GNU bash 4.3.30
dixf (local) 1.12
gate (local) 1.10

gate    Group, collect, sift lines, separate lines into buckets. (what)
Path    : ~/bin/gate
Length  : 278 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Help    : probably available with [     ]-h

-----
 Input data file data1, lines = 50:
year,min,max,Instrument,Testcase
2016,201,1003,GEOTROPH-02116,TATA7980
2016,53,1011,GEOTROPH-01963,TATA7980
2016,3,1024,GEOTROPH-02067,TATA7980
2016,203,1027,GEOTROPH-02011,TATA7980
2016,135,1016,GEOTROPH-01932,TATA3405
2016,71,1036,GEOTROPH-02030,TATA7980
2016,173,1046,GEOTROPH-02033,TATA7980
2016,183,1015,GEOTROPH-02120,TATA7980
2016,47,1037,GEOTROPH-02154,TATA7980

-----
 Expected output:
TATA79802016__testcase.csv
and TATA34052016__testcase.csv

-----
 Results, dry run:
     9 entries in group (good) TATA3405
    40 entries in group (good) TATA7980
     1 entries in group (good) Testcase

-----
 Results, real run:
   1 Testcase
   9 TATA3405
  40 TATA7980
  50 total

-----
 Documentation for gate:
NAME
    gate - Group And Transfer Entries

VERSION
    1.0

USAGE
    gate [-d] [-c m,n] [-f n] [-h] [-s c] [files]

REQUIRED ARGUMENTS
    None.

OPTIONS
    -d <off>
        dry-run, determine filenames and counts. A basic sanity check on
        filenames is performed. This is recommended for a first run.

    -c m,n <0>
        Select column mode: define columns m through n as the filenames to
        which lines will be written. The columns are numbered beginning with
        1 on the left.

    -f number <0>
        Choose field number as the group. Data will be written to filenames
        corresponding to the contents of this field. The field numbering
        begins with 1 on the left.

    -h <off>
        Print this man page.

    -s character string <,>
        Select one or more characters as the separator character.

    files <STDIN>
        Specify files to process.

DESCRIPTION
    For each line read, the id value from the field or columns specified is
    isolated from each and used as a filename, to which that line is
    written. As many files are opened and written as are needed.

This may not be the final approach, but it would be a start for us.

Best wishes ... cheers, drl

---------- Post updated at 10:18 ---------- Previous update was at 09:46 ----------

Hi.
Quote:
Originally Posted by RavinderSingh13
... I would like to mention here wouldn't be it good to close the opened files or if there are too many files opened it may be an issue with it. ...
I seem to recall testing the timing between closing and not closing files in this kind of operation. My recollection is that it added a large amount of time to do the close-open-close... operation.

While there may be situations where this might be needed. I'd be tempted to consider splitting the input file, and do separate operations on the split pieces (for example). Multiple cores/CPU can help with parallelization in such cases, say with command parallel.

Best wishes ... cheers, drl
This User Gave Thanks to drl For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split large xml into mutiple files and with header and footer in file

Split large xml into mutiple files and with header and footer in file tried below it splits unevenly and also i need help in adding header and footer command : csplit -s -k -f my_XML_split.xml extrfile.xml "/<Document>/" {1} sample xml <?xml version="1.0" encoding="UTF-8"?><Recipient>... (36 Replies)
Discussion started by: karthik
36 Replies

2. Shell Programming and Scripting

Split a file based on encountering header

I need to split a file based on headers found Input file file1 content: ADD john mickey DROP matt sam output of file F1 john mickey output of file F2 matt sam (5 Replies)
Discussion started by: Diddy
5 Replies

3. Shell Programming and Scripting

Sort and Split file with header and custom name

Hi, I am using SUN SOLARIS (SunOS sun4v sparc SUNW, T5240). I have a huge data file with header and trailer. This file gets used into an ETL process. ETL skips the header record (which is the first record of the file) and loads the rest of the record. The file can be delimited (comma,... (5 Replies)
Discussion started by: Saanvi1
5 Replies

4. Shell Programming and Scripting

Split and add header and trailer from input file

I need to split the file based on pattern from position 34-37 while retaining the header and trailer records in each individual split file Also is it possible to output the TOM and PAT records in the same output file ? I need the output file names same as xyz_pattern_Datetimestamp.txt ... (23 Replies)
Discussion started by: techedipro
23 Replies

5. UNIX for Dummies Questions & Answers

Column Header in the Spool file

Hi All, I have a problem with the column heading while spooling the data from the database. Since i want the column header, therefore i am not using SET HEADING OFF, and i am getting the header, but for the longer column name, some of the character from the column name is missing. for... (7 Replies)
Discussion started by: Pramod_009
7 Replies

6. UNIX for Dummies Questions & Answers

Rename a header column by adding another column entry to the header column name

Hi All, I have a file example.csv which looks like this GrpID,TargetID,Signal,Avg_Num CSCH74_1_1,2007,61,256 CSCH74_1_1,212007,647,679 CSCH74_1_1,12007,3,32 CSCH74_1_1,207,299,777 I want the output as GrpID,TragetID,Signal-CSCH74_1_1,Avg_Num CSCH74_1_1,2007,61,256... (1 Reply)
Discussion started by: Vavad
1 Replies

7. Shell Programming and Scripting

Rename a header column by adding another column entry to the header column name URGENT!!

Hi All, I have a file example.csv which looks like this GrpID,TargetID,Signal,Avg_Num CSCH74_1_1,2007,61,256 CSCH74_1_1,212007,647,679 CSCH74_1_1,12007,3,32 CSCH74_1_1,207,299,777 I want the output as GrpID,TragetID,Signal-CSCH74_1_1,Avg_Num CSCH74_1_1,2007,61,256... (4 Replies)
Discussion started by: Vavad
4 Replies

8. UNIX for Dummies Questions & Answers

split header row into one column

So, I have a massive file with thousands of columns I want a list of the headers in one column in another file. So I need to strip off the top line (can use head-1) But how can I convert from this format: A B C D E F G to A B C D E F G (6 Replies)
Discussion started by: polly_falconer
6 Replies

9. Shell Programming and Scripting

Split large file and add header and footer to each small files

I have one large file, after every 200 line i have to split the file and the add header and footer to each small file? It is possible to add different header and footer to each file? (7 Replies)
Discussion started by: ashish4422
7 Replies

10. Shell Programming and Scripting

Split large file and add header and footer to each file

I have one large file, after every 200 line i have to split the file and the add header and footer to each small file? It is possible to add different header and footer to each file? (1 Reply)
Discussion started by: ashish4422
1 Replies
Login or Register to Ask a Question