Select and copy .csv files based on row and column number

02-26-2019

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi, Robert

Good that you have a solution. I am going to present another solution, one that does not use awk. I definitely recommend that you learn awk, especially if you are going to deal with columns (I usually call them fields). If you did wish to use only the command line, then here is one approach. First, we will ignore files that do not have exactly 6 lines. Then we will ignore all remaining files that do not have a total of 24 fields, which we will assume 4 fields on each of the 6 lines. We will do this, in turn, by assuming that each line would have 3 commas (separators), thus 6 X 3 -> 18. The utilities we will use are wc to count lines and characters, and tr which can delete (and transform) characters. So consider a file that has your 6 lines, and 4 fields per line. We will delete everything except the commas, and then count those commas. Any file that has other than 18 will be ignored. Any filename that remains will be collected into a text string (variable).

This script has a lot of other stuff in it to show the data files, results, etc., such as a debug function db. (To see what the results look like without debugging output, simply reverse the order of the 2 lines defining the db function.) Just concentrate on the ideas above and look at the files on which wc and tr operate. Ready?

Code:

#!/usr/bin/env bash

# @(#) s2       Demonstrate counting lines and fields, wc and tr.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { : ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
C=$HOME/bin/context && [ -f $C ] && $C head wc tr

FILE=${1-data?}

pl " Input data files $FILE:"
head $FILE

pl " Line counts for data files $FILE:"
wc -l $FILE

pl " Results:"
l6x4=""
for item in $FILE
do
  if  [ $(wc -l <$item) != 6 ]
  then
    db " ignore file $item, not 6 lines"
        continue
  else
    if [ $(tr -d -c ',' <$item|wc -c) != 18 ]
        then
          db " ignore file $item, not 18 commas"
          continue
        fi
    l6x4="$l6x4 $item"
  fi
done
db " list of files to copy: $l6x4"
pe " sample: cp $l6x4 some-directory"

exit 0

producing:

Code:

$ ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-7-amd64, x86_64
Distribution        : Debian 8.11 (jessie) 
bash GNU bash 4.3.30
head (GNU coreutils) 8.23
wc (GNU coreutils) 8.23
tr (GNU coreutils) 8.23

-----
 Input data files data?:
==> data1 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4

==> data2 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1
4.1
5.1
6.1
7.1

==> data3 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1,3.2,3.3,3.4,3.5

==> data4 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4





==> data5 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1,3.2,3.3,3.4
4.1,4.2,4.3,4.4
5.1,5.2,5.3,5.4
6.1,6.2,6.3,6.4

==> data6 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1,3.2,3.3,3.4
4.1,4.2,4.3,4.4
5.1,5.2,5.3,5.4
6.1,6.2,6.3;6.4 <- semi-colon is not a comma

==> data7 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1,3.2,3.3,3.4
4.1,4.2,4.3,4.4
5.1,5.2,5.3,5.4
6.1,6.2,6.3,6.4

==> data8 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1,3.2,3.3,3.4
4.1,4.2,4.3,4.4
5.1,5.2,5.3,5.4
6.1,6.2,6.3,6.4,6.5

==> data9 <==
1.1,1.2,1.3,1.4
2.1,2.2,2.3,2.4
3.1,3.2,3.3,3.4
4.1,4.2,4.3,4.4
5.1,5.2,5.3         <- only 3 fields: so 2 commas
6.1,6.2,6.3,6.4,6.5 <- but  5 fields: so 4 commas

-----
 Line counts for data files data?:
  2 data1
  7 data2
  3 data3
  6 data4
  6 data5
  6 data6
  6 data7
  6 data8
  6 data9
 48 total

-----
 Results:
 db,  ignore file data1, not 6 lines
 db,  ignore file data2, not 6 lines
 db,  ignore file data3, not 6 lines
 db,  ignore file data4, not 18 commas
 db,  ignore file data6, not 18 commas
 db,  ignore file data8, not 18 commas
 db,  list of files to copy:  data5 data7 data9
 sample: cp  data5 data7 data9 some-directory

OK, so what's good here? It's fairly straightforward command-line stuff, although some, like the comparisons may be new. That means you don't need to learn another language like awk or perl (but, of course, you probably should.) Just simple utilities are used.

So what's bad? First, awk and perl can process multiple operations, whereas here, each file causes a number of separate executions for each file. For a small number of files, say a few hundred, that probably would not matter much, especially if you can get a script running quickly compared to the time it may take to get an awk script (or especially a perl script) running correctly. Secondly, we assumed your files are all syntactically correct, in that each line has 4 fields. If it does not, like data9, the sum of the commas could still be 18, but it may cause an error down the line.

The latter issue could be addressed by looking at each line. Another utility (hefty though it might be) is datamash, which you can use to verify the number of fields in a file. For example:

Code:

$ datamash -t, check 4 fields <data9
line 5 (3 fields):
  5.1,5.2,5.3         <- only 3 fields: so 2 commas
datamash: check failed: line 5 has 3 fields (expecting 4)

Because it is another code that deals with fields, you may wish to look at it:

Code:

datamash        command-line calculations (man)
Path    : /usr/local/bin/datamash
Version : 1.2
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Help    : probably available with -h,--help
Repo    : Debian 8.11 (jessie) 
Home    : https://savannah.gnu.org/projects/datamash/ (pm)
Home    : http://www.gnu.org/software/datamash (doc)

The code datamash was in my repository for a machine like:

Code:

OS, ker|rel, machine: Linux, 3.16.0-7-amd64, x86_64
Distribution        : Debian 8.11 (jessie) 
datamash (GNU datamash) 1.2

Best wishes ... cheers, drl

This User Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

02-27-2019

Registered User

5, 1

Join Date: Feb 2019

Last Activity: 9 March 2019, 7:52 AM EST

Posts: 5

Thanks Given: 2

Thanked 1 Time in 1 Post

Hi drl,

Many thanks for taking the time to provide such an extensive answer - the support provided on this forum is really extraordinary. I could not have produced a script like yours with all the bells and whistles but the central for-if-else-loop seems indeed pretty straightforward.

I'm aware I should probably familiarize with awk but as a lay programmer (I use MATLAB quite a bit but that's about it) I find its syntax not particularly intuitive and time is just such a limited resource... *sigh*

In any case, many thanks again - both you and Rudi have been super helpful!

Robert

This User Gave Thanks to rcsapo For This Post:

rcsapo

View Public Profile for rcsapo

Find all posts by rcsapo

UNIX for Beginners Questions & Answers

Select and copy .csv files based on row and column number

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Get maximum per column from CSV file, based on date column

Discussion started by: ejianu

2. Shell Programming and Scripting

Row bind multiple csv files having different column headers

Discussion started by: ks_reddy

3. Shell Programming and Scripting

List files with number to select based on number

Discussion started by: mysocks

4. Shell Programming and Scripting

Comparing Select Columns from two CSV files in UNIX and create a third file based on comparision

Discussion started by: ady_koolz

5. Shell Programming and Scripting

Read in 2-column CSV, output many files based on field

Discussion started by: pxalpine

6. Shell Programming and Scripting

awk: Transpose csv row to column.

Discussion started by: drbiloukos

7. Shell Programming and Scripting

delete a row in csv file based on the date

Discussion started by: pals70423

8. Shell Programming and Scripting

How to print column based on row number

Discussion started by: Surabhi_so_mh

9. Shell Programming and Scripting

Split single file into multiple files based on the number in the column

Discussion started by: tomasl

10. Shell Programming and Scripting

Select Record based on First Column

Discussion started by: mgirinath