Finding total distinct count from multiple csv files through UNIX script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Finding total distinct count from multiple csv files through UNIX script
# 1  
Old 08-28-2017
Finding total distinct count from multiple csv files through UNIX script

Hi All ,
I have multiple pipe delimited csv files are present in a directory.I need to find out distinct count on a column on those files and need the total distinct
count on all files.

We can't merge all the files here as file size are huge in millions.I have tried in below way for each files.
Code:
cat Test1.csv|cut -d"|" -f38|uniq|wc -l
cat Test2.csv|cut -d"|" -f38|uniq|wc -l
cat Test3.csv|cut -d"|" -f38|uniq|wc -l

I need to automate the above procedure (ex: by putting each files count in a temporary file)as multiple csv files are present in the directory as I need the total distinct count on a column on all those files.Can anyone please help me with this.

Last edited by Don Cragun; 08-28-2017 at 04:00 AM.. Reason: Change HTML tags to CODE tags.
# 2  
Old 08-28-2017
Are all of your .csv files sorted on field 38? If not, your code won't work. (You will get artificially high counts for the number of distinct values in a file because uniq produces a line of output for each case where a field 38 value changes from the value found on the previous line.)

Does each distinct value in field 38 of all of your files appear in only one or your input files? If not, your code won't work. (You don't have any way to determine which distinct values in a single file appear in one or more of the other files.)

Are you always processing 3 files?

Having 3 files of a megabyte each each should not cause any problem producing a single merged or sorted combined file. Why are you unable to merge them?

Why not just use a single awk script to read all of your files once and produce the output you want for each input file and for the combined input from all of the input files?

Do you really want the number of distinct field 38 values in each input file? Or, do you really just want the number of distinct field 38 values in the merged input files?
# 3  
Old 08-28-2017
Please become accustomed to provide decent context info of your problem.
It is always helpful to support a request with system info like OS and shell, related environment (variables, options), preferred tools, adequate (representative) sample input and desired output data and the logics connecting the two, and, if existent, system (error) messages verbatim, to avoid ambiguities and keep people from guessing.

Please be aware that without prior sort, uniq can't reliably find unique entries.
# 4  
Old 08-28-2017
Hi ,
Please find below my response :
All the csv files are not sorted on field 38 but field 38 contains only numerical values.
Each distinct value appear only in one file.Each distinct value won't be there in multiple files.
We are not processing 3 files only.It can be around 100-120 files.

As all the input files(around 120) can contains data in millions.If system data size exceeds 400gb ,there is chance if system blown up ,that's why we not merging all the files.

We need total distinct count on field 38 on all the files present on the directory.

As I'm relatively new to the scripting area ,can you please help me with the above requirement.Your help will be very much appreciated.
# 5  
Old 08-28-2017
As RudiC suggested (and as you well know from being a member of this forum for over 2.5 years), if you don't provide sample input and corresponding desired output from a couple of sample input files, you don't tell us what operating system and shell you're using, and you don't provide a clear specification of what you are trying to do, there is a good chance that any suggestions provided might not work with your data in your environment. The following is completely untested and makes several assumptions that might be wrong...
Code:
awk -F'|' '
!(($38 + 0) in a) {
	a[$38 + 0]
	c++
}
END {	print c
}' *.csv

# 6  
Old 08-28-2017
Hi Don ,
PFB the sample input files in a directory.
Code:
[omnidevint@sftp311 full_01jan13_31aug15]$ ls -ltr
total 800068
-rw-rw-r--. 1 omnidevint omnidevint  97184275 Aug 25 15:38 Final_Customer_Data_2013_2015_0.csv
-rw-rw-r--. 1 omnidevint omnidevint  88463636 Aug 25 15:38 Final_Customer_Data_2013_2015_1.csv
-rw-rw-r--. 1 omnidevint omnidevint  89536908 Aug 25 15:38 Final_Customer_Data_2013_2015_2.csv
-rw-rw-r--. 1 omnidevint omnidevint 107937776 Aug 25 15:38 Final_Customer_Data_2013_2015_3.csv
-rw-rw-r--. 1 omnidevint omnidevint 161423700 Aug 25 15:38 Final_Customer_Data_2013_2015_4.csv
-rw-rw-r--. 1 omnidevint omnidevint 129165088 Aug 25 15:38 Final_Customer_Data_2013_2015_5.csv
-rw-rw-r--. 1 omnidevint omnidevint 117823518 Aug 25 15:38 Final_Customer_Data_2013_2015_6.csv
-rw-rw-r--. 1 omnidevint omnidevint  27721536 Aug 25 15:38 Final_Customer_Data_2013_2015_7.csv

The contents of each files like below where the field 38 contains customer_id.
Code:
 head -3 Final_Customer_Data_2013_2015_0.csv
first_name|last_name|email_address|brand|store_no|sales_associate_no|create_date|create_source|address_type_code|address_1|address_2|address_3|address_4|address_5|post_code|telephone_no|country_code|gvisa_flag|marital_status|gender|title|mobile_no|phone_opt_in_flag|mail_opt_in_flag|email_type|opt_in_flag|transaction_id|bm_customer_id|address_Active_Flag|mail_opt_in_date|phone_opt_in_date|date_last_modified|membership_date|head_of_household_flag|email_opt_in_date|points_current_balance|rewards_id|customer_id
Paula|Frediani|plfrediani@comcast.net|GYM|8888|9999|07/22/1999|STORE|BILLING|1906 Miller Ave||Belmont|CA||94002 1765|6506545255|US||U|U||6506545255|0|1|Rewards Signup||||1|02/17/2007||11/12/2016|06/18/2010|1||133|144RG2JS7F31735|58
Valerie|Burkart|valzb@sbcglobal.net|GYM|8888|9999|08/05/1999|STORE|BILLING|UNKNOWN||PLEASANTON|CA||94566|9254850902|US||U|U||9254850902|0|1|Rewards Signup|||2253998879592428|1|02/17/2007||03/18/2012|03/17/2012|1||239|1AS4V98V8G31735|255

And we are working in linux operating system and in bash shell.
Code:
Linux sftp311 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

And we need the distinct customer_id count from all the csv files in the directory.Based on these info ,if you provide me any script ,that will be helpful.Thanks in advance.
# 7  
Old 08-28-2017
You didn't say anything about there being header lines in your files that need to be ignored, so the code I suggested in post #5 in this thread would give you a count that is 1 more than the number of different customer IDs found in all of the .csv files in the directory in which you run that script. I assume that you can manually subtract 1 from the result printed or change the line in the code that prints the value of c to instead print c-1 to get the results you want.

Did you try running the code I suggested? Do you have some reason to think it is not doing what I described above?

If what I provided is not sufficient, please explain what it is doing wrong AND show us exactly the output you want from the sample input you provided.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Export Oracle multiple tables to multiple csv files using UNIX shell scripting

Hello All, just wanted to export multiple tables from oracle sql using unix shell script to csv file and the below code is exporting only the first table. Can you please suggest why? or any better idea? export FILE="/abc/autom/file/geo_JOB.csv" Export= `sqlplus -s dev01/password@dEV3... (16 Replies)
Discussion started by: Hope
16 Replies

2. Shell Programming and Scripting

Help with Getting distinct record count from a .dat file using UNIX command

Hi, I have a .dat file with contents like the below: Input file ============SEQ NO-1: COLUMN1========== 9835619 7152815 ============SEQ NO-2: COLUMN2 ========== 7615348 7015548 9373086 ============SEQ NO-3: COLUMN3=========== 9373086 Expected Output: (I just... (1 Reply)
Discussion started by: MS06
1 Replies

3. Shell Programming and Scripting

Shell script for field wise record count for different Files .csv files

Hi, Very good wishes to all! Please help to provide the shell script for generating the record counts in filed wise from the .csv file My question: Source file: Field1 Field2 Field3 abc 12f sLm 1234 hjd 12d Hyd 34 Chn My target file should generate the .csv file with the... (14 Replies)
Discussion started by: Kirands
14 Replies

4. Shell Programming and Scripting

Script to compare count of two csv files

Hi Guys, I need to write a script to compare the count of two csv files each having 5 columns. Everyday a csv file is recived. Now we need to compare the count of todays csv file with yesterday's csv file and if the total count of records is same in todays csv file and yesterday csv file out... (3 Replies)
Discussion started by: Vivekit82
3 Replies

5. Shell Programming and Scripting

Finding total count of a word.

i want to find the no:of occurrences of a word in a file cat 1.txt unix script unix script unix script unix script unix script unix script unix script unix script unix unix script unix script unix script now i want to find , how many times 'unix' was occurred please help me thanks... (6 Replies)
Discussion started by: mahesh1987
6 Replies

6. Shell Programming and Scripting

Search and find total count from multiple files

Please advice how can we search for a string say (abc) in multiple files and to get total occurrence of that searched string. (Need number of records that exits in period of time). File look like this (read as filename.yyyymmdd) a.20100101 b.20100108 c.20100115 d.20100122 e.20100129... (2 Replies)
Discussion started by: zooby
2 Replies

7. Shell Programming and Scripting

perl script on how to count the total number of lines of all the files under a directory

how to count the total number of lines of all the files under a directory using perl script.. I mean if I have 10 files under a directory then I want to count the total number of lines of all the 10 files contain. Please help me in writing a perl script on this. (5 Replies)
Discussion started by: adityam
5 Replies

8. Shell Programming and Scripting

How to use the programming in UNIX to count the total G+C and the GC%?What command li

Seems like can use awk and perl command. But I don't have the idea to write the command line. Thanks for all of your advise. For example, if I have the file whose content are: Sample 1. ATAGCAGAGGGAGTGAAGAGGTGGTGGGAGGGAGCT Sample 2. ACTTTTATTTGAATGTAATATTTGGGACAATTATTC Sample 3.... (1 Reply)
Discussion started by: patrick chia
1 Replies

9. UNIX for Dummies Questions & Answers

grep running total/ final total across multiple files

Ok, another fun hiccup in my UNIX learning curve. I am trying to count the number of occurrences of an IP address across multiple files named example.hits. I can extract the number of occurrences from the files individually but when you use grep -c with multiple files you get the output similar to... (5 Replies)
Discussion started by: MrAd
5 Replies

10. Shell Programming and Scripting

finding duplicate files by size and finding pattern matching and its count

Hi, I have a challenging task,in which i have to find the duplicate files by its name and size,then i need to take anyone of the file.Then i need to open the file and find for more than one pattern and count of that pattern. Note:These are the samples of two files,but i can have more... (2 Replies)
Discussion started by: jerome Sukumar
2 Replies
Login or Register to Ask a Question