Merge files and remove duplicated rows


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Merge files and remove duplicated rows
# 1  
Old 09-04-2014
Merge files and remove duplicated rows

In a folder I'll several times daily receive new files that I want to combine into one big file, without any duplicate rows.
The file name in the folder will look like e.q:
Code:
MissingData_2014-08-25_09-30-18.txt
MissingData_2014-08-25_09-30-14.txt
MissingData_2014-08-26_09-30-12.txt

The content of each file will consist of a header (that I want to keep in the new big file, and a set of tabulator separated fields. In addition a last column should be added, where the date of the soruce file is added (either from the file name or from the header of the spesific files). Example (for only two of the files):

MissingData_2014-08-25_09-30-18.txt:
Code:
Missing Data Report
Aug 25 09:30:19 CEST 2014
-----------------------------------------------------
Utility Id |	SDP Id |   	USDP Id |  	Channel Ref	
-----------------------------------------------------
NornAS	708550500021500221	819158189 	1-5SLG7   	
NornAS	708550500021500221	819158189 	1-5SLG1   	
NornAS	708550500021472047	205609001 	1-9RNN2



MissingData_2014-08-26_09-30-12.txt:
Code:
Missing Data Report
Aug 26 09:30:19 CEST 2014
-----------------------------------------------------
Utility Id |	SDP Id |    	USDP Id |   	Channel Ref	
-----------------------------------------------------
NornAS	708550500021500221	819158189	1-5SLG7   	 
NornAS	708550500021500221	819158189	1-5SLG1   	 
NornAS	708550500021472050	205609001	1-9RNN2



RESULTS IN ONE BIG FILE:
Code:
Missing Data Report
<Current date & time>
-------------------------------------------------------------------------------
Utility Id |	SDP Id |    	USDP Id |   	Channel Ref |	 Source File Name (Date)
-------------------------------------------------------------------------------
NornAS	708550500021500221	819158189 	1-5SLG7   	 2014-08-25
NornAS	708550500021500221	819158189	1-5SLG1   	 2014-08-25
NornAS	708550500021472047	205609001	1-9RNN2   	 2014-08-25
NornAS	708550500021472050	205609001	1-9RNN2   	 2014-08-25

I've tried to make a script by using 'cat' to merge the files, but that doesn't help me avoiding duplicates or truncate the header and add dates after each row:
Code:
for i in /home/Reports/MissingData*
do
cat $i >> MissingData.txt
done


Last edited by Bergans; 09-04-2014 at 07:13 AM.. Reason: code tags for data samples
# 2  
Old 09-04-2014
Hi,

The simplest option is to pipe the output file through sort -u after the loop is complete add the line;

Code:
 sort -u MissingData.txt > newoutput.txt

You will have to use other sort options to refine the sort but you can find these by typing;

Code:
man sort

Regards

Dave
This User Gave Thanks to gull04 For This Post:
# 3  
Old 09-04-2014
Thanks gull04,
This helps me sort the file.
Nevertheless the Header won't be a Header any more, but I hope studing the 'sort' function will enlighten me.

---------- Post updated at 03:24 PM ---------- Previous update was at 03:23 PM ----------

Du anybody know how I might add the date from the file name, as specified in the first post?
# 4  
Old 09-04-2014
Hi,

The simple way would be to build a header file as required, strip the individual headers from your files and output the data to a second file using "sort" append the data to the header file that you created.

Regards

Dave
This User Gave Thanks to gull04 For This Post:
# 5  
Old 09-04-2014
If duplicates are removed, what source file should be printed out a the end of line of the one entry remaining?
# 6  
Old 09-05-2014
Quote:
Originally Posted by RudiC
If duplicates are removed, what source file should be printed out a the end of line of the one entry remaining?
The last produced file. That means the text between the MissingData_ and .txt in the file name.
# 7  
Old 09-05-2014
Not sure if this gets even close to what you need:
Code:
awk 'NR<6; FNR<6 {next} {print $0 "\t" substr(FILENAME ,13,10)|"sort -rk4 | sort -uk1,4"}' *.txt
Missing Data Report
Aug 25 09:30:19 CEST 2014
-----------------------------------------------------
Utility Id |    SDP Id |       USDP Id |      Channel Ref    
-----------------------------------------------------
NornAS    708550500021472047    205609001    1-9RNN2    2014-08-25
NornAS    708550500021472050    205609001    1-9RNN2    2014-08-26
NornAS    708550500021500221    819158189    1-5SLG1    2014-08-26
NornAS    708550500021500221    819158189    1-5SLG7    2014-08-26

This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. AIX

Remove duplicated bootlist entries

Hello. I have a server with 2 boot disk but in the bootlist there are 5 paths of one disk but no path of the other. How can I remove paths from one disk to insert paths from the other disk? Thanks in advance. server074:root:/# bootlist -om normal hdisk0 blv=hd5 pathid=0 hdisk0... (7 Replies)
Discussion started by: Gabriander
7 Replies

2. Shell Programming and Scripting

Removing duplicated first field rows

Hello, I am trying to eliminate rows where the first field is duplicated, leaving the row where the last field is "NET". Data file: 345234|22.34|LST 546543|55.33|LST 793929|98.23|LST 793929|64.69|NET 149593|49.22|LST Desired output: 345234|22.34|LST 546543|55.33|LST... (2 Replies)
Discussion started by: palex
2 Replies

3. Shell Programming and Scripting

Merge files based on both common and uncommon rows

Hi, I have two files A (2190 rows) and file B (1100 rows). I want to merge the contents of two files based on common field, also I need the unmatched rows from file A file A: ABC XYZ PQR file B: >LMN|chr1:11000-12456: >ABC|chr15:176578-187678: >PQR|chr3:14567-15866: output... (3 Replies)
Discussion started by: Diya123
3 Replies

4. Shell Programming and Scripting

How to remove duplicated lines?

Hi, if i have a file like this: Query=1 a a b c c c d Query=2 b b b c c e . . . (7 Replies)
Discussion started by: the_simpsons
7 Replies

5. UNIX for Dummies Questions & Answers

Merge/Remove 2 files

Hi, I would like to get some advice here. Basically I have 2 diff files, I need to add/comment FileA (old) contents as compared with FileB (new). The file has almost 100k lines so it would be great if someone can advise me on the easiest way out. For example FileB has this parameter but... (1 Reply)
Discussion started by: tech.dummy
1 Replies

6. UNIX for Dummies Questions & Answers

Merge two files with common IDs but unequal number of rows

Hi, I have two files that I would like to merge and think that there should be a solution using awk. The files look something like this: file 1 IDX1 IDY1 IDX2 IDY2 IDX3 IDY3 file 2 IDY1 dataA data1 IDY2 dataB data2 IDY3 dataC data3 Desired output IDX1 IDY1 dataA data1 IDX2 ... (5 Replies)
Discussion started by: katie8856
5 Replies

7. Shell Programming and Scripting

Remove rows with first 4 fields duplicated in awk

Hi, I am trying to use awk to remove all rows where the first 4 fields are duplicates. e.g. in the following data lines 6-9 would be removed, leaving one copy of the duplicated row (row 5) Borgarhraun FH9822 ol24 FH9822_ol24_m20 ol Deformed c Borgarhraun FH9822 ol24 ... (3 Replies)
Discussion started by: tomahawk
3 Replies

8. Shell Programming and Scripting

Help with remove duplicated content

Input file: hcmv-US25-2-3p hsa-3160-5 hcmv-US33 hsa-47 hcmv-UL70-3p hsa-4508 hcmv-UL70-3p hsa-4486 hcms-US25 hsa-360-5 hcms-US25 hsa-4 hcms-US25 hsa-458 hcms-US25 hsa-44812 . . Desired Output file: hcmv-US25-2-3p hsa-3160-5 hcmv-US33 hsa-47 hcmv-UL70-3p hsa-4508 hsa-4486... (3 Replies)
Discussion started by: perl_beginner
3 Replies

9. Shell Programming and Scripting

remove duplicated columns

hi all, i have a file contain multicolumns, this file is sorted by col2 and col3. i want to remove the duplicated columns if the col2 and col3 are the same in another line. example fileA AA BB CC DD CC XX CC DD BB CC ZZ FF DD FF HH HH the output is AA BB CC DD BB CC ZZ FF... (6 Replies)
Discussion started by: kamel.seg
6 Replies

10. Shell Programming and Scripting

remove duplicated lines without sort

Hi Just wondering whether or not I can remove duplicated lines without sort For example, I use the command who, which shows users who are logging on. In some cases, it shows duplicated lines of users who are logging on more than one terminal. Normally, I would do who | cut -d" " -f1 |... (6 Replies)
Discussion started by: lalelle
6 Replies
Login or Register to Ask a Question