Issues with filtering duplicate records using gawk script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Issues with filtering duplicate records using gawk script
# 1  
Old 04-09-2009
Issues with filtering duplicate records using gawk script

Hi All,

I have huge trade file with milions of trades.I need to remove duplicate records (e.g I have following records)
30/10/2009,trdeId1,..,..
26/10/2009.tradeId1,..,..,,
30/10/2009,tradeId2,..

In the above case i need to filter duplicate recods and I should get following output.
30/10/2009,trdeId1,..,..
30/10/2009,tradeId2,..
(trade record with latest COB date)
COB -closed of business day

I need to handle following three conditions.
1.Trade file will be sorted in ascending order on first two columns(COB date and trade id)
2.Trade file will be sorted in descending order on first two columns(COB date and trade id)
3.Trade file may not have duplicate records.
In all the above condition my code should work.

I have written following code.but it doen't seems to be working.As i m new to awk can anybody help me in getting this.

#!/usr/bin/gawk
BEGIN {
FS = ","
}
END {
print prevLine;
}
{
if( FNR == 1)
{
prevDate=$1;
prevSourceTradeId=$2;
prevLine=$0;
}
else
{
if(prevSourceTradeId==$2)
{
if((compareDate(prevDate,$1) == 1))
{
print prevLine;
flag=true
}
else
{
prevDate=$1;
prevLine=$0;
prevSourceTradeId=$2;
print prevLine;
flag=true
}
}
else
{
if(flag)
{
prevDate=$1;
prevSourceTradeId=$2;
prevLine=$0;
}
else print prevLine;
prevDate=$1;
prevSourceTradeId=$2;
prevLine=$0;
flag=false;
}
}
}
}
function compareDate(lhsDate, rhsDate)
{
lhsSize = split(lhsDate, lhsFields, "/");
rhsSize = split(rhsDate, rhsFields, "/");
if(lhsSize != rhsSize)
{
print "Invalid prevDate " lhsDate " "rhsDate;
return 0;
}
for(i = rhsSize; i > 0; i--)
{
if(lhsFields[i] > rhsFields[i]) return 1;
}
return 0;
}
{

Trade.txt file
30/03/2009,17981-G,MIDAS,,FX Euro Option,,MELLON ADM,MELLON ADM,DBSA,DBSA,26/03/2009,84450.7476,30/03/2009,,4200000,BRL,,,USD,C,B,26/05/2009,139,USD,199061.35,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17980-G,MIDAS,,FX Euro Option,,MELLON ADM,MELLON ADM,DBSA,DBSA,26/03/2009,183108.5122,30/03/2009,,6600000,BRL,,,USD,C,B,26/05/2009,137,USD,374182.77,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17978-G,MIDAS,,FX Euro Option,,QUEST MACRO 30,QUEST MACRO 30,DBSA,DBSA,24/03/2009,-7841.8551,30/03/2009,,-390000,BRL,,,USD,C,S,26/05/2009,139,USD,-20803.77,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17977-G,MIDAS,,FX Euro Option,,ADVANTAGE QUEST,ADVANTAGE QUEST,DBSA,DBSA,24/03/2009,-1709.1223,30/03/2009,,-85000,BRL,,,USD,C,S,26/05/2009,139,USD,-4534.15,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17976-G,MIDAS,,FX Euro Option,,QUEST90 FIM,QUEST90 FIM,DBSA,DBSA,24/03/2009,-9651.514,30/03/2009,,-480000,BRL,,,USD,C,S,26/05/2009,139,USD,-25604.64,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17975-G,MIDAS,,FX Euro Option,,QUESTX FIM,QUESTX FIM,DBSA,DBSA,24/03/2009,-8042.9283,30/03/2009,,-400000,BRL,,,USD,C,S,26/05/2009,139,USD,-21337.2,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17974-G,MIDAS,,FX Euro Option,,MELLONQUEST30,MELLONQUEST30,DBSA,DBSA,24/03/2009,-51173.1316,30/03/2009,,-2545000,BRL,,,USD,C,S,26/05/2009,139,USD,-135757.93,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17973-G,MIDAS,,FX Euro Option,,MELLONQUEST I,MELLONQUEST I,DBSA,DBSA,24/03/2009,-6032.1963,30/03/2009,,-300000,BRL,,,USD,C,S,26/05/2009,139,USD,-16002.9,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17972-G,MIDAS,,FX Euro Option,,QUEST MACRO 30,QUEST MACRO 30,DBSA,DBSA,24/03/2009,-16923.6655,30/03/2009,,-610000,BRL,,,USD,C,S,26/05/2009,137,USD,-34583.55,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17971-G,MIDAS,,FX Euro Option,,QUEST90 FIM,QUEST90 FIM,DBSA,DBSA,24/03/2009,-21085.2226,30/03/2009,,-760000,BRL,,,USD,C,S,26/05/2009,137,USD,-43087.7,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17970-G,MIDAS,,FX Euro Option,,QUESTX FIM,QUESTX FIM,DBSA,DBSA,24/03/2009,-17201.1027,30/03/2009,,-620000,BRL,,,USD,C,S,26/05/2009,137,USD,-35150.49,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17969-G,MIDAS,,FX Euro Option,,MELLONQUEST30,MELLONQUEST30,DBSA,DBSA,24/03/2009,-110974.8559,30/03/2009,,-4000000,BRL,,,USD,C,S,26/05/2009,137,USD,-226777.44,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17968-G,MIDAS,,FX Euro Option,,MELLONQUEST I,MELLONQUEST I,DBSA,DBSA,24/03/2009,-13316.9827,30/03/2009,,-480000,BRL,,,USD,C,S,26/05/2009,137,USD,-27213.28,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17962-G,MIDAS,,FX Euro Option,,ADVANTAGE QUEST,ADVANTAGE QUEST,DBSA,DBSA,23/03/2009,-3606.6828,30/03/2009,,-130000,BRL,,,USD,C,S,26/05/2009,137,USD,-7370.26,BRL,BRL,BRZ,,,,26/05/2009,,
30/03/2009,17960-G,MIDAS,,FX Euro Option,,MELGLOMKTFICFIM,MELGLOMKTFICFIM,DBSA,DBSA,18/03/2009,-149704.8449,30/03/2009,,-15000000,BRL,,,USD,C,S,01/04/2011,2.3,USD,-6880999.5,BRL,BRL,BRZ,,,,01/04/2011,,
30/03/2009,17959-G,MIDAS,,FX Euro Option,,MELGLOMKTFICFIM,MELGLOMKTFICFIM,DBSA,DBSA,18/03/2009,-435720.3749,30/03/2009,,-20000000,BRL,,,USD,C,S,03/01/2011,2,USD,-12858000,BRL,BRL,BRZ,,,,03/01/2011,,
30/03/2009,17958-G,MIDAS,,FX Euro Option,,MELGLOMKTFICFIM,MELGLOMKTFICFIM,DBSA,DBSA,18/03/2009,-256346.867,30/03/2009,,-15000000,BRL,,,USD,C,S,03/01/2011,2.2,USD,-7200000,BRL,BRL,BRZ,,,,03/01/2011,,
30/03/2009,17957-G,MIDAS,,FX Euro Option,,MELGLOMKTFICFIM,MELGLOMKTFICFIM,DBSA,DBSA,18/03/2009,-762701.3198,30/03/2009,,-30000000,BRL,,,USD,C,S,01/07/2010,2,USD,-16455000,BRL,BRL,BRZ,,,,01/07/2010,,
30/03/2009,17956-G,MIDAS,,FX Euro Option,,MELGLOMKTFICFIM,MELGLOMKTFICFIM,DBSA,DBSA,18/03/2009,-269765.1783,30/03/2009,,-15000000,BRL,,,USD,C,S,01/07/2010,2.2,USD,-5856999,BRL,BRL,BRZ,,,,01/07/2010,,
# 2  
Old 04-10-2009
whats the problem you are facing??
1.Trade file will be sorted in ascending order on first two columns(COB date and trade id)
2.Trade file will be sorted in descending order on first two columns(COB date and trade id)
i didn't understand above two condition..
why do you want it to sort it in both ways ??
# 3  
Old 04-10-2009
Issues with filtering duplicate records using gawk script

Thanks for your reply.
consider trade file is sorted based on COB date(Close of business date)
e.g
30/10/2009,tradeid1,..,..
26/10/2009,tradeId1,..,..
30/10/2009,tradeId2,..,..
26/10/2009,tradeId2,..,..

Now I want to remove duplicate records by comparing trade ids and COB dates.
If trade ids are same then i need latest COB date record.
30/10/2009,tradeid1,..,..
30/10/2009,tradeId1,..,..

also gawk script which i wrote it should work even if no duplicate records are there.

the script which i wrote works fine when no duplicate records are there but it doest not work if duplicate records are there.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Filtering records of a csv file based on a value of a column

Hi, I tried filtering the records in a csv file using "awk" command listed below. awk -F"~" '$4 ~ /Active/{print }' inputfile > outputfile The output always has all the entries. The same command worked for different users from one of the forum links. content of file I was... (3 Replies)
Discussion started by: sunilmudikonda
3 Replies

2. Shell Programming and Scripting

Need help with filtering records in a file

Hi, I have following records in a file more file1.txt setting applicaction ABC for user setting applicaction CDE for user setting applicaction XXX for user logging applicaction XXX for user I need to filter out records which have strings " setting... (5 Replies)
Discussion started by: manid
5 Replies

3. Shell Programming and Scripting

Perl: filtering lines based on duplicate values in a column

Hi I have a file like this. I need to eliminate lines with first column having the same value 10 times. 13 18 1 + chromosome 1, 122638287 AGAGTATGGTCGCGGTTG 13 18 1 + chromosome 1, 128904080 AGAGTATGGTCGCGGTTG 13 18 1 - chromosome 14, 13627938 CAACCGCGACCATACTCT 13 18 1 + chromosome 1,... (5 Replies)
Discussion started by: polsum
5 Replies

4. Linux

Need awk script for removing duplicate records

I have log file having Traffic line 2011-05-21 15:11:50.356599 TCP (6), length: 52) 10.10.10.1.3020 > 10.10.10.254.50404: 2011-05-21 15:11:50.652739 TCP (6), length: 52) 10.10.10.254.50404 > 10.10.10.1.3020: 2011-05-21 15:11:50.652558 TCP (6), length: 89) 10.10.10.1.3020 >... (1 Reply)
Discussion started by: Rastamed
1 Replies

5. Shell Programming and Scripting

Filtering issues while using nawk

Hi, I am currently filtering a file that has multiple sets of data. An example of some of the data is as follows; Sat Oct 2 07:42:45 2010 01:33:46 R1_CAR_12.34 Sun Oct 3 13:09:53 2010 00:02:34 R2_BUS_56.78 Sun Oct 3 21:11:39 2010 00:43:21 R3_TRAIN_COACH_90.12 Mon Oct 4... (1 Reply)
Discussion started by: crunchie
1 Replies

6. Shell Programming and Scripting

Filtering Issues Using sed and awk

Hi, I am currently using the sed and awk commands to filter a file that has multiple sets of data in different columns. An example of part of the file i am filtering is as follows; Sat Oct 2 07:42:45 2010 01:33:46 R1_CAR_12.34 Sun Oct 3 13:09:53 2010 00:02:34 R2_BUS_56.78 Sun... (4 Replies)
Discussion started by: crunchie
4 Replies

7. Shell Programming and Scripting

Filtering issues with multiple columns in a single file

Hi, I am new to unix and would greatly appreciate some help. I have a file containing multiple colums containing different sets of data e.g. File 1: John Ireland 27_December_69 Mary England 13_March_55 Mike France 02_June_80 I am currently using the awk... (10 Replies)
Discussion started by: crunchie
10 Replies

8. Linux

Need awk script for removing duplicate records

I have huge txt file having millions of trade data. For e.g Trade.txt (first 8 lines in the file is header info) COB_DATE,TRADE_ID,SOURCE_SYSTEM_TRADE_ID,TRADE_GROUP_ID, TRADE_TYPE,DEALER_NAME,EXTERNAL_COUNTERPARTY_ID, EXTERNAL_COUNTERPARTY_NAME,DB_COUNTERPARTY_ID,... (6 Replies)
Discussion started by: nmumbarkar
6 Replies

9. UNIX for Dummies Questions & Answers

Filtering records of a file based on a value of a column

Hi all, I would like to extract records of a file based on a condition. The file contains 47 fields, and I would like to extract only those records that match a certain value in one of the columns, e.g. COL1 COL2 COL3 ............... COL47 1 XX 45 ... (4 Replies)
Discussion started by: risk_sly
4 Replies

10. UNIX for Advanced & Expert Users

Filtering duplicate lines

Does anybody know a command that filters duplicate lines out of a file. Similar to the uniq command but can handle duplicate lines no matter where they occur in a file? (9 Replies)
Discussion started by: AreaMan
9 Replies
Login or Register to Ask a Question