Awk- How to extract duplicate expressions


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Awk- How to extract duplicate expressions
# 1  
Old 09-14-2010
Awk- How to extract duplicate expressions

How to extract duplicate expressions ? CD of c4

input
Code:
c3       100     120     TF03_X2         +       AABDDAAABDDBCBACDBBC
c4       100     120     TF03_X3         +       AABCDAAABDDBCBACDBBC

Script
Code:
 awk '{ for(i=1; i<=NF; i++)  if($5 == "+" && $6 ~/CD/)  {print index($6, "/CD_+/"),"\t",length($6),"\t",$0,"\t", "YCDY"}}' input1 |awk '!a[$0]++'t

outpu
Code:
0        20      c3      100     120     TF03_X2         +       AABDDAAABDDBCBACDBBC    YCDY
0        20      c4      100     120     TF03_X3         +       AABCDAAABDDBCBACDBBC    YCDY

Needed output
Code:
0        20      c3      100     120     TF03_X2         +       AABDDAAABDDBCBACDBBC    YCDY
0        20      c4      100     120     TF03_X3         +       AABCDAAABDDBCBACDBBC    YCDY
0        20      c4      100     120     TF03_X3         +       AABCDAAABDDBCBACDBBC    YCDY

# 2  
Old 09-14-2010
Something like this?
Code:
awk 'function prmatch(s,pat,type){x=0;while(match(s,pat)){print (x+RSTART)"\t"$0"\t"type;x+=RSTART+RLENGTH-1;s=substr(s,x+1)}}
     $5 == "+" && $6~/.DC./{prmatch($6,".DC.","YDCY")}
     $5 == "-" && $6~/.CD./{prmatch($6,".CD.","YCDY")} 
     $5 == "+" && $6~/DBCADB/{prmatch($6,"DBCADB","DBCADB")} 
     $5 == "-" && $6~/BDACBD/{prmatch($6,"BDACBD","BDACBD")}' infile

output:
Code:
11      c1      100     120     TF01_X1 +       AABDDAAABDDBCADBDABC    DBCADB
15      c2      100     120     TF02_X2 -       AABDDAAABDDBCBACDBBC    YCDY
15      c3      100     120     TF03_X2 +       AABDDAAABDDBCBADCBBC    YDCY
3       c4      100     120     TF03_X3 +       AABDCAAABDDBCBADCBBC    YDCY
15      c4      100     120     TF03_X3 +       AABDCAAABDDBCBADCBBC    YDCY


Last edited by Scrutinizer; 09-14-2010 at 09:08 AM..
# 3  
Old 09-14-2010
yes perfect

HTML Code:
/.CD./ - Thanx for solving this
and modifying code in your own way.

1. Is it possible to print the orginal pattern like ACDB instead of YCDY + or BDCA -

2. The important thing I'm looking for is instead of manually assigning all the patterns can I assign hundreds of patterns as a seperate input file and run the same pattern search across input1

ex:
input2
Code:
 
name1  ACDC
name2  ACDB
name...........
name100 DDDD

Thanx for your time and answer

Last edited by bumblebee_2010; 09-14-2010 at 09:49 AM.. Reason: one more question
# 4  
Old 09-14-2010
Sure.. e.g:
Code:
awk 'function prmatch(pat){x=0;s=$6;p=pat;gsub("Y",".",p);while(match(s,p))
     {print (x+RSTART)"\t"$0"\t"substr(s,RSTART,RLENGTH);x+=RSTART+RLENGTH-1;s=substr(s,x+1)}}
     $5 == "+" {prmatch("DBCADB");prmatch("YDCY")}
     $5 == "-" {prmatch("BDACBD");prmatch("YCDY")}' infile

Output:
Code:
11      c1      100     120     TF01_X1 +       AABDDAAABDDBCADBDABC    DBCADB
15      c2      100     120     TF02_X2 -       AABDDAAABDDBCBACDBBC    ACDB
15      c3      100     120     TF03_X2 +       AABDDAAABDDBCBADCBBC    ADCB
3       c4      100     120     TF03_X3 +       AABDCAAABDDBCBADCBBC    BDCA
15      c4      100     120     TF03_X3 +       AABDCAAABDDBCBADCBBC    ADCB

# 5  
Old 09-14-2010
Thats helpful and handy .
And about second question ?

Quote:
2. The important thing I'm looking for is instead of manually assigning all the patterns can I assign hundreds of patterns as a seperate input file and run the same pattern search across input1

ex:
input2


name1 ACDC
name2 ACDB
name...........
name100 DDDD
# 6  
Old 09-16-2010
Hey I think the previous code is throwing several errors. Could you please take a look.
For ex: like this input I'm not getting all the output values correctly

Code:
a10   123233201       123233307       name1  -       BDACADBADBADBABBBACDBDDBBCADBCABDBCCCDCCCABABACCAACBDDCAABCABDDBBDABAABACDDBBADCBAADDCDCACDCDCACAACCAADBAB
a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB
a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB

# 7  
Old 09-16-2010
I am getting:
Code:
18      a10   123233201       123233307       name1  -       BDACADBADBADBABBBACDBDDBBCADBCABDBCCCDCCCABABACCAACBDDCAABCABDDBBDABAABACDDBBADCBAADDCDCACDCDCACAACCAADBAB ACDB
36      a10   123233201       123233307       name1  -       BDACADBADBADBABBBACDBDDBBCADBCABDBCCCDCCCABABACCAACBDDCAABCABDDBBDABAABACDDBBADCBAADDCDCACDCDCACAACCAADBAB CCDC
51      a10   123233201       123233307       name1  -       BDACADBADBADBABBBACDBDDBBCADBCABDBCCCDCCCABABACCAACBDDCAABCABDDBBDABAABACDDBBADCBAADDCDCACDCDCACAACCAADBAB ACDD
9       a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB   CCDC
23      a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB   CCDC
37      a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB   CCDC
44      a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB   DCDB
89      a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB   BCDB
73      a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB     CDCA
96      a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB     ADCD
107     a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB     BDCD

Could you indicate what is wrong?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract duplicate rows with conditions

Gents Can you help please. Input file 5490921425 1 7 1310342 54909214251 5490921425 2 1 1 54909214252 5491120937 1 1 3 54911209371 5491120937 3 1 1 54911209373 5491320785 1 ... (4 Replies)
Discussion started by: jiam912
4 Replies

2. Shell Programming and Scripting

Extract and count number of Duplicate rows

Hi All, I need to extract duplicate rows from a file and write these bad records into another file. And need to have a count of these bad records. i have a command awk ' {s++} END { for(i in s) { if(s>1) { print i } } }' ${TMP_DUPE_RECS}>>${TMP_BAD_DATA_DUPE_RECS}... (5 Replies)
Discussion started by: Arun Mishra
5 Replies

3. Shell Programming and Scripting

Extract values of duplicate keys

I have two questions that are related, so it would be great if you can help me with both! Question1: I have a file A that looks like this: a x b y b z c w I want to get something like: a x b y; z c w Given that a,b,c has no spaces. But the other letters might contain spaces. ... (2 Replies)
Discussion started by: Viernes
2 Replies

4. Shell Programming and Scripting

Extract expressions between two strings in html file

Hello guys, I'm trying to extract all the expressions between the following tags: <b></b> from a HTML file. This is how it looks: big lines containing several dozens expressions (made of 1,2,3,4,6 or even 7 words) I would like to extract: <b>bla ble</b>bla ble</td><tr valign="top"><td... (3 Replies)
Discussion started by: bobylapointe
3 Replies

5. UNIX for Dummies Questions & Answers

extract columns using grep or regular expressions

I am trying to print columns from a table whose name (header) matches a certain string. E.g., patient1001 patient1002 patient2005 patient3005 patient4001 0 0 0 0 0 2 9 2 8 3 2 7 3 0 2 Say I want to print columns whose names end with "01" patient1001 patient4001 0 0 2 3 2 2 ... (3 Replies)
Discussion started by: quextil
3 Replies

6. Shell Programming and Scripting

How to extract duplicate rows

Hi! I have a file as below: line1 line2 line2 line3 line3 line3 line4 line4 line4 line4 I would like to extract duplicate lines (not unique, triplicate or quadruplicate lines). Output will be as below: line2 line2 I would appreciate if anyone can help. Thanks. (4 Replies)
Discussion started by: chromatin
4 Replies

7. Shell Programming and Scripting

How to extract duplicate rows

I have searched the internet for duplicate row extracting. All I have seen is extracting good rows or eliminating duplicate rows. How do I extract duplicate rows from a flat file in unix. I'm using Korn shell on HP Unix. For.eg. FlatFile.txt ======== 123:456:678 123:456:678 123:456:876... (5 Replies)
Discussion started by: bobbygsk
5 Replies

8. Shell Programming and Scripting

How to extract text from string using regular expressions

Hi, I'm trying to use sed to extract some text and assign it to a variable. Can anyone provide me with some help? it would be much appreciated! I"m looking to extract for example: filename=/output/R34/2005_13_R34_C1042S_T83_CRFTXT_20081015.txt I'm trying to extract the 1042... (9 Replies)
Discussion started by: jtung
9 Replies

9. Shell Programming and Scripting

Extract duplicate fields in rows

I have a input file with formating: 6000000901 ;36200103 ;h3a01f496 ; 2000123605 ;36218982 ;heefa1328 ; 2000273132 ;36246985 ;h08c5cb71 ; 2000041207 ;36246985 ;heef75497 ; Each fields is seperated by semi-comma. Sometime, the second files is... (6 Replies)
Discussion started by: anhtt
6 Replies

10. UNIX for Dummies Questions & Answers

How to extract duplicate records with associated header record

All, I have a task to search through several hundred files and extract duplicate detail records and keep them grouped with their header record. If no duplicate detail record exists, don't pull the header. For example, an input file could look like this: input.txt HA D1 D2 D2 D3 D4 D4... (17 Replies)
Discussion started by: run_eim
17 Replies
Login or Register to Ask a Question