Awk- How to extract duplicate expressions


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Awk- How to extract duplicate expressions
# 8  
Old 09-16-2010
Yes. Sorry for not doing that in previous post

For example Take [a1 26610463 26610660 name1 ] and it should print all the letters that has "DC" inside(bold ) but It could pick few of them not all.

Code:
BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB     CDCA

# 9  
Old 09-16-2010
I found the bug... This should work better:
Code:
awk 'function prmatch(pat){x=0;s=$6;p=pat;gsub("Y",".",p);while(match(s,p))
     {print (x+RSTART)"\t"$0"\t"substr(s,RSTART,RLENGTH);x+=RSTART+RLENGTH-1;s=substr(s,RSTART+RLENGTH)}}
     $5 == "+" {prmatch("DBCADB");prmatch("YDCY")}
     $5 == "-" {prmatch("BDACBD");prmatch("YCDY")}' infile

output:
Code:
18      a10   123233201       123233307       name1  -       BDACADBADBADBABBBACDBDDBBCADBCABDBCCCDCCCABABACCAACBDDCAABCABDDBBDABAABACDDBBADCBAADDCDCACDCDCACAACCAADBAB ACDB
36      a10   123233201       123233307       name1  -       BDACADBADBADBABBBACDBDDBBCADBCABDBCCCDCCCABABACCAACBDDCAABCABDDBBDABAABACDDBBADCBAADDCDCACDCDCACAACCAADBAB CCDC
72      a10   123233201       123233307       name1  -       BDACADBADBADBABBBACDBDDBBCADBCABDBCCCDCCCABABACCAACBDDCAABCABDDBBDABAABACDDBBADCBAADDCDCACDCDCACAACCAADBAB ACDD
85      a10   123233201       123233307       name1  -       BDACADBADBADBABBBACDBDDBBCADBCABDBCCCDCCCABABACCAACBDDCAABCABDDBBDABAABACDDBBADCBAADDCDCACDCDCACAACCAADBAB DCDC
89      a10   123233201       123233307       name1  -       BDACADBADBADBABBBACDBDDBBCADBCABDBCCCDCCCABABACCAACBDDCAABCABDDBBDABAABACDDBBADCBAADDCDCACDCDCACAACCAADBAB ACDC
9       a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB CCDC
23      a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB CCDC
35      a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB ACDB
49      a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB CCDC
82      a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB DCDB
167     a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB BCDB
172     a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB CCDA
176     a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB CCDC
185     a10   123269482       123269673       name1  -       ABCBADCBCCDCACCBBCCCADCCDCCAABCCBBACDBCCBBCAAADBCCDCCACABDBBDCBBABBABACBDABABDDDBDCDBCAABBDDDACABDBADBCCCABCCCCACADCCABDBBADCAABCACBDBBAAAABAACBBCABDAAADACBBBCCCBACBBBCDBCCCDACCDCAABBDDCDCAAB DCDC
73      a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB   CDCA
96      a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB   ADCD
108     a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB   DDCA
126     a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB   DDCB
132     a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB   DDCC
137     a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB   BDCC
159     a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB   BDCB
165     a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB   CDCB
183     a1    26610463        26610660        name1  +       BDBBCDBCBCCAABBCBBCABAABABBCBCCCBABBABBCBCCBBABBACBCBBCCCBBBCBBCBBACBABCCDCABCDBCDBCACBBDBCBBBCADCDBDAABDBBDDCAACBDBCBCADBBBBDDCBBCDDCCDBDCCADBACCBCCCBCBCCBBBBDCBCBCDCBACCCCCCABDBBADBDCDDDBDBCACCAB   BDCD

This User Gave Thanks to Scrutinizer For This Post:
# 10  
Old 09-16-2010
working very well.
Though it still not picking some thing like this

Code:
ACDBCDY = this contains 2 integrated "CD"s

ACDB
BCDY

or vice versa with others. script is giving ACDB not BCDY.
# 11  
Old 09-17-2010
We'll just have to subtract 1 so that the next substring is extended one character to the left..
Code:
awk 'function prmatch(pat){x=0;s=$6;p=pat;gsub("Y",".",p);while(match(s,p))
     {print (x+RSTART)"\t"$0"\t"substr(s,RSTART,RLENGTH);x+=RSTART+RLENGTH-2;s=substr(s,RSTART+RLENGTH-1)}}
     $5 == "+" {prmatch("DBCADB");prmatch("YDCY")}
     $5 == "-" {prmatch("BDACBD");prmatch("YCDY")}' infile


Last edited by Scrutinizer; 09-17-2010 at 03:49 AM..
# 12  
Old 09-17-2010
One last thing.
Your script is great working wit matching and finding YCDY/YDCY or others.
Is it possible to change the script to print the output that is not matched to YCDY/YDCY .....
I tried

Code:
 
!YCDY

but I think this is wrong way?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract duplicate rows with conditions

Gents Can you help please. Input file 5490921425 1 7 1310342 54909214251 5490921425 2 1 1 54909214252 5491120937 1 1 3 54911209371 5491120937 3 1 1 54911209373 5491320785 1 ... (4 Replies)
Discussion started by: jiam912
4 Replies

2. Shell Programming and Scripting

Extract and count number of Duplicate rows

Hi All, I need to extract duplicate rows from a file and write these bad records into another file. And need to have a count of these bad records. i have a command awk ' {s++} END { for(i in s) { if(s>1) { print i } } }' ${TMP_DUPE_RECS}>>${TMP_BAD_DATA_DUPE_RECS}... (5 Replies)
Discussion started by: Arun Mishra
5 Replies

3. Shell Programming and Scripting

Extract values of duplicate keys

I have two questions that are related, so it would be great if you can help me with both! Question1: I have a file A that looks like this: a x b y b z c w I want to get something like: a x b y; z c w Given that a,b,c has no spaces. But the other letters might contain spaces. ... (2 Replies)
Discussion started by: Viernes
2 Replies

4. Shell Programming and Scripting

Extract expressions between two strings in html file

Hello guys, I'm trying to extract all the expressions between the following tags: <b></b> from a HTML file. This is how it looks: big lines containing several dozens expressions (made of 1,2,3,4,6 or even 7 words) I would like to extract: <b>bla ble</b>bla ble</td><tr valign="top"><td... (3 Replies)
Discussion started by: bobylapointe
3 Replies

5. UNIX for Dummies Questions & Answers

extract columns using grep or regular expressions

I am trying to print columns from a table whose name (header) matches a certain string. E.g., patient1001 patient1002 patient2005 patient3005 patient4001 0 0 0 0 0 2 9 2 8 3 2 7 3 0 2 Say I want to print columns whose names end with "01" patient1001 patient4001 0 0 2 3 2 2 ... (3 Replies)
Discussion started by: quextil
3 Replies

6. Shell Programming and Scripting

How to extract duplicate rows

Hi! I have a file as below: line1 line2 line2 line3 line3 line3 line4 line4 line4 line4 I would like to extract duplicate lines (not unique, triplicate or quadruplicate lines). Output will be as below: line2 line2 I would appreciate if anyone can help. Thanks. (4 Replies)
Discussion started by: chromatin
4 Replies

7. Shell Programming and Scripting

How to extract duplicate rows

I have searched the internet for duplicate row extracting. All I have seen is extracting good rows or eliminating duplicate rows. How do I extract duplicate rows from a flat file in unix. I'm using Korn shell on HP Unix. For.eg. FlatFile.txt ======== 123:456:678 123:456:678 123:456:876... (5 Replies)
Discussion started by: bobbygsk
5 Replies

8. Shell Programming and Scripting

How to extract text from string using regular expressions

Hi, I'm trying to use sed to extract some text and assign it to a variable. Can anyone provide me with some help? it would be much appreciated! I"m looking to extract for example: filename=/output/R34/2005_13_R34_C1042S_T83_CRFTXT_20081015.txt I'm trying to extract the 1042... (9 Replies)
Discussion started by: jtung
9 Replies

9. Shell Programming and Scripting

Extract duplicate fields in rows

I have a input file with formating: 6000000901 ;36200103 ;h3a01f496 ; 2000123605 ;36218982 ;heefa1328 ; 2000273132 ;36246985 ;h08c5cb71 ; 2000041207 ;36246985 ;heef75497 ; Each fields is seperated by semi-comma. Sometime, the second files is... (6 Replies)
Discussion started by: anhtt
6 Replies

10. UNIX for Dummies Questions & Answers

How to extract duplicate records with associated header record

All, I have a task to search through several hundred files and extract duplicate detail records and keep them grouped with their header record. If no duplicate detail record exists, don't pull the header. For example, an input file could look like this: input.txt HA D1 D2 D2 D3 D4 D4... (17 Replies)
Discussion started by: run_eim
17 Replies
Login or Register to Ask a Question