Finding duplicates from positioned substring across lines


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Finding duplicates from positioned substring across lines
# 1  
Old 12-23-2008
Question Finding duplicates from positioned substring across lines

I have million's of records each containing exactly 50 characters and have to check the uniqueness of 4 character substring of 50 character (postion known prior) and report if any duplicates are found.

Eg. data...

AAAA00000000000000XXXX0000 0000000000... upto50 chars
AAAA00000000000000XXXY0000 0000000000... upto50 chars
AAAA00000000000000XXXY0000 0000000000... upto50 chars

output:
Duplicates are found for XXXY.

I'm new to unix scripting. Can anyone provide me direction?

~GAP
# 2  
Old 12-23-2008
Code:
awk '{ arr[substr($0,50,4))]++ } 
      END { for (i in arr) { if (arr[i]>1) {print arr[i], i}}}' inputfile

# 3  
Old 12-24-2008
Code:
nawk '{
str=substr($0,19,4)
_[str]++
}
END{
  for(i in _)
    if(_[i]>1)
       print "Duplicated found for "i
}' a.txt

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Finding a word through substring in a file

I have a text file that has some data like: PADHOGOA1 IOP055_VINREG5_1 ( .IO(VINREG5_1), .MONI(), .MON_D(px_IOP055_VINREG5_1_MON_D), .R0T(px_IOP054_VINREG5_0_R0T), .IO1() ); PADV30MA0 IOP056_VOUT3_IN ( .IO(VOUT3_IN), .V30M(px_IOP056_VOUT3_IN_V30M)); PADV30MA0 IOP057_VOUT3_OUT (... (2 Replies)
Discussion started by: utkarshkhanna44
2 Replies

2. Shell Programming and Scripting

Finding duplicates in a file excluding specific pattern

I have unix file like below >newuser newuser <hello hello newone I want to find the unique values in the file(excluding <,>),so that the out put should be >newuser <hello newone can any body tell me what is command to get this new file. (7 Replies)
Discussion started by: shiva2985
7 Replies

3. UNIX for Dummies Questions & Answers

Finding duplicates then copying, almost there, maybe?

Hi everyone. I'm trying to help my wife with a project, she has exported 200 images from many different folders, unfortunately there was a problem with the export and I need to find the master versions so that she doesn't have to go through and select them again. I need to: For each image in... (2 Replies)
Discussion started by: Rhinoskin
2 Replies

4. Shell Programming and Scripting

finding duplicates in csv based on key columns

Hi team, I have 20 columns csv files. i want to find the duplicates in that file based on the column1 column10 column4 column6 coulnn8 coulunm2 . if those columns have same values . then it should be a duplicate record. can one help me on finding the duplicates, Thanks in advance. ... (2 Replies)
Discussion started by: baskivs
2 Replies

5. Shell Programming and Scripting

Help finding non duplicates

I am currently creating a script to find filenames that are listed once in an input file (find non duplicates). I then want to report those single files in another file. Here is the function that I have so far: function dups_filenames { file2="" file1="" file="" dn="" ch="" pn="" ... (6 Replies)
Discussion started by: chipblah84
6 Replies

6. Shell Programming and Scripting

How to delete lines in a file that have duplicates or derive the lines that aper once

Input: a b b c d d I need: a c I know how to get this (the lines that have duplicates) : b d sort file | uniq -d But i need opossite of this. I have searched the forum and other places as well, but have found solution for everything except this variant of the problem. (3 Replies)
Discussion started by: necroman08
3 Replies

7. Shell Programming and Scripting

Finding longest common substring among filenames

I will be performing a task on several directories, each containing a large number of files (2500+) that follow a regular naming convention: YYYY_MM_DD_XX.foo_bar.A.B.some_different_stuff.EXT What I would like to do is automatically discover the part of the filenames that are common to all... (1 Reply)
Discussion started by: cmcnorgan
1 Replies

8. Shell Programming and Scripting

finding duplicates in columns and removing lines

I am trying to figure out how to scan a file like so: 1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com 2 margies office","555-555-5555","ralph@mail.com","www.ralph.com 3 kims office","555-555-5555","kims@mail.com","www.ralph.com 4 tims... (17 Replies)
Discussion started by: totus
17 Replies

9. Shell Programming and Scripting

finding the last substring...

hii, i want to know the shell command for finding the last occurance of a substring in string.. i can use grep command or sed to find out the occurance of a substring in a string but how do i find out the last occurance.shud i use grep amd and cut the string everytime and store it in a new... (7 Replies)
Discussion started by: cutelucks
7 Replies

10. Shell Programming and Scripting

finding duplicates with perl

I have a huge file (over 30mb) that I am processing through with perl. I am pulling out a list of filenames and placing it in an array called @reports. I am fine up till here. What I then want to do is go through the array and find any duplicates. If there is a duplicate, output it to the screen.... (3 Replies)
Discussion started by: dangral
3 Replies
Login or Register to Ask a Question