Unix/Linux Go Back    


Shell Programming and Scripting BSD, Linux, and UNIX shell scripting — Post awk, bash, csh, ksh, perl, php, python, sed, sh, shell scripts, and other shell scripting languages questions here.

Finding duplicates from positioned substring across lines

Shell Programming and Scripting


Tags
finding duplicates

Closed    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 12-23-2008   -   Original Discussion by gapprasath
gapprasath gapprasath is offline
Registered User
 
Join Date: Dec 2008
Last Activity: 12 February 2009, 6:59 PM EST
Posts: 1
Thanks: 0
Thanked 0 Times in 0 Posts
Unix or Linux Question Finding duplicates from positioned substring across lines

I have million's of records each containing exactly 50 characters and have to check the uniqueness of 4 character substring of 50 character (postion known prior) and report if any duplicates are found.

Eg. data...

AAAA00000000000000XXXX0000 0000000000... upto50 chars
AAAA00000000000000XXXY0000 0000000000... upto50 chars
AAAA00000000000000XXXY0000 0000000000... upto50 chars

output:
Duplicates are found for XXXY.

I'm new to unix scripting. Can anyone provide me direction?

~GAP
Sponsored Links
    #2  
Old Unix and Linux 12-23-2008   -   Original Discussion by gapprasath
jim mcnamara jim mcnamara is offline Forum Staff  
...@...
 
Join Date: Feb 2004
Last Activity: 18 November 2017, 3:16 PM EST
Location: NM
Posts: 11,239
Thanks: 570
Thanked 1,116 Times in 1,030 Posts

Code:
awk '{ arr[substr($0,50,4))]++ } 
      END { for (i in arr) { if (arr[i]>1) {print arr[i], i}}}' inputfile

Sponsored Links
    #3  
Old Unix and Linux 12-24-2008   -   Original Discussion by gapprasath
summer_cherry summer_cherry is offline Forum Advisor  
Registered User
 
Join Date: Jun 2007
Last Activity: 11 November 2016, 3:44 AM EST
Location: Beijing China
Posts: 1,305
Thanks: 0
Thanked 26 Times in 26 Posts

Code:
nawk '{
str=substr($0,19,4)
_[str]++
}
END{
  for(i in _)
    if(_[i]>1)
       print "Duplicated found for "i
}' a.txt

Sponsored Links
Closed

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Finding longest common substring among filenames cmcnorgan Shell Programming and Scripting 1 12-12-2008 08:41 PM
finding duplicates in columns and removing lines totus Shell Programming and Scripting 17 11-29-2008 11:27 AM
duplicates lines with one column different dhanamurthy Shell Programming and Scripting 10 05-07-2008 06:38 AM
finding the last substring... cutelucks Shell Programming and Scripting 7 11-04-2006 06:48 AM
finding duplicates with perl dangral Shell Programming and Scripting 3 01-28-2003 12:50 PM



All times are GMT -4. The time now is 05:11 PM.