Sort and Remove duplicates

02-16-2015

Registered User

36, 0

Join Date: Oct 2011

Last Activity: 5 September 2016, 10:51 PM EDT

Posts: 36

Thanks Given: 6

Thanked 0 Times in 0 Posts

Sort and Remove duplicates

Here is my task :

I need to sort two input files and remove duplicates in the output files :

Sort by 13 characters from 97 Ascending
Sort by 1 characters from 96 Ascending

If duplicates are found retain the first value in the file

the input files are variable length, convert them 250 characters fixed width files with padding spaces.

Mainframe equivalent code :

https://www.unix.com/attachments/shel...tes-snipit-jpg

Here is the code i developed:

Code:

sort -nuc -k1.97,1.109 --key=1.96,1.96 file1 file2  | awk '{ printf ("%-250s\n",$0) }' > out.txt

Can any experts validate and correct me if something is wrong?

ysvsr1

View Public Profile for ysvsr1

Find all posts by ysvsr1

02-17-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

What OS and shell are you using?
What does your data look like?

Are there any spaces or tabs in the first 110 characters of any of your input lines?
What is the maximum line length of a line in your input files?
How big are your input files?

By definition, any lines that compare equal based on the sort key you provide are the same. When using the -u option, the sort utility makes no statement about which line from a set of lines having identical sort keys in the input files will be copied to the output.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

02-17-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

A few comments on your statement:

- the -c option would not sort:

Quote:

-c, --check, --check=diagnose-first
check for sorted input; do not sort

- as Don Cragun surmises, any white space before char 96 would count the fields up and destroy your key definitions. Set the terminator to an exotic char with -t
- you can use the short form -k more than one time in a statement
- if lines longer than 250 chars can occur (again DC'c suspicion), your printf format will expand the line; use the precision field as well: "%-250.250s\n"

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-17-2015

Registered User

36, 0

Join Date: Oct 2011

Last Activity: 5 September 2016, 10:51 PM EDT

Posts: 36

Thanks Given: 6

Thanked 0 Times in 0 Posts

OS:
Linux x86_64 x86_64 x86_64 GNU/Linux

Sample Data :

Code:

YSVSR1 Kiladi    12198ASDA 21329180928AFJASDKDKDK ED AEFKF;p FK ADS 2132309183298209381 akfjalksdfjkdajfk j231 1239128390218309

Data contains spaces,tabs Alphanumeric values

Sort columns can also contain Alphanumeric Values

Are there any spaces or tabs in the first 110 characters of any of your input lines? Yes.

What is the maximum line length of a line in your input files? 211 Characters

How big are your input files? about 25000 lines in each file

ysvsr1

View Public Profile for ysvsr1

Find all posts by ysvsr1

02-19-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by ysvsr1

OS:
Linux x86_64 x86_64 x86_64 GNU/Linux

Sample Data :

Code:

YSVSR1 Kiladi    12198ASDA 21329180928AFJASDKDKDK ED AEFKF;p FK ADS 2132309183298209381 akfjalksdfjkdajfk j231 1239128390218309

In your sample input, the first field is marked above in red. Since it contains less than 96 characters, every line in your input files will have the same, empty sort keys.

If you are trying to use the characters marked in orange above as your primary sort key (characters #97 through #109 on the line) and the character marked in green above as your secondary sort key (character #96 on the line), your sort keys would still all be identical because using the -n option to sort tells it to perform a numeric comparison and to stop trying to gather characters for a key at the first character that is not part of a numeric value. So, since characters #96 and #97 on your sample input line are both alphabetic, even if you change the field delimiter to something that does not appear anywhere in your input files, your sort keys will still all be 0 unless you remove the -n option.

And, as has been said before, you can't rely sort -u to get the unique keys, if you require that the 1st line be selected out of sets of lines with identical keys. (On some systems, that might happen to be what you get sometimes, but there is no guarantee that that is the line you'll always get.)

So, instead of showing us a sort command line that you know is not giving you what you want, please explain in English exactly what you are trying to do. And, explain what you want when you say "the input files are variable length, convert them 250 characters fixed width files with padding spaces." Do you want leading spaces to be added, or do you want trailing spaces to be added?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Sort and Remove duplicates

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Concatenate and sort to remove duplicates

Discussion started by: Paras Pandey

2. UNIX for Beginners Questions & Answers

Sort and remove duplicates in directory based on first 5 columns:

Discussion started by: gnnsprapa

3. Shell Programming and Scripting

Help in modifying a PERL script to sort Singletons and Duplicates

Discussion started by: gimley

4. Shell Programming and Scripting

Bash - remove duplicates without sort

Discussion started by: locoroco

5. Shell Programming and Scripting

Sort data by date first and then remove duplicates

Discussion started by: samrat dutta

6. Shell Programming and Scripting

remove duplicates and sort

Discussion started by: dvah

7. UNIX for Dummies Questions & Answers

sort and find duplicates for files with no white space

Discussion started by: mmarshall

8. UNIX for Dummies Questions & Answers

removing duplicates and sort -k

Discussion started by: orahi001

9. Shell Programming and Scripting

Sort, Uniq, Duplicates

Discussion started by: Amruta Pitkar

10. Shell Programming and Scripting

Removing duplicates [sort , uniq]

Discussion started by: sharatz83