How to delete 'duplicated' column values and make a delimited file too?

Login or Register to Ask a Question and Join Our Community

How to delete 'duplicated' column values and make a delimited file too?

Tags

awk, remove duplicate, shell scripts

Login to Discuss or Reply to this Discussion in Our Community

Top Forums Shell Programming and Scripting How to delete 'duplicated' column values and make a delimited file too?

08-31-2016

Registered User

295, 6

Join Date: May 2009

Last Activity: 7 May 2020, 5:18 PM EDT

Posts: 295

Thanks Given: 62

Thanked 6 Times in 6 Posts

How to delete 'duplicated' column values and make a delimited file too?

Hi,

I have the following output from an Oracle SQL statement and I want to remove duplicated column values.

I know it is possible using Oracle analytical/statistical functions but unfortunately I don't know how to use any of those.

So now, I've gone to PLAN B using awk/sed maybe or any other UNIX string tips/tricks.

The original output is as below:

Code:

     CHANGE REQUESTOR            START               END                 STATUS           SERVICE_NO GROUP                     RESOURCE_PERSON
----------- -------------------- ------------------- ------------------- --------------- ----------- ------------------------- --------------------
     153281 User AAA             2016-07-21 23:00:00 2016-07-22 01:00:00 Closed               466814 Support Number 1          Mars
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466875 Linux                     Martian 01
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466876 DBA                       Earthling 01
     153283 User BBB             2016-07-28 12:00:00 2016-07-28 15:00:00 Closed               467055 Storage                   Jupiter
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               466877 DBA                       Earthling 02
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               467105 Unix                      Martian 02
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466923 Linux                     Martian 01
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466924 DBA                       Earthling 01
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466812 Linux                     Martian 01
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466813 DBA                       Earthling 01
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467098 Linux                     Martian 01
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467099 DBA                       Earthling 01

Below is the desired output. There are instances where the first five columns are repeated values and where they are repeated values, I want to display those five column values on its first occurrence only.

Desired output below:

Code:

     CHANGE REQUESTOR            START               END                 STATUS           SERVICE_NO GROUP                     RESOURCE_PERSON
----------- -------------------- ------------------- ------------------- --------------- ----------- ------------------------- --------------------
     153281 User AAA             2016-07-21 23:00:00 2016-07-22 01:00:00 Closed               466814 Support Number 1          Mars
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466875 Linux                     Martian 01
                                                                                              466876 DBA                       Earthling 01
     153283 User BBB             2016-07-28 12:00:00 2016-07-28 15:00:00 Closed               467055 Storage                   Jupiter
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               466877 DBA                       Earthling 02
                                                                                              467105 Unix                      Martian 02
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466923 Linux                     Martian 01
                                                                                              466924 DBA                       Earthling 01
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466812 Linux                     Martian 01
                                                                                              466813 DBA                       Earthling 01
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467098 Linux                     Martian 01
                                                                                              467099 DBA                       Earthling 01

A final thing that I am wanting to do if possible is to have the desired output to be a delimited file, i.e. pipe or comma delimited that I can open from a spreadsheet program. In this case, the repeated column values would have to be replaced by the delimiter character instead.

Any advice much appreciated. Thanks in advance.

newbie_01

View Public Profile for newbie_01

Find all posts by newbie_01

08-31-2016

Registered User

3,149, 702

Join Date: Apr 2010

Last Activity: 10 July 2019, 11:33 PM EDT

Posts: 3,149

Thanks Given: 46

Thanked 702 Times in 677 Posts

Code:

awk '{if(!a[$1 $2 $3 $4 $5]++){print}else{$1=$2=$3=$4=$5=$6=$7=$8="";print}}' filename

This User Gave Thanks to itkamaraj For This Post:

itkamaraj

View Public Profile for itkamaraj

Find all posts by itkamaraj

08-31-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello newbie_01,

Could you please try following.

Code:

awk '{if(!A[$1,$2,$3,$4,$5,$6,$7,$8]++){print;next} else {print ""}}'  Input_file

Output will be as follows.

Code:

     CHANGE REQUESTOR            START               END                 STATUS           SERVICE_NO GROUP                     RESOURCE_PERSON
----------- -------------------- ------------------- ------------------- --------------- ----------- ------------------------- --------------------
     153281 User AAA             2016-07-21 23:00:00 2016-07-22 01:00:00 Closed               466814 Support Number 1          Mars
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466875 Linux                     Martian 01
      
     153283 User BBB             2016-07-28 12:00:00 2016-07-28 15:00:00 Closed               467055 Storage                   Jupiter
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               466877 DBA                       Earthling 02
      
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466923 Linux                     Martian 01
      
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466812 Linux                     Martian 01
      
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467098 Linux                     Martian 01

As addition, in case you don't want to print the new line and want to print only unique values then following may help.

Code:

awk '!A[$1,$2,$3,$4,$5,$6,$7,$8]++'   Input_file

Thanks,
R. Singh

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

08-31-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi newbie_01,
Maybe something like this would come closer to what you want. The other suggestions don't seem to convert the output to CSV format and don't seem to maintain the fixed-width fields either.

Code:

#!/bin/ksh
awk '
# Function to extract fields from an input line into an array.
function ef(line,	i) {
	# Extract fileds based on column positions and strip leading and
	# trailing whitespace.
	for(i = 1; i <= nf; i++) {
		array[i] = substr(line, P[i], L[i])
		gsub(/^[[:space:]]*|[[:space:]]*$/, "", array[i])
	}
}

# Function to print a line from an array replacing initial fields that
# duplicate data printed on the previous output line with an empty field.
function pl(	dup, i) {
	# Initialize...
	dup = 1	# Set clear duplicates flag.
	for(i = 1; i <= nf; i++) {
		if(dup && array[i] == last[i])
			printf("%s", (i == nf) ? array[i] ORS : OFS)
		else {	printf("%s%s", array[i], (i == nf) ? ORS : OFS)
			last[i] = array[i]
			dup = 0
		}
	}
}
BEGIN {	# Set output field separator before reading any input lines.
	OFS = ","
}
NR == 1 {
	# Save the heading on the 1st line of the file for later processing.
	hl = $0
	next
}
NR == 2 {
	# Calculate the number of fields, their starting positions, and their
	# lengths from the 2nd header line.
	P[1] = 1
	for(i = 1; i <= NF; i++)
		P[i + 1] = P[i] + 1 + (L[i] = length($i))
	nf = NF

	# Extract the headers from the saved header line.
	ef(hl)
	# Print header.
	pl()
	next
}
{	# Extract data from the current input line and print it.
	ef($0)
	pl()
}' file

This will skip printing any leading fields that duplicate data found on the previous line (except the last field on an input line will always be printed).

With your sample input file, this produces the output:

Code:

CHANGE,REQUESTOR,START,END,STATUS,SERVICE_NO,GROUP,RESOURCE_PERSON
153281,User AAA,2016-07-21 23:00:00,2016-07-22 01:00:00,Closed,466814,Support Number 1,Mars
153282,User ABCDE,2016-07-28 10:00:00,2016-07-28 11:00:00,Closed,466875,Linux,Martian 01
,,,,,466876,DBA,Earthling 01
153283,User BBB,2016-07-28 12:00:00,2016-07-28 15:00:00,Closed,467055,Storage,Jupiter
153286,User WXYZ,2016-07-28 18:00:00,2016-08-02 20:00:00,Closed,466877,DBA,Earthling 02
,,,,,467105,Unix,Martian 02
153287,User ABCDEF,2016-08-01 10:00:00,2016-08-01 11:00:00,Closed,466923,Linux,Martian 01
,,,,,466924,DBA,Earthling 01
153288,User XXX123456,2016-08-12 10:00:00,2016-08-12 11:00:00,Closed,466812,Linux,Martian 01
,,,,,466813,DBA,Earthling 01
153290,User XXXYYYZZZ,2016-08-15 18:30:00,2016-08-15 19:30:00,Closed,467098,Linux,Martian 01
,,,,,467099,DBA,Earthling 01

You haven't said what operating system you're using. If you want to run this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk. (Note that neither awk nor nawk on a Solaris system will not work with this script.)

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-31-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

To preserve column width, you could try something like this:

Code:

$ awk '{p=sprintf("%.88s",$0); q=sprintf("%88s",x)} A[p]++{sub(p,q)}1' file

or better because not vulnerable to regex interpretation:

Code:

awk '{p=substr($0,1,88); q=substr($0,89); print (A[p]++?sprintf("%88s",x):p) q}' file

The column width can be parameterized:

Code:

awk -v w=88 '{p=sprintf("%." w "s",$0); q=sprintf("%" w "s",x)} A[p]++{sub(p,q)}1' file

or

Code:

awk -v w=88 '{p=substr($0,1,w); q=substr($0,w+1); print (A[p]++?sprintf("%" w "s",x):p) q}' file

respectively...

output

Code:

 
     CHANGE REQUESTOR            START               END                 STATUS           SERVICE_NO GROUP                     RESOURCE_PERSON
----------- -------------------- ------------------- ------------------- --------------- ----------- ------------------------- --------------------
     153281 User AAA             2016-07-21 23:00:00 2016-07-22 01:00:00 Closed               466814 Support Number 1          Mars
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466875 Linux                     Martian 01
                                                                                              466876 DBA                       Earthling 01
     153283 User BBB             2016-07-28 12:00:00 2016-07-28 15:00:00 Closed               467055 Storage                   Jupiter
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               466877 DBA                       Earthling 02
                                                                                              467105 Unix                      Martian 02
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466923 Linux                     Martian 01
                                                                                              466924 DBA                       Earthling 01
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466812 Linux                     Martian 01
                                                                                              466813 DBA                       Earthling 01
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467098 Linux                     Martian 01
                                                                                              467099 DBA                       Earthling 01

These 2 Users Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

09-01-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Brilliant!
To save a few CPU cycles, the assignment of a constant to q could be done in a BEGIN section. Wouldn't anchoring p to the begin-of-line reduce the regex vulnerability?

Please be aware that - should a page break repeat the header and underline line - these two would be pruned as well.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Replace a column in tab delimited file with column in other tab delimited file,based on match

Hello Everyone.. I want to replace the retail col from FileI with cstp1 col from FileP if the strpno matches in both files FileP.txt ...

2. Shell Programming and Scripting

awk script to append suffix to column when column has duplicated values

Please help me to get required output for both scenario 1 and scenario 2 and need separate code for both scenario 1 and scenario 2 Scenario 1 i need to do below changes only when column1 is CR and column3 has duplicates rows/values. This inputfile can contain 100 of this duplicated rows of...

3. UNIX for Dummies Questions & Answers

Sort csv file by duplicated column value

hello, I have a large file (about 1gb) that is in a file similar to the following: I want to make it so that I can put all the duplicates where column 3 (delimited by the commas) are shown on top. Meaning all people with the same age are listed at the top. The command I used was ...

4. Shell Programming and Scripting

Delete an entire column from a tab delimited file

Hi, Can anyone please tell me about how we can delete an entire column from a tab delimited file? Mu input_file.txt looks like this: And I want the output as: I used the below code nawk -v d="1" 'BEGIN{FS=OFS="\t"}{$d=""}{print}' input_file.txtBut in the output, the first column is...

5. Shell Programming and Scripting

How to make tab delimited file to space delimited?

Hi How to make tab delimited file to space delimited? in put file: ABC kgy jkh ghj ash kjl o/p file: ABC kgy jkh ghj ash kjl Use code tags, thanks.

6. UNIX for Dummies Questions & Answers

Extracting rows from a space delimited text file based on the values of a column

I have a space delimited text file. I want to extract rows where the third column has 0 as a value and write those rows into a new space delimited text file. How do I go about doing that? Thanks!

7. UNIX for Dummies Questions & Answers

How do you delete cells from a space delimited text file given row and column number?

How do you delete cells from a space delimited text file given row and column number? Letś say the row number is r and the column number is c. Thanks!

8. Shell Programming and Scripting

Changing one column of delimited file column to fixed width column

Hi, Iam new to unix. I have one input file . Input file : ID1~Name1~Place1 ID2~Name2~Place2 ID3~Name3~Place3 I need output such that only first column should change to fixed width column of 15 characters of length. Output File: ID1<<12 spaces>>Name1~Place1 ID2<<12...

9. Shell Programming and Scripting

Delete first column in tab-delimited text-file

I have a large text-file with tab-delimited genetic data that looks like: KSC112 KSC234 0 0 1 1 A G C T I simply wan to delete the first column, but since the file has 600 000 columns, it is not possible with awk (seems to be limited at 32k columns). Does anyone have an idea how to do this?

10. Shell Programming and Scripting

Delete parts of a string of character in one given column of a tab delimited file

I would like to remove characters from column 7 so that from an input file looking like this: >HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 ref_chr8.fa 6527777 F DD I get something like that in an output file: ...

Login or Register to Ask a Question