How to delete 'duplicated' column values and make a delimited file too?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to delete 'duplicated' column values and make a delimited file too?
# 1  
Old 08-31-2016
How to delete 'duplicated' column values and make a delimited file too?

Hi,

I have the following output from an Oracle SQL statement and I want to remove duplicated column values.

I know it is possible using Oracle analytical/statistical functions but unfortunately I don't know how to use any of those.

So now, I've gone to PLAN B using awk/sed maybe or any other UNIX string tips/tricks.

The original output is as below:

Code:
     CHANGE REQUESTOR            START               END                 STATUS           SERVICE_NO GROUP                     RESOURCE_PERSON
----------- -------------------- ------------------- ------------------- --------------- ----------- ------------------------- --------------------
     153281 User AAA             2016-07-21 23:00:00 2016-07-22 01:00:00 Closed               466814 Support Number 1          Mars
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466875 Linux                     Martian 01
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466876 DBA                       Earthling 01
     153283 User BBB             2016-07-28 12:00:00 2016-07-28 15:00:00 Closed               467055 Storage                   Jupiter
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               466877 DBA                       Earthling 02
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               467105 Unix                      Martian 02
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466923 Linux                     Martian 01
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466924 DBA                       Earthling 01
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466812 Linux                     Martian 01
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466813 DBA                       Earthling 01
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467098 Linux                     Martian 01
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467099 DBA                       Earthling 01

Below is the desired output. There are instances where the first five columns are repeated values and where they are repeated values, I want to display those five column values on its first occurrence only.

Desired output below:

Code:
     CHANGE REQUESTOR            START               END                 STATUS           SERVICE_NO GROUP                     RESOURCE_PERSON
----------- -------------------- ------------------- ------------------- --------------- ----------- ------------------------- --------------------
     153281 User AAA             2016-07-21 23:00:00 2016-07-22 01:00:00 Closed               466814 Support Number 1          Mars
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466875 Linux                     Martian 01
                                                                                              466876 DBA                       Earthling 01
     153283 User BBB             2016-07-28 12:00:00 2016-07-28 15:00:00 Closed               467055 Storage                   Jupiter
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               466877 DBA                       Earthling 02
                                                                                              467105 Unix                      Martian 02
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466923 Linux                     Martian 01
                                                                                              466924 DBA                       Earthling 01
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466812 Linux                     Martian 01
                                                                                              466813 DBA                       Earthling 01
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467098 Linux                     Martian 01
                                                                                              467099 DBA                       Earthling 01

A final thing that I am wanting to do if possible is to have the desired output to be a delimited file, i.e. pipe or comma delimited that I can open from a spreadsheet program. In this case, the repeated column values would have to be replaced by the delimiter character instead.

Any advice much appreciated. Thanks in advance.
# 2  
Old 08-31-2016
Code:
awk '{if(!a[$1 $2 $3 $4 $5]++){print}else{$1=$2=$3=$4=$5=$6=$7=$8="";print}}' filename

This User Gave Thanks to itkamaraj For This Post:
# 3  
Old 08-31-2016
Hello newbie_01,

Could you please try following.
Code:
awk '{if(!A[$1,$2,$3,$4,$5,$6,$7,$8]++){print;next} else {print ""}}'  Input_file

Output will be as follows.
Code:
     CHANGE REQUESTOR            START               END                 STATUS           SERVICE_NO GROUP                     RESOURCE_PERSON
----------- -------------------- ------------------- ------------------- --------------- ----------- ------------------------- --------------------
     153281 User AAA             2016-07-21 23:00:00 2016-07-22 01:00:00 Closed               466814 Support Number 1          Mars
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466875 Linux                     Martian 01
      
     153283 User BBB             2016-07-28 12:00:00 2016-07-28 15:00:00 Closed               467055 Storage                   Jupiter
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               466877 DBA                       Earthling 02
      
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466923 Linux                     Martian 01
      
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466812 Linux                     Martian 01
      
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467098 Linux                     Martian 01

As addition, in case you don't want to print the new line and want to print only unique values then following may help.
Code:
awk '!A[$1,$2,$3,$4,$5,$6,$7,$8]++'   Input_file

Thanks,
R. Singh
# 4  
Old 08-31-2016
Hi newbie_01,
Maybe something like this would come closer to what you want. The other suggestions don't seem to convert the output to CSV format and don't seem to maintain the fixed-width fields either.
Code:
#!/bin/ksh
awk '
# Function to extract fields from an input line into an array.
function ef(line,	i) {
	# Extract fileds based on column positions and strip leading and
	# trailing whitespace.
	for(i = 1; i <= nf; i++) {
		array[i] = substr(line, P[i], L[i])
		gsub(/^[[:space:]]*|[[:space:]]*$/, "", array[i])
	}
}

# Function to print a line from an array replacing initial fields that
# duplicate data printed on the previous output line with an empty field.
function pl(	dup, i) {
	# Initialize...
	dup = 1	# Set clear duplicates flag.
	for(i = 1; i <= nf; i++) {
		if(dup && array[i] == last[i])
			printf("%s", (i == nf) ? array[i] ORS : OFS)
		else {	printf("%s%s", array[i], (i == nf) ? ORS : OFS)
			last[i] = array[i]
			dup = 0
		}
	}
}
BEGIN {	# Set output field separator before reading any input lines.
	OFS = ","
}
NR == 1 {
	# Save the heading on the 1st line of the file for later processing.
	hl = $0
	next
}
NR == 2 {
	# Calculate the number of fields, their starting positions, and their
	# lengths from the 2nd header line.
	P[1] = 1
	for(i = 1; i <= NF; i++)
		P[i + 1] = P[i] + 1 + (L[i] = length($i))
	nf = NF

	# Extract the headers from the saved header line.
	ef(hl)
	# Print header.
	pl()
	next
}
{	# Extract data from the current input line and print it.
	ef($0)
	pl()
}' file

This will skip printing any leading fields that duplicate data found on the previous line (except the last field on an input line will always be printed).

With your sample input file, this produces the output:
Code:
CHANGE,REQUESTOR,START,END,STATUS,SERVICE_NO,GROUP,RESOURCE_PERSON
153281,User AAA,2016-07-21 23:00:00,2016-07-22 01:00:00,Closed,466814,Support Number 1,Mars
153282,User ABCDE,2016-07-28 10:00:00,2016-07-28 11:00:00,Closed,466875,Linux,Martian 01
,,,,,466876,DBA,Earthling 01
153283,User BBB,2016-07-28 12:00:00,2016-07-28 15:00:00,Closed,467055,Storage,Jupiter
153286,User WXYZ,2016-07-28 18:00:00,2016-08-02 20:00:00,Closed,466877,DBA,Earthling 02
,,,,,467105,Unix,Martian 02
153287,User ABCDEF,2016-08-01 10:00:00,2016-08-01 11:00:00,Closed,466923,Linux,Martian 01
,,,,,466924,DBA,Earthling 01
153288,User XXX123456,2016-08-12 10:00:00,2016-08-12 11:00:00,Closed,466812,Linux,Martian 01
,,,,,466813,DBA,Earthling 01
153290,User XXXYYYZZZ,2016-08-15 18:30:00,2016-08-15 19:30:00,Closed,467098,Linux,Martian 01
,,,,,467099,DBA,Earthling 01

You haven't said what operating system you're using. If you want to run this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk. (Note that neither awk nor nawk on a Solaris system will not work with this script.)
These 2 Users Gave Thanks to Don Cragun For This Post:
# 5  
Old 08-31-2016
To preserve column width, you could try something like this:
Code:
$ awk '{p=sprintf("%.88s",$0); q=sprintf("%88s",x)} A[p]++{sub(p,q)}1' file

or better because not vulnerable to regex interpretation:
Code:
awk '{p=substr($0,1,88); q=substr($0,89); print (A[p]++?sprintf("%88s",x):p) q}' file

The column width can be parameterized:
Code:
awk -v w=88 '{p=sprintf("%." w "s",$0); q=sprintf("%" w "s",x)} A[p]++{sub(p,q)}1' file

or
Code:
awk -v w=88 '{p=substr($0,1,w); q=substr($0,w+1); print (A[p]++?sprintf("%" w "s",x):p) q}' file

respectively...

output
Code:
 
     CHANGE REQUESTOR            START               END                 STATUS           SERVICE_NO GROUP                     RESOURCE_PERSON
----------- -------------------- ------------------- ------------------- --------------- ----------- ------------------------- --------------------
     153281 User AAA             2016-07-21 23:00:00 2016-07-22 01:00:00 Closed               466814 Support Number 1          Mars
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466875 Linux                     Martian 01
                                                                                              466876 DBA                       Earthling 01
     153283 User BBB             2016-07-28 12:00:00 2016-07-28 15:00:00 Closed               467055 Storage                   Jupiter
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               466877 DBA                       Earthling 02
                                                                                              467105 Unix                      Martian 02
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466923 Linux                     Martian 01
                                                                                              466924 DBA                       Earthling 01
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466812 Linux                     Martian 01
                                                                                              466813 DBA                       Earthling 01
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467098 Linux                     Martian 01
                                                                                              467099 DBA                       Earthling 01

These 2 Users Gave Thanks to Scrutinizer For This Post:
# 6  
Old 09-01-2016
Brilliant!
To save a few CPU cycles, the assignment of a constant to q could be done in a BEGIN section. Wouldn't anchoring p to the begin-of-line reduce the regex vulnerability?

Please be aware that - should a page break repeat the header and underline line - these two would be pruned as well.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Replace a column in tab delimited file with column in other tab delimited file,based on match

Hello Everyone.. I want to replace the retail col from FileI with cstp1 col from FileP if the strpno matches in both files FileP.txt ... (2 Replies)
Discussion started by: YogeshG
2 Replies

2. Shell Programming and Scripting

awk script to append suffix to column when column has duplicated values

Please help me to get required output for both scenario 1 and scenario 2 and need separate code for both scenario 1 and scenario 2 Scenario 1 i need to do below changes only when column1 is CR and column3 has duplicates rows/values. This inputfile can contain 100 of this duplicated rows of... (1 Reply)
Discussion started by: as7951
1 Replies

3. UNIX for Dummies Questions & Answers

Sort csv file by duplicated column value

hello, I have a large file (about 1gb) that is in a file similar to the following: I want to make it so that I can put all the duplicates where column 3 (delimited by the commas) are shown on top. Meaning all people with the same age are listed at the top. The command I used was ... (3 Replies)
Discussion started by: jl487
3 Replies

4. Shell Programming and Scripting

Delete an entire column from a tab delimited file

Hi, Can anyone please tell me about how we can delete an entire column from a tab delimited file? Mu input_file.txt looks like this: And I want the output as: I used the below code nawk -v d="1" 'BEGIN{FS=OFS="\t"}{$d=""}{print}' input_file.txtBut in the output, the first column is... (5 Replies)
Discussion started by: sampoorna
5 Replies

5. Shell Programming and Scripting

How to make tab delimited file to space delimited?

Hi How to make tab delimited file to space delimited? in put file: ABC kgy jkh ghj ash kjl o/p file: ABC kgy jkh ghj ash kjl Use code tags, thanks. (1 Reply)
Discussion started by: jagdishrout
1 Replies

6. UNIX for Dummies Questions & Answers

Extracting rows from a space delimited text file based on the values of a column

I have a space delimited text file. I want to extract rows where the third column has 0 as a value and write those rows into a new space delimited text file. How do I go about doing that? Thanks! (2 Replies)
Discussion started by: evelibertine
2 Replies

7. UNIX for Dummies Questions & Answers

How do you delete cells from a space delimited text file given row and column number?

How do you delete cells from a space delimited text file given row and column number? Letś say the row number is r and the column number is c. Thanks! (5 Replies)
Discussion started by: evelibertine
5 Replies

8. Shell Programming and Scripting

Changing one column of delimited file column to fixed width column

Hi, Iam new to unix. I have one input file . Input file : ID1~Name1~Place1 ID2~Name2~Place2 ID3~Name3~Place3 I need output such that only first column should change to fixed width column of 15 characters of length. Output File: ID1<<12 spaces>>Name1~Place1 ID2<<12... (5 Replies)
Discussion started by: manneni prakash
5 Replies

9. Shell Programming and Scripting

Delete first column in tab-delimited text-file

I have a large text-file with tab-delimited genetic data that looks like: KSC112 KSC234 0 0 1 1 A G C T I simply wan to delete the first column, but since the file has 600 000 columns, it is not possible with awk (seems to be limited at 32k columns). Does anyone have an idea how to do this? (2 Replies)
Discussion started by: andmal
2 Replies

10. Shell Programming and Scripting

Delete parts of a string of character in one given column of a tab delimited file

I would like to remove characters from column 7 so that from an input file looking like this: >HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 ref_chr8.fa 6527777 F DD I get something like that in an output file: ... (13 Replies)
Discussion started by: matlavmac
13 Replies
Login or Register to Ask a Question