Scanning columns for duplicates and printing in one line


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Scanning columns for duplicates and printing in one line
# 1  
Old 02-23-2010
Scanning columns for duplicates and printing in one line

Description of data:
Code:
NC_002737.1  4  F1VI4M001A3IAU F1VI4M001A3IAU F1VI4M001A3IAU F1VI4M001A3IAU
NC_006372.1  5   F1VI4M001BH0HY FF1VI4M001BH0HY F1VI4M001C0ZC5 F1VI4M001DOF2X F1VI4M001AYNTS

Every field in every record is tab separated
There can be "n" columns.

Problem:
What I want to achieve is following
Code:
NC_002737.1 4 F1VI4M001A3IAU 4
NC_006372.1 5 F1VI4M001BH0HY 2 F1VI4M001C0ZC5 1 F1VI4M001DOF2X 1 F1VI4M001AYNTS 1

So far this happening:
Code:
awk 'BEGIN{OFS="\t";cnt=0}{if (NF>3) {for (i=3;i<=NF;i++) {if ($(i)==$(i+1)) {cnt = cnt+1} print $1,$(i),cnt}}cnt=1}'

Output:
Code:
NC_002737.1     F1VI4M001A3IAU  2
NC_002737.1     F1VI4M001A3IAU  3
NC_002737.1     F1VI4M001A3IAU  4
NC_002737.1     F1VI4M001A3IAU  4
NC_006372.1     F1VI4M001BH0HY  2
NC_006372.1     F1VI4M001C0ZC5  1
NC_006372.1     F1VI4M001DOF2X  1
NC_006372.1     F1VI4M001AYNTS  1

I know the problem: For loop is going one by one and then it is printing the lines. At this I am not able to come up with a way to get the desired output (in one line)
as shown above.
Can any one suggest away. I can again write another awk script to do this, but I wondering if there is a way to get this fix all in go.

Thanks

Last edited by Scott; 02-23-2010 at 04:24 PM.. Reason: please use code tags
# 2  
Old 02-23-2010
Quote:
I know the problem: For loop is going one by one and then it is printing the lines. At this I am not able to come up with a way to get the desired output (in one line)
as shown above.
use printf.
# 3  
Old 02-23-2010
With awk[1]:


Code:
awk '{ 
  split(x, c); j = 0 # for GNU awk use delete c
  for (i = 0; ++i <= NF;)  
  if (i < 3) printf "%s\t", $i 
    else c[$i]++ || n[++j] = $i 
  for (i = 0; ++i <= j;)
    printf "%s", (n[i] OFS c[n[i]] (i < j ? OFS : RS))  
  }' OFS='\t' infile

Perl:

Code:
perl -lane'
  %_ = (); $_{$F[$_]}++ for 2..$#F;
  print join "\t", @F[0..1], map {$_, $_{$_}} keys %_;
  ' infile

[1]. Use GNU awk (gawk), Bell Labs New awk (nawk) or X/Open sawk (/usr/xpg4/bin/awk) on Solaris.
[2]. I'm assuming that every record has at least three fields. If you need something different, let me know.

Last edited by radoulov; 02-23-2010 at 06:20 PM..
# 4  
Old 02-23-2010
Code:
while(<DATA>){
	my @tmp = split;
	my %hash;
	print $tmp[0]," ",$tmp[1]," ";
	map {$hash{$_}++} @tmp[2..$#tmp];
	foreach my $key(keys %hash){
		print $key," ",$hash{$key}," ";
	}
	print "\n";
}
__DATA__
NC_002737.1  4  F1VI4M001A3IAU F1VI4M001A3IAU F1VI4M001A3IAU F1VI4M001A3IAU
NC_006372.1  5   F1VI4M001BH0HY F1VI4M001BH0HY F1VI4M001C0ZC5 F1VI4M001DOF2X F1VI4M001AYNTS
test	6 a a a b b c c c d a b c d

# 5  
Old 02-24-2010
Thanks everyone for your suggestions, I really wanted to make this work in AWK, but Perl codes are welcome. And many thanks for the code. I have not yet tried them, but will do now.
Cheers
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Remove duplicates in a dataframe (table) keeping all the different cells of just one of the columns

Hello all, I need to filter a dataframe composed of several columns of data to remove the duplicates according to one of the columns. I did it with pandas. In the main time, I need that the last column that contains all different data ( not redundant) is conserved in the output like this: A ... (5 Replies)
Discussion started by: pedro88
5 Replies

2. UNIX for Beginners Questions & Answers

Sort and remove duplicates in directory based on first 5 columns:

I have /tmp dir with filename as: 010020001_S-FOR-Sort-SYEXC_20160229_2212101.marker 010020001_S-FOR-Sort-SYEXC_20160229_2212102.marker 010020001-S-XOR-Sort-SYEXC_20160229_2212104.marker 010020001-S-XOR-Sort-SYEXC_20160229_2212105.marker 010020001_S-ZOR-Sort-SYEXC_20160229_2212106.marker... (4 Replies)
Discussion started by: gnnsprapa
4 Replies

3. Shell Programming and Scripting

Removing duplicates from delimited file based on 2 columns

Hi guys,Got a bit of a bind I'm in. I'm looking to remove duplicates from a pipe delimited file, but do so based on 2 columns. Sounds easy enough, but here's the kicker... Column #1 is a simple ID, which is used to identify the duplicate. Once dups are identified, I need to only keep the one... (2 Replies)
Discussion started by: kevinprood
2 Replies

4. Shell Programming and Scripting

Shell do loop to compare two columns (duplicates)

Hello, I have a single text file with three columns like: 2 1 10 3 6 11 4 4 12 5 5 13 6 8 14 I was wondering how I can run a shell script to (do loop) to read the first point in the first column , compare it with all the points in the second column and if they are the same,... (12 Replies)
Discussion started by: Paul Moghadam
12 Replies

5. Shell Programming and Scripting

finding duplicates in csv based on key columns

Hi team, I have 20 columns csv files. i want to find the duplicates in that file based on the column1 column10 column4 column6 coulnn8 coulunm2 . if those columns have same values . then it should be a duplicate record. can one help me on finding the duplicates, Thanks in advance. ... (2 Replies)
Discussion started by: baskivs
2 Replies

6. Shell Programming and Scripting

Scanning Line by Line and Replace

Hi Everyone, I have a file which contains the autosys (job scheduler) JIL (Job Information Language) Information. For Example: /* ----------------- TS_QDB_DATA_MASK_RUN ----------------- */ insert_job: TS_QDB_DATA_MASK_RUN job_type: c command: ${DATA_BIN}/hrt_data_mask.sh machine:... (5 Replies)
Discussion started by: filter
5 Replies

7. Shell Programming and Scripting

Search based on 1,2,4,5 columns and remove duplicates in the same file.

Hi, I am unable to search the duplicates in a file based on the 1st,2nd,4th,5th columns in a file and also remove the duplicates in the same file. Source filename: Filename.csv "1","ccc","information","5000","temp","concept","new" "1","ddd","information","6000","temp","concept","new"... (2 Replies)
Discussion started by: onesuri
2 Replies

8. Shell Programming and Scripting

Remove duplicates based on the two key columns

Hi All, I needs to fetch unique records based on a keycolumn(ie., first column1) and also I needs to get the records which are having max value on column2 in sorted manner... and duplicates have to store in another output file. Input : Input.txt 1234,0,x 1234,1,y 5678,10,z 9999,10,k... (7 Replies)
Discussion started by: kmsekhar
7 Replies

9. Shell Programming and Scripting

finding duplicates in columns and removing lines

I am trying to figure out how to scan a file like so: 1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com 2 margies office","555-555-5555","ralph@mail.com","www.ralph.com 3 kims office","555-555-5555","kims@mail.com","www.ralph.com 4 tims... (17 Replies)
Discussion started by: totus
17 Replies

10. UNIX for Dummies Questions & Answers

searching text files on specific columns for duplicates

Is it possible to search through a large file full of rows and columns of text and retrieve only the rows that contain duplicates fields, searchiing for duplicates on col4 & col6 Sample below Col1 col2 col3 col4 col5 col6 G405H SURG FERGUSON ... (2 Replies)
Discussion started by: Gerry405
2 Replies
Login or Register to Ask a Question