Count Repetitive Number in a column and renumbering using awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Count Repetitive Number in a column and renumbering using awk
# 1  
Old 10-01-2018
Count Repetitive Number in a column and renumbering using awk

Unable to get the desired output. Need only the rows which has repeated values in column 5.

Input File <tab separated file>
Code:
chr1    3773797 3773797 CEP10   1
chr1    3773797 3773797 CEP104  2
chr1    3689350 3689350 SMIM1   2
chr1    3773797 3773797 CEP4    3
chr1    3773797 3773797 EP104   4
chr1    43283059    43283059    CCDC23  5
chr1    43282775    43282775    ERMAP   5

Extra column 6 should be added in column 6. Rows, with repeated valued in column 5, should be extracted and extra column 6 is added with renumbering from 1 to n.

Output File <tab separated file>
Code:
chr1    3773797 3773797 CEP104  2   1
chr1    3689350 3689350 SMIM1   2   1
chr1    43283059    43283059    CCDC23  5   2
chr1    43282775    43282775    ERMAP   5   2

Moderator's Comments:
Mod Comment Please use CODE tags as required by forum rules!

Last edited by RudiC; 10-01-2018 at 09:41 AM.. Reason: Added CODE tags.
# 2  
Old 10-01-2018
where exactly are you stuck?
This User Gave Thanks to vgersh99 For This Post:
# 3  
Old 10-01-2018
code tags, please!

I am not able to increase the column 6 value. Every time it is print 2 in the column 6.

awk 'NR == FNR {CNT[$NF]++; next} CNT[$NF] > 1 {print $0, CNT[$NF]}' file1 file1

Output I am getting.
Code:
chr1    3773797 3773797 CEP104  2 2
chr1    3689350 3689350 SMIM1   2 2
chr1    43283059    43283059    CCDC23  5 2
chr1    43282775    43282775    ERMAP   5 2

# 4  
Old 10-01-2018
Code:
 awk 'NR == FNR {CNT[$NF]++;if (!($NF in order) && CNT[$NF]>1) order[$NF]=++cnt;next} $NF in order {print $0, order[$NF]}' file file

This User Gave Thanks to vgersh99 For This Post:
# 5  
Old 10-01-2018
@Vgersh99: Thank you very much. The code is working perfectly fine.
# 6  
Old 10-01-2018
You haven't described the format of your input file. Just showing us a 10 line sample input file lets us makes lots of possibly erroneous assumptions that might or might not be valid in your actual input files.

The code you showed us in post #3 in this thread prints each row with a last field (not necessarily 5th field) that appears more than once with a new final field (not necessarily field 6) that specifies the number of times the previous final field appeared in your input file. Your code assumes that each input line contains 5 input fields, but that is not stated as a known requirement of any input file you will be processing.

The code vgersh99 suggested looks like it should do what you requested as long as each input line contains exactly 5 input fields and all lines within a given input file with the same value in the last field are adjacent in the input file. Neither of these assumptions are stated as requirements for your input files, but both assumptions hold true in your sample input file.

Note that both the code you showed us and the code vgersh99 suggested read the input file twice. For large files, this can be inefficient. If you want to process data being read from a pipe, it just won't work.

If we can assume that all records within a given input file with the same value in the 5th field are adjacent in the input file (including that all lines with less than 5 fields and all lines with an empty 5th field are adjacent), the following seems to do what you requested no matter how many fields appear on each input line and only needing to read an input file once. Again, this assumption is not stated as a requirement for your input files but holds true in your sample input file.
Code:
awk '
BEGIN {	FS = OFS = "\t"
}
{	if(++c[$5] == 1) {
		$5 = $5 OFS groups+1
		$0 = $0
		last = $0
	}
	if(c[$5] == 2) {
		print last
		$5 = $5 OFS (++groups)
		$0 = $0
		print
	}
	if(c[$5] > 2) {
		$5 = $5 OFS groups
		$0 = $0
		print
	}
}' file

Note that the parentheses in the statement:
Code:
		$5 = $5 OFS (++groups)

should not be needed. But, at least with the BSD-based awk version 20070501 distributed with macOS High Sierra version 10.13.6, this script gets a memory fault in awk when processing the first line that has a field #5 that matches a field #5 from a previous line if those parentheses are omitted.

You haven't said what operating system you're using. If you want to try this code on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

With your sample input, the above code produces the output you requested in post #1. With the following in file:
Code:
f1
f1	f2
f1	f2	f3
f1	f2	f3	f4
f1	f2	f3	f4	f5
chr1	3773797	3773797	CEP10	1
chr1	3773797	3773797	CEP104	2
chr1	3689350	3689350	SMIM1	2
chr1	3773797	3773797	CEP4	3
chr1	3773797	3773797	EP104	4
chr1	43283059	43283059	CCDC23	5
chr1	43282775	43282775	ERMAP	5
add1	43282775	43282775	ERMAP	6	field 6	field 7.1
add2	43282775	43282775	ERMAP	6	f62	field 7.2
add3	43282775	43282775	ERMAP	6	f63	field 7.3
add4	43282775	43282775	ERMAP	6	f64	field 7.4
add5	43282775	43282775	ERMAP	6	f65	field 7.5
add6	43282775	43282775	ERMAP	6	f66	field 7.6

the output produced by the above script is:
Code:
f1					1
f1	f2				1
f1	f2	f3			1
f1	f2	f3	f4		1
chr1	3773797	3773797	CEP104	2	2
chr1	3689350	3689350	SMIM1	2	2
chr1	43283059	43283059	CCDC23	5	3
chr1	43282775	43282775	ERMAP	5	3
add1	43282775	43282775	ERMAP	6	4	field 6	field 7.1
add2	43282775	43282775	ERMAP	6	4	f62	field 7.2
add3	43282775	43282775	ERMAP	6	4	f63	field 7.3
add4	43282775	43282775	ERMAP	6	4	f64	field 7.4
add5	43282775	43282775	ERMAP	6	4	f65	field 7.5
add6	43282775	43282775	ERMAP	6	4	f66	field 7.6

Note that the output field counting the number of groups of common field #5 values is always stored in field #6 no matter how many fields were in the input file. If there were 6 or more fields in an input line, all fields after field #5 are shifted to the right in the output and the group counter is inserted in field #6. If were less than 5 fields in an input line, empty fields are inserted before output field #6.
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 10-01-2018
Code:
awk -F"\t" '
NR==FNR {if (a[$5]++) b[$5]=$5; next}
length(b[$5]) {if (!c[$NF]++) d++; print $0, d}
' file OFS="\t" file

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Count number of unique values in each column of array

What is an efficient way of counting the number of unique values in a 400 column by 1000 row array and outputting the counts per column, assuming the unique values in the array are: A, B, C, D In other words the output should look like: Value COL1 COL2 COL3 A 50 51 52... (16 Replies)
Discussion started by: Geneanalyst
16 Replies

2. UNIX for Dummies Questions & Answers

count number of distinct values in each column with awk

Hi ! input: A|B|C|D A|F|C|E A|B|I|C A|T|I|B As the title of the thread says, I would need to get: 1|3|2|4 I tried different variants of this command, but I don't manage to obtain what I need: gawk 'BEGIN{FS=OFS="|"}{for(i=1; i<=NF; i++) a++} END {for (b in a) print b}' input ... (2 Replies)
Discussion started by: beca123456
2 Replies

3. Shell Programming and Scripting

Count number of characters in particular column

Hi i have data like abchd 124 ldskc aattggcc each separated by tab space i want to count number of characters in 4th column and print it in new column with tabspace for every line can anyone help me how to do it. Thanks. (3 Replies)
Discussion started by: bhargavpbk88
3 Replies

4. Shell Programming and Scripting

Count the number of fields in column

Hi I was going through the below thread https://www.unix.com/shell-programming-scripting/48535-how-count-number-fields-record.html I too have something similar requirement as specified in this thread but the number of columns in my case can be very high, so I am getting following error. ... (3 Replies)
Discussion started by: shekharjchandra
3 Replies

5. Shell Programming and Scripting

Count the number or row with same value in a column

This is the source file, we called it errorlist.out 196 server_a server_unix_2 CD 196 server_b server_win_1 CD 196 server_c server_win_2 CD 196 server_bd server_unix_2 CD 196 server_d server_unix_2 CD 196 server_es server_win_1 CD 196 ... (14 Replies)
Discussion started by: sQew
14 Replies

6. UNIX for Dummies Questions & Answers

count number of rows based on other column values

Could anybody help with this? I have input below ..... david,39 david,39 emelie,40 clarissa,22 bob,42 bob,42 tim,32 bob,39 david,38 emelie,47 what i want to do is count how many names there are with different ages, so output would be like this .... david,2 emelie,2 clarissa,1... (3 Replies)
Discussion started by: itsme999
3 Replies

7. Shell Programming and Scripting

to count the number of occurences of a column value

im trying to count the number of occurences of column 2 value(starting from KKK*) of the below file, file.txt using the code cat file.txt | awk ' BEGIN { print "Category Counts"} {FS=","} {NR > 2} { cats = cats + 1} END { for(c in cats) { print c, "=", cats} } ' but its returning as ... (6 Replies)
Discussion started by: michaelrozar17
6 Replies

8. UNIX for Dummies Questions & Answers

how to count number of rows and sum of column using awk

Hi All, I have the following input which i want to process using AWK. Rows,NC,amount 1,1202,0.192387 2,1201,0.111111 3,1201,0.123456 i want the following output count of rows = 3 ,sum of amount = 0.426954 Many thanks (2 Replies)
Discussion started by: pistachio
2 Replies

9. Shell Programming and Scripting

column renumbering

Hi, I am a beginner in awk scripting! I need your help; I want to replace the fifth column number (which is 15 here) here in this file for example : ATOM 142 N PRO A 15 ATOM 143 CD PRO A 15 ATOM 144 HD1 PRO A 15 ATOM ... (5 Replies)
Discussion started by: adak
5 Replies

10. Shell Programming and Scripting

Count if numbers are not repetitive

Hi All, I have an input below and i would want to do a count on all the term "aaa" and count only once if the number in first column is the same. For eg, if i use a "grep -c aaa input" command, the count will be "8". However, i would want the count to be "6" instead since 2 numbers in the 1st... (7 Replies)
Discussion started by: Raynon
7 Replies
Login or Register to Ask a Question