Count Repetitive Number in a column and renumbering using awk

10-01-2018

Registered User

4, 0

Join Date: Oct 2018

Last Activity: 22 October 2018, 3:40 AM EDT

Location: Marseille, France

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

Count Repetitive Number in a column and renumbering using awk

Unable to get the desired output. Need only the rows which has repeated values in column 5.

Input File <tab separated file>

Code:

chr1    3773797 3773797 CEP10   1
chr1    3773797 3773797 CEP104  2
chr1    3689350 3689350 SMIM1   2
chr1    3773797 3773797 CEP4    3
chr1    3773797 3773797 EP104   4
chr1    43283059    43283059    CCDC23  5
chr1    43282775    43282775    ERMAP   5

Extra column 6 should be added in column 6. Rows, with repeated valued in column 5, should be extracted and extra column 6 is added with renumbering from 1 to n.

Output File <tab separated file>

Code:

chr1    3773797 3773797 CEP104  2   1
chr1    3689350 3689350 SMIM1   2   1
chr1    43283059    43283059    CCDC23  5   2
chr1    43282775    43282775    ERMAP   5   2

Moderator's Comments:

Please use CODE tags as required by forum rules!

Last edited by RudiC; 10-01-2018 at 09:41 AM.. Reason: Added CODE tags.

himanshu

View Public Profile for himanshu

Find all posts by himanshu

10-01-2018

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

where exactly are you stuck?

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

10-01-2018

Registered User

4, 0

Join Date: Oct 2018

Last Activity: 22 October 2018, 3:40 AM EDT

Location: Marseille, France

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

code tags, please!

I am not able to increase the column 6 value. Every time it is print 2 in the column 6.

awk 'NR == FNR {CNT[$NF]++; next} CNT[$NF] > 1 {print $0, CNT[$NF]}' file1 file1

Output I am getting.

Code:

chr1    3773797 3773797 CEP104  2 2
chr1    3689350 3689350 SMIM1   2 2
chr1    43283059    43283059    CCDC23  5 2
chr1    43282775    43282775    ERMAP   5 2

himanshu

View Public Profile for himanshu

Find all posts by himanshu

10-01-2018

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Code:

 awk 'NR == FNR {CNT[$NF]++;if (!($NF in order) && CNT[$NF]>1) order[$NF]=++cnt;next} $NF in order {print $0, order[$NF]}' file file

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

10-01-2018

Registered User

4, 0

Join Date: Oct 2018

Last Activity: 22 October 2018, 3:40 AM EDT

Location: Marseille, France

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

@Vgersh99: Thank you very much. The code is working perfectly fine.

himanshu

View Public Profile for himanshu

Find all posts by himanshu

10-01-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You haven't described the format of your input file. Just showing us a 10 line sample input file lets us makes lots of possibly erroneous assumptions that might or might not be valid in your actual input files.

The code you showed us in post #3 in this thread prints each row with a last field (not necessarily 5th field) that appears more than once with a new final field (not necessarily field 6) that specifies the number of times the previous final field appeared in your input file. Your code assumes that each input line contains 5 input fields, but that is not stated as a known requirement of any input file you will be processing.

The code vgersh99 suggested looks like it should do what you requested as long as each input line contains exactly 5 input fields and all lines within a given input file with the same value in the last field are adjacent in the input file. Neither of these assumptions are stated as requirements for your input files, but both assumptions hold true in your sample input file.

Note that both the code you showed us and the code vgersh99 suggested read the input file twice. For large files, this can be inefficient. If you want to process data being read from a pipe, it just won't work.

If we can assume that all records within a given input file with the same value in the 5th field are adjacent in the input file (including that all lines with less than 5 fields and all lines with an empty 5th field are adjacent), the following seems to do what you requested no matter how many fields appear on each input line and only needing to read an input file once. Again, this assumption is not stated as a requirement for your input files but holds true in your sample input file.

Code:

awk '
BEGIN {	FS = OFS = "\t"
}
{	if(++c[$5] == 1) {
		$5 = $5 OFS groups+1
		$0 = $0
		last = $0
	}
	if(c[$5] == 2) {
		print last
		$5 = $5 OFS (++groups)
		$0 = $0
		print
	}
	if(c[$5] > 2) {
		$5 = $5 OFS groups
		$0 = $0
		print
	}
}' file

Note that the parentheses in the statement:

Code:

		$5 = $5 OFS (++groups)

should not be needed. But, at least with the BSD-based awk version 20070501 distributed with macOS High Sierra version 10.13.6, this script gets a memory fault in awk when processing the first line that has a field #5 that matches a field #5 from a previous line if those parentheses are omitted.

You haven't said what operating system you're using. If you want to try this code on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

With your sample input, the above code produces the output you requested in post #1. With the following in file:

Code:

f1
f1	f2
f1	f2	f3
f1	f2	f3	f4
f1	f2	f3	f4	f5
chr1	3773797	3773797	CEP10	1
chr1	3773797	3773797	CEP104	2
chr1	3689350	3689350	SMIM1	2
chr1	3773797	3773797	CEP4	3
chr1	3773797	3773797	EP104	4
chr1	43283059	43283059	CCDC23	5
chr1	43282775	43282775	ERMAP	5
add1	43282775	43282775	ERMAP	6	field 6	field 7.1
add2	43282775	43282775	ERMAP	6	f62	field 7.2
add3	43282775	43282775	ERMAP	6	f63	field 7.3
add4	43282775	43282775	ERMAP	6	f64	field 7.4
add5	43282775	43282775	ERMAP	6	f65	field 7.5
add6	43282775	43282775	ERMAP	6	f66	field 7.6

the output produced by the above script is:

Code:

f1					1
f1	f2				1
f1	f2	f3			1
f1	f2	f3	f4		1
chr1	3773797	3773797	CEP104	2	2
chr1	3689350	3689350	SMIM1	2	2
chr1	43283059	43283059	CCDC23	5	3
chr1	43282775	43282775	ERMAP	5	3
add1	43282775	43282775	ERMAP	6	4	field 6	field 7.1
add2	43282775	43282775	ERMAP	6	4	f62	field 7.2
add3	43282775	43282775	ERMAP	6	4	f63	field 7.3
add4	43282775	43282775	ERMAP	6	4	f64	field 7.4
add5	43282775	43282775	ERMAP	6	4	f65	field 7.5
add6	43282775	43282775	ERMAP	6	4	f66	field 7.6

Note that the output field counting the number of groups of common field #5 values is always stored in field #6 no matter how many fields were in the input file. If there were 6 or more fields in an input line, all fields after field #5 are shifted to the right in the output and the group counter is inserted in field #6. If were less than 5 fields in an input line, empty fields are inserted before output field #6.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-01-2018

Read Only

1,278, 486

Join Date: Sep 2012

Last Activity: 27 February 2020, 8:59 PM EST

Location: Houston, Texas, USA

Posts: 1,278

Thanks Given: 0

Thanked 486 Times in 451 Posts

Code:

awk -F"\t" '
NR==FNR {if (a[$5]++) b[$5]=$5; next}
length(b[$5]) {if (!c[$NF]++) d++; print $0, d}
' file OFS="\t" file

rdrtx1

View Public Profile for rdrtx1

Find all posts by rdrtx1

Shell Programming and Scripting

Count Repetitive Number in a column and renumbering using awk

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Count number of unique values in each column of array

Discussion started by: Geneanalyst

2. UNIX for Dummies Questions & Answers

count number of distinct values in each column with awk

Discussion started by: beca123456

3. Shell Programming and Scripting

Count number of characters in particular column

Discussion started by: bhargavpbk88

4. Shell Programming and Scripting

Count the number of fields in column

Discussion started by: shekharjchandra

5. Shell Programming and Scripting

Count the number or row with same value in a column

Discussion started by: sQew

6. UNIX for Dummies Questions & Answers

count number of rows based on other column values

Discussion started by: itsme999

7. Shell Programming and Scripting

to count the number of occurences of a column value

Discussion started by: michaelrozar17

8. UNIX for Dummies Questions & Answers

how to count number of rows and sum of column using awk

Discussion started by: pistachio

9. Shell Programming and Scripting

column renumbering

Discussion started by: adak

10. Shell Programming and Scripting

Count if numbers are not repetitive

Discussion started by: Raynon