Count and print the most repeating string in each line


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Count and print the most repeating string in each line
# 1  
Old 03-16-2015
Count and print the most repeating string in each line

Hi all,

I have a file in which each string from column 1 is associated with one or multiple strings from column 2. For an example, in the sample input below, Gene1 from column1 is associated with two different strings from column 2 (BP1 and BP2).For every unique string from column 1, I need to print the most associated string from column 2.


Input.txt:
Code:
Gene1   BP1
Gene1   BP1
Gene1   BP2
Gene1   BP1
Gene1   BP2
Gene2   BP3
Gene2   BP3
Gene2   BP3
Gene2   BP3
Gene3   BP7
Gene3   BP8
Gene3   BP7
Gene3   BP8

Output.txt:
Code:
Gene1   BP1   3
Gene2   BP3   4
Gene3   BP7   2   BP8   2

Here,
BP1 is highest number of connections (3 out of 5) with Gene1.
BP3 is highest number of connections (4 out of 4) with Gene2.
BP7 and BP8 have equal number of connections (each 2) with Gene3.

Your time is much appreciated.
Thanks!
# 2  
Old 03-17-2015
What have you tried so far?

---------- Post updated at 05:24 ---------- Previous update was at 04:06 ----------

Funny, i just looked at the output (ignoring the BP8 as i see now - sorry), and got this:
Code:
#!/usr/bin/env bash
### Expected output 
#Gene1   BP1   3
#Gene2   BP3   4
#Gene3   BP7   2   BP8   2
### from input.txt

for n in 1 2 3
do
	total_genes="$(grep Gene$n input.txt)"
	first_gene=$(echo "$total_genes"|head -n 1)
	first_occourences=$(echo "$total_genes" | grep "$first_gene" | wc -l )
	printf "%s\t%s\n" \
		"$first_gene" \
		"$first_occourences"
done

Code:
Gene1	 BP1	3
Gene2	 BP3	4
Gene3	 BP7	2

Hope this helps to get you started.

Last edited by sea; 03-17-2015 at 03:12 AM..
This User Gave Thanks to sea For This Post:
# 3  
Old 03-17-2015
Hi, another one to get you started:

The code needs to modified still to print the highest number of connections rather than only the number of connections...

awk's associative arrays are nice for this application:
Code:
awk '{A[$1]; B[$2]; C[$1,$2]++} END{for(i in A) {s=i; for(j in B) if ((i,j) in C) s=s OFS j OFS C[i,j]; print s}}' OFS='\t' file

or, using multiple lines of coding real-esate:
Code:
awk '
  {
    A[$1]
    B[$2]
    C[$1,$2]++
  } 
  END {
    for(i in A) {
      s=i
      for(j in B) 
        if ((i,j) in C)
          s=s OFS j OFS C[i,j]
      print s
    }
  }
' OFS='\t' file

This will produce:
Code:
Gene1	BP1	3	BP2	2
Gene2	BP3	4
Gene3	BP7	2	BP8	2

These 2 Users Gave Thanks to Scrutinizer For This Post:
# 4  
Old 03-17-2015
A single awk command could do this more efficiently, but I find the logic easier to express with a couple of sort commands, one uniq command, and a read loop in the shell...
Code:
#!/bin/ksh
last_gene=""
sort Input.txt | uniq -c | sort -k2,2 -k1,1rn -k3,3 | while read -r cnt gene rp
do	if [ "$last_gene" != "$gene" ]
	then	if [ "$last_gene" != "" ]
		then	echo
		fi
		last_gene="$gene"
		last_cnt="$cnt"
		printf '%s   %s   %s' "$gene" "$rp" "$cnt"
	else	if [ "$cnt" -eq "$last_cnt" ]
		then	printf '   %s   %s' "$rp" "$cnt"
		fi
	fi
done
echo

Although written and tested using the Korn shell, any shell that uses basic Bourne shell syntax will be able to run this script and, with your given sample input, produce the output:
Code:
Gene1   BP1   3
Gene2   BP3   4
Gene3   BP7   2   BP8   2

These 2 Users Gave Thanks to Don Cragun For This Post:
# 5  
Old 03-17-2015
A simple one:
Code:
sort Input.txt | uniq -c | awk '($2!=p2 || $1==p1) {print $2,$3,$1} {p1=$1; p2=$2}'

Code:
Gene1 BP1 3
Gene2 BP3 4
Gene3 BP7 2
Gene3 BP8 2

# 6  
Old 03-17-2015
Quote:
Originally Posted by MadeInGermany
A simple one:
Code:
sort Input.txt | uniq -c | awk '($2!=p2 || $1==p1) {print $2,$3,$1} {p1=$1; p2=$2}'

Code:
Gene1 BP1 3
Gene2 BP3 4
Gene3 BP7 2
Gene3 BP8 2

This works with the sample input given in post #1 in this thread, but with a slightly modified Input.txt:
Code:
Gene1   BP1
Gene1   BP2
Gene1   BP2
Gene1   BP2
Gene1   BP2
Gene2   BP3
Gene2   BP3
Gene2   BP3
Gene2   BP3
Gene3   BP7
Gene3   BP8
Gene3   BP7
Gene3   BP8

(changing some Gene1 BP1 lines to Gene1 BP2, it produces the incorrect output:
Code:
2}'
Gene1 BP1 1
Gene2 BP3 4
Gene3 BP7 2
Gene3 BP8 2

instead of the correct output:
Code:
Gene1 BP2 4
Gene2 BP3 4
Gene3 BP7 2
Gene3 BP8 2

or the requested:
Code:
Gene1   BP2   4
Gene2   BP3   4
Gene3   BP7   2   BP8   2

The second sort in the pipeline I suggested can't be skipped unless the awk code is made considerably more complex.
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 03-17-2015
Thanks for the correction. And there is yet another bug.
Hopefully fixed now:
Code:
sort Input.txt | uniq -c | sort -k2,2 -k1,1rn | awk '{ if ($2!=p2 || $1==p1) {print $2,$3,$1; p1=$1} else {p1=""}} {p2=$2}'


Last edited by MadeInGermany; 03-17-2015 at 09:39 AM..
This User Gave Thanks to MadeInGermany For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Count occurences of the word without it repeating

Hi, I would like to count the number of ALA occurences without having them to be repeated. In the script I have written now it has 40 repetitions of ALA but it has to be 8. ALA is chosen as one of the 20 values it can have when the script asks for the input of AAA, which for this example is chosen... (7 Replies)
Discussion started by: Aurimas
7 Replies

2. Shell Programming and Scripting

How to print line starting with certain string together with its following line?

Dear all, How can I print line starting with certain string together with its following line. Example is as follows: Input file: @M01596:22:000000000-A7YH7:1:1101:16615:1070 2:N:0:1... (2 Replies)
Discussion started by: huiyee1
2 Replies

3. Shell Programming and Scripting

String search and print next all lines in one line until blank line

Dear all I want to search special string in file and then print next all line in one line until blank lines come. Help me plz for same. My input file and desire op file is as under. i/p file: A1/EXT "BSCABD1_21233G1" 757 130823 1157 RADIO X-CEIVER ADMINISTRATION BTS EXTERNAL FAULT ... (7 Replies)
Discussion started by: jaydeep_sadaria
7 Replies

4. UNIX for Dummies Questions & Answers

How to count a string in a line and report it?

Hi, I have a text file full of such line (this is only 1 line, tab delimited): 1 108 . C T 553.90 . ... (19 Replies)
Discussion started by: a_bahreini
19 Replies

5. Shell Programming and Scripting

Print String Every Specific Line

Dear All, I have input file like this, 001 059 079 996 758 079 069 059 079 ... ... Desired output: AA 001 BB 059 (4 Replies)
Discussion started by: attila
4 Replies

6. Shell Programming and Scripting

Compare last 90 logs and print repeating lines with >20

*log files are in date order sample logs... ciscoresets_20120314 ciscoresets_20120313 ciscoresets_20120312 ciscoresets_20120311 ciscoresets_20120310 cat ciscoresets_20120314 SYDGRE04,10,9 SYDGRE04,10,10 SYDGRE04,10,11 SYDGRE04,10,12 SYDGRE04,10,13 SYDGRE04,10,14 SYDGRE04,10,15... (2 Replies)
Discussion started by: slashbash
2 Replies

7. Shell Programming and Scripting

Extract string from multiple file based on line count number

Hi, I search all forum, but I can not find solutions of my problem :( I have multiple files (5000 files), inside there is this data : FILE 1: 1195.921 -898.995 0.750312E-02-0.497526E-02 0.195382E-05 0.609417E-05 -2021.287 1305.479-0.819754E-02 0.107572E-01 0.313018E-05 0.885066E-05 ... (15 Replies)
Discussion started by: guns
15 Replies

8. Shell Programming and Scripting

Count and print all repeating words in a line

Gurus, I have a file containing lines like this : Now, number of words in each line varies. My need is, if a word repeats in a line get it printed. Also total number of repeats. So, the output would be : Any help would be highly appreciated. Thanks & Regards (5 Replies)
Discussion started by: AshwaniSharma09
5 Replies

9. Shell Programming and Scripting

awk: sort lines by count of a character or string in a line

I want to sort lines by how many times a string occurs in each line (the most times first). I know how to do this in two passes (add a count field in the first pass then sort on it in the second pass). However, can it be done more optimally with a single AWK command? My AWK has improved... (11 Replies)
Discussion started by: Michael Stora
11 Replies

10. Shell Programming and Scripting

Grep a string and print a string from the line below it

I know how to grep, copy and paste a string from a line. Now, what i want to do is to find a string and print a string from the line below it. To demonstrate: Name 1: ABC Age: 3 Sex: Male Name 2: DEF Age: 4 Sex: Male Output: 3 Male I know how to get "3". My biggest problem is to... (4 Replies)
Discussion started by: kingpeejay
4 Replies
Login or Register to Ask a Question