How many studies have unequal values for each pair?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How many studies have unequal values for each pair?
# 1  
Old 07-26-2016
How many studies have unequal values for each pair?

I have several Studies (s) which has points (p) having Values (v).
My goal is to determine for each pair of points, how many studies have different values ( if available ).

Code:
Study	Point	Value
1	p1	value1
1	p2	value2
1	p3	value1
1	p4	value3
1	p5	value3
2	p2	value1
2	p4	value1
3	p1	value1
3	p5	value5
3	p3	value1
4	p2	value1
4	p4	value5


For example, the pair (p1,p5) are involved in 2 studies , STUDY 1 (value1,value3 ) and STUDY 3 (value1, value5 ) where both values are different. So the count for this pair is 2. Pair (p1,p3) is present in both studies 1 and 3 with same values. So the count is 0.


So my desired output is


Code:
Point1	Point2	#StudiesWhereValuesAreDifferentForThisPair
p1	p2	1
p1	p4	1
p1	p5	2
p2	p3	1
p2	p4	2
p2	p5	1
p3	p4	1
p3	p5	2
p4	p5	1

I do have a working solution for this which works for a small data-set for runs forever for the actual dataset which has several thousand factors in each column

Here is my solution

Code:
awk '{sp[$1 FS $2]=$3;s[$1];p[$2];next }END { for(ss in s) { for (p1s in p) { for (p2s in p)   { if (p1s !=p2s) { print ss,p1s,p2s,sp[ss FS p1s],sp[ss FS p2s] } }}}}'  | awk 'NF==5' | awk '$4!=$5' dataset

Please help me achieve the same in a smarter way.

Last edited by senhia83; 07-26-2016 at 11:34 AM..
# 2  
Old 07-26-2016
Code:
awk '
$1+0 {a[$2]=$2; b[$2,c[$2]++]=$3}                                        # load variable names array, load pairs array
END {
   for (i in a) {
      for (j in a) {
         if (i != j) {
            for (k=0; k<c[i]; k++) {
               if (b[i,k] && b[j,k] && (b[i,k] != b[j,k])) {
                  out[i,j]++;                                            # count pair mismatches for existing pairs
               }
            }
         }
      }
   }
   print "Point1\tPoint2\t#StudiesWhereValuesAreDifferentForThisPair";   # print header
   for (i in a) {
      for (j in a) {
         if (out[i,j] && ! p[j,i] && ! p[i,j]) {                         # for non repeat pairs output
            print i "\t" j "\t" out[i,j];                                # print pair mismatch counts
            p[j,i]=p[i,j]=1;                                             # set pair array printed
         }
      }
   }
}
' infile

Verify p2 p5 from sample output.

Last edited by rdrtx1; 07-26-2016 at 01:44 PM..
This User Gave Thanks to rdrtx1 For This Post:
# 3  
Old 07-26-2016
You are right, that will pair will be 2. Thanks a lot..I need to understand your code now Smilie
# 4  
Old 07-29-2016
Quote:
Originally Posted by senhia83
You are right, that will pair will be 2. Thanks a lot..I need to understand your code now Smilie
Hello senhia83,

First of all thank you for asking good question and showing us what you have done to solve that too, keep it up.
Coming to your question, could you please try following.

One Liner form of solution:
Code:
awk 'BEGIN{print "Study" OFS "Point" OFS "Value"} NR>1{C[$2];E[$2 OFS 1+C[$2]++]=$3;sub(/[[:alpha:]]/,X,$2);Q=Q>$2+0?Q:$2+0} END{;for(i in C){D[++j]=i};for(i=1;i<=length(D);i++){for(k=i+1;k<=length(D);k++){for(q=1;q<=Q;q++){if(E[D[i] OFS q] != E[D[k] OFS q] && E[D[i] OFS q] && E[D[k] OFS q]){R[D[i] OFS D[k]]++;}}}};for(u in R){print u OFS R[u]}}' OFS="\t"   Input_file

Non-one liner form of solution:
Code:
awk 'BEGIN{
           print "Study" OFS "Point" OFS "Value"
          } 
     NR>1 {
           C[$2];
           E[$2 OFS 1+C[$2]++]=$3;
           sub(/[[:alpha:]]/,X,$2);
           Q=Q>$2+0?Q:$2+0
          } 
     END  {
           for(i in C){
                       D[++j]=i
                      };
           for(i=1;i<=length(D);i++){
                                     for(k=i+1;k<=length(D);k++){
                                                                 for(q=1;q<=Q;q++){
                                                                                   if(E[D[i] OFS q] != E[D[k] OFS q] && E[D[i] OFS q] && E[D[k] OFS q]){
                                                                                                                                                        R[D[i] OFS D[k]]++;
                                                                                                                                                                                  }
                                                                                  }
                                                                }
                                    };
           for(u in R){
                       print u OFS R[u]
                      }
           }
    ' OFS="\t"   Input_file

NOTE: Considering that field 2 always will have the number(digit) in it, as per your Input_file shown. Also tested this with GNU awk.

Thanks,
R. Singh

Last edited by RavinderSingh13; 07-29-2016 at 11:19 AM.. Reason: Changed the non-one liner form solution's spaces and fit them to good Looking one :)
# 5  
Old 07-31-2016
Assuming that your input file is always sorted with all lines for a given study adjacent to each other (as in your sample dataset file), you might want to try the following awk script to handle your problem:
Code:
#!/bin/ksh
awk '
BEGIN {	# Before reading 1st line from the input file, set output field
	# separator to <tab> and print heading.
	OFS = "\t"
	print "Point1", "Point2", "Number_of_Studies_with_Different_Values"
}
NR == 1 {
	# Skip over the input file header line.
	next
}
$1 != last {
	# If the 1st field has changed, process all of the lines read for the
	# previous value of the 1st field.
	count()

	# Save the current value of the 1st field.
	last = $1
}
{	# Increment the number of lines seen for this 1st field value and save
	# the point name and the value from the current line.
	p[++n] = $2
	v[n] = $3
}
END {	# Process the last 1st field value.
	count()

	# Print the accumulated results.
	for(i in diffs)
		# Uncomment one and only one of the following lines.
		print i, diffs[i] | "sort"	# use this to sort output
		# print i, diffs[i]		# use this for unsorted output
}
function count(		i, j) {
	# Process the set of lines for a given 1st field value.
	# There are "n" lines in the set.  for each of the 1st "n-1" lines
	# in this set...
	for(i = 1; i < n; i++)
		# for each of the remaining lines in the set...
		for(j = i + 1; j <= n; j++)
			# if the values for those two points are different...
			if(v[i] != v[j])
				# Increment the number of times this pair of
				# points (with the two points sorted by point
				# names) has had different values.
				diffs[(p[i] < p[j]) ?  p[i] OFS p[j] : \
				    p[j] OFS p[i]]++

	# Reset the line counter for the next set.
	n = 0
}' dataset

which, with you sample data, produces the output:
Code:
Point1	Point2	Number_of_Studies_with_Different_Values
p1	p2	1
p1	p4	1
p1	p5	2
p2	p3	1
p2	p4	2
p2	p5	1
p3	p4	1
p3	p5	2

or, with unsorted output instead of sorted output:
Code:
Point1	Point2	Number_of_Studies_with_Different_Values
p3	p4	1
p3	p5	2
p2	p3	1
p2	p4	2
p2	p5	1
p1	p2	1
p1	p4	1
p1	p5	2

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

Note that if I understand the requirements correctly, I don't believe the code suggested by rdrtx1 or RavinderSingh13 produce reliable results (although they do work for the sample data provided in post #1 in this thread). For details, see the next post.

Last edited by Don Cragun; 07-31-2016 at 04:51 AM.. Reason: Add note.
# 6  
Old 07-31-2016
Looking closer at the script rdrtx1 and RavinderSingh13 suggested, I find that I do not understand how they restrict comparisons of values for various points to just compare values within a single study. Although all three of our suggestions produce similar output for the sample data given in post #1 in this thread (differing only in the order of lines in the output), if we change the sample input data to:
Code:
Study	Point	Value
1	p1	value1
1	p2	value1
1	p3	value1
1	p4	value1
1	p5	value1
2	p2	value2
2	p4	value2
3	p1	value3
3	p5	value3
3	p3	value3
4	p2	value4
4	p4	value4

where all values in each study are identical, I believe the output should just be the header line. With this sample input, my suggestion produces the output:
Code:
Point1	Point2	Number_of_Studies_with_Different_Values

the code rdrtx1 suggested produces the output:
Code:
Point1	Point2	#StudiesWhereValuesAreDifferentForThisPair
p1	p2	1
p1	p4	1
p2	p3	1
p2	p5	1
p3	p4	1
p4	p5	1

and the code suggested by RavinderSingh13 produces the output:
Code:
Study Point Value
p4	p5	1
p3	p4	1
p2	p3	1
p2	p5	1
p1	p2	1
p1	p4	1

Did I misunderstand the requirements?
Login or Register to Ask a Question

Previous Thread | Next Thread

7 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk name pair values

Team, I have a file like below FILE: NAM1,KEY1,VAL1 NAM1,KEY2,VAL2 NAM1,KEY3,VAL3 NAM2,KEY1,VALA NAM2,KEY2,VALB NAM2,KEY3,VALCOutput: I have to build commands like below <Script> VAL1 VAL2 VAL3 NAME1 <Script> VALA VALB VALC NAME2Can you please help with awk command i can use... (4 Replies)
Discussion started by: mallak
4 Replies

2. Shell Programming and Scripting

Finding difference between two columns of unequal length

Hi, I have two files which look like this cat waitstate.txt 18.2 82.1 cat gostate.txt 5.6 5.8 6.1 6.3 6.6 6.9 7.2 7.5 (4 Replies)
Discussion started by: jamie_123
4 Replies

3. Shell Programming and Scripting

Compare two unsorted unequal files extracted from xml

I have two files for comparison which are extracts from set of xml files. file1 has: Comparing File: BRCSH1to320140224CC3.xml :: TZZZ:BR :: TAZZ:OUT UIZZ:0 :: ERAZ:1.000000 UIZZ:0 :: CTZZ:B UIZZ:0 :: CCAZ:MYR Comparing File: BRMY20140224CC18REG013SPFNSY13.xml :: TZZZ:BR :: TAZZ:INB... (1 Reply)
Discussion started by: vamsi gunda
1 Replies

4. Shell Programming and Scripting

Pair wise comparisons

Hi, I have 25 groups and I need to perform all possible pairwise compariosns between them using the formula n(n-1)/2. SO in my case it will be 25(25-1)/2 which is equal to 300 comparisons. my 25 groups are FG1 FG2 FG3 FG4 FG5 NT5E CD44 CD44 CD44 AXL ADAM19 CCDC80 L1CAM L1CAM CD44... (1 Reply)
Discussion started by: Diya123
1 Replies

5. Shell Programming and Scripting

Newline between unequal record fields

Assume the following 5 records (field separator is a space): 0903 0903 0910 0910 0910 0910 0910 0910 0917 0917 0917 0917 0924 1001 1001 1001 1001 1008 1008 1008 1008 1015 1015 1015 1015 1022 1029 1029 1029 1029 1105 1105 1105 1105 1112 1112 1112 1112 1119 1126 1126 1126 1126 1203 1203 1203 1203... (8 Replies)
Discussion started by: tree
8 Replies

6. UNIX for Dummies Questions & Answers

Merge two files with common IDs but unequal number of rows

Hi, I have two files that I would like to merge and think that there should be a solution using awk. The files look something like this: file 1 IDX1 IDY1 IDX2 IDY2 IDX3 IDY3 file 2 IDY1 dataA data1 IDY2 dataB data2 IDY3 dataC data3 Desired output IDX1 IDY1 dataA data1 IDX2 ... (5 Replies)
Discussion started by: katie8856
5 Replies

7. Shell Programming and Scripting

Splitting a file into unequal parts

How do I split a file into many parts but with different amounts of lines per part? I looked at the split command but that only splits evenly. I'd like a range specified to determine how many lines each output file should have. For example, if the input file has 1000 lines and the range is... (1 Reply)
Discussion started by: revax
1 Replies
Login or Register to Ask a Question