How many studies have unequal values for each pair?
I have several Studies (s) which has points (p) having Values (v).
My goal is to determine for each pair of points, how many studies have different values ( if available ).
For example, the pair (p1,p5) are involved in 2 studies , STUDY 1 (value1,value3 ) and STUDY 3 (value1, value5 ) where both values are different. So the count for this pair is 2. Pair (p1,p3) is present in both studies 1 and 3 with same values. So the count is 0.
I do have a working solution for this which works for a small data-set for runs forever for the actual dataset which has several thousand factors in each column
Here is my solution
Code:
awk '{sp[$1 FS $2]=$3;s[$1];p[$2];next }END { for(ss in s) { for (p1s in p) { for (p2s in p) { if (p1s !=p2s) { print ss,p1s,p2s,sp[ss FS p1s],sp[ss FS p2s] } }}}}' | awk 'NF==5' | awk '$4!=$5' dataset
awk '
$1+0 {a[$2]=$2; b[$2,c[$2]++]=$3} # load variable names array, load pairs array
END {
for (i in a) {
for (j in a) {
if (i != j) {
for (k=0; k<c[i]; k++) {
if (b[i,k] && b[j,k] && (b[i,k] != b[j,k])) {
out[i,j]++; # count pair mismatches for existing pairs
}
}
}
}
}
print "Point1\tPoint2\t#StudiesWhereValuesAreDifferentForThisPair"; # print header
for (i in a) {
for (j in a) {
if (out[i,j] && ! p[j,i] && ! p[i,j]) { # for non repeat pairs output
print i "\t" j "\t" out[i,j]; # print pair mismatch counts
p[j,i]=p[i,j]=1; # set pair array printed
}
}
}
}
' infile
You are right, that will pair will be 2. Thanks a lot..I need to understand your code now
Hello senhia83,
First of all thank you for asking good question and showing us what you have done to solve that too, keep it up.
Coming to your question, could you please try following.
Assuming that your input file is always sorted with all lines for a given study adjacent to each other (as in your sample dataset file), you might want to try the following awk script to handle your problem:
Code:
#!/bin/ksh
awk '
BEGIN { # Before reading 1st line from the input file, set output field
# separator to <tab> and print heading.
OFS = "\t"
print "Point1", "Point2", "Number_of_Studies_with_Different_Values"
}
NR == 1 {
# Skip over the input file header line.
next
}
$1 != last {
# If the 1st field has changed, process all of the lines read for the
# previous value of the 1st field.
count()
# Save the current value of the 1st field.
last = $1
}
{ # Increment the number of lines seen for this 1st field value and save
# the point name and the value from the current line.
p[++n] = $2
v[n] = $3
}
END { # Process the last 1st field value.
count()
# Print the accumulated results.
for(i in diffs)
# Uncomment one and only one of the following lines.
print i, diffs[i] | "sort" # use this to sort output
# print i, diffs[i] # use this for unsorted output
}
function count( i, j) {
# Process the set of lines for a given 1st field value.
# There are "n" lines in the set. for each of the 1st "n-1" lines
# in this set...
for(i = 1; i < n; i++)
# for each of the remaining lines in the set...
for(j = i + 1; j <= n; j++)
# if the values for those two points are different...
if(v[i] != v[j])
# Increment the number of times this pair of
# points (with the two points sorted by point
# names) has had different values.
diffs[(p[i] < p[j]) ? p[i] OFS p[j] : \
p[j] OFS p[i]]++
# Reset the line counter for the next set.
n = 0
}' dataset
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
Note that if I understand the requirements correctly, I don't believe the code suggested by rdrtx1 or RavinderSingh13 produce reliable results (although they do work for the sample data provided in post #1 in this thread). For details, see the next post.
Last edited by Don Cragun; 07-31-2016 at 04:51 AM..
Reason: Add note.
Looking closer at the script rdrtx1 and RavinderSingh13 suggested, I find that I do not understand how they restrict comparisons of values for various points to just compare values within a single study. Although all three of our suggestions produce similar output for the sample data given in post #1 in this thread (differing only in the order of lines in the output), if we change the sample input data to:
where all values in each study are identical, I believe the output should just be the header line. With this sample input, my suggestion produces the output:
Team,
I have a file like below
FILE:
NAM1,KEY1,VAL1
NAM1,KEY2,VAL2
NAM1,KEY3,VAL3
NAM2,KEY1,VALA
NAM2,KEY2,VALB
NAM2,KEY3,VALCOutput:
I have to build commands like below
<Script> VAL1 VAL2 VAL3 NAME1
<Script> VALA VALB VALC NAME2Can you please help with awk command i can use... (4 Replies)
I have two files for comparison which are extracts from set of xml files.
file1 has:
Comparing File: BRCSH1to320140224CC3.xml
:: TZZZ:BR
:: TAZZ:OUT
UIZZ:0 :: ERAZ:1.000000
UIZZ:0 :: CTZZ:B
UIZZ:0 :: CCAZ:MYR
Comparing File: BRMY20140224CC18REG013SPFNSY13.xml
:: TZZZ:BR
:: TAZZ:INB... (1 Reply)
Hi,
I have 25 groups and I need to perform all possible pairwise compariosns between them using the formula n(n-1)/2. SO in my case it will be 25(25-1)/2 which is equal to 300 comparisons.
my 25 groups are
FG1 FG2 FG3 FG4 FG5
NT5E CD44 CD44 CD44 AXL
ADAM19 CCDC80 L1CAM L1CAM CD44... (1 Reply)
Hi,
I have two files that I would like to merge and think that there should be a solution using awk. The files look something like this:
file 1
IDX1 IDY1
IDX2 IDY2
IDX3 IDY3
file 2
IDY1 dataA data1
IDY2 dataB data2
IDY3 dataC data3
Desired output
IDX1 IDY1 dataA data1
IDX2 ... (5 Replies)
How do I split a file into many parts but with different amounts of lines per part? I looked at the split command but that only splits evenly.
I'd like a range specified to determine how many lines each output file should have.
For example, if the input file has 1000 lines and the range is... (1 Reply)