MATRIX to CSV


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting MATRIX to CSV
# 1  
Old 06-18-2013
MATRIX to CSV

Hello friends,

A big question for the UNIX INTELLIGENCE

I have a CSV file as follows:


Code:
VALUE,USER1,relatedUSER1,relatedUSER2
-1,userA,userB,userC
1,userN,userD,userB
0,userF,userH,userG
0,userT,userH,userB
1,userN,userB,userA
-1,userA,userF,userC
0,userF,userH,userG
1,userB,userJ,userU
-1,userJ,userD,userA

There are a lot of user and their frequences and positions are random in every field!

I need as a output file a CSV as follows:

SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE


Where:
SUPERUSER= most frequent user (in column USER1) in terms of relations to other users (descendent order)
relatedUSER= most frequent user in terms of relations to USER1
TOTFrequencyrelations = total number of relations between SUPERUSER and relatedUSER
(1)Frequency = amount of positive relations
(0)Frequency = amount of neutral relations
(-1)Frequency = amount of negative relations
Value = (1)Frequency-(-1)Frequency/[(1)Frequency+(-1)Frequency)]

Example:

Code:
SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE
userG,userB,15,6,7,1,0....
userG,userF,10,2,4,4,0...
userG,userD,5,2,2,1,0...
userJ,userH,15,6,6,3,0...
userJ,userU,8,4,1,1,0....
userJ,userB,6,2,2,1,0....
userF,userA,12,4,5,3,0.....
userF,userH,12,5,2,5,0.....

Many thanks for your big help.

Have a nice time!!
# 2  
Old 06-19-2013
The first thing that occurs to me is that the input is a columns-should-be-rows flavor, so turn it into proper tuples by making a row/line/tuple for each related. Then you can deal with it like a SQL RDBMS table. You can then cut, sort and uniq -c or while read the various columns to get ranks and statistics. For instance, assuming there are always 2 related and they are peers, here is a partial solution (forgot VALUE, discarded TOTFrequencyrelations f2 -- see if you can fix it):
Code:
(
  cut -d, -f 2,3,1 in_file
  cut -d, -f 2,4,1 in_file
 ) | (
  sort
  echo ZZZEND,ZZZEND
 ) | (
  IFS=${IFS}, lrusr=
  while read usr rusr pnz
   do
 
    if [ "$lusr" = "$usr" ]
     then
 
      (( uct++ ))
 
      if [ "$lrusr" = "$rusr" ]
       then
 
        (( ruct++ ))
 
       fi
     fi
 
    if [ "lusr-$lrusr" = "$usr-$rusr" ]
     then
 
      case "$pnz" in
       (-1)
         (( pnzn++ ))
        ;;
       (0)
         (( pnzz++ ))
        ;;
       (*)
         (( pnzp++ ))
        ;;
       esac
 
      continue
     fi
 
    if [ "$lrusr" != "" ]
     then
 
      echo "$uct,$ruct,$usr,$rusr,$pnzp,$pnzz,$pnn"
 
     fi
 
    if [ "$lrusr" = "ZZZEND" ]
     then
 
      break
 
     fi
 
    uct=1 ruct=1 pnzz=0 pnzp=0 pnzn=0 lusr=$usr lrusr=$rusr
   done
 ) | sort -nr -t, | cut -d, -f 3-7

This cuts out fields for related1 and then for related2 into a common stream and feeds it to a sort, creating a sorted tuple file on the stream. The while read loop counts the user and user+related counts for sort ordering, and the +1 p, -1 n and 0 z, spitting out counts when the key changes. A dummy input trailer makes the loop spit out the last set of values. Once sorted numerically, the counts for sorting are discarded.

Last edited by DGPickett; 06-19-2013 at 05:23 PM..
# 3  
Old 06-19-2013
I am having great difficulty trying to figure out what you want. I don't see any way to convert the data you said is in your CSV file into the example output you supplied using the descriptions of the output fields you provided (even ignoring the fact that the mathematical calculation you specified will NEVER produce the ..., ...., or ..... you show in the last field of your expected output). The fields you have labeled (xxx)Frequency appear to be counts in your output rather than frequencies; but neither term matches the data shown in your example.

Please provide sample input, desired output for that sample input, and a description of the processing you want to use to convert the input into the output that is consistent with the sample data.
# 4  
Old 06-19-2013
Hello,

many thanks for your attention,

i will try to explain what i need with a real case:

Example input:

Code:
VALUE,USER1,relatedUSER1,relatedUSER2
-1,userA,userB,userC
1,userN,userD,userB
0,userF,userH,userG
0,userT,userH,userB
1,userN,userB,userA
-1,userA,userF,userC
0,userF,userH,userU
1,userB,userJ,userU
-1,userJ,userD,userA
-1,userA,userB,userC
-1,userN,userD,userB
0,userN,userH,userG


Example output:

Code:
SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE
userN,userB,3,2,0,1,0.3
userN,userD,2,2,0,0,1
userN,userA,1,1,0,0,1
userN,userH,1,0,1,0,0
userN,userG,1,0,1,0,0
userA,userC,3,0,0,3,-1
userA,userB,2,0,0,2,-1
userA,userF,1,0,0,1,-1
userF,userH,2,0,2,0,0
userF,userU,1,0,1,0,0
userT,userH,1,0,1,0,0
userT,userB,1,0,1,0,0
userJ,userD,1,0,0,1,-1
userJ,userA,1,0,0,1,-1


Example Values for userN and userB = 3,2,0,1,0.3
3 = number of total relations
2 = number of total positive relations
0 = number of total neutral relations
1 = number of total negative relations
0.3 = value obtain from this formula 2-1/(2+1)=0.3

I hope that this simple example will be usefull!

Thanks again!
# 5  
Old 06-19-2013
Not sure if this can be done much simpler, anyway here is an awk approach:
Code:
awk -F, '
        BEGIN {
                print "SUPERUSER,relatedUSER,TOTFrequencyrelations,(1)Frequency,(0)Frequency,(-1)Frequency,VALUE"
                OFMT = "%.1f"
        }
        NR > 1 {
                U[$2]
                U[$3]
                U[$4]
                R[$2","$3]++
                R[$2","$4]++
                R[$3","$2]++
                R[$3","$4]++
                R[$4","$2]++
                R[$4","$3]++
                if ( $1 > 0 )
                {
                        P[$2","$3]++
                        P[$2","$4]++
                        P[$3","$2]++
                        P[$3","$4]++
                        P[$4","$2]++
                        P[$4","$3]++
                }
                if ( $1 == 0 )
                {
                        N[$2","$3]++
                        N[$2","$4]++
                        N[$3","$2]++
                        N[$3","$4]++
                        N[$4","$2]++
                        N[$4","$3]++
                }
                if ( $1 < 0 )
                {
                        L[$2","$3]++
                        L[$2","$4]++
                        L[$3","$2]++
                        L[$3","$4]++
                        L[$4","$2]++
                        L[$4","$3]++
                }
        }
        END {
                for ( k in U )
                {
                        for ( l in R )
                        {
                                split ( l, V )
                                if ( V[1] == k )
                                {
                                        p = P[l] ? P[l] : 0
                                        n = N[l] ? N[l] : 0
                                        z = L[l] ? L[l] : 0
                                        print V[1], V[2], R[l], p, n, z, ( ( p + z ) > 0 ? ( p - z ) / ( p + z ) : 0 )
                                }
                        }
                }
        }
' OFS=, file

# 6  
Old 06-19-2013
Quote:
Originally Posted by kraterions
... ... ...

0.3 = value obtain from this formula 2-1/(2+1)=0.3

I hope that this simple example will be usefull!

Thanks again!
In normal arithmetic, multiplication and division have higher precedence than addition and subtraction. So the formula 2-1/(2+1) (with the results rounded to one decimal place would be 1.7; not 0.3. Did you intend for the formula (using alternative variable names that don't contain parentheses) to be:
Code:
Value = (Plus1Count-Minus1Count)/(Plus1Count+Minus1Count)
        rather than
Value = Plus1Count-Minus1Count/(Plus1Count+Minus1Count)

or should the result be 1.7?
# 7  
Old 06-20-2013
I like parens even if not needed due to priority, as so few seem to remember priority rules.

Yoda takes user, rel user 1 and rel user 2 to be peers, and does relationships trilaterally, bidirectionally. Does that fit this model?

Is zero the right answer for the divide by zero case? Maybe zero cases should be in the denominator, preventing that?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Match columns from two csv files and update field in one of the csv file

Hi, I have a file of csv data, which looks like this: file1: 1AA,LGV_PONCEY_LES_ATHEE,1,\N,1,00020460E1,0,\N,\N,\N,\N,2,00.22335321,0.00466628 2BB,LES_POUGES_ASF,\N,200,200,00006298G1,0,\N,\N,\N,\N,1,00.30887539,0.00050312... (10 Replies)
Discussion started by: djoseph
10 Replies

2. Shell Programming and Scripting

Compare 2 csv files in ksh and o/p the difference in a new csv file

(say) I have 2 csv files - file1.csv & file2.csv as mentioned below: file1.csv ID,version,cost 1000,1,30 2000,2,40 3000,3,50 4000,4,60 file2.csv ID,version,cost 1000,1,30 2000,2,45 3000,4,55 6000,5,70 The... (7 Replies)
Discussion started by: Naresh101
7 Replies

3. Shell Programming and Scripting

Comparing 2 CSV files and sending the difference to a new csv file

(say) I have 2 csv files - file1.csv & file2.csv as mentioned below: file1.csv ID,version,cost 1000,1,30 2000,2,40 3000,3,50 4000,4,60 file2.csv ID,version,cost 1000,1,30 2000,2,45 3000,4,55 6000,5,70 ... (1 Reply)
Discussion started by: Naresh101
1 Replies

4. Shell Programming and Scripting

3 column .csv --> correlation matrix; awk, perl?

Greetings, salutations. I have a 3 column csv file with ~13 million rows and I would like to generate a correlation matrix. Interestingly, you all previously provided a solution to the inverse of this problem. Thread title: "awk? adjacency matrix to adjacency list / correlation matrix to list"... (6 Replies)
Discussion started by: R3353
6 Replies

5. Shell Programming and Scripting

Perl search csv fileA where two strings exist on another csv fileB

Hi I have two csv files, with the following formats: FileA.log: Application, This occured blah Application, That occured blah Application, Also this AnotherLog, Bob did this AnotherLog, Dave did that FileB.log: Uk, London, Application, datetime, LaterDateTime, Today it had'nt... (8 Replies)
Discussion started by: PerlNewbRP
8 Replies

6. Shell Programming and Scripting

CSV to SQL insert: Awk for strings with multiple lines in csv

Hi Fellows, I have been struggling to fix an issue in csv records to compose sql statements and have been really losing sleep over it. Here is the problem: I have csv files in the following pipe-delimited format: Column1|Column2|Column3|Column4|NEWLINE Address Type|some descriptive... (4 Replies)
Discussion started by: khayal
4 Replies

7. Shell Programming and Scripting

awk? adjacency matrix to adjacency list / correlation matrix to list

Hi everyone I am very new at awk but think that that might be the best strategy for this. I have a matrix very similar to a correlation matrix and in practical terms I need to convert it into a list containing the values from the matrix (one value per line) with the first field of the line (row... (5 Replies)
Discussion started by: stonemonkey
5 Replies

8. Ubuntu

How to convert full data matrix to linearised left data matrix?

Hi all, Is there a way to convert full data matrix to linearised left data matrix? e.g full data matrix Bh1 Bh2 Bh3 Bh4 Bh5 Bh6 Bh7 Bh1 0 0.241058 0.236129 0.244397 0.237479 0.240767 0.245245 Bh2 0.241058 0 0.240594 0.241931 0.241975 ... (8 Replies)
Discussion started by: evoll
8 Replies

9. Shell Programming and Scripting

2 problems: Mailing CSV file / parsing CSV for display

I have been trying to find a good solution for this seemingly simple task for 2 days, and I'm giving up and posting a thread. I hope someone can help me out! I'm on HPUX, using sqlplus, mailx, awk, have some other tools available, but can't install stuff that isn't already in place (without a... (6 Replies)
Discussion started by: soldstatic
6 Replies

10. Shell Programming and Scripting

diagonal matrix to square matrix

Hello, all! I am struggling with a short script to read a diagonal matrix for later retrieval. 1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098 1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.058 1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125 1.000 0.587 0.159 0.357... (11 Replies)
Discussion started by: yifangt
11 Replies
Login or Register to Ask a Question