Bitwise comparison of cols


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Bitwise comparison of cols
# 1  
Old 11-07-2013
Bitwise comparison of cols

Hello,

I want to compute the bitwise number of matches in pairwise fashion for all columns. The problem is I have 18486955 rows and 750 columns. Please help with code, I believe this will take a lot of time, is there a way of tracking progress?

Input
Code:
Org1    Org2    Org3
A    A    T
A     A    A
A    G    G

Output
Code:
Org1 Org2 2
Org1 Org3 1
Org2 Org3 2

# 2  
Old 11-07-2013
Quote:
Originally Posted by ritakadm
Hello,

I want to compute the bitwise number of matches in pairwise fashion for all columns. The problem is I have 18486955 rows and 750 columns. Please help with code, I believe this will take a lot of time, is there a way of tracking progress?

Input
Code:
Org1    Org2    Org3
A    A    T
A     A    A
A    G    G

Output
Code:
Org1 Org2 2
Org1 Org3 1
Org2 Org3 2

What do the numbers in the 3rd field of your output mean? It isn't the number of different pairings found (or Org1 Org3 would be 3). It isn't the number of time both elements of the pairing are the same (or Org1 Org3 would be 0).
# 3  
Old 11-07-2013
Try this

Code:
awk 'NR==1{for(cols=1;cols<=NF;cols++)H[cols]=$cols;next}
{for(i=1;i<=NF;i++)
    for(j=i+1;j<=NF;j++)
	   if($i==$j) c[i,j]++
}
END{
   for(i=1;i<cols;i++)
      for(j=i+1;j<cols;j++)
	  print H[i],H[j],0+c[i,j]
} ' infile

---------- Post updated at 10:55 AM ---------- Previous update was at 10:02 AM ----------

Quote:
I believe this will take a lot of time, is there a way of tracking progress?
Yes, my initial testing indicates it may take 3 or 4 weeks or runtime! The following logs each block to 100 lines processed to stderr:

Code:
awk 'NR==1{for(cols=1;cols<=NF;cols++)H[cols]=$cols;next}
{for(i=1;i<=NF;i++)
    for(j=i+1;j<=NF;j++)
       if($i==$j) c[i,j]++
}
!(NR%100) { printf("%cLines processed: %09d", 13, NR)> "/dev/stderr"}
END{
   for(i=1;i<cols;i++)
      for(j=i+1;j<cols;j++)
      print H[i],H[j],0+c[i,j]
} ' infile > outfile


Last edited by Chubler_XL; 11-07-2013 at 09:11 PM..
This User Gave Thanks to Chubler_XL For This Post:
# 4  
Old 11-08-2013
Quote:
Originally Posted by Don Cragun
What do the numbers in the 3rd field of your output mean? It isn't the number of different pairings found (or Org1 Org3 would be 3). It isn't the number of time both elements of the pairing are the same (or Org1 Org3 would be 0).

The third number is the number of matches (bitwise),, Org1 is AAA,Org3 is TAG,, so only the middle A matches ..hence the 1,,so comparing AAA and AAG is 2, TAG and GTA is 0..

---------- Post updated 11-08-13 at 10:40 AM ---------- Previous update was 11-07-13 at 10:54 PM ----------

Quote:
Yes, my initial testing indicates it may take 3 or 4 weeks or runtime! The following logs each block to 100 lines processed to stderr:
Ok, I guess I have to wait,,,I just started the process with & at the end, ,,, I believe even if I close the remote terminal, it will continue running in the background?
# 5  
Old 11-08-2013
Logging out will send a sighup to your background processes causing them to stop. disown will make the jobs not receive that signal, see man page.
nohup when sending the job to background will do similar.
These 2 Users Gave Thanks to RudiC For This Post:
# 6  
Old 11-08-2013
Hi.

Most modern computers have multiple cores and/or multiple CPUs. This task is very CPU-intensive: about 280,000 comparisons per line (if my binomial calculation is correct).

So it makes sense to try and utilize all the power that the computer has. Here is an example that uses the awk code of Chubler_XL (which I will not list -- it is in a separate file "a1").

The idea is that the input file is split up and many instances are run simultaneously (hence "parallel"). This script will run 1,2, and 4 instances. The computer is a beefy server that uses a 3-GHz XEON CPU, 4-cores, each with hyper-threading:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate real time decrease with use of command parallel.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk split
pe "GNU parallel 20120422"
pe "CPU: Intel Xeon CPU E31230 @ 3.201GHz"
pe "RAM: 6101MB / 16077MB"

rm -f results.[124]
FILE=${1-data0.txt}
rm -f xa?
sed -n '/^#/!p' $FILE |
split -l 100

pl " Input data file xaa:"
wc xaa

pl " Results, 1 process, 100 lines:"
time parallel --gnu --jobs=1 awk -f a1 ::: xaa > results.1
wc results.1

rm -f xa?
sed -n '/^#/!p' $FILE |
split -l 50
pl " Input data files xaa, xab:"
wc xa[ab]
pl " Results, 2 processes, 50 lines:"
time parallel --gnu --jobs=2 awk -f a1 ::: xaa xab > results.2
wc results.2

rm -f xa?
sed -n '/^#/!p' $FILE |
split -l 25
pl " Input data files xa[abcd]:"
wc xa[abcd]
pl " Results, 4 processes, 25 lines:"
time parallel --gnu --jobs=4 awk -f a1 ::: xaa xab xac xad > results.4
wc results.4

exit 0

producing:
Code:
./s1

Environment: LC_ALL = C, LANG = en_US.UTF-8
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.2.0-4-amd64, x86_64
Distribution        : Debian GNU/Linux 7.2 (wheezy, vm-server-ng) 
GNU bash 4.2.37
GNU Awk 4.0.1
split (GNU coreutils) 8.13
GNU parallel 20120422
CPU: Intel Xeon CPU E31230 @ 3.201GHz
RAM: 6101MB / 16077MB

-----
 Input data file xaa:
   100  75000 150000 xaa

-----
 Results, 1 process, 100 lines:
Lines processed: 000000100
real	0m13.971s
user	0m13.573s
sys	0m0.048s
 280875  842625 1966125 results.1

-----
 Input data files xaa, xab:
    50  37500  75000 xaa
    50  37500  75000 xab
   100  75000 150000 total

-----
 Results, 2 processes, 50 lines:

real	0m7.432s
user	0m14.077s
sys	0m0.060s
 561750 1685250 3923136 results.2

-----
 Input data files xa[abcd]:
    25  18750  37500 xaa
    25  18750  37500 xab
    25  18750  37500 xac
    25  18750  37500 xad
   100  75000 150000 total

-----
 Results, 4 processes, 25 lines:

real	0m4.392s
user	0m15.977s
sys	0m0.084s
1123500 3370500 7021826 results.4

The user time will always be about the same because we need to do n operations, regardless of how many processes are running. The real time, however, decreases almost linearly with the addition of "jobs" (processes, and, in this case, effectively cores). So one might expect a 20-fold decrease if one had 20 CPUs available. In reality, there is a slight amount of overhead from parallel, but I noticed a decrease in real time even with more than 1 job and a single CPU (on a different computer). Although this is CPU-intensive, there may be disk contention if there is a large number of processes. There is no way to predict what the value large would be, so testing will need to be done if many cores are available.

The output files are collected, and will need to be reduced to gather similar counts of the pairs. Outside of debugging, this seems like the only downside to me.

I recognize that this may be too advanced for the OP, but if he has time to spend over weeks waiting for the output, then perhaps he could enlist the help of a colleague.

For purposes of comparisons of methods that others may propose, I have uploaded a copy of the raw 1000-line 750 field/line text data file. The comments at the beginning of the file describe the file. As noted, I used only the first 100 lines.

Best wishes ... cheers, drl
This User Gave Thanks to drl For This Post:
# 7  
Old 11-09-2013
Sheer curiosity: having the main loop run to NF-1 only might yield a fraction of a percent run time reduction:
Code:
{for(i=1; i<NF; i++) ...

Unfortunately, replacing the
Code:
 if($i==$j) c[i,j]++

by
Code:
 c[i,j]+= ($i==$j)

increases run time dramatically, probably because the ++ operation is a register operation while the other is a full addition.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

Bitwise operation for state machine

Hello All, I am writing basic state machine which maintains 8 different states and there is posibility that system may be in multiple states at a time (Except for state1 to state3. menas only once state can be active at a time from state1 to state3). I have declared... (9 Replies)
Discussion started by: anand.shah
9 Replies

2. Shell Programming and Scripting

how to use bitwise or operator in /bin/sh

please any one can suggest me how to use bitesie || opearator to do this #initallize a=0 b=0 #condition if then a=0 else a=1 fi #bitwise or opeartion b = a || b Please view this code tag video for how to use code tags when posting code and data. (3 Replies)
Discussion started by: Palaniappan
3 Replies

3. FAQ Submission Queue

Analysis in bitwise XOR

The purpose of this article is revealing the unrevealed parts of the bitwise XOR. As we aware, the truth table for the XOR operator is : A B A^B 0 0 0 0 1 1 1 0 1 1 1 0 For example , 1^2 will be calculated as given below: First the operands... (1 Reply)
Discussion started by: pandeesh
1 Replies

4. Emergency UNIX and Linux Support

bitwise and between two 32 bit binaries

Hello All, i have two 16 bit binaries that in two different variables, i want to perform a bitwise AND between the two and store the result in a different variable. can anyone throw some light on doing this in a bourne shell... eg var1= 1110101010101011 ... (8 Replies)
Discussion started by: venu
8 Replies

5. Shell Programming and Scripting

Grouping matches by cols

Dear all I have a large file w. ~ 10 million lines. The first two cols have matching partners. For example: A A A B B B or A A B A B B The matches may be separated by an unknown number of lines. My intention is to group them and add a "group" value in the last col. For... (12 Replies)
Discussion started by: gbalsu
12 Replies

6. Programming

bitwise and if

Hi Suppose we have these code lines: #define _IN_USE 0x001 /* set when process slot is in use */ #define _EXITING 0x002 /* set when exit is expected */ #define _REFRESHING 0x004 ... 1 main () { 2 3 unsigned r_flags =_REFRESHING; 4 5 if (r_flag &... (3 Replies)
Discussion started by: Puntino
3 Replies

7. Shell Programming and Scripting

Bitwise negation

I am taking an online course on Unix scripting. The topic is Unix arithmetic operators and the lesson is Logical and bitwise operations. It is not clear how much storage space Unix uses to represent integers that are typed. Bitwise negation caused me to question how many bits are used to... (3 Replies)
Discussion started by: dLloydm
3 Replies

8. Programming

resetting counter using bitwise XOR

Hi ! How to reset a variable to 0 after a reset value, say 10 using bitwise XOR. For example, int cnt=0; if(cnt<10) cnt++; else cnt = 0; How can we achieve this by using XOR only. thanks, (1 Reply)
Discussion started by: mrgubbala
1 Replies

9. UNIX for Advanced & Expert Users

bitwise operators

can anybody write a program to divide a number by another number using bitwise operators (9 Replies)
Discussion started by: areef4u
9 Replies

10. Programming

Bit-fields and Bitwise operators

Hi, Is it possible to use bitwise operators in bit fields? For example: typedef struct Mystruct { unsigned char A :1 ; unsigned char B :1 ; } Mystruct; and assume struct Mystruct STR_1S, STR_2S, tempSTRS = {0}; then the following line: tempSTRS = STR_1S & STR_2S; gives the... (3 Replies)
Discussion started by: amatsaka
3 Replies
Login or Register to Ask a Question