Bitwise comparison of cols

11-07-2013

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

Bitwise comparison of cols

Hello,

I want to compute the bitwise number of matches in pairwise fashion for all columns. The problem is I have 18486955 rows and 750 columns. Please help with code, I believe this will take a lot of time, is there a way of tracking progress?

Input

Code:

Org1    Org2    Org3
A    A    T
A     A    A
A    G    G

Output

Code:

Org1 Org2 2
Org1 Org3 1
Org2 Org3 2

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

11-07-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by ritakadm

Hello,

I want to compute the bitwise number of matches in pairwise fashion for all columns. The problem is I have 18486955 rows and 750 columns. Please help with code, I believe this will take a lot of time, is there a way of tracking progress?

Input

Code:

Org1    Org2    Org3
A    A    T
A     A    A
A    G    G

Output

Code:

Org1 Org2 2
Org1 Org3 1
Org2 Org3 2

What do the numbers in the 3rd field of your output mean? It isn't the number of different pairings found (or Org1 Org3 would be 3). It isn't the number of time both elements of the pairing are the same (or Org1 Org3 would be 0).

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-07-2013

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Try this

Code:

awk 'NR==1{for(cols=1;cols<=NF;cols++)H[cols]=$cols;next}
{for(i=1;i<=NF;i++)
    for(j=i+1;j<=NF;j++)
	   if($i==$j) c[i,j]++
}
END{
   for(i=1;i<cols;i++)
      for(j=i+1;j<cols;j++)
	  print H[i],H[j],0+c[i,j]
} ' infile

---------- Post updated at 10:55 AM ---------- Previous update was at 10:02 AM ----------

Quote:

I believe this will take a lot of time, is there a way of tracking progress?

Yes, my initial testing indicates it may take 3 or 4 weeks or runtime! The following logs each block to 100 lines processed to stderr:

Code:

awk 'NR==1{for(cols=1;cols<=NF;cols++)H[cols]=$cols;next}
{for(i=1;i<=NF;i++)
    for(j=i+1;j<=NF;j++)
       if($i==$j) c[i,j]++
}
!(NR%100) { printf("%cLines processed: %09d", 13, NR)> "/dev/stderr"}
END{
   for(i=1;i<cols;i++)
      for(j=i+1;j<cols;j++)
      print H[i],H[j],0+c[i,j]
} ' infile > outfile

Last edited by Chubler_XL; 11-07-2013 at 09:11 PM..

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

11-08-2013

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Don Cragun

What do the numbers in the 3rd field of your output mean? It isn't the number of different pairings found (or Org1 Org3 would be 3). It isn't the number of time both elements of the pairing are the same (or Org1 Org3 would be 0).

The third number is the number of matches (bitwise),, Org1 is AAA,Org3 is TAG,, so only the middle A matches ..hence the 1,,so comparing AAA and AAG is 2, TAG and GTA is 0..

---------- Post updated 11-08-13 at 10:40 AM ---------- Previous update was 11-07-13 at 10:54 PM ----------

Quote:

Yes, my initial testing indicates it may take 3 or 4 weeks or runtime! The following logs each block to 100 lines processed to stderr:

Ok, I guess I have to wait,,,I just started the process with & at the end, ,,, I believe even if I close the remote terminal, it will continue running in the background?

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

11-08-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Logging out will send a sighup to your background processes causing them to stop. disown will make the jobs not receive that signal, see man page.
nohup when sending the job to background will do similar.

These 2 Users Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-08-2013

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Most modern computers have multiple cores and/or multiple CPUs. This task is very CPU-intensive: about 280,000 comparisons per line (if my binomial calculation is correct).

So it makes sense to try and utilize all the power that the computer has. Here is an example that uses the awk code of Chubler_XL (which I will not list -- it is in a separate file "a1").

The idea is that the input file is split up and many instances are run simultaneously (hence "parallel"). This script will run 1,2, and 4 instances. The computer is a beefy server that uses a 3-GHz XEON CPU, 4-cores, each with hyper-threading:

Code:

#!/usr/bin/env bash

# @(#) s1	Demonstrate real time decrease with use of command parallel.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk split
pe "GNU parallel 20120422"
pe "CPU: Intel Xeon CPU E31230 @ 3.201GHz"
pe "RAM: 6101MB / 16077MB"

rm -f results.[124]
FILE=${1-data0.txt}
rm -f xa?
sed -n '/^#/!p' $FILE |
split -l 100

pl " Input data file xaa:"
wc xaa

pl " Results, 1 process, 100 lines:"
time parallel --gnu --jobs=1 awk -f a1 ::: xaa > results.1
wc results.1

rm -f xa?
sed -n '/^#/!p' $FILE |
split -l 50
pl " Input data files xaa, xab:"
wc xa[ab]
pl " Results, 2 processes, 50 lines:"
time parallel --gnu --jobs=2 awk -f a1 ::: xaa xab > results.2
wc results.2

rm -f xa?
sed -n '/^#/!p' $FILE |
split -l 25
pl " Input data files xa[abcd]:"
wc xa[abcd]
pl " Results, 4 processes, 25 lines:"
time parallel --gnu --jobs=4 awk -f a1 ::: xaa xab xac xad > results.4
wc results.4

exit 0

producing:

Code:

./s1

Environment: LC_ALL = C, LANG = en_US.UTF-8
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.2.0-4-amd64, x86_64
Distribution        : Debian GNU/Linux 7.2 (wheezy, vm-server-ng) 
GNU bash 4.2.37
GNU Awk 4.0.1
split (GNU coreutils) 8.13
GNU parallel 20120422
CPU: Intel Xeon CPU E31230 @ 3.201GHz
RAM: 6101MB / 16077MB

-----
 Input data file xaa:
   100  75000 150000 xaa

-----
 Results, 1 process, 100 lines:
Lines processed: 000000100
real	0m13.971s
user	0m13.573s
sys	0m0.048s
 280875  842625 1966125 results.1

-----
 Input data files xaa, xab:
    50  37500  75000 xaa
    50  37500  75000 xab
   100  75000 150000 total

-----
 Results, 2 processes, 50 lines:

real	0m7.432s
user	0m14.077s
sys	0m0.060s
 561750 1685250 3923136 results.2

-----
 Input data files xa[abcd]:
    25  18750  37500 xaa
    25  18750  37500 xab
    25  18750  37500 xac
    25  18750  37500 xad
   100  75000 150000 total

-----
 Results, 4 processes, 25 lines:

real	0m4.392s
user	0m15.977s
sys	0m0.084s
1123500 3370500 7021826 results.4

The user time will always be about the same because we need to do n operations, regardless of how many processes are running. The real time, however, decreases almost linearly with the addition of "jobs" (processes, and, in this case, effectively cores). So one might expect a 20-fold decrease if one had 20 CPUs available. In reality, there is a slight amount of overhead from parallel, but I noticed a decrease in real time even with more than 1 job and a single CPU (on a different computer). Although this is CPU-intensive, there may be disk contention if there is a large number of processes. There is no way to predict what the value large would be, so testing will need to be done if many cores are available.

The output files are collected, and will need to be reduced to gather similar counts of the pairs. Outside of debugging, this seems like the only downside to me.

I recognize that this may be too advanced for the OP, but if he has time to spend over weeks waiting for the output, then perhaps he could enlist the help of a colleague.

For purposes of comparisons of methods that others may propose, I have uploaded a copy of the raw 1000-line 750 field/line text data file. The comments at the beginning of the file describe the file. As noted, I used only the first 100 lines.

Best wishes ... cheers, drl

data0.txt (1.43 MB)

This User Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

11-09-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Sheer curiosity: having the main loop run to NF-1 only might yield a fraction of a percent run time reduction:

Code:

{for(i=1; i<NF; i++) ...

Unfortunately, replacing the

Code:

 if($i==$j) c[i,j]++

by

Code:

 c[i,j]+= ($i==$j)

increases run time dramatically, probably because the ++ operation is a register operation while the other is a full addition.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Bitwise comparison of cols

10 More Discussions You Might Find Interesting

1. Programming

Bitwise operation for state machine

Discussion started by: anand.shah

2. Shell Programming and Scripting

how to use bitwise or operator in /bin/sh

Discussion started by: Palaniappan

3. FAQ Submission Queue

Analysis in bitwise XOR

Discussion started by: pandeesh

4. Emergency UNIX and Linux Support

bitwise and between two 32 bit binaries

Discussion started by: venu

5. Shell Programming and Scripting

Grouping matches by cols

Discussion started by: gbalsu

6. Programming

bitwise and if

Discussion started by: Puntino

7. Shell Programming and Scripting

Bitwise negation

Discussion started by: dLloydm

8. Programming

resetting counter using bitwise XOR

Discussion started by: mrgubbala

9. UNIX for Advanced & Expert Users

bitwise operators

Discussion started by: areef4u

10. Programming

Bit-fields and Bitwise operators

Discussion started by: amatsaka