awk code to process column pairs and find those with more than 1 set of possible values

11-01-2011

Registered User

11, 0

Join Date: Sep 2011

Last Activity: 9 November 2011, 3:13 AM EST

Posts: 11

Thanks Given: 0

Thanked 0 Times in 0 Posts

awk code to process column pairs and find those with more than 1 set of possible values

Hi,

I have a very wide dataset with pairs of columns starting from column 7 onwards (see example below).

Code:

0 123456 -1 0 0 -9 0 0 1 2 2 2 1 1 1 1 2 2...
0 123457 -1 0 0 -9 1 2 1 1 2 2 0 0 0 0 2 2...
0 123458 -1 0 0 -9 0 0 1 2 2 2 1 1 2 2 1 2...
0 123459 -1 0 0 -9 1 2 0 0 2 2 1 1 1 2 1 1...

For each pair of columns 7&8, 9&10, 11&12... I would like to produce an indicator of whether there is more than 1 possibility for each pair in the column. A pair 0 0 represents a missing value and should be ignored (ie: not counted as its own class).

So for the first pair of columns the possibilities are 0 0 and 1 2. However because 0 0 represents a missing value, the only remaining combination is 1 2. Therefore there is only 1 possibility for the pair of columns.

For the second pair of columns the possibilities are 0 0, 1 1 and 1 2. Therefore there are 2 possibilities for the pair of columns.

The output for the above example would then be: 0 1 0 0 1 1 where the zero means there is only 1 possibility for the pair and a 1 means there is more than one possibility for the pair in the column.

I hope that makes sense - it is a bit difficult to explain.

I would then also like to apply the filtering 0 1 0 0 1 1 to a second file which looks as follows (there is one row for every pair of columns in the first file):

Code:

 
22 gen1 12345
22 gen2 34678
22 gen3 47659
22 gen4 57647
22 gen5 68543
22 gen6 75624
.
.
.

The final output would then be a file with rows from column 2 of this file where the corresponding column of the filter is a 1.
eg: gen2
gen5
gen6
.
.

Any help would be appreciated. I have a code to do this in R and one in J but they are a bit too resource expensive and I thought that something using awk might work a lot better.

Many thanks in advance,
Kathryn

Moderator's Comments:

Video tutorial on how to use code tags in The UNIX and Linux Forums.

Last edited by vbe; 11-01-2011 at 09:45 AM..

kasan0

View Public Profile for kasan0

Find all posts by kasan0

11-01-2011

Registered User

1,203, 103

Join Date: Mar 2007

Last Activity: 28 January 2020, 10:33 PM EST

Location: Orlando, Florida

Posts: 1,203

Thanks Given: 1

Thanked 103 Times in 100 Posts

See if this will be slower or faster than what you already have:

Code:

#!/usr/bin/ksh
rm -f File*
typeset -i mPair
cut -d' ' -f7- b | while read mLine; do
  mPair=0
  echo ${mLine} | xargs -n2 | while read m1 m2; do
    mPair=${mPair}+1
    mFName='File'${mPair}
    if [[ "${m1}" != "0" && "${m2}" != "0" ]]; then
      echo ${m1}${m2} >> ${mFName}
    fi
  done
done
mFTemp='Temp_File'
for mFName in File*; do
  sort -o ${mFName} -u ${mFName}
  mCnt=$(wc -l < ${mFName})
  if [[ "${mCnt}" = "0" || "${mCnt}" = "1" ]]; then
    echo '0' >> ${mFTemp}
  else
    echo '1' >> ${mFTemp}
  fi
done
paste -d' ' ${mFTemp} The_Search_File | while read m1 m2 m3 m4; do
  if [[ "${m1}" = "1" ]]; then
    echo ${m3}
  fi
done
rm -f File*
rm -f ${mFTemp}

Shell_Life

View Public Profile for Shell_Life

Find all posts by Shell_Life

11-02-2011

Registered User

11, 0

Join Date: Sep 2011

Last Activity: 9 November 2011, 3:13 AM EST

Posts: 11

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks very much for that. I've since found out that the problem is slightly different from what I thought so I'm going to re-post. I will try your code out though as it looks like it would be helpful. Thanks!

kasan0

View Public Profile for kasan0

Find all posts by kasan0

UNIX for Dummies Questions & Answers

awk code to process column pairs and find those with more than 1 set of possible values

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk script to append suffix to column when column has duplicated values

Discussion started by: as7951

2. UNIX for Beginners Questions & Answers

Find unique values but only in column 1

Discussion started by: mutley2202

3. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Discussion started by: sajmar

4. UNIX for Dummies Questions & Answers

Htop process viewer - set column width

Discussion started by: plaidshirtuser

5. Shell Programming and Scripting

Sum pairs of values and per level and state which one is absent

Discussion started by: jalaj841

6. Shell Programming and Scripting

awk Print New Column For Every Two Lines and Match On Multiple Column Values to print another column

Discussion started by: jacobs.smith

7. UNIX for Dummies Questions & Answers

Command line / script option to filter a data set by values of one column

Discussion started by: gnat01

8. Shell Programming and Scripting

Using AWK to find top Nth values in Nth column

Discussion started by: ncwxpanther

9. Shell Programming and Scripting

for each different entry in column 1 extract maximum values from column 2 in unix/awk

Discussion started by: Diya123

10. Shell Programming and Scripting

How to pick values from column based on key values by usin AWK

Discussion started by: repinementer