awk code to process column pairs and find those with more than 1 set of possible values


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers awk code to process column pairs and find those with more than 1 set of possible values
# 1  
Old 11-01-2011
awk code to process column pairs and find those with more than 1 set of possible values

Hi,

I have a very wide dataset with pairs of columns starting from column 7 onwards (see example below).
Code:
0 123456 -1 0 0 -9 0 0 1 2 2 2 1 1 1 1 2 2...
0 123457 -1 0 0 -9 1 2 1 1 2 2 0 0 0 0 2 2...
0 123458 -1 0 0 -9 0 0 1 2 2 2 1 1 2 2 1 2...
0 123459 -1 0 0 -9 1 2 0 0 2 2 1 1 1 2 1 1...

For each pair of columns 7&8, 9&10, 11&12... I would like to produce an indicator of whether there is more than 1 possibility for each pair in the column. A pair 0 0 represents a missing value and should be ignored (ie: not counted as its own class).

So for the first pair of columns the possibilities are 0 0 and 1 2. However because 0 0 represents a missing value, the only remaining combination is 1 2. Therefore there is only 1 possibility for the pair of columns.

For the second pair of columns the possibilities are 0 0, 1 1 and 1 2. Therefore there are 2 possibilities for the pair of columns.

The output for the above example would then be: 0 1 0 0 1 1 where the zero means there is only 1 possibility for the pair and a 1 means there is more than one possibility for the pair in the column.

I hope that makes sense - it is a bit difficult to explain.

I would then also like to apply the filtering 0 1 0 0 1 1 to a second file which looks as follows (there is one row for every pair of columns in the first file):
Code:
 
22 gen1 12345
22 gen2 34678
22 gen3 47659
22 gen4 57647
22 gen5 68543
22 gen6 75624
.
.
.

The final output would then be a file with rows from column 2 of this file where the corresponding column of the filter is a 1.
eg: gen2
gen5
gen6
.
.

Any help would be appreciated. I have a code to do this in R and one in J but they are a bit too resource expensive and I thought that something using awk might work a lot better.

Many thanks in advance,
Kathryn

Moderator's Comments:
Mod Comment Video tutorial on how to use code tags in The UNIX and Linux Forums.

Last edited by vbe; 11-01-2011 at 09:45 AM..
# 2  
Old 11-01-2011
See if this will be slower or faster than what you already have:
Code:
#!/usr/bin/ksh
rm -f File*
typeset -i mPair
cut -d' ' -f7- b | while read mLine; do
  mPair=0
  echo ${mLine} | xargs -n2 | while read m1 m2; do
    mPair=${mPair}+1
    mFName='File'${mPair}
    if [[ "${m1}" != "0" && "${m2}" != "0" ]]; then
      echo ${m1}${m2} >> ${mFName}
    fi
  done
done
mFTemp='Temp_File'
for mFName in File*; do
  sort -o ${mFName} -u ${mFName}
  mCnt=$(wc -l < ${mFName})
  if [[ "${mCnt}" = "0" || "${mCnt}" = "1" ]]; then
    echo '0' >> ${mFTemp}
  else
    echo '1' >> ${mFTemp}
  fi
done
paste -d' ' ${mFTemp} The_Search_File | while read m1 m2 m3 m4; do
  if [[ "${m1}" = "1" ]]; then
    echo ${m3}
  fi
done
rm -f File*
rm -f ${mFTemp}

# 3  
Old 11-02-2011
Thanks very much for that. I've since found out that the problem is slightly different from what I thought so I'm going to re-post. I will try your code out though as it looks like it would be helpful. Thanks!
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk script to append suffix to column when column has duplicated values

Please help me to get required output for both scenario 1 and scenario 2 and need separate code for both scenario 1 and scenario 2 Scenario 1 i need to do below changes only when column1 is CR and column3 has duplicates rows/values. This inputfile can contain 100 of this duplicated rows of... (1 Reply)
Discussion started by: as7951
1 Replies

2. UNIX for Beginners Questions & Answers

Find unique values but only in column 1

Hi All, Does anyone have any suggestions/examples of how i could show only lines where the first field is not duplicated. If the first field is listed more than once it shouldnt be shown even if the other columns make it unique. Example file : 876,RIBDA,EC2 876,RIBDH,EX7 877,RIBDF,E28... (4 Replies)
Discussion started by: mutley2202
4 Replies

3. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Dear folks I have a map file of around 54K lines and some of the values in the second column have the same value and I want to find them and delete all of the same values. I looked over duplicate commands but my case is not to keep one of the duplicate values. I want to remove all of the same... (4 Replies)
Discussion started by: sajmar
4 Replies

4. UNIX for Dummies Questions & Answers

Htop process viewer - set column width

I try to enlarge the htop column's width. I've found a solution, but it seems very specific and also too difficult. Is there any simpler way to make all the characters in a column visible? (0 Replies)
Discussion started by: plaidshirtuser
0 Replies

5. Shell Programming and Scripting

Sum pairs of values and per level and state which one is absent

There are 3 values (cols 3,4,5) for each name (col 1) and level (col2 ). Some levels for some of the names do not exist. Files are space delimited SSGG765 L1 1 2 3 SSGG765 L2 4 5 6 GUHJHJJ7 L1 7 8 9 GUHJHJJ7 L5 10 12 13 FFRTGGG L1 11 1 3 Given a list of pairwise names, I want... (5 Replies)
Discussion started by: jalaj841
5 Replies

6. Shell Programming and Scripting

awk Print New Column For Every Two Lines and Match On Multiple Column Values to print another column

Hi, My input files is like this axis1 0 1 10 axis2 0 1 5 axis1 1 2 -4 axis2 2 3 -3 axis1 3 4 5 axis2 3 4 -1 axis1 4 5 -6 axis2 4 5 1 Now, these are my following tasks 1. Print a first column for every two rows that has the same value followed by a string. 2. Match on the... (3 Replies)
Discussion started by: jacobs.smith
3 Replies

7. UNIX for Dummies Questions & Answers

Command line / script option to filter a data set by values of one column

Hi all! I have a data set in this tab separated format : Label, Value1, Value2 An instance is "data.txt" : 0 1 1 -1 2 3 0 2 2 I would like to parse this data set and generate two files, one that has only data with the label 0 and the other with label -1, so my outputs should be, for... (1 Reply)
Discussion started by: gnat01
1 Replies

8. Shell Programming and Scripting

Using AWK to find top Nth values in Nth column

I have an awk script to find the maximum value of the 2nd column of a 2 column datafile, but I need to find the top 5 maximum values of the 2nd column. Here is the script that works for the maximum value. awk 'BEGIN { subjectmax=$1 ; max=0} $2 >= max {subjectmax=$1 ; max=$2} END {print... (3 Replies)
Discussion started by: ncwxpanther
3 Replies

9. Shell Programming and Scripting

for each different entry in column 1 extract maximum values from column 2 in unix/awk

Hello, I have 2 columns (1st column has multiple entries but the corresponding values in the column 2 may be the same or different.) however I want to extract unique values for each entry in column 1 by assigning the max value from column 2 SDF4 -0.211654 SDF4 0.978068 ... (1 Reply)
Discussion started by: Diya123
1 Replies

10. Shell Programming and Scripting

How to pick values from column based on key values by usin AWK

Dear Guyz:) I have 2 different input files like this. I would like to pick the values or letters from the inputfile2 based on inputfile1 keys (A,F,N,X,Z). I have done similar task by using awk but in that case the inputfiles are similar like in inputfile2 (all keys in 1st column and values in... (16 Replies)
Discussion started by: repinementer
16 Replies
Login or Register to Ask a Question