Clustering data by matching columns


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Clustering data by matching columns
# 1  
Old 03-20-2012
Clustering data by matching columns

I am stuck with by DNA clustering analysis. I thought this forum will be a great help with data manipulations. Please help me.

I have a table with 91 columns. First I want to trim the table to only having rows where the column values are single characters which are A,T,G,C or 0. So any row having column values such as AA,AAG, AATG , Y, K etc has to be filtered out.
I figured out the regular expression will be something like [0ATGC]

Next I want to compare all the columns pairwise and group the columns which have the exact same values.The intermediate table output is not required.

Example input for 6 columns

Code:
Col1 Col2 Col3 Col4 Col5 Col6
A G G G T A
A Y R R TT A
A G T T T A
A G G T 0 A
A 0 R T TT AGGGGTT

Trimmed table (no output reqd)

Code:
 Col1 Col2 Col3 Col4 Col5 Col6
A G G G T A
A G T T T A
A G G T 0 A

Clustering output (desired output)

Code:
Col1,Col6 
Col2
Col3
Col4
Col5

# 2  
Old 03-20-2012
Try:
Code:
awk 'NR>1{for (i=1;i<=NF;i++) if (length($i)>1) next}1' file | awk 'NR==1{for (i=1;i<=NF;i++) n[i]=$i}
NR>1{for (i=1;i<=NF;i++) a[i]=a[i]""$i}
END{for (i=1;i<=NF;i++) b[a[i]]=b[a[i]]","n[i];
for (i in b) {sub("^,","",b[i]);print b[i]}}'

This User Gave Thanks to bartus11 For This Post:
# 3  
Old 03-20-2012
Hi Bartus11
Can you also help me filter the single character values other than A,T,G,C and 0? Single characters like Y, M , R etc are present in my input which are not allowed.
Many Thanks
# 4  
Old 03-20-2012
Code:
awk 'NR>1{for (i=1;i<=NF;i++) if (length($i)>1||$i!~"A|G|T|C|0") next}1' file | awk 'NR==1{for (i=1;i<=NF;i++) n[i]=$i}
NR>1{for (i=1;i<=NF;i++) a[i]=a[i]""$i}
END{for (i=1;i<=NF;i++) b[a[i]]=b[a[i]]","n[i];
for (i in b) {sub("^,","",b[i]);print b[i]}}'

This User Gave Thanks to bartus11 For This Post:
# 5  
Old 03-20-2012
Works like a charm ! Thank you bartus11
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Data match 2 files based on first 2 columns matching only and join if match

Hi, i have 2 files , the data i need to match is in masterfile and i need to pull out column 3 from master if column 1 and 2 match and output entire row to new file I have tried with join and awk and i keep getting blank outputs or same file is there an easier way than what i am... (4 Replies)
Discussion started by: axis88
4 Replies

2. Shell Programming and Scripting

awk Matching Columns - Am I missing something?

I am using awk to match columns and output based on those matches. For some reason it is not printing matching columns, am I missing something? Operating system - windows with cygwin. Command that I am using: sed 's/]*,]*/,/g' $tempdir/file1 > $tempdir/file1.$$ && awk -F, 'FNR==NR{f2=$2... (7 Replies)
Discussion started by: dis0wned
7 Replies

3. Shell Programming and Scripting

Matching first 2 columns..

Hello All, I want to make a file which will have primarily lines of file2 but when first 2 fields matches with the file1 it should have those lines of file1.. example is as below.. file1 a;b;1 c;d f;e t;r;5 file2 b;g a;b c;d v;b f;e t;r (2 Replies)
Discussion started by: ailnilanjan
2 Replies

4. Shell Programming and Scripting

Join two files with matching columns

Hi, I need to join two files together with one common value in a column. I think I can use awk or join or a combination but I can't quite get it. Basically my data looks like this, with the TICKER columns matching up in each file File1 TICKER,column 1, column, 2, column, 3, column 4 ... (6 Replies)
Discussion started by: unkleruckus
6 Replies

5. Shell Programming and Scripting

Help with awk Matching columns from two files

Hello, I have two files as following: #bin chrom chromStart chromEnd name score strand observed 585 chr2 29442 29443 rs4637157 0 + C/T 585 chr2 33011 33012 rs13423995 0 + A/G 585 chr2 34502 34503 rs13386087 0 + ... (2 Replies)
Discussion started by: Homa
2 Replies

6. Shell Programming and Scripting

Common records after matching on different columns

Hi, I have the following files. cat 1.txt cat 2.txt output.txt The logic is as follows.... (10 Replies)
Discussion started by: jacobs.smith
10 Replies

7. UNIX for Dummies Questions & Answers

matching columns

Hello experts, I have this problem, I need to match values based on two files, this is what I have: file1 1.1 1.2 1.3 5.5 1.4 1.5 1.6 file2 1 a 2 B 3 C 4 D 5 z (7 Replies)
Discussion started by: Gery
7 Replies

8. UNIX for Dummies Questions & Answers

Matching corresponding columns in two different files

Hi to all, I have two separated files: FILE1 "V1" "V2" "V3" Mary James Nicole Robert Francisco Sophie Nancy Antony Matt Josephine Louise Rose Mark Simon Charles FILE2 "V1" "V2" "V3"... (2 Replies)
Discussion started by: eleonoral
2 Replies

9. UNIX for Dummies Questions & Answers

Suggestion to convert data in rows to data in columns

Hello everyone! I have a huge dataset looking like this: nameX nameX 0 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 ............... nameY nameY 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 ..... nameB nameB 0 1 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 ..... (can be several thousands of codes) and I need... (8 Replies)
Discussion started by: kush
8 Replies

10. Shell Programming and Scripting

matching columns from two files

Hey, I have two files that have exactly the same format. They are both tab-delimited and contain 12 columns. However the # of rows vary. What I want to do is match columns # 5,6 and 7 between the two files. If they do match exactly (based on numbers) then I want the whole row from file 2 to... (1 Reply)
Discussion started by: phil_heath
1 Replies
Login or Register to Ask a Question