Visit Our UNIX and Linux User Community


Getting unique based on clusters


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Getting unique based on clusters
# 1  
Old 08-16-2013
Getting unique based on clusters

Hi,

I have a file with 25 clusters and each cluster has multiple rows. I need to find the unique genes in each cluster and assign them

Code:
Annotation Cluster 2	Enrichment Score: 10.199579524507685											
Category	Term	Count	%	PValue	Genes	List Total	Pop Hits	Pop Total	Fold Enrichment	Bonferroni	Benjamini	FDR
GOTERM_CC_FAT	GO:0031012~extracellular matrix	38	9.268292683	1.42E-14	WNT5B, LTBP2, TNC, SPOCK1, POSTN, MMP3, MMP2, MMP1, TGFB2, OGN, LAMB4, SMOC2, CD44, CRISPLD2, TGFBI, COL6A3, COL6A2, COL6A1, LOX, THBS1, SPON2, TFPI2, COL11A1, FN1, COL18A1, COL4A1, LGALS3, FBN1, CCDC80, MGP, NID1, SPARC, COL5A1, PRELP, ADAMTS7, THSD4, VCAN, COL1A1	307	345	12782	4.585903791	3.92E-12	9.81E-13	1.88E-11
SP_PIR_KEYWORDS	extracellular matrix	30	7.317073171	2.91E-14	WNT5B, TNC, SPOCK1, POSTN, MMP3, MMP2, MMP1, SMOC2, LAMB4, OGN, TGFBI, COL6A3, COL6A2, COL6A1, SPON2, COL11A1, FN1, SPP1, COL18A1, COL4A1, FBN1, CCDC80, NID1, SPARC, COL5A1, PRELP, ADAMTS7, THSD4, VCAN, COL1A1	403	241	19235	5.941435087	1.16E-11	3.87E-12	4.06E-11
GOTERM_CC_FAT	GO:0005578~proteinaceous extracellular matrix	34	8.292682927	1.26E-12	WNT5B, LTBP2, TNC, SPOCK1, POSTN, MMP3, MMP2, MMP1, LAMB4, SMOC2, OGN, TGFBI, COL6A3, COL6A2, COL6A1, LOX, SPON2, TFPI2, COL11A1, FN1, COL18A1, COL4A1, LGALS3, FBN1, CCDC80, MGP, NID1, SPARC, COL5A1, PRELP, ADAMTS7, THSD4, VCAN, COL1A1	307	320	12782	4.423737785	3.48E-10	6.97E-11	1.67E-09
GOTERM_CC_FAT	GO:0044420~extracellular matrix part	16	3.902439024	1.21E-07	COL18A1, COL4A1, TNC, FBN1, CCDC80, NID1, SPARC, COL5A1, SMOC2, LAMB4, COL6A3, COL6A1, LOX, COL1A1, COL11A1, FN1	307	117	12782	5.693699713	3.34E-05	5.56E-06	1.60E-04
GOTERM_CC_FAT	GO:0005604~basement membrane	11	2.682926829	1.59E-05	COL18A1, SMOC2, LAMB4, COL4A1, TNC, FBN1, CCDC80, NID1, SPARC, COL5A1, FN1	307	78	12782	5.871627829	0.004382322	5.49E-04	0.020991649
												
Annotation Cluster 3	Enrichment Score: 8.477392514797849											
Category	Term	Count	%	PValue	Genes	List Total	Pop Hits	Pop Total	Fold Enrichment	Bonferroni	Benjamini	FDR
GOTERM_BP_FAT	GO:0001568~blood vessel development	27	6.585365854	1.05E-10	TNFRSF12A, PGF, FOXO1, ANPEP, MMP2, TGFB2, CD44, HMOX1, IL1B, LOX, THBS1, KLF5, COL18A1, FLT1, IL8, EPAS1, MYO1E, TGFBR2, APOLD1, ITGA4, ARHGAP24, COL5A1, CDH13, FOXC2, FOXC1, COL1A1, PLAU	316	245	13528	4.717850685	2.10E-07	1.05E-07	1.80E-07
GOTERM_BP_FAT	GO:0001944~vasculature development	27	6.585365854	1.79E-10	TNFRSF12A, PGF, FOXO1, ANPEP, MMP2, TGFB2, CD44, HMOX1, IL1B, LOX, THBS1, KLF5, COL18A1, FLT1, IL8, EPAS1, MYO1E, TGFBR2, APOLD1, ITGA4, ARHGAP24, COL5A1, CDH13, FOXC2, FOXC1, COL1A1, PLAU	316	251	13528	4.605073377	3.58E-07	1.19E-07	3.07E-07
GOTERM_BP_FAT	GO:0001525~angiogenesis	18	4.390243902	6.10E-08	KLF5, COL18A1, FLT1, IL8, EPAS1, TNFRSF12A, PGF, TGFBR2, APOLD1, ANPEP, ARHGAP24, TGFB2, CDH13, HMOX1, IL1B, FOXC2, THBS1, PLAU	316	148	13528	5.206637017	1.22E-04	8.70E-06	1.05E-04
GOTERM_BP_FAT	GO:0048514~blood vessel morphogenesis	21	5.12195122	1.08E-07	KLF5, COL18A1, FLT1, EPAS1, IL8, TNFRSF12A, PGF, MYO1E, TGFBR2, APOLD1, ANPEP, ITGA4, ARHGAP24, TGFB2, CDH13, HMOX1, IL1B, FOXC2, FOXC1, THBS1, PLAU	316	211	13528	4.260723499	2.15E-04	1.27E-05	1.85E-04

My output shouls be

Annotation cluster1 ABC,DEF,XYZ,MNO

Can I do this in awk?

Thanks
# 2  
Old 08-20-2013
so where are these "ABC,DEF,XYZ,MNO" come from?

Previous Thread | Next Thread
Test Your Knowledge in Computers #227
Difficulty: Easy
According to NetMarketShare, in September 2019 BSD had a 0.01% global market share of the desktop computing market.
True or False?

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print lines based upon unique values in Nth field

For some reason I am having difficulty performing what should be a fairly easy task. I would like to print lines of a file that have a unique value in the first field. For example, I have a large data-set with the following excerpt: PS003,001 MZMWR/ L-DWD// * PS003,001... (4 Replies)
Discussion started by: jvoot
4 Replies

2. Linux

To get all the columns in a CSV file based on unique values of particular column

cat sample.csv ID,Name,no 1,AAA,1 2,BBB,1 3,AAA,1 4,BBB,1 cut -d',' -f2 sample.csv | sort | uniq this gives only the 2nd column values Name AAA BBB How to I get all the columns of CSV along with this? (1 Reply)
Discussion started by: sanvel
1 Replies

3. Shell Programming and Scripting

Unique entries based on a range of numbers.

Hi, I have a matrix like this: Algorithm predicted_gene start_point end_point A x 65 85 B x 70 80 C x 75 85 D x 10 20 B y 125 130 C y 120 140 D y 200 210 Here there are four tab-separated columns. The first column is the used algorithm for prediction, and there are 4 of them A-D.... (8 Replies)
Discussion started by: flyfisherman
8 Replies

4. UNIX for Dummies Questions & Answers

Sorting and saving values based on unique entries

Hi all, I wanted to save the values of a file that contains unique entries based on a specific column (column 4). my sample file looks like the following: input file: 200006-07file.txt 145 35 10 3 147 35 12 4 146 36 11 3 145 34 12 5 143 31 15 4 146 30 14 5 desired output files:... (5 Replies)
Discussion started by: ida1215
5 Replies

5. Shell Programming and Scripting

Find unique lines based off of bytes

Hello All, I have two VERY large .csv files that I want to compare values based on substrings. If the lines are unique, then print the line. For example, if I run a diff file1.csv and file2.csv I get results similar to +_id34,brown,car,2006 +_id1,blue,train,1985... (5 Replies)
Discussion started by: jl487
5 Replies

6. Shell Programming and Scripting

Calculate difference in timestamps based on unique column value

Hi Friends, Require a quick help to write the difference between 2 timestamps based on a unique column value: Input file: 08/23/2012 12:36:09,JOB_5340,08/23/2012 12:36:14,JOB_5340 08/23/2012 12:36:22,JOB_5350,08/23/2012 12:36:26,JOB_5350 08/23/2012 13:08:51,JOB_5360,08/23/2012... (4 Replies)
Discussion started by: asnandhakumar
4 Replies

7. Shell Programming and Scripting

Converting a list to a row to create clusters based on numerical identity

Hello. I have a long list of data which has the following structure The number shows the unique identity of the word. And all homophones are clustered with the same number ID. An example will make this clear The awk script I have allows conversion of a list to row but on condition that each... (4 Replies)
Discussion started by: gimley
4 Replies

8. Shell Programming and Scripting

count the unique records based on certain columns

Hi everyone, I have a file result.txt with records as following and another file mirna.txt with a list of miRNAs e.g. miR22, miR123, miR13 etc. Gene Transcript miRNA Gar Nm_111233 miR22 Gar Nm_123440 miR22 Gar Nm_129939 miR22 Hel Nm_233900 miR13 Hel ... (6 Replies)
Discussion started by: miclow
6 Replies

9. Shell Programming and Scripting

awk : extracting unique lines based on columns

Hi, snp.txt CHR_A SNP_A BP_A_st BP_A_End CHR_B BP_B SNP_B R2 p-SNP_A p-SNP_B 5 rs1988728 74904317 74904318 5 74960646 rs1427924 0.377333 0.000740085 0.013930081 5 ... (12 Replies)
Discussion started by: genehunter
12 Replies

Featured Tech Videos