Splitting columns


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Splitting columns
# 1  
Old 07-30-2012
Splitting columns

Hi Friends,

My input file has more than 20 columns

Code:
UniProtKB	A0A183	LCE6A		GO:0031424	GO_REF:0000037	IEA	UniProtKB-KW:KW-0417	P	Late cornified envelope protein 6A	LCE6A_HUMAN|C1orf44|LCE6A	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0A5B9	TRBC2		GO:0004872	GO_REF:0000037	IEA	UniProtKB-KW:KW-0675	F	T-cell receptor beta-2 chain C region	TRBC2_HUMAN|TRBC2|TCRBC2	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0A5B9	TRBC2		GO:0016020	GO_REF:0000039	IEA	UniProtKB-SubCell:SL-0162	C	T-cell receptor beta-2 chain C region	TRBC2_HUMAN|TRBC2|TCRBC2	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0A5B9	TRBC2		GO:0016021	GO_REF:0000037	IEA	UniProtKB-KW:KW-0812	C	T-cell receptor beta-2 chain C region	TRBC2_HUMAN|TRBC2|TCRBC2	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0AUX0	AMPD3		GO:0003876	GO_REF:0000019	IEA	Ensembl:ENSRNOP00000024933	F	cDNA FLJ76195, highly similar to Homo sapiens adenosine monophosphate deaminase (isoform E) (AMPD3),mRNA	A0AUX0_HUMAN|AMPD3|hCG_22452	protein	taxon:9606	20120303	ENSEMBL		
UniProtKB	A0AUX0	AMPD3		GO:0006188	GO_REF:0000002	IEA	InterPro:IPR006329	P	cDNA FLJ76195, highly similar to Homo sapiens adenosine monophosphate deaminase (isoform E) (AMPD3),mRNA	A0AUX0_HUMAN|AMPD3|hCG_22452	proteintaxon:9606	20120303	InterPro		
UniProtKB	A0AV02	SLC12A8		GO:0006811	GO_REF:0000037	IEA	UniProtKB-KW:KW-0406	P	Solute carrier family 12 member 8	S12A8_HUMAN|SLC12A8|CCC9	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0AV02	SLC12A8		GO:0006813	GO_REF:0000037	IEA	UniProtKB-KW:KW-0633	P	Solute carrier family 12 member 8	S12A8_HUMAN|SLC12A8|CCC9	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0AV02	SLC12A8		GO:0015293	GO_REF:0000037	IEA	UniProtKB-KW:KW-0769	F	Solute carrier family 12 member 8	S12A8_HUMAN|SLC12A8|CCC9	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0AV02	SLC12A8		GO:0016020	GO_REF:0000039	IEA	UniProtKB-SubCell:SL-0162	C	Solute carrier family 12 member 8	S12A8_HUMAN|SLC12A8|CCC9	protein	taxon:9606	20120303	UniProtKB


My expected output is this

Code:
LCE6A GO:0031424 Late cornified envelope protein 6A
TRBC2 GO:0004872 T-cell receptor beta-2 chain C region
..........
........
.............
...........
SLC12A8 GO:0016020 Solute carrier family 12 member 8

Basically, what am I trying to do is to grab $3"\t"$4"\t"$9

Please note the following.

My input file is tab separated between columns and space separated in 9th column.
My expected output is also the same way.
The dots in my expected output means that the records in between needs to be printed. I didn't want to type everything. Hopefully, you guys understood me.
IEA ($6) is a column and it is tab separated from other columns and is not space separated with other columns.
A hint might be....9th column can be extracted anything after
Code:
P or F or C in the 8th column

and eliminating the content before
Code:
_HUMAN

in $10.
# 2  
Old 07-30-2012
Code:
nawk -F'\t' '{print $3,$4,$9}' OFS='\t' myFile

# 3  
Old 07-30-2012
Quote:
Originally Posted by vgersh99
Code:
nawk -F'\t' '{print $3,$4,$9}' OFS='\t' myFile

Thanks for ur time.

When I run ur code, the error I see is

Code:
-bash: nawk: command not found

I tried this


Code:
awk -F'\t' '{print $3,$4,$9}' OFS='\t' test | head

The output is

Code:
LCE6A		P
TRBC2		F
TRBC2		C
TRBC2		C
AMPD3		F
AMPD3		P
SLC12A8		P
SLC12A8		P
SLC12A8		F
SLC12A8		C

Does this mean that my input is space separated? I am confused now. Because, I printed all the columns to be tab separated.
# 4  
Old 07-30-2012
Quote:
Originally Posted by jacobs.smith
Thanks for ur time.

When I run ur code, the error I see is

Code:
-bash: nawk: command not found

I tried this


Code:
awk -F'\t' '{print $3,$4,$9}' OFS='\t' test | head

The output is

Code:
LCE6A        P
TRBC2        F
TRBC2        C
TRBC2        C
AMPD3        F
AMPD3        P
SLC12A8        P
SLC12A8        P
SLC12A8        F
SLC12A8        C

Does this mean that my input is space separated? I am confused now. Because, I printed all the columns to be tab separated.
maybe. post the output of head myFile | cat -vet
# 5  
Old 07-30-2012
Quote:
Originally Posted by vgersh99
maybe. post the output of head myFile | cat -vet
Please find the output I did. I just did

Code:
cat input | cat -vet

output

Code:
UniProtKB^IA0A183^ILCE6A^I^IGO:0031424^IGO_REF:0000037^IIEA^IUniProtKB-KW:KW-0417^IP^ILate cornified envelope protein 6A^ILCE6A_HUMAN|C1orf44|LCE6A^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0A5B9^ITRBC2^I^IGO:0004872^IGO_REF:0000037^IIEA^IUniProtKB-KW:KW-0675^IF^IT-cell receptor beta-2 chain C region^ITRBC2_HUMAN|TRBC2|TCRBC2^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0A5B9^ITRBC2^I^IGO:0016020^IGO_REF:0000039^IIEA^IUniProtKB-SubCell:SL-0162^IC^IT-cell receptor beta-2 chain C region^ITRBC2_HUMAN|TRBC2|TCRBC2^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0A5B9^ITRBC2^I^IGO:0016021^IGO_REF:0000037^IIEA^IUniProtKB-KW:KW-0812^IC^IT-cell receptor beta-2 chain C region^ITRBC2_HUMAN|TRBC2|TCRBC2^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0AUX0^IAMPD3^I^IGO:0003876^IGO_REF:0000019^IIEA^IEnsembl:ENSRNOP00000024933^IF^IcDNA FLJ76195, highly similar to Homo sapiens adenosine monophosphate deaminase (isoform E) (AMPD3),mRNA^IA0AUX0_HUMAN|AMPD3|hCG_22452^Iprotein^Itaxon:9606^I20120303^IENSEMBL^I^I$
UniProtKB^IA0AUX0^IAMPD3^I^IGO:0006188^IGO_REF:0000002^IIEA^IInterPro:IPR006329^IP^IcDNA FLJ76195, highly similar to Homo sapiens adenosine monophosphate deaminase (isoform E) (AMPD3),mRNA^IA0AUX0_HUMAN|AMPD3|hCG_22452^Iproteintaxon:9606^I20120303^IInterPro^I^I$
UniProtKB^IA0AV02^ISLC12A8^I^IGO:0006811^IGO_REF:0000037^IIEA^IUniProtKB-KW:KW-0406^IP^ISolute carrier family 12 member 8^IS12A8_HUMAN|SLC12A8|CCC9^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0AV02^ISLC12A8^I^IGO:0006813^IGO_REF:0000037^IIEA^IUniProtKB-KW:KW-0633^IP^ISolute carrier family 12 member 8^IS12A8_HUMAN|SLC12A8|CCC9^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0AV02^ISLC12A8^I^IGO:0015293^IGO_REF:0000037^IIEA^IUniProtKB-KW:KW-0769^IF^ISolute carrier family 12 member 8^IS12A8_HUMAN|SLC12A8|CCC9^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0AV02^ISLC12A8^I^IGO:0016020^IGO_REF:0000039^IIEA^IUniProtKB-SubCell:SL-0162^IC^ISolute carrier family 12 member 8^IS12A8_HUMAN|SLC12A8|CCC9^Iprotein^Itaxon:9606^I20120303^IUniProtKB$


I donno how this would help you.

Could you please explain me how does this help?

Learning everyday from Unix. Smilie
# 6  
Old 07-30-2012
'^I' in the output are your tabs.
The problem is with your 4-th field - it's separated with TWO tabs - your can see 2 ^I-s in the above 'cat -vet' output.
try this instead awk -F'\t\t*' '{print $3,$4,$9}' OFS='\t' myFile
This User Gave Thanks to vgersh99 For This Post:
# 7  
Old 07-30-2012
The output from the
Code:
cat -vet

shows us that field 4 is always empty, the last two fields on every line are also empty, and the only spaces in each line appear in field 10. (The cat -t option prints <tab> characters in the input as the two character sequence "^I".)

Since field 4 is always empty, it looks like you didn't realize it was there. It looks like you want to printf fields 3, 5, and 10 instead of fields 3, 4, and 9.
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Compare 2 csv files by columns, then extract certain columns of matcing rows

Hi all, I'm pretty much a newbie to UNIX. I would appreciate any help with UNIX coding on comparing two large csv files (greater than 10 GB in size), and output a file with matching columns. I want to compare file1 and file2 by 'id' and 'chain' columns, then extract exact matching rows'... (5 Replies)
Discussion started by: bkane3
5 Replies

2. Shell Programming and Scripting

Splitting the numeric vs alpha values in a column to distinct columns

How could i take an input file and split the numeric values from the alpha values (123 vs abc) to distinc columns, and if the source is blank to keep it blank (null) in both of the new columns: So if the source file had a column like: Value: |1 | |2.3| | | |No| I would... (7 Replies)
Discussion started by: driftlogic
7 Replies

3. Shell Programming and Scripting

Deleting all the fields(columns) from a .csv file if all rows in that columns are blanks

Hi Friends, I have come across some files where some of the columns don not have data. Key, Data1,Data2,Data3,Data4,Data5 A,5,6,,10,, A,3,4,,3,, B,1,,4,5,, B,2,,3,4,, If we see the above data on Data5 column do not have any row got filled. So remove only that column(Here Data5) and... (4 Replies)
Discussion started by: ks_reddy
4 Replies

4. Shell Programming and Scripting

Combine columns from many files but keep them aligned in columns-shorter left column issue

Hello everyone, I searched the forum looking for answers to this but I could not pinpoint exactly what I need as I keep having trouble. I have many files each having two columns and hundreds of rows. first column is a string (can have many words) and the second column is a number.The files are... (5 Replies)
Discussion started by: isildur1234
5 Replies

5. UNIX for Dummies Questions & Answers

Splitting up a text file into multiple files by columns

Hi, I have a space delimited text file with multiple columns 102 columns. I want to break it up into 100 files labelled 1.txt through 100.txt (n.txt). Each text file will contain the first two columns and in addition the nth column (that corresponds to n.txt). The third file will contain the... (1 Reply)
Discussion started by: evelibertine
1 Replies

6. Shell Programming and Scripting

Splitting the data in a column into several columns

Hi, I have the following input file 32895901-d17f-414c-ac93-3e7e0f5ec240 AND @GDF_INPUT 73b129e1-1fa9-4c0d-b95b-4682e5389612 AUS @GDF_INPUT 40f82e88-d1ff-4ce2-9b8e-d827ddb39447 BEL @GDF_INPUT 36e9c3f1-042a-43a4-a80e-4a3bc2513d01 BGR @GDF_INPUT I want to split column 3 into two columns:... (1 Reply)
Discussion started by: ramky79
1 Replies

7. UNIX for Advanced & Expert Users

Help in Deleting columns and Renaming Mutliple columns in a .Csv File

Hi All, i have a .Csv file in the below format startTime, endTime, delta, gName, rName, rNumber, m2239max, m2239min, m2239avg, m100016509avg, m100019240max, metric3min, m100019240avg, propValues 11-Mar-2012 00:00:00, 11-Mar-2012 00:05:00, 300.0, vma3550a, a-1_CPU Index<1>, 200237463, 0.0,... (9 Replies)
Discussion started by: mahi_mayu069
9 Replies

8. Shell Programming and Scripting

splitting a huge line of file into multiple lines with fixed number of columns

Hi, I have a huge file with a single line. But I want to break that line into lines of with each line having five columns. My file is like this: code: "hi","there","how","are","you?","It","was","great","working","with","you.","hope","to","work","you." I want it like this: code:... (1 Reply)
Discussion started by: rajsharma
1 Replies

9. Shell Programming and Scripting

Splitting data from one row as multiple columns

Hi I have a file containing some data as follows: 11-17-2010:13:26 64 4 516414 1392258 11-17-2010:13:26 128 4 586868 695603 11-17-2010:13:26 256 4 474937 1642294 11-17-2010:13:32 64 4 378715 1357066 11-17-2010:13:32 128 4 597981 1684006 ... (17 Replies)
Discussion started by: annazpereira
17 Replies

10. Shell Programming and Scripting

Single command for add 2 columns and remove 2 columns in unix/performance tuning

Hi all, I have created a script which adding two columns and removing two columns for all files. Filename: Cust_information_1200_201010.txt Source Data: "1","Cust information","123","106001","street","1-203 high street" "1","Cust information","124","105001","street","1-203 high street" ... (0 Replies)
Discussion started by: onesuri
0 Replies
Login or Register to Ask a Question