Splitting columns

07-30-2012

Banned

363, 7

Join Date: Jan 2012

Last Activity: 24 June 2017, 6:25 PM EDT

Posts: 363

Thanks Given: 318

Thanked 7 Times in 7 Posts

Splitting columns

Hi Friends,

My input file has more than 20 columns

Code:

UniProtKB	A0A183	LCE6A		GO:0031424	GO_REF:0000037	IEA	UniProtKB-KW:KW-0417	P	Late cornified envelope protein 6A	LCE6A_HUMAN|C1orf44|LCE6A	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0A5B9	TRBC2		GO:0004872	GO_REF:0000037	IEA	UniProtKB-KW:KW-0675	F	T-cell receptor beta-2 chain C region	TRBC2_HUMAN|TRBC2|TCRBC2	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0A5B9	TRBC2		GO:0016020	GO_REF:0000039	IEA	UniProtKB-SubCell:SL-0162	C	T-cell receptor beta-2 chain C region	TRBC2_HUMAN|TRBC2|TCRBC2	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0A5B9	TRBC2		GO:0016021	GO_REF:0000037	IEA	UniProtKB-KW:KW-0812	C	T-cell receptor beta-2 chain C region	TRBC2_HUMAN|TRBC2|TCRBC2	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0AUX0	AMPD3		GO:0003876	GO_REF:0000019	IEA	Ensembl:ENSRNOP00000024933	F	cDNA FLJ76195, highly similar to Homo sapiens adenosine monophosphate deaminase (isoform E) (AMPD3),mRNA	A0AUX0_HUMAN|AMPD3|hCG_22452	protein	taxon:9606	20120303	ENSEMBL		
UniProtKB	A0AUX0	AMPD3		GO:0006188	GO_REF:0000002	IEA	InterPro:IPR006329	P	cDNA FLJ76195, highly similar to Homo sapiens adenosine monophosphate deaminase (isoform E) (AMPD3),mRNA	A0AUX0_HUMAN|AMPD3|hCG_22452	proteintaxon:9606	20120303	InterPro		
UniProtKB	A0AV02	SLC12A8		GO:0006811	GO_REF:0000037	IEA	UniProtKB-KW:KW-0406	P	Solute carrier family 12 member 8	S12A8_HUMAN|SLC12A8|CCC9	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0AV02	SLC12A8		GO:0006813	GO_REF:0000037	IEA	UniProtKB-KW:KW-0633	P	Solute carrier family 12 member 8	S12A8_HUMAN|SLC12A8|CCC9	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0AV02	SLC12A8		GO:0015293	GO_REF:0000037	IEA	UniProtKB-KW:KW-0769	F	Solute carrier family 12 member 8	S12A8_HUMAN|SLC12A8|CCC9	protein	taxon:9606	20120303	UniProtKB		
UniProtKB	A0AV02	SLC12A8		GO:0016020	GO_REF:0000039	IEA	UniProtKB-SubCell:SL-0162	C	Solute carrier family 12 member 8	S12A8_HUMAN|SLC12A8|CCC9	protein	taxon:9606	20120303	UniProtKB

My expected output is this

Code:

LCE6A GO:0031424 Late cornified envelope protein 6A
TRBC2 GO:0004872 T-cell receptor beta-2 chain C region
..........
........
.............
...........
SLC12A8 GO:0016020 Solute carrier family 12 member 8

Basically, what am I trying to do is to grab $3"\t"$4"\t"$9

Please note the following.

My input file is tab separated between columns and space separated in 9th column.
My expected output is also the same way.
The dots in my expected output means that the records in between needs to be printed. I didn't want to type everything. Hopefully, you guys understood me.
IEA ($6) is a column and it is tab separated from other columns and is not space separated with other columns.
A hint might be....9th column can be extracted anything after

Code:

P or F or C in the 8th column

and eliminating the content before

Code:

_HUMAN

in $10.

jacobs.smith

View Public Profile for jacobs.smith

Find all posts by jacobs.smith

07-30-2012

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Code:

nawk -F'\t' '{print $3,$4,$9}' OFS='\t' myFile

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

07-30-2012

Banned

363, 7

Join Date: Jan 2012

Last Activity: 24 June 2017, 6:25 PM EDT

Posts: 363

Thanks Given: 318

Thanked 7 Times in 7 Posts

Quote:

Originally Posted by vgersh99

Code:

nawk -F'\t' '{print $3,$4,$9}' OFS='\t' myFile

Thanks for ur time.

When I run ur code, the error I see is

Code:

-bash: nawk: command not found

I tried this

Code:

awk -F'\t' '{print $3,$4,$9}' OFS='\t' test | head

The output is

Code:

LCE6A		P
TRBC2		F
TRBC2		C
TRBC2		C
AMPD3		F
AMPD3		P
SLC12A8		P
SLC12A8		P
SLC12A8		F
SLC12A8		C

Does this mean that my input is space separated? I am confused now. Because, I printed all the columns to be tab separated.

jacobs.smith

View Public Profile for jacobs.smith

Find all posts by jacobs.smith

07-30-2012

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Quote:

Originally Posted by jacobs.smith

Thanks for ur time.

When I run ur code, the error I see is

Code:

-bash: nawk: command not found

I tried this

Code:

awk -F'\t' '{print $3,$4,$9}' OFS='\t' test | head

The output is

Code:

LCE6A        P
TRBC2        F
TRBC2        C
TRBC2        C
AMPD3        F
AMPD3        P
SLC12A8        P
SLC12A8        P
SLC12A8        F
SLC12A8        C

Does this mean that my input is space separated? I am confused now. Because, I printed all the columns to be tab separated.

maybe. post the output of head myFile | cat -vet

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

07-30-2012

Banned

363, 7

Join Date: Jan 2012

Last Activity: 24 June 2017, 6:25 PM EDT

Posts: 363

Thanks Given: 318

Thanked 7 Times in 7 Posts

Quote:

Originally Posted by vgersh99

maybe. post the output of head myFile | cat -vet

Please find the output I did. I just did

Code:

cat input | cat -vet

output

Code:

UniProtKB^IA0A183^ILCE6A^I^IGO:0031424^IGO_REF:0000037^IIEA^IUniProtKB-KW:KW-0417^IP^ILate cornified envelope protein 6A^ILCE6A_HUMAN|C1orf44|LCE6A^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0A5B9^ITRBC2^I^IGO:0004872^IGO_REF:0000037^IIEA^IUniProtKB-KW:KW-0675^IF^IT-cell receptor beta-2 chain C region^ITRBC2_HUMAN|TRBC2|TCRBC2^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0A5B9^ITRBC2^I^IGO:0016020^IGO_REF:0000039^IIEA^IUniProtKB-SubCell:SL-0162^IC^IT-cell receptor beta-2 chain C region^ITRBC2_HUMAN|TRBC2|TCRBC2^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0A5B9^ITRBC2^I^IGO:0016021^IGO_REF:0000037^IIEA^IUniProtKB-KW:KW-0812^IC^IT-cell receptor beta-2 chain C region^ITRBC2_HUMAN|TRBC2|TCRBC2^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0AUX0^IAMPD3^I^IGO:0003876^IGO_REF:0000019^IIEA^IEnsembl:ENSRNOP00000024933^IF^IcDNA FLJ76195, highly similar to Homo sapiens adenosine monophosphate deaminase (isoform E) (AMPD3),mRNA^IA0AUX0_HUMAN|AMPD3|hCG_22452^Iprotein^Itaxon:9606^I20120303^IENSEMBL^I^I$
UniProtKB^IA0AUX0^IAMPD3^I^IGO:0006188^IGO_REF:0000002^IIEA^IInterPro:IPR006329^IP^IcDNA FLJ76195, highly similar to Homo sapiens adenosine monophosphate deaminase (isoform E) (AMPD3),mRNA^IA0AUX0_HUMAN|AMPD3|hCG_22452^Iproteintaxon:9606^I20120303^IInterPro^I^I$
UniProtKB^IA0AV02^ISLC12A8^I^IGO:0006811^IGO_REF:0000037^IIEA^IUniProtKB-KW:KW-0406^IP^ISolute carrier family 12 member 8^IS12A8_HUMAN|SLC12A8|CCC9^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0AV02^ISLC12A8^I^IGO:0006813^IGO_REF:0000037^IIEA^IUniProtKB-KW:KW-0633^IP^ISolute carrier family 12 member 8^IS12A8_HUMAN|SLC12A8|CCC9^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0AV02^ISLC12A8^I^IGO:0015293^IGO_REF:0000037^IIEA^IUniProtKB-KW:KW-0769^IF^ISolute carrier family 12 member 8^IS12A8_HUMAN|SLC12A8|CCC9^Iprotein^Itaxon:9606^I20120303^IUniProtKB^I^I$
UniProtKB^IA0AV02^ISLC12A8^I^IGO:0016020^IGO_REF:0000039^IIEA^IUniProtKB-SubCell:SL-0162^IC^ISolute carrier family 12 member 8^IS12A8_HUMAN|SLC12A8|CCC9^Iprotein^Itaxon:9606^I20120303^IUniProtKB$

I donno how this would help you.

Could you please explain me how does this help?

Learning everyday from Unix.

jacobs.smith

View Public Profile for jacobs.smith

Find all posts by jacobs.smith

07-30-2012

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

'^I' in the output are your tabs.
The problem is with your 4-th field - it's separated with TWO tabs - your can see 2 ^I-s in the above 'cat -vet' output.
try this instead awk -F'\t\t*' '{print $3,$4,$9}' OFS='\t' myFile

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

07-30-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The output from the

Code:

cat -vet

shows us that field 4 is always empty, the last two fields on every line are also empty, and the only spaces in each line appear in field 10. (The cat -t option prints <tab> characters in the input as the two character sequence "^I".)

Since field 4 is always empty, it looks like you didn't realize it was there. It looks like you want to printf fields 3, 5, and 10 instead of fields 3, 4, and 9.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Splitting columns

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Compare 2 csv files by columns, then extract certain columns of matcing rows

Discussion started by: bkane3

2. Shell Programming and Scripting

Splitting the numeric vs alpha values in a column to distinct columns

Discussion started by: driftlogic

3. Shell Programming and Scripting

Deleting all the fields(columns) from a .csv file if all rows in that columns are blanks

Discussion started by: ks_reddy

4. Shell Programming and Scripting

Combine columns from many files but keep them aligned in columns-shorter left column issue

Discussion started by: isildur1234

5. UNIX for Dummies Questions & Answers

Splitting up a text file into multiple files by columns

Discussion started by: evelibertine

6. Shell Programming and Scripting

Splitting the data in a column into several columns

Discussion started by: ramky79

7. UNIX for Advanced & Expert Users

Help in Deleting columns and Renaming Mutliple columns in a .Csv File

Discussion started by: mahi_mayu069

8. Shell Programming and Scripting

splitting a huge line of file into multiple lines with fixed number of columns

Discussion started by: rajsharma

9. Shell Programming and Scripting

Splitting data from one row as multiple columns

Discussion started by: annazpereira

10. Shell Programming and Scripting

Single command for add 2 columns and remove 2 columns in unix/performance tuning

Discussion started by: onesuri