Split certain strings in a line for a specific column.

08-25-2014

Registered User

149, 1

Join Date: Dec 2010

Last Activity: 9 June 2015, 10:16 AM EDT

Posts: 149

Thanks Given: 100

Thanked 1 Time in 1 Post

Split certain strings in a line for a specific column.

Hi,

i need help to extract certain strings/words from lines with different length. I have 3 columns separated by tab delimiter. like below

Code:

Probable arabinan endo-1,5-alpha-L-arabinosidase A	(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase A) (ABN A) abnA	Ady3G14620
Probable arabinan endo-1,5-alpha-L-arabinosidase B	(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase B) (ABN B) abnB	Ady2G14150
Probable arabinan endo-1,5-alpha-L-arabinosidase C	(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase C) (ABN C) abnC	Ady6G00770
Isocitrate lyase (ICL) (Isocitrase) (Isocitratase)	(EC 4.1.3.1) icl1 icl	Ady4G13510
Putative aconitate hydratase, mitochondrial (Aconitase 2)	(EC 4.2.1.-) acoB	Ady1g06810
Putative aconitate hydratase (Aconitase 3)	(EC 4.2.1.-) acoC	Ady8g07140
Aconitate hydratase, mitochondrial (Aconitase)	(EC 4.2.1.3) (Citrate hydro-lyase) (Homocitrate dehydratase)	Ady6g12930
Adenine deaminase (ADE)	(EC 3.5.4.2) (Adenine aminohydrolase) (AAH) aah1	Ady2G09150
Disintegrin and metalloproteinase domain-containing protein B (ADAM B)	(EC 3.4.24.-) ADM-B	Ady4G11150
Probable alpha-galactosidase D	(EC 3.2.1.22) (Melibiase D) aglD	Ady4G03585
Arginine biosynthesis bifunctional protein ArgJ, mitochondrial [Cleaved into: Arginine biosynthesis bifunctional protein ArgJ alpha chain; Arginine biosynthesis bifunctional protein ArgJ beta chain] [Includes: Glutamate N-acetyltransferase (GAT)	(EC 2.3.1.35) (Ornithine acetyltransferase) (OATase) (Ornithine transacetylase); Amino-acid acetyltransferase	Ady5G08120

I want to split $2 to take only the "EC x.x.x.x" for it and ignore the rest of the words in $2 and print $1,$2 (EC x.x.x.x only) and $3. and i want to remove it's "brackets" too. The output should be like below

Code:

Probable arabinan endo-1,5-alpha-L-arabinosidase A	EC 3.2.1.99	Ady3G14620
Probable arabinan endo-1,5-alpha-L-arabinosidase B	EC 3.2.1.99	Ady2G14150
Probable arabinan endo-1,5-alpha-L-arabinosidase C	EC 3.2.1.99	Ady6G00770
Isocitrate lyase (ICL) (Isocitrase) (Isocitratase)	EC 4.1.3.1	Ady4G13510
Putative aconitate hydratase, mitochondrial (Aconitase 2)	EC 4.2.1.-	Ady1g06810
Putative aconitate hydratase (Aconitase 3)	EC 4.2.1.-	Ady8g07140
Aconitate hydratase, mitochondrial (Aconitase)	EC 4.2.1.3	Ady6g12930
Adenine deaminase (ADE)	EC 3.5.4.2      Ady2G09150
Disintegrin and metalloproteinase domain-containing protein B (ADAM B)	EC 3.4.24.-	Ady4G11150
Probable alpha-galactosidase D	EC 3.2.1.22 (Melibiase D)	Ady4G03585
Arginine biosynthesis bifunctional protein ArgJ, mitochondrial [Cleaved into: Arginine biosynthesis bifunctional protein ArgJ alpha chain; Arginine biosynthesis bifunctional protein ArgJ beta chain] [Includes: Glutamate N-acetyltransferase GAT	EC 2.3.1.35	Ady5G08120

I did the following codes but still i could not remove the words following the "EC x.x.x.x" for $2. and the sed scripts remove all brackets, i just need to remove brackets for EC.x.x.x.x only. I am sure it should not be that complicated but just couldn't figure out.

Code:

awk -F. '{print $1"."$2"."$3"."$4,$4}' inputfile | sed 's/(\|)//g'

Any help would be appreciated.

redse171

View Public Profile for redse171

Find all posts by redse171

08-26-2014

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

"awk | sed" is - whatever the arguments to the commands might be - eo ipso wrong.

First, read the lines and split them into field. You say they are separated by tabs. In the following "<t>" means a literal tab, "<b>" a blank char.

Code:

field1=""
field2=""
field3=""

while IFS='<t>' read field1 field2 field3 ; do
     print - "field1: ${field1}"
     print - "field2: ${field2}"
     print - "field3: ${field3}"
     print - "----------"
done < /path/to/your/data

First, check the output! You may have to adjust your definitions maybe. If you are satisfied, add the next step: extract content from field2. We use shell variable expansion for this, see the man page of ksh for details.

Code:

field1=""
field2=""
field3=""

while IFS='<t>' read field1 field2 field3 ; do
     print - "field1: ${field1}"

     field2="${field2#?}"         # split off first character "("
     field2="${field2%%\)*}"      # split off everything after first ")"

     print - "field2: ${field2}"
     print - "field3: ${field3}"
     print - "----------"
done < /path/to/your/data

Test again. If you are still satisfied, "print" the final version:

Code:

field1=""
field2=""
field3=""

while IFS='<t>' read field1 field2 field3 ; do
     field2="${field2#?}"         # split off first character "("
     field2="${field2%%\)*}"      # split off everything after first ")"

     print - "${field1}\t${field2}\t${field3}"
done < /path/to/your/data > /path/to/your/output

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

08-26-2014

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

With awk you can use split() to further select the subfields that you need:

Code:

awk '{split($2,F,/[()]/); print $1, F[2], $3}' FS='\t' OFS='\t' file

/[()]/ means that parentheses should be used to further split the second field into array "F". The second array element should then contain the first text in parentheses..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

08-26-2014

Registered User

149, 1

Join Date: Dec 2010

Last Activity: 9 June 2015, 10:16 AM EDT

Posts: 149

Thanks Given: 100

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Scrutinizer

With awk you can use split() to further select the subfields that you need:

Code:

awk '{split($2,F,/[()]/); print $1, F[2], $3}' FS='\t' OFS='\t' file

/[()]/ means that parentheses should be used to further split the second field into array "F". The second array element should then contain the first text in parentheses..

Hi Scrutinizer,

It worked awesome!!! and thanks so much for your explanation. I might sound stupid, but, i don't understand why the second array (F[2]) is the first text in parenthesis? what is the first array then? is it the parenthesis itself? thanks

---------- Post updated at 10:44 AM ---------- Previous update was at 10:41 AM ----------

Hi bakunin,

Thanks so much for your great response. I normally use awk and sed in my work and i am still learning. and your code is quite new to me. I will try to look into it and work on it and give feedback asap. This is great as i got a chance to learn new stuff

redse171

View Public Profile for redse171

Find all posts by redse171

08-26-2014

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Quote:

Originally Posted by redse171

Good to hear... The first element of the array contains the empty string before the first opening parenthesis..

If we take the second field of the first line as an example:

Code:

(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase A) (ABN A) abnA

Code:

F[1] contains ""
F[2] contains "EC 3.2.1.99"
F[3] contains " "
F[4] contains "Endo-1,5-alpha-L-arabinanase A"
F[5] contains " "
F[6] contains "ABN A"
F[7] contains " abnA"

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

08-26-2014

Registered User

149, 1

Join Date: Dec 2010

Last Activity: 9 June 2015, 10:16 AM EDT

Posts: 149

Thanks Given: 100

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Scrutinizer

Good to hear... The first element of the array contains the empty string before the first opening parenthesis..

If we take the second field of the first line as an example:

Code:

(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase A) (ABN A) abnA

Code:

F[1] contains ""
F[2] contains "EC 3.2.1.99"
F[3] contains " "
F[4] contains "Endo-1,5-alpha-L-arabinanase A"
F[5] contains " "
F[6] contains "ABN A"
F[7] contains " abnA"

cool... thanks a bunch... now i get it

redse171

View Public Profile for redse171

Find all posts by redse171

Shell Programming and Scripting

Split certain strings in a line for a specific column.

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Deletion of strings depending of the value in a specific column

Discussion started by: echo manolis

2. Shell Programming and Scripting

Overwrite specific column in xml file with the specific column from adjacent line

Discussion started by: rk4k

3. Shell Programming and Scripting

Help with print out line that have different record in specific column

Discussion started by: perl_beginner

4. Shell Programming and Scripting

awk to sum a column based on duplicate strings in another column and show split totals

Discussion started by: prashob123

5. Shell Programming and Scripting

Converting Single Column into Multiple rows, but with strings to specific tab column

Discussion started by: AK47

6. Shell Programming and Scripting

Split each column in TSV file to be new line?

Discussion started by: pxalpine

7. Shell Programming and Scripting

Split a file into multiple files based on line numbers and first column value

Discussion started by: sarav.shan

8. Shell Programming and Scripting

Counting rows line by line from a specific column using Awk

Discussion started by: vnayak

9. Shell Programming and Scripting

On the command line using bash, how do you split a string by column?

Discussion started by: teiji

10. Shell Programming and Scripting

subtitute specific column in line

Discussion started by: MomoChan