Split certain strings in a line for a specific column.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split certain strings in a line for a specific column.
# 1  
Old 08-25-2014
Split certain strings in a line for a specific column.

Hi,

i need help to extract certain strings/words from lines with different length. I have 3 columns separated by tab delimiter. like below

Code:
Probable arabinan endo-1,5-alpha-L-arabinosidase A	(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase A) (ABN A) abnA	Ady3G14620
Probable arabinan endo-1,5-alpha-L-arabinosidase B	(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase B) (ABN B) abnB	Ady2G14150
Probable arabinan endo-1,5-alpha-L-arabinosidase C	(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase C) (ABN C) abnC	Ady6G00770
Isocitrate lyase (ICL) (Isocitrase) (Isocitratase)	(EC 4.1.3.1) icl1 icl	Ady4G13510
Putative aconitate hydratase, mitochondrial (Aconitase 2)	(EC 4.2.1.-) acoB	Ady1g06810
Putative aconitate hydratase (Aconitase 3)	(EC 4.2.1.-) acoC	Ady8g07140
Aconitate hydratase, mitochondrial (Aconitase)	(EC 4.2.1.3) (Citrate hydro-lyase) (Homocitrate dehydratase)	Ady6g12930
Adenine deaminase (ADE)	(EC 3.5.4.2) (Adenine aminohydrolase) (AAH) aah1	Ady2G09150
Disintegrin and metalloproteinase domain-containing protein B (ADAM B)	(EC 3.4.24.-) ADM-B	Ady4G11150
Probable alpha-galactosidase D	(EC 3.2.1.22) (Melibiase D) aglD	Ady4G03585
Arginine biosynthesis bifunctional protein ArgJ, mitochondrial [Cleaved into: Arginine biosynthesis bifunctional protein ArgJ alpha chain; Arginine biosynthesis bifunctional protein ArgJ beta chain] [Includes: Glutamate N-acetyltransferase (GAT)	(EC 2.3.1.35) (Ornithine acetyltransferase) (OATase) (Ornithine transacetylase); Amino-acid acetyltransferase	Ady5G08120

I want to split $2 to take only the "EC x.x.x.x" for it and ignore the rest of the words in $2 and print $1,$2 (EC x.x.x.x only) and $3. and i want to remove it's "brackets" too. The output should be like below

Code:
Probable arabinan endo-1,5-alpha-L-arabinosidase A	EC 3.2.1.99	Ady3G14620
Probable arabinan endo-1,5-alpha-L-arabinosidase B	EC 3.2.1.99	Ady2G14150
Probable arabinan endo-1,5-alpha-L-arabinosidase C	EC 3.2.1.99	Ady6G00770
Isocitrate lyase (ICL) (Isocitrase) (Isocitratase)	EC 4.1.3.1	Ady4G13510
Putative aconitate hydratase, mitochondrial (Aconitase 2)	EC 4.2.1.-	Ady1g06810
Putative aconitate hydratase (Aconitase 3)	EC 4.2.1.-	Ady8g07140
Aconitate hydratase, mitochondrial (Aconitase)	EC 4.2.1.3	Ady6g12930
Adenine deaminase (ADE)	EC 3.5.4.2      Ady2G09150
Disintegrin and metalloproteinase domain-containing protein B (ADAM B)	EC 3.4.24.-	Ady4G11150
Probable alpha-galactosidase D	EC 3.2.1.22 (Melibiase D)	Ady4G03585
Arginine biosynthesis bifunctional protein ArgJ, mitochondrial [Cleaved into: Arginine biosynthesis bifunctional protein ArgJ alpha chain; Arginine biosynthesis bifunctional protein ArgJ beta chain] [Includes: Glutamate N-acetyltransferase GAT	EC 2.3.1.35	Ady5G08120

I did the following codes but still i could not remove the words following the "EC x.x.x.x" for $2. and the sed scripts remove all brackets, i just need to remove brackets for EC.x.x.x.x only. I am sure it should not be that complicated but just couldn't figure out.

Code:
awk -F. '{print $1"."$2"."$3"."$4,$4}' inputfile | sed 's/(\|)//g'

Any help would be appreciated.
# 2  
Old 08-26-2014
"awk | sed" is - whatever the arguments to the commands might be - eo ipso wrong.

First, read the lines and split them into field. You say they are separated by tabs. In the following "<t>" means a literal tab, "<b>" a blank char.

Code:
field1=""
field2=""
field3=""

while IFS='<t>' read field1 field2 field3 ; do
     print - "field1: ${field1}"
     print - "field2: ${field2}"
     print - "field3: ${field3}"
     print - "----------"
done < /path/to/your/data

First, check the output! You may have to adjust your definitions maybe. If you are satisfied, add the next step: extract content from field2. We use shell variable expansion for this, see the man page of ksh for details.

Code:
field1=""
field2=""
field3=""

while IFS='<t>' read field1 field2 field3 ; do
     print - "field1: ${field1}"

     field2="${field2#?}"         # split off first character "("
     field2="${field2%%\)*}"      # split off everything after first ")"

     print - "field2: ${field2}"
     print - "field3: ${field3}"
     print - "----------"
done < /path/to/your/data

Test again. If you are still satisfied, "print" the final version:

Code:
field1=""
field2=""
field3=""

while IFS='<t>' read field1 field2 field3 ; do
     field2="${field2#?}"         # split off first character "("
     field2="${field2%%\)*}"      # split off everything after first ")"

     print - "${field1}\t${field2}\t${field3}"
done < /path/to/your/data > /path/to/your/output


I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
# 3  
Old 08-26-2014
With awk you can use split() to further select the subfields that you need:
Code:
awk '{split($2,F,/[()]/); print $1, F[2], $3}' FS='\t' OFS='\t' file

/[()]/ means that parentheses should be used to further split the second field into array "F". The second array element should then contain the first text in parentheses..
This User Gave Thanks to Scrutinizer For This Post:
# 4  
Old 08-26-2014
Quote:
Originally Posted by Scrutinizer
With awk you can use split() to further select the subfields that you need:
Code:
awk '{split($2,F,/[()]/); print $1, F[2], $3}' FS='\t' OFS='\t' file

/[()]/ means that parentheses should be used to further split the second field into array "F". The second array element should then contain the first text in parentheses..
Hi Scrutinizer,

It worked awesome!!! and thanks so much for your explanation. I might sound stupid, but, i don't understand why the second array (F[2]) is the first text in parenthesis? what is the first array then? is it the parenthesis itself? thanks

---------- Post updated at 10:44 AM ---------- Previous update was at 10:41 AM ----------

Hi bakunin,

Thanks so much for your great response. I normally use awk and sed in my work and i am still learning. and your code is quite new to me. I will try to look into it and work on it and give feedback asap. This is great as i got a chance to learn new stuff Smilie.
# 5  
Old 08-26-2014
Quote:
Originally Posted by redse171
Hi Scrutinizer,

It worked awesome!!! and thanks so much for your explanation. I might sound stupid, but, i don't understand why the second array (F[2]) is the first text in parenthesis? what is the first array then? is it the parenthesis itself? thanks
[..]
Good to hear... The first element of the array contains the empty string before the first opening parenthesis..

If we take the second field of the first line as an example:
Code:
(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase A) (ABN A) abnA

Code:
F[1] contains ""
F[2] contains "EC 3.2.1.99"
F[3] contains " "
F[4] contains "Endo-1,5-alpha-L-arabinanase A"
F[5] contains " "
F[6] contains "ABN A"
F[7] contains " abnA"

This User Gave Thanks to Scrutinizer For This Post:
# 6  
Old 08-26-2014
Quote:
Originally Posted by Scrutinizer
Good to hear... The first element of the array contains the empty string before the first opening parenthesis..

If we take the second field of the first line as an example:
Code:
(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase A) (ABN A) abnA

Code:
F[1] contains ""
F[2] contains "EC 3.2.1.99"
F[3] contains " "
F[4] contains "Endo-1,5-alpha-L-arabinanase A"
F[5] contains " "
F[6] contains "ABN A"
F[7] contains " abnA"

cool... thanks a bunch... now i get it Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Deletion of strings depending of the value in a specific column

Happy new year guys! I have a new question for you! Ubuntum, Bash version: 4.3.46 BashI have a csv file, composed from several columns. INPUT x1 x2 x3 x4 x5 as 10 32 T 3 sd 50 7 B 48 af 18 98 D 25 fe 75 55 P 15 I want to cancel the strings where the x2 and/or x3 values are <=10... (6 Replies)
Discussion started by: echo manolis
6 Replies

2. Shell Programming and Scripting

Overwrite specific column in xml file with the specific column from adjacent line

I have an xml file dumped from rrd file, that I want to "patch" so the xml file doesn't contain any blank hole in the resulting graph of the rrd file. Here is the file. <!-- 2015-10-12 14:00:00 WIB / 1444633200 --> <row><v> 4.0419731265e+07 </v><v> 4.5045912770e+06... (2 Replies)
Discussion started by: rk4k
2 Replies

3. Shell Programming and Scripting

Help with print out line that have different record in specific column

Input file 1: - 7367 8198 - 8225 9383 + 9570 10353 Input file 2: - 2917 3667 - 3851 4250 + 4517 6302 + 6302 6740 + 6768 7524 + 7648 8170 + 8272 8896 + 8908 9915 - 10010 ... (18 Replies)
Discussion started by: perl_beginner
18 Replies

4. Shell Programming and Scripting

awk to sum a column based on duplicate strings in another column and show split totals

Hi, I have a similar input format- A_1 2 B_0 4 A_1 1 B_2 5 A_4 1 and looking to print in this output format with headers. can you suggest in awk?awk because i am doing some pattern matching from parent file to print column 1 of my input using awk already.Thanks! letter number_of_letters... (5 Replies)
Discussion started by: prashob123
5 Replies

5. Shell Programming and Scripting

Converting Single Column into Multiple rows, but with strings to specific tab column

Dear fellows, I need your help. I'm trying to write a script to convert a single column into multiple rows. But it need to recognize the beginning of the string and set it to its specific Column number. Each Line (loop) begins with digit (RANGE). At this moment it's kind of working, but it... (6 Replies)
Discussion started by: AK47
6 Replies

6. Shell Programming and Scripting

Split each column in TSV file to be new line?

My TSV looks like: Hello my name is John \t Hello world \t Have a good day! \t See you later! Is there a simple bash script that splits the tsv on tab to: Hello my name is John Hello world Have a good day! See you later! I'm really stuck, would appreciate any help! (5 Replies)
Discussion started by: pxalpine
5 Replies

7. Shell Programming and Scripting

Split a file into multiple files based on line numbers and first column value

Hi All I have one query,say i have a requirement like the below code should be move to diffent files whose maximum lines can be of 10 lines.Say in the below example,it consist of 14 lines. This should be moved logically using the data in the fisrt coloumn to file1 and file 2.The data of first... (2 Replies)
Discussion started by: sarav.shan
2 Replies

8. Shell Programming and Scripting

Counting rows line by line from a specific column using Awk

Dear UNIX community, I would like to to count characters from a specific row and have them displayed line-by-line. I have a file called testAwk2.csv which contain the following data: rabbit penguin goat giraffe emu ostrich I would like to count in the middle row individually... (4 Replies)
Discussion started by: vnayak
4 Replies

9. Shell Programming and Scripting

On the command line using bash, how do you split a string by column?

Input: MD5(secret.txt)= fe66cbf9d929934b09cc7e8be890522e MD5(secret2.txt)= asd123qwlkjgre5ug8je7hlt488dkr0p I want the results to look like these, respectively: MD5(secret.txt)= fe66cbf9 d929934b 09cc7e8b e890522e MD5(secret2.txt)= asd123qw lkjgre5u g8je7hlt 488dkr0p Basically, keeping... (11 Replies)
Discussion started by: teiji
11 Replies

10. Shell Programming and Scripting

subtitute specific column in line

Hi All, I have problem to solve aaaa,aaaaa,aaa,aaaa,aaa,aa ,aa bbbb,bbbbbbbbb,bbbb,bbbbb ,bb to aaaa;aaaaa,aaa ;aaaa;aaa,aa ;aa bbbb;bbbbbbbbb;bbbb;bbbbb ;bb i try use sed to find and replace, but dont know how to replace specific column position. can u help me?? thx for the... (11 Replies)
Discussion started by: MomoChan
11 Replies
Login or Register to Ask a Question