awk split | Unix Linux Forums | UNIX for Dummies Questions & Answers

  Go Back    


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

awk split

UNIX for Dummies Questions & Answers


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 08-12-2011
heecha heecha is offline
Registered User
 
Join Date: Mar 2010
Last Activity: 26 September 2012, 2:30 PM EDT
Location: Cambridge, MA
Posts: 12
Thanks: 10
Thanked 0 Times in 0 Posts
awk split

Hi Folks,

I have lines that look like this:


Code:
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/469_894ACGTGCTATGCGG

I want to split all lines into:


Code:
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426     ACGTGCTATGCGG
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/469_894 ACGTGCTATGCGG

Then I want to print:


Code:
ACGTGCTATGCGG
ACGTGCTATGCGG


Code:
awk '{split($0,a,"[A;C;G;T]");print a[1]}'

gives:


Code:
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/469_894

but


Code:
awk '{split($0,a,"[A;C;G;T]");print a[2]}'

gives nothing.

I can split in other ways, e.g.,


Code:
awk '{split($0,a,"/[0-9]_[0-9]");print a[2]}'

or


Code:
awk '{split($0,a,"/[0-9]_[0-9][A;C;G;T]");print a[2]}'

a[1] always prints correctly, but a[2] is always "empty".

What am I doing wrong?

Thanks for your help.
Robert

Last edited by zxmaus; 08-12-2011 at 10:44 PM..
Sponsored Links
    #2  
Old 08-12-2011
alister alister is offline
Registered User
 
Join Date: Dec 2009
Last Activity: 11 June 2014, 8:40 PM EDT
Posts: 3,231
Thanks: 179
Thanked 973 Times in 789 Posts
You can't expect characters that are used to split a string to be part of the result. If you split "1,2,3,4" on the comma, by definition the comma is not an allowed member of a field. Same goes with a bracket expression such as "[ACGT]"; splitting on such an expression forbids A, C, G, and T from occurring in a field.

Assuming I understood what were trying to do, the semicolons in your bracket expressions are incorrect. Characters in a bracket expression should not be delimited. To split on the four letters "A", "C", "G", and "T", "[ACGT]" is all that's needed. Adding those semicolons will cause splitting on semicolons as well.

Looking at your data:

Code:
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG

If you just want to print the highlighted base sequence, and if its always preceded by the final number in the line, the following will do:

Code:
sed 's/.*[[:digit:]]//'

Or if the base sequence always begins at the 4th character past the final underscore:

Code:
sed 's/.*_...//'

Regards,
Alister

Last edited by alister; 08-12-2011 at 02:33 PM..
The Following User Says Thank You to alister For This Useful Post:
heecha (08-12-2011)
Sponsored Links
    #3  
Old 08-12-2011
dude2cool's Avatar
dude2cool dude2cool is offline
Registered User
 
Join Date: Jul 2011
Last Activity: 29 April 2013, 6:57 PM EDT
Posts: 317
Thanks: 7
Thanked 60 Times in 60 Posts
Try this one liner awk, see if it helps:


Code:
echo "m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG"|awk -F'A' '{print $1 " " "A" $2 $3}'

---------- Post updated at 02:36 PM ---------- Previous update was at 02:34 PM ----------

off course if you want to print just ACGT...


Code:
echo "m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG"|awk -F'A' '{print  "A" $2 $3}'


Last edited by zxmaus; 08-12-2011 at 10:45 PM..
The Following User Says Thank You to dude2cool For This Useful Post:
heecha (08-12-2011)
    #4  
Old 08-12-2011
heecha heecha is offline
Registered User
 
Join Date: Mar 2010
Last Activity: 26 September 2012, 2:30 PM EDT
Location: Cambridge, MA
Posts: 12
Thanks: 10
Thanked 0 Times in 0 Posts
Alister,
Thanks for taking a look, and for your comments, quite helpful. The solution you offered did the trick. I sincerely appreciate your time.

Best,
Robert
Sponsored Links
    #5  
Old 08-12-2011
EAGL€ EAGL€ is offline
Registered User
 
Join Date: Aug 2009
Last Activity: 26 May 2014, 9:22 AM EDT
Location: izmir
Posts: 314
Thanks: 29
Thanked 13 Times in 13 Posts
Quote:
Originally Posted by dude2cool View Post
Try this one liner awk, see if it helps:


Code:
echo "m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG"|awk -F'A' '{print $1 " " "A" $2 $3}'

---------- Post updated at 02:36 PM ---------- Previous update was at 02:34 PM ----------

off course if you want to print just ACGT...

echo "m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG"|awk -F'A' '{print "A" $2 $3}'
Alister's solution sure simpliest and better but if you want to do it with awk then using different seperator rather than "A" cold be better i guess:

Code:
echo ">m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG" | awk -F_ 'c=substr($NF,4,13){print c}'

The Following User Says Thank You to EAGL€ For This Useful Post:
heecha (08-15-2011)
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
split() phoeberunner Shell Programming and Scripting 4 07-15-2010 10:08 AM
awk or sed to split dn sniper57 Shell Programming and Scripting 1 09-01-2009 07:02 AM
split -d vijay_0209 Shell Programming and Scripting 1 09-30-2008 02:52 AM
help with split Hawks444 Shell Programming and Scripting 1 02-28-2008 05:11 PM
Split a file with no pattern -- Split, Csplit, Awk madhunk UNIX for Dummies Questions & Answers 10 12-17-2007 11:57 AM



All times are GMT -4. The time now is 10:36 AM.