awk split | Unix Linux Forums | UNIX for Dummies Questions & Answers

  Unix/Linux Go Back    


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

awk split

UNIX for Dummies Questions & Answers


Closed Linux or Unix Question    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 08-12-2011
heecha heecha is offline
Registered User
 
Join Date: Mar 2010
Last Activity: 26 September 2012, 2:30 PM EDT
Location: Cambridge, MA
Posts: 12
Thanks: 10
Thanked 0 Times in 0 Posts
awk split

Hi Folks,

I have lines that look like this:


Code:
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/469_894ACGTGCTATGCGG

I want to split all lines into:


Code:
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426     ACGTGCTATGCGG
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/469_894 ACGTGCTATGCGG

Then I want to print:


Code:
ACGTGCTATGCGG
ACGTGCTATGCGG


Code:
awk '{split($0,a,"[A;C;G;T]");print a[1]}'

gives:


Code:
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/469_894

but


Code:
awk '{split($0,a,"[A;C;G;T]");print a[2]}'

gives nothing.

I can split in other ways, e.g.,


Code:
awk '{split($0,a,"/[0-9]_[0-9]");print a[2]}'

or


Code:
awk '{split($0,a,"/[0-9]_[0-9][A;C;G;T]");print a[2]}'

a[1] always prints correctly, but a[2] is always "empty".

What am I doing wrong?

Thanks for your help.
Robert

Last edited by zxmaus; 08-12-2011 at 10:44 PM..
Sponsored Links
    #2  
Old Unix and Linux 08-12-2011
alister alister is offline
Registered User
 
Join Date: Dec 2009
Last Activity: 11 June 2014, 8:40 PM EDT
Posts: 3,231
Thanks: 179
Thanked 974 Times in 789 Posts
You can't expect characters that are used to split a string to be part of the result. If you split "1,2,3,4" on the comma, by definition the comma is not an allowed member of a field. Same goes with a bracket expression such as "[ACGT]"; splitting on such an expression forbids A, C, G, and T from occurring in a field.

Assuming I understood what were trying to do, the semicolons in your bracket expressions are incorrect. Characters in a bracket expression should not be delimited. To split on the four letters "A", "C", "G", and "T", "[ACGT]" is all that's needed. Adding those semicolons will cause splitting on semicolons as well.

Looking at your data:

Code:
>m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG

If you just want to print the highlighted base sequence, and if its always preceded by the final number in the line, the following will do:

Code:
sed 's/.*[[:digit:]]//'

Or if the base sequence always begins at the 4th character past the final underscore:

Code:
sed 's/.*_...//'

Regards,
Alister

Last edited by alister; 08-12-2011 at 02:33 PM..
The Following User Says Thank You to alister For This Useful Post:
heecha (08-12-2011)
Sponsored Links
    #3  
Old Unix and Linux 08-12-2011
dude2cool's Unix or Linux Image
dude2cool dude2cool is offline
Registered User
 
Join Date: Jul 2011
Last Activity: 29 April 2013, 6:57 PM EDT
Posts: 317
Thanks: 7
Thanked 60 Times in 60 Posts
Try this one liner awk, see if it helps:


Code:
echo "m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG"|awk -F'A' '{print $1 " " "A" $2 $3}'

---------- Post updated at 02:36 PM ---------- Previous update was at 02:34 PM ----------

off course if you want to print just ACGT...


Code:
echo "m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG"|awk -F'A' '{print  "A" $2 $3}'


Last edited by zxmaus; 08-12-2011 at 10:45 PM..
The Following User Says Thank You to dude2cool For This Useful Post:
heecha (08-12-2011)
    #4  
Old Unix and Linux 08-12-2011
heecha heecha is offline
Registered User
 
Join Date: Mar 2010
Last Activity: 26 September 2012, 2:30 PM EDT
Location: Cambridge, MA
Posts: 12
Thanks: 10
Thanked 0 Times in 0 Posts
Alister,
Thanks for taking a look, and for your comments, quite helpful. The solution you offered did the trick. I sincerely appreciate your time.

Best,
Robert
Sponsored Links
    #5  
Old Unix and Linux 08-12-2011
EAGL€ EAGL€ is offline
Registered User
 
Join Date: Aug 2009
Last Activity: 7 January 2015, 9:43 AM EST
Location: izmir
Posts: 316
Thanks: 30
Thanked 13 Times in 13 Posts
Quote:
Originally Posted by dude2cool View Post
Try this one liner awk, see if it helps:


Code:
echo "m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG"|awk -F'A' '{print $1 " " "A" $2 $3}'

---------- Post updated at 02:36 PM ---------- Previous update was at 02:34 PM ----------

off course if you want to print just ACGT...

echo "m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG"|awk -F'A' '{print "A" $2 $3}'
Alister's solution sure simpliest and better but if you want to do it with awk then using different seperator rather than "A" cold be better i guess:

Code:
echo ">m110730_101608_00120_c100168052554400000315046108261127_s1_p0/7/29_426ACGTGCTATGCGG" | awk -F_ 'c=substr($NF,4,13){print c}'

The Following User Says Thank You to EAGL€ For This Useful Post:
heecha (08-15-2011)
Sponsored Links
Closed Linux or Unix Question

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Unix or Linux Image More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
split() phoeberunner Shell Programming and Scripting 4 07-15-2010 10:08 AM
awk or sed to split dn sniper57 Shell Programming and Scripting 1 09-01-2009 07:02 AM
split -d vijay_0209 Shell Programming and Scripting 1 09-30-2008 02:52 AM
help with split Hawks444 Shell Programming and Scripting 1 02-28-2008 05:11 PM
Split a file with no pattern -- Split, Csplit, Awk madhunk UNIX for Dummies Questions & Answers 10 12-17-2007 11:57 AM



All times are GMT -4. The time now is 01:17 PM.