The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
file size comparision local file and remote file dba.admin2008 Shell Programming and Scripting 4 11-13-2008 05:57 PM
Reading a file and writing the file name to a param file. thebeginer UNIX for Advanced & Expert Users 1 10-05-2007 04:38 PM
Reading file names from a file and executing the relative file from shell script anushilrai Shell Programming and Scripting 4 03-10-2006 05:25 AM
How can I find the 3 first letters from the name file steiner Shell Programming and Scripting 8 06-17-2005 08:10 AM
look in file, seperate letters, put in order... chekeitout UNIX for Advanced & Expert Users 3 11-05-2004 05:00 PM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 05-09-2009
kylle345 kylle345 is offline
Registered User
  
 

Join Date: May 2009
Posts: 46
Need help with a file that prints letters from a file according to another file!

So basically what I want to do is pull out DNA sequences for a particular gene name.

I have 2 files (FILE1 and FILE2) and I want an output into a separate file (FILE3).

FILE1 and 2 are MASSIVE so I am only posting examples from each file.

So FILE1 looks like this (tab deliminted, 4 columns):

##gff-version 1

1154 10 + AAD6
418 7429 + AAH1
702 759 + AAT1
584 10 - ABF2
642 4894 - ACC1
651 7213 - ACN9
1055 3454 - ADE1

The next file, FILE2, looks like this:


>1154
ATCTCACTCGTAATTCTACATAATTTTGTTTATGCTTTTATTGTCATTTTATATATTGTCAGTCATTATCCTATTACATTATCAATCCTTGCATTTCAGC TTCCACTTATTTCGATGACCGCTTCTCATAACTTATGTCATCTTCTAACACCGTATATGATAATGTACCAGTAGTATGAC
>584
GCAAGCTTTATAGTGACAACAATAAGGTATCACTCGGTTACAATTACCCCCACTTCCCCT


What I want to do is identify column 1 of FILE1 with the ># on FILE2. So for example, 1154 from FILE1 will match up with 1154 from FILE2. Next, I want it to identify the value on column 2 (so for 1154, it will identify the 10th letter which happens to be G). So if column 3 of FILE1 is + then it will print the first 8 letters in from of it (i.e. the 8 letters in front of G would be TCTCACTC). But if is it – on column 3, then it will take the reverse. So for ABF2 on “584” it will take the top 8 sequences starting from the reverse end. So instead of starting at “G” at >584, it will start at “T” (the end). So the position of ABF2 will be 25 letters away from “T” , so the letter will be “C”. Then it will take the values behind it… so CCACTTCC.

The output file will print out column 4 of FILE1, the top 8 letters from FILE2 and column 3 from FILE1.

The final file (FILE3) will look like this:

AAD6 TCTCACTC +
ABF2 CCACTTCC -


Could someone give me some help on this! I am new to perl and I am put in a situation where I have to program at a very high level.

Thanks
  #2 (permalink)  
Old 05-09-2009
devtakh devtakh is offline
Registered User
  
 

Join Date: Oct 2007
Location: Bangalore
Posts: 514
I am not clear on this part -

HTML Code:
But if is it – on column 3, then it will take the reverse. So for ABF2 on “584” it will take the top 8 sequences starting from the reverse end. So instead of starting at “G” at >584, it will start at “T” (the end). So the position of ABF2 will be 25 letters away from “T” , so the letter will be “C”. Then it will take the values behind it… so CCACTTCC.
  #3 (permalink)  
Old 05-10-2009
devtakh devtakh is offline
Registered User
  
 

Join Date: Oct 2007
Location: Bangalore
Posts: 514
Try this.

Code:
awk 'FNR==NR{a[$1]=$2SUBSEP$3","$4;next}
/^>/{gsub(/>/,"",$1);s=$1;
if (s in a){
getline;
st=substr(a[s],1,index(a[s],SUBSEP)-1)
sg=substr(a[s],index(a[s],SUBSEP)+1,1)
if ( sg == "+")str=substr($0,st-8,8);else str=substr($0,length($0)-st,8);pt=substr(a[s],index(a[s],",")+1,length(a[s]))
print pt,str,sg;next
}}' file1 file2

cheers,
Devaraj Takhellambam
  #4 (permalink)  
Old 05-10-2009
ghostdog74 ghostdog74 is offline Forum Advisor  
Registered User
  
 

Join Date: Sep 2006
Posts: 2,511
how MASSIVE is your file1 and file2, in terms of MB?? GB??
also, if you are new to Perl, then you should at least read up something on Perl before attempting this.
Closed Thread

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 01:16 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0