bash script to parse sequence...


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers bash script to parse sequence...
# 1  
Old 12-02-2010
bash script to parse sequence...

Hi,

I have 4000 list files and 4000 sequence data files. Each list file contains a number of 'headers' and data file contains 'header and data'. I would like to extract data from the data file using the list file and write into a new file. As each of the files are quite large, an efficient piece of script(preferably bash) will be much appreciated. Example below:

Example list file:
HTML Code:
contig00002 length=653   numreads=34
contig00005 length=636   numreads=21
contig00015 length=662   numreads=51
contig00033 length=584   numreads=24
contig00045 length=539   numreads=19
contig00073 length=454   numreads=67
contig00046 length=660   numreads=27
contig00014 length=746   numreads=18
contig00089 length=298   numreads=19
.....
.....
Example data file:
HTML Code:
>contig00001 length=477   numreads=22
GGGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGTAAGTGAAT
GTCACATCGTTTGGATCAAGACCCATTTGCAGCACAAGCCCTGTTTTGTT
>contig00002 length=530   numreads=27
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGGAGGATAGGG
AGCTGAGCAGCCAGTGACAGGATCCAGCTCCAGGGGGTGAATGGGGATGG
>contig00004 length=670   numreads=22
GGGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGATTGTTGAA
GTGGAAAGCCATTTTGACTATTACCGCCCGGTGGCAGAAACCAAACCTGG
.....
....
Example output file:
HTML Code:
>contig00002 length=653   numreads=34
GGGCAGCTGCGGCCGCTAATACGACTCACTATAGGGAGAGGCTTGCTCAA
ATCCGCGTTCAAGGATTTCCAGATTGGTAAGAACTTCAGATTCCTTGACG
>contig00005 length=636   numreads=21
GGGCAGCTGCGGCCGCTAATACGACTCACTATAGGGAGAGATCGTGGCGA
TCGCCAATCACCCAGGTGCCGTTAGCCAGAGCTGGTTTGATGACCGTTTC
>contig00015 length=662   numreads=51
GGGCAGCTGCGGCCGCTAATACGACTCACTATAGGGAGAGAGCTCCAGCA
GAATGGACACGCCTCCTGAGCTGTGATAGGGAGAGCATAAACACGCCTCC
.....
.....
Thanks in advance.
# 2  
Old 12-02-2010
Instead of greping data from data file, extract length and numreads from list file.
Try below mentioned script.

Code:
listFile="/path/listFile.txt"
dataFile="/path/dataFile.txt"
>outputFile

cat $dataFile | while read line
do
if [ "$line" = ".*contig.*length.*" ] ; then
     header=`echo $line | cut -d" " -f1`
     optLine=`grep $header $listFile`
else
     optLine="$line"
fi
echo "$optLine" >> outputFile
done

Use sed if cut command not works.
R0H0N
# 3  
Old 12-02-2010
Hi ROHON,

Thanks for your reply. Just two issues:

1. The code produces exactly the similar 'Output' as the input 'DataFile'. The headers are not similar. List file contain less headers. For example 'contig00001 length=477 numreads=22' is not present in the 'list' file and should not be retrieved.

2. I've 4000 list and data files. Is there a way to loop over all the files in one run?

Cheers.
# 4  
Old 12-02-2010
I'm sure there's a way to loop over your data files. However you've told us nothing about them other than you have a lot of them so it's a little hard to help. Do they have filenames? Do you have a list of them? Are they organized in any way?

Quote:
Originally Posted by R0H0N
[CODE]listFile="/path/listFile.txt"
dataFile="/path/dataFile.txt"
>outputFile

cat $dataFile | while read line
You've replaced the useless use of grep with a useless use of cat.

I suggest this instead:

Code:
while read RECORD LENGTH NUMREADS
do
        IFS="=" read G LENGTH <<< "${LENGTH}"
        IFS="=" read G NUMREADS <<< "${NUMREADS}"

        ...
done < listfile

Still working on the data processing. Should post in a while.

---------- Post updated at 09:52 AM ---------- Previous update was at 09:43 AM ----------

Does the output data have to be produced in the same order as the input data? That's going to be a royal pain because it's in random order, whereas the input file is sorted. That means starting over at the top of the datafile for every record instead of reading as you go.

---------- Post updated at 09:55 AM ---------- Previous update was at 09:52 AM ----------

Quote:
1. The code produces exactly the similar 'Output' as the input 'DataFile'. The headers are not similar. List file contain less headers. For example 'contig00001 length=477 numreads=22' is not present in the 'list' file and should not be retrieved.
Well, that turns this on its head. I hope you mean data file, because if the list file contains no listings that makes no sense at all... What does the data file look like then? What relationship do the list fields have with the data to be retrieved? Is length bytes? What does numreads mean? How do we calculate some sort of offset from this?
# 5  
Old 12-02-2010
Hi Rohon,

The list fiiles are as follows: List1.txt, List2.txt, ....., List4000.txt
And the corresponding data files are: DataFile1.txt, DataFile2.txt, ...., DataFile4000.txt
List1.txt will be used for DataFaile1.txt and so on.

Contents of both the files are not sorted. For example:
List1.txt
HTML Code:
contig00002 length=653   numreads=34
contig00015 length=636   numreads=21
contig00005 length=662   numreads=51
contig00045 length=584   numreads=24
contig00033 length=539   numreads=19
DataFile1.txt
HTML Code:
>contig00015 length=477   numreads=22
GGGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGTAAGTGAAT
GTCACATCGTTTGGATCAAGACCCATTTGCAGCACAAGCCCTGTTTTGTT
>contig00002 length=530   numreads=27
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGGAGGATAGGG
AGCTGAGCAGCCAGTGACAGGATCCAGCTCCAGGGGGTGAATGGGGATGG
>contig00005 length=670   numreads=22
GGGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGATTGTTGAA
GTGGAAAGCCATTTTGACTATTACCGCCCGGTGGCAGAAACCAAACCTGG
>contig00045 length=636   numreads=21
GGGCAGCTGCGGCCGCTAATACGACTCACTATAGGGAGAGATCGTGGCGA
TCGCCAATCACCCAGGTGCCGTTAGCCAGAGCTGGTTTGATGACCGTTTC
>contig00072 length=662   numreads=51
GGGCAGCTGCGGCCGCTAATACGACTCACTATAGGGAGAGAGCTCCAGCA
GAATGGACACGCCTCCTGAGCTGTGATAGGGAGAGCATAAACACGCCTCC
One way would be, read one header from the list file, say 'contig00002 length=653 numreads=34', search for it in the DataFile1.txt and retrieve the following:
'>contig00002 length=530 numreads=27
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGGAGGATAGGG
AGCTGAGCAGCCAGTGACAGGATCCAGCTCCAGGGGGTGAATGGGGATGG'

Cheers.
# 6  
Old 12-02-2010
--- removed ---

Last edited by ctsgnb; 12-02-2010 at 01:09 PM.. Reason: Ooops
# 7  
Old 12-02-2010
Quote:
Originally Posted by Fahmida
Contents of both the files are not sorted.
If nothing's sorted and it has to produce output in arbitrary order, there's no efficient solution, since there's no way but sheer brute force for finding the records.
Code:
for ((N=1; N<=4000; N++))
do
        while read RECORD REST
        do
                # Open datafile into FD 5
                exec 5<DataFile${N}.txt

                # Find that entry in the file, leaving file pos at the next line
                ( if grep -m 1 "^\>${RECORD}"
                then
                        # Print each line until we find the next record, or EOF
                        while read LINE
                        do
                                [ "${LINE:0:1}" == ">" ] && break
                                echo "$LINE"
                        done
                fi ) <&5

                # Close the file
                exec 5<&-
        done < List${N}.txt > output${N}.txt
done


Last edited by Corona688; 12-02-2010 at 01:11 PM.. Reason: ${1} should be ${N}
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash functions sequence ?

OK, I know function has to be defined first - in sequence - before it can be used. So the script has to be build "bottoms -up style, if you pardon my expression. I am running into a problem reusing function and breaking the sequence. It would be nice to be able to see the function... (10 Replies)
Discussion started by: annacreek
10 Replies

2. Shell Programming and Scripting

Bash Script to parse Perforce Logs

Hi All, I need to write a bash script that will parse some perforce log files, the log files will contain user login information, the script would need to pare the log, and check who logs in, and if the user is a superadmin, then the script will check the ip address to see which server the... (4 Replies)
Discussion started by: BostonRob
4 Replies

3. Shell Programming and Scripting

Bash script - cygwin (powershell?) pull from GitHub API Parse JSON

All, Have a weird issue where i need to generate a report from GitHub monthly detailing user accounts and the last time they logged in. I'm using a windows box to do this (work issued) and would like to know if anyone has any experience scripting for GitAPI using windows / cygwin / powershell?... (9 Replies)
Discussion started by: ChocoTaco
9 Replies

4. Shell Programming and Scripting

BASH script to parse XML and generate CSV

Hi All, Hope all you are doing good! Need your help. I have an XML file which needs to be converted CSV file. I am not an expert of awk/sed so your help is highly appreciated!! XML file looks like this: <l:event dateTime="2013-03-13 07:15:54.713" layerName="OSB" processName="ABC"... (2 Replies)
Discussion started by: bhaskar_m
2 Replies

5. Shell Programming and Scripting

Bash Script for parse input like option and value

I would create a bash script than parse like this: test.sh -p (protocol) -i (address) -d (directory) I need retrive the value after -p for example... understand??? I hope... thanks (6 Replies)
Discussion started by: ionral
6 Replies

6. Shell Programming and Scripting

Press Any Key script sequence using bash - HELP

hi to all. im a newbie in unix shell scripts. i want to make a simple unix shell script using the bash shell that asks a user to press any key after a series of commands, or an x if he wishes to exit. here's a sample script that i made: #!/usr/bin/bash pause(){ /usr/bin/echo "\t\t Press... (3 Replies)
Discussion started by: booghaw
3 Replies

7. Shell Programming and Scripting

Bash Shell Script to parse file

Raw Results: results|192.168.2|192.168.2.1|general/udp|10287|Security Note|For your information, here is the traceroute from 192.168.2.24 to 192.168.2.1 : \n192.168.2.24\n192.168.2.1\n\n results|192.168.2|192.168.2.1|ssh (22/tcp)|22964|Security Note|An SSH server is running on this port.\n... (2 Replies)
Discussion started by: jroberson
2 Replies

8. Shell Programming and Scripting

Bash Script to read a file and parse each record

Hi Guys, I am new to unix scripting and I am tasked to parse through a CSV file delimited by #. Sample: sample.csv H#A#B#C D#A#B#C T#A#B#C H = Header D = Detail Record T = Tail What I need is to read the file and parse through it to get the columns. I have no idea on how... (8 Replies)
Discussion started by: 3vilwyatt
8 Replies

9. Shell Programming and Scripting

Need to Parse XML from bash script

I am completely new to bash scripting and now need to write a bash script that would parse a XML file and take out values from specific tags. I tried using xsltproc, xml_grep commands. But the issue is that the XML i am trying to parse is not UTF 8. so those commands are unable to parse my XML's... (4 Replies)
Discussion started by: shivashankar.g
4 Replies

10. Shell Programming and Scripting

How do you parse a variable in a bash script?

I have a script I use on my web server (Apache2). I am changing to Lighttpd and need to make a few changes. This is what I use on my apache server #!/bin/bash # accepts 3 parameters: <domain name> <user name> <XXXXXXXX> # domain name is without www (just domain.com) # username would be... (3 Replies)
Discussion started by: vertical98
3 Replies
Login or Register to Ask a Question