Extracting and copying text from one file to another


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extracting and copying text from one file to another
# 1  
Old 01-13-2016
Extracting and copying text from one file to another

Helooo,

So I have a .fasta file (a text file with sequence data) which looks like this, with just over 3 million lines of data.

Code:
>TCONS_00000001 gene=XLOC_000001
AATTGTGGTGAAATGACTTCTGTTAACGGAGACATCGATGATTGTTGTTACTATTTGTTCTCAGGATTCA
TTTGTCCGGTTCATACCCCGGACGGCGCCCCTTGCGGGCTGCTCAATCACCTGACAATGAACTGTATCGT
CACGAAGCATCCGGATCGCAAATTAAAGGCTGCGCTACCAACGGTGCTGGTGGATCTAGGAATGCTTCCG
TTGTCTGTTGCGAATAATTGGAAGGACTCGTACACGGTAATGCTGAATGGTAAAGTGATCGGCCTGATCG
AAGATAATATTGTTGATAAGGTGGCCCGCAAACTAAGGCAGCTGAAGATAATTGGTGAAGAGGTGCCGAA
CACGTTGGAGATCGCGCTGGTGCCGAAGAGGAAGG
>TCONS_00000002 gene=XLOC_000002
CGGATGTATATCGTGCCGTGCTTTGATCGTTTATTTGATGTCCCATTTGCTGTTGGACTTGCGGCGGTAT
TGCCGTTGTTCTCGGCCTTGGTCGTGGCCGTGTGTCTTGCGTGTTTAGGTCCGGGCTGTCTTGAGCACCA
ACTTCCAGTGTCGGTAGTGGAGCTCGTGGTTGCAGGGTTTGCTGCCGAGTTCGTTGGGGCGTTTTGATTG
TTAGGCCTCGTGAACTCGTTTTTTTCGACGCAGATATTGATTTCGAAGGTGTGTGTCTCCTTTCCTGCGG
TTGTTTCGTTTGTTTTGTCGTCGACGGCTCGACGTATTTCGTTGTACTTGAGGTGTCTTTGTTTTGTCGA
TCTTTGTTTCGATCGAGTATATTCCCAACGTTGTGGACGTTGGTCTTCATTCTTCTTATTTCAAATATTA
TATTTTTCCGGCGTTCCTCAAGATATTGGAGGCACCGTTGTTCTCTTTCGCGAAGTCGCGTGAACTCTTC

I have another text file which is a list of headers like...

Code:
XLOC_002667
XLOC_002676
XLOC_003874

I'd like to search the first file and extract the headers + sequence data underneath, using the headers from the second file.

I've been using

Code:
grep -A1 -w -f id.txt sam_gtf.fasta > output.fasta

But it only copies some of the sequence data, rather than all of it. Can anyone help me modify the code so it will copy all of the sequence data under each header?

Thanks!
# 2  
Old 01-13-2016
Hello 4galaxy7,

Could you please try with following and let me know if this helps you.
Let's say following is the Input_file1 and Input_file2.
Code:
cat Input_file1
>TCONS_00000001 gene=XLOC_000001
AATTGTGGTGAAATGACTTCTGTTAACGGAGACATCGATGATTGTTGTTACTATTTGTTCTCAGGATTCA
TTTGTCCGGTTCATACCCCGGACGGCGCCCCTTGCGGGCTGCTCAATCACCTGACAATGAACTGTATCGT
CACGAAGCATCCGGATCGCAAATTAAAGGCTGCGCTACCAACGGTGCTGGTGGATCTAGGAATGCTTCCG
TTGTCTGTTGCGAATAATTGGAAGGACTCGTACACGGTAATGCTGAATGGTAAAGTGATCGGCCTGATCG
AAGATAATATTGTTGATAAGGTGGCCCGCAAACTAAGGCAGCTGAAGATAATTGGTGAAGAGGTGCCGAA
CACGTTGGAGATCGCGCTGGTGCCGAAGAGGAAGG
>TCONS_00000002 gene=XLOC_000002
CGGATGTATATCGTGCCGTGCTTTGATCGTTTATTTGATGTCCCATTTGCTGTTGGACTTGCGGCGGTAT
TGCCGTTGTTCTCGGCCTTGGTCGTGGCCGTGTGTCTTGCGTGTTTAGGTCCGGGCTGTCTTGAGCACCA
ACTTCCAGTGTCGGTAGTGGAGCTCGTGGTTGCAGGGTTTGCTGCCGAGTTCGTTGGGGCGTTTTGATTG
TTAGGCCTCGTGAACTCGTTTTTTTCGACGCAGATATTGATTTCGAAGGTGTGTGTCTCCTTTCCTGCGG
TTGTTTCGTTTGTTTTGTCGTCGACGGCTCGACGTATTTCGTTGTACTTGAGGTGTCTTTGTTTTGTCGA
TCTTTGTTTCGATCGAGTATATTCCCAACGTTGTGGACGTTGGTCTTCATTCTTCTTATTTCAAATATTA
TATTTTTCCGGCGTTCCTCAAGATATTGGAGGCACCGTTGTTCTCTTTCGCGAAGTCGCGTGAACTCTTC
>TCONS_00000002 gene=XLOC_003874
CGGATGTATATCGTGCCGTGCTTTGATCGTTTATTTGATGTCCCATTTGCTGTTGGACTTGCGGCGGTAT
TGCCGTTGTTCTCGGCCTTGGTCGTGGCCGTGTGTCTTGCGTGTTTAGGTCCGGGCTGTCTTGAGCACCA
ACTTCCAGTGTCGGTAGTGGAGCTCGTGGTTGCAGGGTTTGCTGCCGAGTTCGTTGGGGCGTTTTGATTG
TTAGGCCTCGTGAACTCGTTTTTTTCGACGCAGATATTGATTTCGAAGGTGTGTGTCTCCTTTCCTGCGG
TTGTTTCGTTTGTTTTGTCGTCGACGGCTCGACGTATTTCGTTGTACTTGAGGTGTCTTTGTTTTGTCGA
TCTTTGTTTCGATCGAGTATATTCCCAACGTTGTGGACGTTGGTCTTCATTCTTCTTATTTCAAATATTA
TATTTTTCCGGCGTTCCTCAAGATATTGGAGGCACCGTTGTTCTCTTTCGCGAAGTCGCGTGAACTCTTC
  
  
cat Input_file2
XLOC_002667
XLOC_002676
XLOC_003874
XLOC_000001

Then following code may help you.
Code:
awk 'FNR==NR{A[$0];next} ($2 in A){C=1;print $0;next} {if($0 ~ /^>/){C=""}} {if(C){print}}'  Input_file2 FS="="  Input_file1

Output will be as follows.
Code:
>TCONS_00000001 gene=XLOC_000001
AATTGTGGTGAAATGACTTCTGTTAACGGAGACATCGATGATTGTTGTTACTATTTGTTCTCAGGATTCA
TTTGTCCGGTTCATACCCCGGACGGCGCCCCTTGCGGGCTGCTCAATCACCTGACAATGAACTGTATCGT
CACGAAGCATCCGGATCGCAAATTAAAGGCTGCGCTACCAACGGTGCTGGTGGATCTAGGAATGCTTCCG
TTGTCTGTTGCGAATAATTGGAAGGACTCGTACACGGTAATGCTGAATGGTAAAGTGATCGGCCTGATCG
AAGATAATATTGTTGATAAGGTGGCCCGCAAACTAAGGCAGCTGAAGATAATTGGTGAAGAGGTGCCGAA
CACGTTGGAGATCGCGCTGGTGCCGAAGAGGAAGG
>TCONS_00000002 gene=XLOC_003874
CGGATGTATATCGTGCCGTGCTTTGATCGTTTATTTGATGTCCCATTTGCTGTTGGACTTGCGGCGGTAT
TGCCGTTGTTCTCGGCCTTGGTCGTGGCCGTGTGTCTTGCGTGTTTAGGTCCGGGCTGTCTTGAGCACCA
ACTTCCAGTGTCGGTAGTGGAGCTCGTGGTTGCAGGGTTTGCTGCCGAGTTCGTTGGGGCGTTTTGATTG
TTAGGCCTCGTGAACTCGTTTTTTTCGACGCAGATATTGATTTCGAAGGTGTGTGTCTCCTTTCCTGCGG
TTGTTTCGTTTGTTTTGTCGTCGACGGCTCGACGTATTTCGTTGTACTTGAGGTGTCTTTGTTTTGTCGA
TCTTTGTTTCGATCGAGTATATTCCCAACGTTGTGGACGTTGGTCTTCATTCTTCTTATTTCAAATATTA
TATTTTTCCGGCGTTCCTCAAGATATTGGAGGCACCGTTGTTCTCTTTCGCGAAGTCGCGTGAACTCTTC

Thanks,
R. Singh
# 3  
Old 01-13-2016
Thanks - it seems to return the sequence data but not the ID tag line. It also seems to be copying out a lot more than just the 50 or so tags in the second file.
# 4  
Old 01-13-2016
Hello 4galaxy7,

Not sure if I got you correctly, if you need only lines which have ids in them then following may help you in same.
Code:
awk 'FNR==NR{A[$0];next} ($2 in A){print $0}'  Input_file2 FS="=" Input_file1

Output will be as follows.
Code:
>TCONS_00000001 gene=XLOC_000001
>TCONS_00000002 gene=XLOC_003874

Thanks,
R. Singh
# 5  
Old 01-13-2016
Sorry maybe I didn't explain. The output you gave in your first post is exactly what I wanted, but the output that I got was:

Code:
AATTGTGGTGAAATGACTTCTGTTAACGGAGACATCGATGATTGTTGTTACTATTTGTTCTCAGGATTCA
TTTGTCCGGTTCATACCCCGGACGGCGCCCCTTGCGGGCTGCTCAATCACCTGACAATGAACTGTATCGT
CACGAAGCATCCGGATCGCAAATTAAAGGCTGCGCTACCAACGGTGCTGGTGGATCTAGGAATGCTTCCG
TTGTCTGTTGCGAATAATTGGAAGGACTCGTACACGGTAATGCTGAATGGTAAAGTGATCGGCCTGATCG
AAGATAATATTGTTGATAAGGTGGCCCGCAAACTAAGGCAGCTGAAGATAATTGGTGAAGAGGTGCCGAA
CACGTTGGAGATCGCGCTGGTGCCGAAGAGGAAGG
CGGATGTATATCGTGCCGTGCTTTGATCGTTTATTTGATGTCCCATTTGCTGTTGGACTTGCGGCGGTAT
TGCCGTTGTTCTCGGCCTTGGTCGTGGCCGTGTGTCTTGCGTGTTTAGGTCCGGGCTGTCTTGAGCACCA
ACTTCCAGTGTCGGTAGTGGAGCTCGTGGTTGCAGGGTTTGCTGCCGAGTTCGTTGGGGCGTTTTGATTG
TTAGGCCTCGTGAACTCGTTTTTTTCGACGCAGATATTGATTTCGAAGGTGTGTGTCTCCTTTCCTGCGG
TTGTTTCGTTTGTTTTGTCGTCGACGGCTCGACGTATTTCGTTGTACTTGAGGTGTCTTTGTTTTGTCGA
TCTTTGTTTCGATCGAGTATATTCCCAACGTTGTGGACGTTGGTCTTCATTCTTCTTATTTCAAATATTA
TATTTTTCCGGCGTTCCTCAAGATATTGGAGGCACCGTTGTTCTCTTTCGCGAAGTCGCGTGAACTCTTC

Etc. So the headers are missing, as I need those as well as the sequence? Is that clear?
# 6  
Old 01-13-2016
Try also
Code:
awk 'FNR == NR {T[$1]; next} {sub(/^/,">")} substr($2, 6) in T' file2 RS=">" ORS="" file1

This User Gave Thanks to RudiC For This Post:
# 7  
Old 01-13-2016
Great that has worked perfectly, thanks!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Copying a file to multiple other files using a text file as input

Hello, I have a file called COMPLIST as follows that contains 4 digit numbers.0002 0003 0010 0013 0015 0016 0022 0023 0024 0025 0027 0030 0031 0032 0033 0035 0038 0041 (3 Replies)
Discussion started by: sph90457
3 Replies

2. Shell Programming and Scripting

Inserting text in file names while copying them.

I'm trying to find a Bourne shell script that will copy files from one directory using a wild card for the file name (*) and add some more characters in the middle of the file name as it is copied. As an example: /u01/tmp-file1.xml => /u02/tmp-file1-20130620.xml /u01/tmp-file2.xml => ... (6 Replies)
Discussion started by: Tony Keller
6 Replies

3. UNIX for Dummies Questions & Answers

Extracting lines from a text file based on another text file with line numbers

Hi, I am trying to extract lines from a text file given a text file containing line numbers to be extracted from the first file. How do I go about doing this? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

4. Shell Programming and Scripting

help extracting text from file

Hello I have a large file with lines beginning with 552, 553, 554, below is a small sample, I need to extract the data you can see below highlighted in bold from this file on the same location on every line and output it to a new file. Thank you in advance for any help 55201KL... (2 Replies)
Discussion started by: firefox2k2
2 Replies

5. UNIX for Dummies Questions & Answers

Extracting the last column of a text file

I would like to extract the last column of a text file but different rows of the text file have different numbers of columns. How do I go about doing that? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

6. UNIX for Dummies Questions & Answers

Copying and Pasting columns from one text file to another

I have a tab delimited text file that I want to cut columns 3,4,5 from. Then I want to paste these columns into a space delimited text file between columns 2 and 3. I still want to keep the space delimited format in the final text file. How do I go about doing that? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

7. Shell Programming and Scripting

Need help with searching and copying in a text file

Hi, I need help searching through a large text file. I need to find a certain string within the text, and copy each line until another string appears. The file looks like this: >scf15164843 ATTAAAGGNNNGGAATTTCCCCAA ATTACCGGCTTTAAANNNTTACCC >scf15154847 CCGGGNNNTTTAAACCCGNGNGCC... (2 Replies)
Discussion started by: repiv
2 Replies

8. UNIX for Dummies Questions & Answers

extracting text and reusing the text to rename file

Hi, I have some ps files where I want to ectract/copy a certain number from and use that number to rename the ps file. eg: 'file.ps' contains following text: 14 (09 01 932688 0)t the text can be variable, the only fixed element is the '14 ('. The problem is that the fixed element can appear... (7 Replies)
Discussion started by: JohnDS
7 Replies

9. Shell Programming and Scripting

Extracting specific text from a file

Dear All, I have to extract a a few lines from a log file and I know the starting String and end string(WHich is same ). Is there any simplere way using sed - awk. e.g. from the following file -------------------------------------- Some text Date: 21 Oct 2008 Text to be extracted... (8 Replies)
Discussion started by: rahulkav
8 Replies

10. Shell Programming and Scripting

Extracting a line in a text file

If my file looks like this…. 10 20 30 and I want to take each line individually and put it in a variable so it can be read later in it's on individual test statement, how can I do that? I guess what I'm asking is how can I extract each line individually. Thanks (5 Replies)
Discussion started by: terryporter51
5 Replies
Login or Register to Ask a Question