Extract specific content from data and rename its header problem asking

03-23-2010

Registered User

242, 1

Join Date: Sep 2009

Last Activity: 24 August 2018, 1:52 AM EDT

Posts: 242

Thanks Given: 27

Thanked 1 Time in 1 Post

Extract specific content from data and rename its header problem asking

Input file 1:

Code:

>pattern_5
GAATTCGTTCATGTAGGTTGASDASFGDSGRTYRYGHDGSDFGSDGGDSGSDGSDFGSDF
ATTTAATTATGATTCATACGTCATATGTTATTATTCAATCGTATAAAATTATGTGACCTT
SDFSDGSDFKSDAFLKJASLFJASKLFSJAKJFHASJKFHASJKFHASJKFHSJAKFHAW
>pattern_1
AAGTCTTAAGATATCACCGTCGATTAGGTTTATACAGCTTTTGTGTTATTTAAATTTGAC
ASDASRFSARFASFEDEGSDGHSDHWDYTQATWQRQOPTPEOTPWRIYRHIRGOPEIWRA
.
.

Input file 2:

Code:

pattern_5    5   15
pattern_5    18  25 
pattern_1    10  19
pattern_1    22  27
.
.

Desired output:

Code:

>pattern_5_0.01
TCGTTCATGTA
>pattern_5_0.02
TTGASDAS
>pattern_1_0.01
GATATCACCG
>pattern_1_0.02
GATTAG
.
.

I got a long list of input file 1 and input file 2. Input file 1 is the raw data while input file 2 is the range of input file 1 data that I'm interested to extract and generate the output result file. The column 2 and column 3 of input file 2 is the position that I interested to extract from the data of input file 1. The output file I will rename with the header like "pattern_*_0.0*"
It seems like awk or perl scripts able to archive these goal.
Thanks a lot for any advice.

Last edited by patrick87; 03-23-2010 at 06:31 AM.. Reason: further explaining of my question

patrick87

View Public Profile for patrick87

Find all posts by patrick87

03-23-2010

Registered User

511, 29

Join Date: Sep 2008

Last Activity: 10 November 2015, 2:16 AM EST

Location: In the beautiful World...

Posts: 511

Thanks Given: 10

Thanked 29 Times in 29 Posts

Can you explain how you get the below line in red from your desired output..

Code:

>pattern_5_0.01
TCGTTCATGTA
>pattern_5_0.02
TTGASDAS
>pattern_1_0.01
GATATCACCG
>pattern_1_0.02
GATTAG

malcomex999

View Public Profile for malcomex999

Find all posts by malcomex999

03-24-2010

Registered User

242, 1

Join Date: Sep 2009

Last Activity: 24 August 2018, 1:52 AM EDT

Posts: 242

Thanks Given: 27

Thanked 1 Time in 1 Post

Hi malcomex999,
I just edited my question and explain the usage of column 2 and column 3 inside input file 2.
Thanks for your advice.

---------- Post updated 03-24-10 at 01:53 AM ---------- Previous update was 03-23-10 at 04:32 AM ----------

Hi malcomex999,
do you got any idea to archive the desired goal?
Thanks a lot.

patrick87

View Public Profile for patrick87

Find all posts by patrick87

03-24-2010

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Hi, patrick87:

Code:

awk 'FNR==NR {if (/^>/) p=substr($0,2); else a[p]=a[p] $0; next} {printf(">%s_0.%02u\n%s\n", $1, ++i[$1], substr(a[$1], $2, $3-$2+1))}' f1 f2

While processing the first file (FNR==NR), if a line begins with ">", grab everything that follows it and store it in p, the pattern name. If a line does not begin with a ">", then it is data for the current pattern, p; append the line to a[p], that pattern's entry in array a. Repeat until done with the first file.

For the second file, we use the pattern name in the first field and the index values in the second and third fields to extract the required substring from a[$1], while incrementing a counter for each pattern name seen, in the i array, i[$1].

Cheers,
Alister

Last edited by alister; 03-24-2010 at 05:31 AM..

alister

View Public Profile for alister

Find all posts by alister

03-24-2010

Registered User

242, 1

Join Date: Sep 2009

Last Activity: 24 August 2018, 1:52 AM EDT

Posts: 242

Thanks Given: 27

Thanked 1 Time in 1 Post

Thanks alister,
I'm trying apply your awk code to my case now

Besides that, thanks a lot for your further explanation of your awk code too.
I very appreciate and thanks for your help and advice.
Thanks again ^^

---------- Post updated at 04:19 AM ---------- Previous update was at 03:43 AM ----------

Hi Alister,
Your awk code worked perfectly in my case. Thanks a lot.
Can I ask you if my input file 2 change like this:

Code:

pattern_5    15  5
pattern_5    18  25 
pattern_1    10  19
pattern_1    22  27
.
.

How I can edit the awk code that you suggested to give the same output result as above?
Is it I need to add the "if" condition in the awk code for this problem?
Thanks again for your advice.

patrick87

View Public Profile for patrick87

Find all posts by patrick87

03-24-2010

Registered User

511, 29

Join Date: Sep 2008

Last Activity: 10 November 2015, 2:16 AM EST

Location: In the beautiful World...

Posts: 511

Thanks Given: 10

Thanked 29 Times in 29 Posts

You can modify alister code like this...

Code:

awk 'FNR==NR {if (/^>/) p=substr($0,2); 
else a[p]=a[p] $0; next} 
{printf(">%s_0.%02u\n%s\n", $1, ++i[$1], substr(a[$1], $2, ($2>=$3?$3:$3-$2+1)))}' f1 f2

malcomex999

View Public Profile for malcomex999

Find all posts by malcomex999

03-24-2010

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by malcomex999

You can modify alister code like this...

Code:

awk 'FNR==NR {if (/^>/) p=substr($0,2); 
else a[p]=a[p] $0; next} 
{printf(">%s_0.%02u\n%s\n", $1, ++i[$1], substr(a[$1], $2, ($2>=$3?$3:$3-$2+1)))}' f1 f2

Hi, malcomeex999:

That tweak is incorrect, if I understand the modification to f2 correctly. If the second field is greater than the third, then it instead of being treated as the beginning index of the substring, it should be considered the end index (and the interpretation of the third field should be complementarily swapped). The correct solution requires that the second argument to substr() be modified as well, since in the case of $2 > $3, it should be $3 not $2.

By the way, malcomeex999 and rdcwayx, thank you very much for your bit awards. It's appreciated

Hi, patrick87:

One solution to handle both cases (even if they appear within the same file2):

Code:

awk 'FNR==NR {if (/^>/) p=substr($0,2); else a[p]=a[p] $0; next}
     {if ($2>$3) {t=$2; $2=$3; $3=t}; printf(">%s_0.%02u\n%s\n", $1, ++i[$1], substr(a[$1], $2, $3-$2+1))}' f1 f2

It works identically to my earlier solution except that it tests the second and third fields in f2. If the first index is greater than the second, their values are swapped before the substr() call.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

Shell Programming and Scripting

Extract specific content from data and rename its header problem asking

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with rename data content

Discussion started by: perl_beginner

2. Shell Programming and Scripting

extract specific string and rename file

Discussion started by: mukeshguliao

3. Shell Programming and Scripting

Help with rename header content based on reference file problem

Discussion started by: perl_beginner

4. Shell Programming and Scripting

Extract all content that match exactly only specific word

Discussion started by: patrick87

5. Shell Programming and Scripting

mailx requirement - email body header in bold and data content in normal text

Discussion started by: sureshg_sampat

6. Shell Programming and Scripting

Way to extract detail and its content above specific value problem asking

Discussion started by: patrick87

7. Shell Programming and Scripting

Remove specific pattern header and its content problem facing

Discussion started by: patrick87

8. Shell Programming and Scripting

Extract specific data content from a long list of data

Discussion started by: patrick87

9. Shell Programming and Scripting

Extract all the content after a specific data

Discussion started by: patrick87

10. Shell Programming and Scripting

Extract specific content from a file

Discussion started by: patrick87