how to fetch substring from records into another file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting how to fetch substring from records into another file
# 1  
Old 07-31-2008
how to fetch substring from records into another file

Hi all,
Im stuck in findind solution to ths problem. Please guide me if u have any ideas.

I have two files.

===FILE1===
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra]
MRVIAAAMLYLYIVVLAICSVGIQGIDYPSVSFNLAGAKSATWDFLRMPHDLVGEDNKYNDGEPITGNII
GRDGLCVDVRNGYDTDGTPLQLWPCGTQRNQQWTFYTDDTIRSMGKCMTANGLSNGSNIMIFNCSTAVEN
AIKWEVTIDGSIINPSSG
>bi|21083|em|CAA26939.1| precursor [Ricinus communis]
MKPGGNTIVIWMYAVATWLCFGSTSGWSFTLEDNNIFPKQYPIINFTTAGATVQSYTNFIRAVRGRLTTG
ADVRHEIPVLPNRVGLPINQRFILVELSNHAELSVTLALDVTNAYVVGYRAGNSAYFFHPDNQEDAEAIT
HLFTDVQNRYTFAFGGNYDRLEQLAGNLRENIELGNGPLEEAISALYYYSTGGTQLPTL
>bi|19526601|geb|AAL87006.1| chain A [Viscum album]
YERLRLRVTHQTTGEEYFRFITLLRDYVSSGSFSNEIPLLRQSTIPVSDAQRFVLVELTNEGGDSITAAI
DVTNLYVVAYQAGDQSYFLRDAPRGAETHLFTGTTRSSLPFNGSYPDLERYAGHRDQIPLGIDQLIQSVT
ALRFPGGNTRTQARSILILIQMISEAARFNPILWRARQYINSGASFLPDVY

=====FILE2====
bi|2138271|geb|AAC15885----- 20 ---- 32
bi|2138271|geb|AAC15885 ----- 92 ---- 110
bi|19526601|geb|AAL8700 ----- 74 ---- 92
bi|2138271|geb|AAC15885 ----- 26 ---- 38
bi|21083|em|CAA26939.1| ----- 19 ---- 37
bi|21083|em|CAA26939.1| ----- 52 ---- 70

please note that ----- is not present in actual file i just wrote to make the columns visible otherwise it is a tab separated file.

It has three columns-
column1 contains patterns from FILE1
column 2 contains the start position to cut the string
and column3 contains end position til the string has to be cut
WHERE POSITION 1 STARTS FROM THE LINE CONTAINING ALL THE ALPHABETS IN CAPS WITHOUT ANY SPACE AND ENDS WHERE NEXT '>' SYMBOL APPEARS

I want to have a new file that contains header(LINE BEGINING WITH '>' SYMBOL) maching first column of FILE2 and then only a substring

for eg- in case of first record of FILE2

The ouput file should contain-
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (20-32)
SVGIQGIDYPSV

and so on------ for all the other records of FILE2

Your contribution is highly appretiated.
Thanks
Smriti Smilie

Last edited by smriti_shridhar; 07-31-2008 at 07:55 AM.. Reason: more clarification
# 2  
Old 07-31-2008
I am not sure that the output of the following script is what what you are looking for.
Try and adapt it.
Code:
awk -F'\t' '

NR==FNR {
   if (NR==1)
      Key_len = length($1) + 1;
   k = ">" substr($0,1, Key_len-1);
   n = ++Keys[k];
   From[k,n] = $2;
     To[k,n] = $3;
    Len[k,n] = $3 - $2;
   next;
}

function print_selected(    i,k) {
   if (selected) {
      k = substr(Header, 1, Key_len);
      for (i=1; i<=Keys[k]; i++) {
         printf("%s (%s-%s)\n", Header, From[k,i], To[k,i]);
         print substr(Alphabets, From[k,i], Len[k,i]);
      }
   }
}

/^>/ {
   print_selected();
   selected  = (substr($0, 1, Key_len) in Keys);
   Header    = $0;
   Alphabets = "";
   next;
}

selected {
   Alphabets = Alphabets $0;
}

END {
   print_selected();
}
' FILE2 FILE1

Input file FIL1 :
Code:
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra]
MRVIAAAMLYLYIVVLAICSVGIQGIDYPSVSFNLAGAKSATWDFLRMPHDLVGEDNKYNDGEPITGNII
GRDGLCVDVRNGYDTDGTPLQLWPCGTQRNQQWTFYTDDTIRSMGKCMTANGLSNGSNIMIFNCSTAVEN
AIKWEVTIDGSIINPSSG
>bi|21083|em|CAA26939.1| precursor [Ricinus communis]
MKPGGNTIVIWMYAVATWLCFGSTSGWSFTLEDNNIFPKQYPIINFTTAGATVQSYTNFIRAVRGRLTTG
ADVRHEIPVLPNRVGLPINQRFILVELSNHAELSVTLALDVTNAYVVGYRAGNSAYFFHPDNQEDAEAIT
HLFTDVQNRYTFAFGGNYDRLEQLAGNLRENIELGNGPLEEAISALYYYSTGGTQLPTL
>bi|19526601|geb|AAL87006.1| chain A [Viscum album]
YERLRLRVTHQTTGEEYFRFITLLRDYVSSGSFSNEIPLLRQSTIPVSDAQRFVLVELTNEGGDSITAAI
DVTNLYVVAYQAGDQSYFLRDAPRGAETHLFTGTTRSSLPFNGSYPDLERYAGHRDQIPLGIDQLIQSVT
ALRFPGGNTRTQARSILILIQMISEAARFNPILWRARQYINSGASFLPDVY

Input file FILE2 :
Code:
bi|2138271|geb|AAC15885 20      32
bi|2138271|geb|AAC15885 92      110
bi|19526601|geb|AAL8700 74      92
bi|2138271|geb|AAC15885 26      38
bi|21083|em|CAA26939.1| 19      37
bi|21083|em|CAA26939.1| 52      70

Output:
Code:
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (20-32)
SVGIQGIDYPSV
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (92-110)
LWPCGTQRNQQWTFYTDD
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (26-38)
IDYPSVSFNLAG
>bi|21083|em|CAA26939.1| precursor [Ricinus communis] (19-37)
LCFGSTSGWSFTLEDNNI
>bi|21083|em|CAA26939.1| precursor [Ricinus communis] (52-70)
TVQSYTNFIRAVRGRLTT
>bi|19526601|geb|AAL87006.1| chain A [Viscum album] (74-92)
NLYVVAYQAGDQSYFLRD

Jean-Pierre.
# 3  
Old 08-01-2008
Thanks Jean

hey! this is wht I ws exactly looking for. Thank you so much Jean.

you know what Jean? It didn't work initially. Then I removed the field separator - tab to keep the space as default value and I also changed it in FILE2. Then it worked perfectly.

Before also I noticed that ths field separator value does't work in my shell. Tell me if you know why this can happen.

Thanks again. Smilie
smriti
# 4  
Old 08-01-2008
One more thing Jean..

I run that script on files which need to extract longer substrings and the format of output file is getting disturbed. I tried some options of printf but coudn't correct it. Please help me out.

===FILE1====
>bi|37779709|geb|AAP20876.1| kinetic [ ternata]
MASKLLLFLLPAILGLIIPRPAVAVGTNYLLSGETLDTDGHLKNGDFDFIMQEDCNAVLYNGNWQSNTAN
KGRDCKLTLTDRGELVINNGEGSAVWRSGSQSAKGNYAAVLHPEGKLVIYGPSVFKINPWVPGLNSLRLG
NVPFTCNMLFSGQVLYGDGKITARNHMLVMQGDCNLVLYGGKCDWQSNTHGNGEHCFLRLNHKGELIIKD
DDFKSIWSSQSSSKQGDYVFILQDNGYGVIYGPAIWATSSKRSVAAQETMIGMVTEKVN
>bi|146403769|geb|ABQ32294.1| pure [an eg]
MAKLLLFLLPAILGLLIPRSAVALGTNYLLSGQTLNTDGHLKNGDFDLVMQNDCNLVLYNGNWQSNTANN
GRDCKLTLTDYGELVIKNGDGSTVWRSRAKSVKGNYAAVLHPDGRLVVFGPSVFKIDPWVPGLNSLRFRN
IPFTDNLLFSGQVLYGDGRLTAKNHQLVMQGDCNLVLYGGKYGWQSNTHGNGEHCFLRLNHKGELIIKDD
DFRPSGAAVPAPSR

===FILE2====
bi|37779709|geb|AAP20876.1| 28 264
bi|146403769|geb|ABQ32294.1| 27 224

===OUTPUTFILE===
>gi|37779709|gb|AAP20876.1| lectin [Pinellia ternata] (28-264)
NYLLSGETLDTDGHLKNGDFDFIMQEDCNAVLYNGNWQSNTANKGRDCKLTLTDRGELVINNGEGSAVWRSGSQSAKGNYAAVLHPEGKLVIYGPSVFKI NPWVPGLNSLRLGNVPFTCNMLFSGQVLYGDGKITARNHMLVMQGDCNLVLYGGKCDWQSNTHGNGEHCFLRLNHKGELIIKDDDFKSIWSSQSSSKQGD YVFILQDNGYGVIYGPAIWATSSKRSVAAQETMIGM
>gi|146403769|gb|ABQ32294.1| lectin [Colocasia esculenta] (27-224)
NYLLSGQTLNTDGHLKNGDFDLVMQNDCNLVLYNGNWQSNTANNGRDCKLTLTDYGELVIKNGDGSTVWRSRAKSVKGNYAAVLHPDGRLVVFGPSVFKI DPWVPGLNSLRFRNIPFTDNLLFSGQVLYGDGRLTAKNHQLVMQGDCNLVLYGGKYGWQSNTHGNGEHCFLRLNHKGELIIKDDDFRPSGAAVPAPS


where as the output should be like this:
>gi|37779709|gb|AAP20876.1| lectin [Pinellia ternata] (28-264)
NYLLSGETLDTDGHLKNGDFDFIMQEDCNAVLYNGNWQSNTANKGRDCKLTLTDRGELVINNGEGSAVWR
SGSQSAKGNYAAVLHPEGKLVIYGPSVFKINPWVPGLNSLRLGNVPFTCNMLFSGQVLYGDGKITARNHML
VMQGDCNLVLYGGKCDWQSNTHGNGEHCFLRLNHKGELIIKDDDFKSIWSSQSSSKQGDYVFILQDNGY
GVIYGPAIWATSSKRSVAAQETMIGM
>gi|146403769|gb|ABQ32294.1| lectin [Colocasia esculenta] (27-224)
NYLLSGQTLNTDGHLKNGDFDLVMQNDCNLVLYNGNWQSNTANNGRDCKLTLTDYGELVIKNGDGSTVWR
SRAKSVKGNYAAVLHPDGRLVVFGPSVFKIDPWVPGLNSLRFRNIPFTDNLLFSGQVLYGDGRLTAKNHQLV
MQGDCNLVLYGGKYGWQSNTHGNGEHCFLRLNHKGELIIKDDDFRPSGAAVPAPS

where each line after header line should not have more than 70 characters.

I will be thankful to you. Smilie
-smriti
# 5  
Old 08-01-2008
Should the overlong lines be truncated or folder over multiple lines, or what?

To truncate (this truncates to eight characters; feel free to add more dots):

Code:
sed -e 's/^\(........\).*/\1/'

To fold, have a look at the fold command, or maybe use something like

Code:
sed -e 's/\(.......\)\(.\)/\1\
\2/g'

(Yes, that's a literal newline, to do the folding.) Again, this example folds at every eight characters; feel free to add more dots. (Actually the first line will be one character shorter. I'm too lazy to fix that. Or rather, the fix will depend on your sed dialect, and I don't want to go there.)

This could just as well be done with awk or perl or the truncation even with cut.

Last edited by era; 08-01-2008 at 09:16 AM..
# 6  
Old 08-01-2008
hi era

I don't want to truncate it and cut will also do that only. one more thing is that i want to apply ths folding only on lines after the line with '>' symbol.

and
sed -e 's/\(.......\)\(.\)/\1\
\2/g'
is giving a vague output ---
It is giving a / after every eight characters.

I gave the command as
sed -e 's/\(.......\)\(.\)/\1\\2/g' filename

Thanks
-smriti
# 7  
Old 08-01-2008
No, there should be a literal newline between \1\ and \2

Anyway, fold should not touch lines which are shorter than 70 characters. As long as your > lines are shorter than that, you should be fine.

Here's another one for you:

Code:
perl -pe 's/(.{70})(?!$)/$1\n/g unless m/^>/'

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to fetch matched records from files between two different directory?

awk 'NR==FNR{arr;next} $0 in arr' /tmp/Data_mismatch.sh /prd/HK/ACCTCARD_20160115.txt edit by bakunin: seems that one CODE-tag got lost somewhere. i corrected that, but please check your posts more carefully. Thank you. (5 Replies)
Discussion started by: suresh_target
5 Replies

2. Shell Programming and Scripting

Separate records of a file on 2 types of records

Hi I am new to shell programming in unix Please if I can provide help. I have a file structure of a header record and "N" detail records. The header record will be the total number of detail records I need to split the file in 2: One for the header Another for all detail records Could... (1 Reply)
Discussion started by: jamcogar
1 Replies

3. Shell Programming and Scripting

Deleting duplicate records from file 1 if records from file 2 match

I have 2 files "File 1" is delimited by ";" and "File 2" is delimited by "|". File 1 below (3 record shown): Doc1;03/01/2012;New York;6 Main Street;Mr. Smith 1;Mr. Jones Doc2;03/01/2012;Syracuse;876 Broadway;John Davis;Barbara Lull Doc3;03/01/2012;Buffalo;779 Old Windy Road;Charles... (2 Replies)
Discussion started by: vestport
2 Replies

4. Shell Programming and Scripting

make the name of file and fetch few things from log file

Hello All, I am working on a script where I need to fetch the value from a log file and log file creates with different name but few thing are common DEV_INFOMGT161_MULTI_PTC_BLD01.Stage_All_to_stp2perf1.042312114644.log STP_12_02_01_00_RC01.Stage_stp-domain_to_stp2perf2.042312041739.log ... (2 Replies)
Discussion started by: anuragpgtgerman
2 Replies

5. UNIX for Dummies Questions & Answers

Grep specific records from a file of records that are separated by an empty line

Hi everyone. I am a newbie to Linux stuff. I have this kind of problem which couldn't solve alone. I have a text file with records separated by empty lines like this: ID: 20 Name: X Age: 19 ID: 21 Name: Z ID: 22 Email: xxx@yahoo.com Name: Y Age: 19 I want to grep records that... (4 Replies)
Discussion started by: Atrisa
4 Replies

6. Shell Programming and Scripting

How to sca a sequential file and fetch some substring data from it

Hi, I have a task where i need to scan second column of seuential file and fetch first 3 digits of that column For e.g. FOLLOWING IS THE SAMPLE FOR MY SEQUENTIAL FILE AU_ID ACCT_NUM CRNCY_CDE THHSBC001 30045678 THB THHSBC001 10154267 THB THHSBC001 ... (2 Replies)
Discussion started by: manmeet
2 Replies

7. Shell Programming and Scripting

how to scan a sequential file to fetch some of the records?

Hi I am working on a script which needs to scan a sequential file and fetch the row where 2nd column = 'HUB' Can any one help me with this... Thanks (1 Reply)
Discussion started by: manmeet
1 Replies

8. Shell Programming and Scripting

Fetch lines from a file matching column2 of another file

Hi guys, Please help me out in this problem. I have two files FILE1 abc-23 : 4529675 cde-42 : 9824532 dge-91 : 1245367 gre-45 : 9824532 fgr-76 : 4529675 FILE2 4529675 : Gal Glu house-2-be 9824532 : cat mouse 1245367 : sirf surf-2-beta where FILE2 is a static file with fixed... (5 Replies)
Discussion started by: smriti_shridhar
5 Replies

9. Shell Programming and Scripting

fetch substring from html code

hello mates. please help me out once again. i have a html file where i want to fetch out one value from the entire html-code sample html code: ..... <b>Amount:<b> 12345</div> ... now i only want to fetch the 12345 from the html document. how to i tell sed to get me the value from... (2 Replies)
Discussion started by: scarfake
2 Replies

10. Shell Programming and Scripting

Count No of Records in File without counting Header and Trailer Records

I have a flat file and need to count no of records in the file less the header and the trailer record. I would appreciate any and all asistance Thanks Hadi Lalani (2 Replies)
Discussion started by: guiguy
2 Replies
Login or Register to Ask a Question