how to fetch substring from records into another file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting how to fetch substring from records into another file
# 15  
Old 08-01-2008
I tried many other combinations

after trying many other combinations I realised that it is related to length of COLUMN ONE of FILE2 , if it varies the output varies.

and sometimes it gives a blank line at the end in the output and if i put ths last entry of FILE2 somewhere in the middle all the entries after that are not processed. And I can't understand this.

i'll try by making the length of column 1 same in all entries.

Thanks for all your support.
-smriti
# 16  
Old 08-01-2008
The order of the output correspond to the order of headers in file1.
The logic is :
For each record in file 1
Proceed extractions specified in file 2 relative to this record
The script:
Code:
awk '

NR==FNR {
   if (NR==1)
      Key_len = length($1) + 1;
   k = ">" substr($0,1, Key_len-1);
   n = ++Keys[k];
   From[k,n] = $2;
     To[k,n] = $3;
    Len[k,n] = $3 - $2;
   next;
}

function print_selected(    i,k,p,str) {
   if (selected) {
      k = substr(Header, 1, Key_len);
      for (i=1; i<=Keys[k]; i++) {
         printf("%s (%s-%s)\n", Header, From[k,i], To[k,i]);
         str = substr(Alphabets, From[k,i], Len[k,i]);
         p = 1
         while (p<=Len[k,i]) {
            print substr(str, p, 70);
            p += 70;
         }
      }
   }
}

/^>/ {
   print_selected();
   selected  = (substr($0, 1, Key_len) in Keys);
   Header    = $0;
   Alphabets = "";
   next;
}

selected {
   Alphabets = Alphabets $0;
}

END {
   print_selected();
}
' sm2.dat sm1.dat

Input file 1 (sm1.dat):
Code:
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra]
MRVIAAAMLYLYIVVLAICSVGIQGIDYPSVSFNLAGAKSATWDFLRMPHDLVGEDNKYNDGEPITGNII
GRDGLCVDVRNGYDTDGTPLQLWPCGTQRNQQWTFYTDDTIRSMGKCMTANGLSNGSNIMIFNCSTAVEN
AIKWEVTIDGSIINPSSG
>bi|21083|em|CAA26939.1| precursor [Ricinus communis]
MKPGGNTIVIWMYAVATWLCFGSTSGWSFTLEDNNIFPKQYPIINFTTAGATVQSYTNFIRAVRGRLTTG
ADVRHEIPVLPNRVGLPINQRFILVELSNHAELSVTLALDVTNAYVVGYRAGNSAYFFHPDNQEDAEAIT
HLFTDVQNRYTFAFGGNYDRLEQLAGNLRENIELGNGPLEEAISALYYYSTGGTQLPTL
>bi|19526601|geb|AAL87006.1| chain A [Viscum album]
YERLRLRVTHQTTGEEYFRFITLLRDYVSSGSFSNEIPLLRQSTIPVSDAQRFVLVELTNEGGDSITAAI
DVTNLYVVAYQAGDQSYFLRDAPRGAETHLFTGTTRSSLPFNGSYPDLERYAGHRDQIPLGIDQLIQSVT
ALRFPGGNTRTQARSILILIQMISEAARFNPILWRARQYINSGASFLPDVY

Input file 2 (sm2.dat) :
Code:
bi|2138271|geb|AAC15885 92      110
bi|19526601|geb|AAL8700 74      92
bi|2138271|geb|AAC15885 20      132
bi|21083|em|CAA26939.1| 19      37
bi|21083|em|CAA26939.1| 52      70
bi|2138271|geb|AAC15885 26      38

Output :
Code:
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (92-110)
LWPCGTQRNQQWTFYTDD
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (20-132)
SVGIQGIDYPSVSFNLAGAKSATWDFLRMPHDLVGEDNKYNDGEPITGNIIGRDGLCVDVRNGYDTDGTP
LQLWPCGTQRNQQWTFYTDDTIRSMGKCMTANGLSNGSNIMI
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (26-38)
IDYPSVSFNLAG
>bi|21083|em|CAA26939.1| precursor [Ricinus communis] (19-37)
LCFGSTSGWSFTLEDNNI
>bi|21083|em|CAA26939.1| precursor [Ricinus communis] (52-70)
TVQSYTNFIRAVRGRLTT
>bi|19526601|geb|AAL87006.1| chain A [Viscum album] (74-92)
NLYVVAYQAGDQSYFLRD

Jean-Pierre.
# 17  
Old 08-01-2008
Quote:
Originally Posted by smriti_shridhar
after trying many other combinations I realised that it is related to length of COLUMN ONE of FILE2 , if it varies the output varies.

and sometimes it gives a blank line at the end in the output and if i put ths last entry of FILE2 somewhere in the middle all the entries after that are not processed. And I can't understand this.

i'll try by making the length of column 1 same in all entries.

Thanks for all your support.
-smriti
My script assumes that all keys in file 2 are same length.
The key length is given by the key of the first line.
Code:
NR==FNR {
   if (NR==1)
      Key_len = length($1) + 1;
   k = ">" substr($0,1, Key_len-1);

Jean-Pierre.
# 18  
Old 08-01-2008
ok I got it..

Ya I got your point but can we do it other way round also like-

For each record in file 2

Proceed extractions in file 1 relative to this record

If we can match first column of FILE2 as pattern in FILE1 header line and when it is found we proceed to extract the subtring in the lines following the header line till the next '>' appears by using values of column 2 and 3 of FILE2.

Just want to know your opinion on this logic as I am in learning phase. I'll try to work this out on my own if you think it will not be unnecessarily complicated.

Just want to develop my logics Smilie

Give your suggestions and hints to elaborate it if you find it ok.

Thanks Jean
-smriti
# 19  
Old 08-01-2008
The answer depends of the key length (first column of file2).
If keys are fixed length, there is no problem to invert the logic.
Code:
Read file1: Memorize each header and datas records in an array indexed by the key.
Read file2:For each record, if the record had been memorized (key is an index in the array) proceed extraction from memorized datas.

If keys are variable length, the header and records from file1 will be memorized in a array indexed by record number, we can't use another index because we don't know what part of the record is the key.
For each extraction specified in file 2, we will scan sequentially the array until we found the right record (more time consuming).

Jean-Pierre.
# 20  
Old 08-02-2008
Thanks i'll try this

Ya Keys can be of fixed length.. that is not a problem.
So I'll try this out and will ask you if I'll have any doubts.

Thanks Jean,
-smriti
# 21  
Old 08-04-2008
Bug I have a problem with the code

Hey!

There is a small problem with the code. The substring fetched by the code miss the last character i.e. if FILE2 has the following line -
bi|19526601|geb|AAL8700 74 92

The substring fetched starts from 74th position but it ends at 91 i.e. the output do not contain the character at position 92.

Please help!

Thanks,
smriti
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to fetch matched records from files between two different directory?

awk 'NR==FNR{arr;next} $0 in arr' /tmp/Data_mismatch.sh /prd/HK/ACCTCARD_20160115.txt edit by bakunin: seems that one CODE-tag got lost somewhere. i corrected that, but please check your posts more carefully. Thank you. (5 Replies)
Discussion started by: suresh_target
5 Replies

2. Shell Programming and Scripting

Separate records of a file on 2 types of records

Hi I am new to shell programming in unix Please if I can provide help. I have a file structure of a header record and "N" detail records. The header record will be the total number of detail records I need to split the file in 2: One for the header Another for all detail records Could... (1 Reply)
Discussion started by: jamcogar
1 Replies

3. Shell Programming and Scripting

Deleting duplicate records from file 1 if records from file 2 match

I have 2 files "File 1" is delimited by ";" and "File 2" is delimited by "|". File 1 below (3 record shown): Doc1;03/01/2012;New York;6 Main Street;Mr. Smith 1;Mr. Jones Doc2;03/01/2012;Syracuse;876 Broadway;John Davis;Barbara Lull Doc3;03/01/2012;Buffalo;779 Old Windy Road;Charles... (2 Replies)
Discussion started by: vestport
2 Replies

4. Shell Programming and Scripting

make the name of file and fetch few things from log file

Hello All, I am working on a script where I need to fetch the value from a log file and log file creates with different name but few thing are common DEV_INFOMGT161_MULTI_PTC_BLD01.Stage_All_to_stp2perf1.042312114644.log STP_12_02_01_00_RC01.Stage_stp-domain_to_stp2perf2.042312041739.log ... (2 Replies)
Discussion started by: anuragpgtgerman
2 Replies

5. UNIX for Dummies Questions & Answers

Grep specific records from a file of records that are separated by an empty line

Hi everyone. I am a newbie to Linux stuff. I have this kind of problem which couldn't solve alone. I have a text file with records separated by empty lines like this: ID: 20 Name: X Age: 19 ID: 21 Name: Z ID: 22 Email: xxx@yahoo.com Name: Y Age: 19 I want to grep records that... (4 Replies)
Discussion started by: Atrisa
4 Replies

6. Shell Programming and Scripting

How to sca a sequential file and fetch some substring data from it

Hi, I have a task where i need to scan second column of seuential file and fetch first 3 digits of that column For e.g. FOLLOWING IS THE SAMPLE FOR MY SEQUENTIAL FILE AU_ID ACCT_NUM CRNCY_CDE THHSBC001 30045678 THB THHSBC001 10154267 THB THHSBC001 ... (2 Replies)
Discussion started by: manmeet
2 Replies

7. Shell Programming and Scripting

how to scan a sequential file to fetch some of the records?

Hi I am working on a script which needs to scan a sequential file and fetch the row where 2nd column = 'HUB' Can any one help me with this... Thanks (1 Reply)
Discussion started by: manmeet
1 Replies

8. Shell Programming and Scripting

Fetch lines from a file matching column2 of another file

Hi guys, Please help me out in this problem. I have two files FILE1 abc-23 : 4529675 cde-42 : 9824532 dge-91 : 1245367 gre-45 : 9824532 fgr-76 : 4529675 FILE2 4529675 : Gal Glu house-2-be 9824532 : cat mouse 1245367 : sirf surf-2-beta where FILE2 is a static file with fixed... (5 Replies)
Discussion started by: smriti_shridhar
5 Replies

9. Shell Programming and Scripting

fetch substring from html code

hello mates. please help me out once again. i have a html file where i want to fetch out one value from the entire html-code sample html code: ..... <b>Amount:<b> 12345</div> ... now i only want to fetch the 12345 from the html document. how to i tell sed to get me the value from... (2 Replies)
Discussion started by: scarfake
2 Replies

10. Shell Programming and Scripting

Count No of Records in File without counting Header and Trailer Records

I have a flat file and need to count no of records in the file less the header and the trailer record. I would appreciate any and all asistance Thanks Hadi Lalani (2 Replies)
Discussion started by: guiguy
2 Replies
Login or Register to Ask a Question