The UNIX and Linux Forums  


Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
URL encoding Vichu Shell Programming and Scripting 8 08-27-2008 08:16 PM
File encoding in Unix ssmallya UNIX for Dummies Questions & Answers 6 02-04-2008 10:39 AM
character encoding in Fedora6 bsky UNIX for Dummies Questions & Answers 1 01-04-2008 09:29 AM
encoding palmer18 UNIX for Dummies Questions & Answers 3 08-21-2007 10:35 AM
no SOAP encoding under unix? devotedsinner SUN Solaris 0 11-07-2005 07:28 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rating: Thread Rating: 1 votes, 4.00 average. Display Modes
  #1 (permalink)  
Old 12-08-2008
fearboy fearboy is offline
Registered User
  
 

Join Date: Jun 2008
Posts: 9
convert email headers' encoding?

hi all -

first, huge thanks to anyone who might be able to help me out with this. it's fairly esoteric, but it seems like there has to be an answer for me...

* the environment:

mac os x 10.5.x server
communigate pro (mail server)
bash script (read on)

* the brief:

my script is meant to parse a spam folder; it puts together a nicely-formatted summary email of all messages that have arrived in the past 24 hours, showing only the From: and Subject: lines. mechanically speaking, it works great.

* the problem:

encodings. some character sets (russian/cyrillic; japanese; presumably chinese) break my script pretty badly - a mailer will display them properly in the From or Subject line, but in the body of my email, it just shows them as garbage, i assume because my emails are using another character set. for example:

Subject: =?koi8-r?B?UmU6IMvVxMEg0M/FxMnNIM/UxNnIwdTYPw==?=

the script is smart enough to find the encoding and run the whole message through iconv - but that doesn't seem to help with the header lines, only the email body. which is ignored by the script, so...yeah.

* the question:

does anyone know of a way to properly convert these header lines, ideally into something like utf-8? alternatively, would it help if i specified some text encoding in the summary email itself instead?

for what it's worth, when the lines are displayed in the summaries, i've stripped out the Subject: and From: part, leaving only the actual subject and from text in place. in case that matters...

thanks for reading,
-john.
  #2 (permalink)  
Old 12-09-2008
vbe's Avatar
vbe vbe is offline Forum Staff  
Moderator
  
 

Join Date: Sep 2005
Location: Switzerland - GE
Posts: 1,576
When I suffered from char issues, HPUX using roman8, I used a .mailrc file with this inside:
set crt=21
set encoding=8bit
set charset=iso-8859-1
#

it would be worth investigating ?
  #3 (permalink)  
Old 12-09-2008
fearboy fearboy is offline
Registered User
  
 

Join Date: Jun 2008
Posts: 9
hi vbe -

i'll check that out. in the meantime, i tried changing the charset in the emails themselves from us-ascii to utf-8 (which i think would accomplish pretty much the same thing), with no effect.

i also realized that i could've provided a little more info - sorry, folks. the accounts all have .mdir mailboxes (as opposed to .mbox) - so each message is its own rfc 822-compliant textfile. that means the script is plowing through sometimes hundreds of files per account, and pulling only what it needs (in this case, from, subject, and a couple of other things that are irrelevant to this problem).

for each message, it takes that info and writes it all to one line in a temp file, then moves on to the next. when it's processed all the messages for that account, it reads back the file it just finished writing (which consists of the from & subject lines plus that other info, like a from name and its spam score), one line at a time, and clunks those bits of info into the body of the summary email.

i guess it's a little more complicated than i remembered - but again, the mechanics are working fine; it's just this charset thing that's broken.

thanks again to anyone with a tip,
-john.
  #4 (permalink)  
Old 12-09-2008
cbkihong cbkihong is offline Forum Advisor  
Advisor
  
 

Join Date: Sep 2002
Location: Hong Kong, China
Posts: 1,624
It's no wonder switching to UTF-8 "doesn't work", because email messages must be composed of entirely ASCII and anything else must be encoded. UTF-8 is of no exception to this rule (but still, I think using UTF-8 is better than other legacy encodings - it just doesn't relate to your issue).

The subject header you quoted has been encoded as required by MIME. You can refer to additional information in the RFC 2047 itself:

http://www.rfc-editor.org/rfc/rfc2047.txt

I don't think you can easily find a shell script that does MIME decoding for you. Even with Perl, a set of custom modules would be needed to be installed to parse all that properly. If you are willing to use PHP for this parsing, it is likely the easiest route because support is builtin, and you save a lot of module installation. As an example, parsing the sample you quoted:

Code:
<?php

// Actually in PHP 5, iconv_mime_decode() is the easiest way.

// Assume base64 encoding
$array = array();
$mstring = '=?koi8-r?B?UmU6IMvVxMEg0M/FxMnNIM/UxNnIwdTYPw==?=';
preg_match('/^=\?(.+)\?B\?(.+?)\?=$/', $mstring, $array);
list(, $charset, $encoded) = $array;
$str = base64_decode($encoded);
echo iconv($charset, "UTF-8", $str);

?>
So on my terminal, I got

Code:
Re: куда поедим отдыхать?
Not sure what it is, but it looks properly decoded.
  #5 (permalink)  
Old 12-10-2008
fearboy fearboy is offline
Registered User
  
 

Join Date: Jun 2008
Posts: 9
well, then...time to learn some php!

i'll see if i can't roll your code into something that works in my environment.

thanks for your help!

-john.
  #6 (permalink)  
Old 12-10-2008
cbkihong cbkihong is offline Forum Advisor  
Advisor
  
 

Join Date: Sep 2002
Location: Hong Kong, China
Posts: 1,624
My code was meant to show you the general process of MIME decoding (and mostly concept). It was not quite good for production use. Parsing a real-world email message is likely slightly more complex due to existence of variations.

To be frank, if you can get hold of PHP 5, as indicated in the inline comment, the simplest approach would be to use the iconv_mime_decode() function which is a one-stop shop of what you want. There was a (intentional) flaw in my posted code because it didn't handle the case where the encoding is quoted-printable, that is also supported by MIME. For simplicity, I only posted the part which decodes Base64, because that was used in your sample posted.

If you get hold of the concepts needed, you may then check other languages or tools to see if they may better suit your environment compared with PHP. As PHP install is typically pretty big, it may not be necessarily suitable in all deployment environments (say on very limited storage space).
Closed Thread

Bookmarks

Tags
shell script, shell scripting, unix scripting, unix scripting basics

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 05:03 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0