Regular expression / regex substition on Unicode text Post: 302391747

Sponsored Content

Top Forums UNIX for Advanced & Expert Users Regular expression / regex substition on Unicode text Post 302391747 by thomas.hedden on Tuesday 2nd of February 2010 10:51:44 PM

02-02-2010

Registered User

Regular expression / regex substition on Unicode text

I have a large file encoded in Unicode that I need to convert to CSV. In general, I know how to do this by regular expression substitutions using sed or Perl, but one problem I am having is that I need to put a quotation mark at the end of each line to protect the last field. The usual regex substitution ...
s/$/"/
... works fine for 7-bit ASCII text, but when I run this on my Unicode text file, the double quotation mark appears at the BEGINNING of the FOLLOWING line, not at the end of the line on which it's supposed to appear.
The file came from a Windows system, but piping through dos2unix doesn't seem to make any difference. I've tried the "use Encode;" pragma with several different encodings, but I get the same result. Perhaps I'm doing something wrong. Does anyone know of a special library function intended for this purpose, a Perl pragma, etc., that would accomplish this easily? This should be a trivial problem.
Thanks in advance for any suggestions.
Tom

---------- Post updated at 10:51 PM ---------- Previous update was at 10:49 AM ----------

As I mentioned, the regex ...
s/$/"/
... puts the `"' at the beginning of the following line, and piping through dos2unix doesn't matter one way or the other. Using `\n' instead of `$' doesn't
make any difference. However, I discovered an interesting fact: The regex ... s/\r/"/ # use `\r' instead of `$' or `\n' ... gives the expected result, and does so without piping through dos2unix making any difference! I also found that a C program using a wchar_t declaration behaves similarly. That is, a character that appears as if it should be output BEFORE the EOL actually appears after it, if it is matched as '\n', but if it is matched as '\r' then it is output as expected.
My immediate problem is solved, however I wonder whether this doesn't show a bug in Perl or in its regular expression engine ...
This seems to be a clear case where Unicode text is handled differently than non-Unicode text.
Any opinions?
Tom

Last edited by pludi; 02-03-2010 at 02:29 AM.. Reason: text format, rm URL, and another text format

thomas.hedden

View Public Profile for thomas.hedden

Find all posts by thomas.hedden

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regular Expression + Aritmetical Expression

Is it possible to combine a regular expression with a aritmetical expression? For example, taking a 8-numbers caracter sequece and casting each output of a grep, comparing to a constant. THX!

2. Linux

Regular expression to extract "y" from "abc/x.y.z" .... i need regular expression

Regular expression to extract "y" from "abc/x.y.z"

3. Shell Programming and Scripting

Regular expression (regex) required

I want to block all special characters except alphanumerics.. and "."(dot ) character currently am using // I want to even block only single dot or multiple dots.. ex: . or .............. should be blocked. please provide me the reg ex. ---------- Post updated at 05:11 AM...

4. Shell Programming and Scripting

Integer expression expected: with regular expression

CA_RELEASE has a value of 6. I need to check if that this is a numeric value. if not error. source $CA_VERSION_DATA if * ] then echo "CA_RELESE $CA_RELEASE is invalid" exit -1 fi + source /etc/ncgl/ca_version_data ++ CA_PRODUCT_ID=samxts ++ CA_RELEASE=6 ++ CA_WEEK_NO=7 ++...

5. Shell Programming and Scripting

How can I get the matched text when using regular expression.

Hello: (exp) : match "exp",the matched text is stored in auto named arrays. How can I get the matched text ? What is the name of the auto named arrays on linux shell ?

6. Programming

Perl: How to read from a file, do regular expression and then replace the found regular expression

Hi all, How am I read a file, find the match regular expression and overwrite to the same files. open DESTINATION_FILE, "<tmptravl.dat" or die "tmptravl.dat"; open NEW_DESTINATION_FILE, ">new_tmptravl.dat" or die "new_tmptravl.dat"; while (<DESTINATION_FILE>) { # print...

7. Shell Programming and Scripting

passing a regex as variable to awk and using that as regular expression for search

Hi All, I have a sftp session log where I am transferring multi files by issuing "mput abc*.dat". The contents of the logfile is below - ################################################# Connecting to 10.75.112.194... Changing to: /home/dasd9x/testing1 sftp> mput abc*.dat Uploading...

8. Emergency UNIX and Linux Support

Regular expression (regex) clean up text

Hi, Server - MEDIAWIKI - MYSQL - CENTOS 5 - PHP5 I have a database import of close to a million pages into my wiki, mediawiki site, the format that were left with is not pretty, and I need to find a way to clean this up and present it nicely... I think regex is the best option as I can...

9. Shell Programming and Scripting

regular expression with shell script to extract data out of a text file

hi i am trying to extract some specific data out of a text file using regular expressions with shell script that is using a multiline grep .. and the tool i am using is pcregrep so that i can get compatibility with perl's regular expressions for a sample data like this, i am trying to grab...

10. UNIX for Advanced & Expert Users

sed: -e expression #1, char 0: no previous regular expression

Hello All, I'm trying to extract the lines between two consecutive elements of an array from a file. My array looks like: problem_arr=(PRS111 PRS213 PRS234) j=0 while } ] do k=`expr $j + 1` sed -n "/${problem_arr}/,/${problem_arr}/p" problemid.txt ---some operation goes...

LEARN ABOUT ULTRIX

re_exec

regex(3)						     Library Functions Manual							  regex(3)

Name
       re_comp, re_exec - regular expression handler

Syntax
       char *re_comp(s)
       char *s;

       re_exec(s)
       char *s;

Description
       The  subroutine	compiles  a string into an internal form suitable for pattern matching.  The subroutine checks the argument string against
       the last string passed to

       The subroutine returns 0 if the string s was compiled successfully; otherwise a string containing an  error  message  is  returned.  If	is
       passed 0 or a null string, it returns without changing the currently compiled regular expression.

       The  subroutine returns 1 if the string s matches the last compiled regular expression, 0 if the string s failed to match the last compiled
       regular expression, and -1 if the compiled regular expression was invalid (indicating an internal error).

       The strings passed to both and may have trailing or embedded newline characters; they are terminated by	nulls.	 The  regular  expressions
       recognized are described in the manual entry for given the above difference.

Diagnostics
       The subroutine returns -1 for an internal error.

       The subroutine returns one of the following strings if an error occurs:

       No previous regular expression
       Regular expression too long
       unmatched (
       missing ]
       too many () pairs
       unmatched )

See Also
       ed(1), ex(1), egrep(1), fgrep(1), grep(1)

																	  regex(3)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regular Expression + Aritmetical Expression

Discussion started by: Z0mby

2. Linux

Regular expression to extract "y" from "abc/x.y.z" .... i need regular expression

Discussion started by: rag84dec

3. Shell Programming and Scripting

Regular expression (regex) required

Discussion started by: shams11

4. Shell Programming and Scripting

Integer expression expected: with regular expression

Discussion started by: ketkee1985