02-02-2010
Regular expression / regex substition on Unicode text
I have a large file encoded in Unicode that I need to convert to CSV. In general, I know how to do this by regular expression substitutions using sed or Perl, but one problem I am having is that I need to put a quotation mark at the end of each line to protect the last field. The usual regex substitution ...
s/$/"/
... works fine for 7-bit ASCII text, but when I run this on my Unicode text file, the double quotation mark appears at the BEGINNING of the FOLLOWING line, not at the end of the line on which it's supposed to appear.
The file came from a Windows system, but piping through dos2unix doesn't seem to make any difference. I've tried the "use Encode;" pragma with several different encodings, but I get the same result. Perhaps I'm doing something wrong. Does anyone know of a special library function intended for this purpose, a Perl pragma, etc., that would accomplish this easily? This should be a trivial problem.
Thanks in advance for any suggestions.
Tom
---------- Post updated at 10:51 PM ---------- Previous update was at 10:49 AM ----------
As I mentioned, the regex ...
s/$/"/
... puts the `"' at the beginning of the following line, and piping through dos2unix doesn't matter one way or the other. Using `\n' instead of `$' doesn't
make any difference. However, I discovered an interesting fact: The regex ... s/\r/"/ # use `\r' instead of `$' or `\n' ... gives the expected result, and does so without piping through dos2unix making any difference! I also found that a C program using a wchar_t declaration behaves similarly. That is, a character that appears as if it should be output BEFORE the EOL actually appears after it, if it is matched as '\n', but if it is matched as '\r' then it is output as expected.
My immediate problem is solved, however I wonder whether this doesn't show a bug in Perl or in its regular expression engine ...
This seems to be a clear case where Unicode text is handled differently than non-Unicode text.
Any opinions?
Tom
Last edited by pludi; 02-03-2010 at 02:29 AM..
Reason: text format, rm URL, and another text format
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Is it possible to combine a regular expression with a aritmetical expression? For example, taking a 8-numbers caracter sequece and casting each output of a grep, comparing to a constant.
THX! (2 Replies)
Discussion started by: Z0mby
2 Replies
2. Linux
Regular expression to extract "y" from "abc/x.y.z" (2 Replies)
Discussion started by: rag84dec
2 Replies
3. Shell Programming and Scripting
I want to block all special characters except alphanumerics.. and "."(dot ) character
currently am using //
I want to even block only single dot or multiple dots..
ex:
. or .............. should be blocked.
please provide me the reg ex.
---------- Post updated at 05:11 AM... (10 Replies)
Discussion started by: shams11
10 Replies
4. Shell Programming and Scripting
CA_RELEASE has a value of 6. I need to check if that this is a numeric value. if not error.
source $CA_VERSION_DATA
if * ]
then
echo "CA_RELESE $CA_RELEASE is invalid"
exit -1
fi
+ source /etc/ncgl/ca_version_data
++ CA_PRODUCT_ID=samxts
++ CA_RELEASE=6
++ CA_WEEK_NO=7
++... (3 Replies)
Discussion started by: ketkee1985
3 Replies
5. Shell Programming and Scripting
Hello:
(exp) : match "exp",the matched text is stored in auto named arrays.
How can I get the matched text ? What is the name of the auto named arrays on linux shell ? (4 Replies)
Discussion started by: 915086731
4 Replies
6. Programming
Hi all,
How am I read a file, find the match regular expression and overwrite to the same files.
open DESTINATION_FILE, "<tmptravl.dat" or die "tmptravl.dat";
open NEW_DESTINATION_FILE, ">new_tmptravl.dat" or die "new_tmptravl.dat";
while (<DESTINATION_FILE>)
{
# print... (1 Reply)
Discussion started by: jessy83
1 Replies
7. Shell Programming and Scripting
Hi All,
I have a sftp session log where I am transferring multi files by issuing "mput abc*.dat". The contents of the logfile is below -
#################################################
Connecting to 10.75.112.194...
Changing to: /home/dasd9x/testing1
sftp> mput abc*.dat
Uploading... (7 Replies)
Discussion started by: k_bijitesh
7 Replies
8. Emergency UNIX and Linux Support
Hi,
Server - MEDIAWIKI - MYSQL - CENTOS 5 - PHP5
I have a database import of close to a million pages into my wiki, mediawiki site,
the format that were left with is not pretty, and I need to find a way to clean this up and present it nicely...
I think regex is the best option as I can... (1 Reply)
Discussion started by: lawstudent
1 Replies
9. Shell Programming and Scripting
hi
i am trying to extract some specific data out of a text file using regular expressions with shell script
that is using a multiline grep .. and the tool i am using is pcregrep so that i can get compatibility with perl's regular expressions
for a sample data like this, i am trying to grab... (6 Replies)
Discussion started by: vemkiran
6 Replies
10. UNIX for Advanced & Expert Users
Hello All,
I'm trying to extract the lines between two consecutive elements of an array from a file.
My array looks like:
problem_arr=(PRS111 PRS213 PRS234)
j=0
while } ]
do
k=`expr $j + 1`
sed -n "/${problem_arr}/,/${problem_arr}/p" problemid.txt
---some operation goes... (11 Replies)
Discussion started by: InduInduIndu
11 Replies
LEARN ABOUT ULTRIX
re_exec
regex(3) Library Functions Manual regex(3)
Name
re_comp, re_exec - regular expression handler
Syntax
char *re_comp(s)
char *s;
re_exec(s)
char *s;
Description
The subroutine compiles a string into an internal form suitable for pattern matching. The subroutine checks the argument string against
the last string passed to
The subroutine returns 0 if the string s was compiled successfully; otherwise a string containing an error message is returned. If is
passed 0 or a null string, it returns without changing the currently compiled regular expression.
The subroutine returns 1 if the string s matches the last compiled regular expression, 0 if the string s failed to match the last compiled
regular expression, and -1 if the compiled regular expression was invalid (indicating an internal error).
The strings passed to both and may have trailing or embedded newline characters; they are terminated by nulls. The regular expressions
recognized are described in the manual entry for given the above difference.
Diagnostics
The subroutine returns -1 for an internal error.
The subroutine returns one of the following strings if an error occurs:
No previous regular expression
Regular expression too long
unmatched (
missing ]
too many () pairs
unmatched )
See Also
ed(1), ex(1), egrep(1), fgrep(1), grep(1)
regex(3)