02-02-2010
Regular expression / regex substition on Unicode text
I have a large file encoded in Unicode that I need to convert to CSV. In general, I know how to do this by regular expression substitutions using sed or Perl, but one problem I am having is that I need to put a quotation mark at the end of each line to protect the last field. The usual regex substitution ...
s/$/"/
... works fine for 7-bit ASCII text, but when I run this on my Unicode text file, the double quotation mark appears at the BEGINNING of the FOLLOWING line, not at the end of the line on which it's supposed to appear.
The file came from a Windows system, but piping through dos2unix doesn't seem to make any difference. I've tried the "use Encode;" pragma with several different encodings, but I get the same result. Perhaps I'm doing something wrong. Does anyone know of a special library function intended for this purpose, a Perl pragma, etc., that would accomplish this easily? This should be a trivial problem.
Thanks in advance for any suggestions.
Tom
---------- Post updated at 10:51 PM ---------- Previous update was at 10:49 AM ----------
As I mentioned, the regex ...
s/$/"/
... puts the `"' at the beginning of the following line, and piping through dos2unix doesn't matter one way or the other. Using `\n' instead of `$' doesn't
make any difference. However, I discovered an interesting fact: The regex ... s/\r/"/ # use `\r' instead of `$' or `\n' ... gives the expected result, and does so without piping through dos2unix making any difference! I also found that a C program using a wchar_t declaration behaves similarly. That is, a character that appears as if it should be output BEFORE the EOL actually appears after it, if it is matched as '\n', but if it is matched as '\r' then it is output as expected.
My immediate problem is solved, however I wonder whether this doesn't show a bug in Perl or in its regular expression engine ...
This seems to be a clear case where Unicode text is handled differently than non-Unicode text.
Any opinions?
Tom
Last edited by pludi; 02-03-2010 at 02:29 AM..
Reason: text format, rm URL, and another text format
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Is it possible to combine a regular expression with a aritmetical expression? For example, taking a 8-numbers caracter sequece and casting each output of a grep, comparing to a constant.
THX! (2 Replies)
Discussion started by: Z0mby
2 Replies
2. Linux
Regular expression to extract "y" from "abc/x.y.z" (2 Replies)
Discussion started by: rag84dec
2 Replies
3. Shell Programming and Scripting
I want to block all special characters except alphanumerics.. and "."(dot ) character
currently am using //
I want to even block only single dot or multiple dots..
ex:
. or .............. should be blocked.
please provide me the reg ex.
---------- Post updated at 05:11 AM... (10 Replies)
Discussion started by: shams11
10 Replies
4. Shell Programming and Scripting
CA_RELEASE has a value of 6. I need to check if that this is a numeric value. if not error.
source $CA_VERSION_DATA
if * ]
then
echo "CA_RELESE $CA_RELEASE is invalid"
exit -1
fi
+ source /etc/ncgl/ca_version_data
++ CA_PRODUCT_ID=samxts
++ CA_RELEASE=6
++ CA_WEEK_NO=7
++... (3 Replies)
Discussion started by: ketkee1985
3 Replies
5. Shell Programming and Scripting
Hello:
(exp) : match "exp",the matched text is stored in auto named arrays.
How can I get the matched text ? What is the name of the auto named arrays on linux shell ? (4 Replies)
Discussion started by: 915086731
4 Replies
6. Programming
Hi all,
How am I read a file, find the match regular expression and overwrite to the same files.
open DESTINATION_FILE, "<tmptravl.dat" or die "tmptravl.dat";
open NEW_DESTINATION_FILE, ">new_tmptravl.dat" or die "new_tmptravl.dat";
while (<DESTINATION_FILE>)
{
# print... (1 Reply)
Discussion started by: jessy83
1 Replies
7. Shell Programming and Scripting
Hi All,
I have a sftp session log where I am transferring multi files by issuing "mput abc*.dat". The contents of the logfile is below -
#################################################
Connecting to 10.75.112.194...
Changing to: /home/dasd9x/testing1
sftp> mput abc*.dat
Uploading... (7 Replies)
Discussion started by: k_bijitesh
7 Replies
8. Emergency UNIX and Linux Support
Hi,
Server - MEDIAWIKI - MYSQL - CENTOS 5 - PHP5
I have a database import of close to a million pages into my wiki, mediawiki site,
the format that were left with is not pretty, and I need to find a way to clean this up and present it nicely...
I think regex is the best option as I can... (1 Reply)
Discussion started by: lawstudent
1 Replies
9. Shell Programming and Scripting
hi
i am trying to extract some specific data out of a text file using regular expressions with shell script
that is using a multiline grep .. and the tool i am using is pcregrep so that i can get compatibility with perl's regular expressions
for a sample data like this, i am trying to grab... (6 Replies)
Discussion started by: vemkiran
6 Replies
10. UNIX for Advanced & Expert Users
Hello All,
I'm trying to extract the lines between two consecutive elements of an array from a file.
My array looks like:
problem_arr=(PRS111 PRS213 PRS234)
j=0
while } ]
do
k=`expr $j + 1`
sed -n "/${problem_arr}/,/${problem_arr}/p" problemid.txt
---some operation goes... (11 Replies)
Discussion started by: InduInduIndu
11 Replies
LEARN ABOUT REDHAT
dos2unix
dos2unix(1) General Commands Manual dos2unix(1)
NAME
dos2unix - DOS/MAC to UNIX text file format converter
SYNOPSYS
dos2unix [options] [-c convmode] [-o file ...] [-n infile outfile ...]
Options:
[-hkqV] [--help] [--keepdate] [--quiet] [--version]
DESCRIPTION
This manual page documents dos2unix, the program that converts plain text files in DOS/MAC format to UNIX format.
OPTIONS
The following options are available:
-h --help
Print online help.
-k --keepdate
Keep the date stamp of output file same as input file.
-q --quiet
Quiet mode. Suppress all warning and messages.
-V --version
Prints version information.
-c --convmode convmode
Sets conversion mode. Simulates dos2unix under SunOS.
-o --oldfile file ...
Old file mode. Convert the file and write output to it. The program default to run in this mode. Wildcard names may be used.
-n --newfile infile outfile ...
New file mode. Convert the infile and write output to outfile. File names must be given in pairs and wildcard names should NOT be
used or you WILL lost your files.
EXAMPLES
Get input from stdin and write output to stdout.
dos2unix
Convert and replace a.txt. Convert and replace b.txt.
dos2unix a.txt b.txt
dos2unix -o a.txt b.txt
Convert and replace a.txt in ASCII conversion mode. Convert and replace b.txt in ISO conversion mode. Convert c.txt from Mac to Unix
ascii format.
dos2unix a.txt -c iso b.txt
dos2unix -c ascii a.txt -c iso b.txt
dos2unix -c mac a.txt b.txt
Convert and replace a.txt while keeping original date stamp.
dos2unix -k a.txt
dos2unix -k -o a.txt
Convert a.txt and write to e.txt.
dos2unix -n a.txt e.txt
Convert a.txt and write to e.txt, keep date stamp of e.txt same as a.txt.
dos2unix -k -n a.txt e.txt
Convert and replace a.txt. Convert b.txt and write to e.txt.
dos2unix a.txt -n b.txt e.txt
dos2unix -o a.txt -n b.txt e.txt
Convert c.txt and write to e.txt. Convert and replace a.txt. Convert and replace b.txt. Convert d.txt and write to f.txt.
dos2unix -n c.txt e.txt -o a.txt b.txt -n d.txt f.txt
DIAGNOSTICS
BUGS
The program does not work properly under MSDOS in stdio processing mode. If you know why is that so, please tell me.
AUTHORS
Benjamin Lin - <blin@socs.uts.edu.au>
Bernd Johannes Wuebben (mac2unix mode) <wuebben@kde.org>
MISCELLANY
Tested environment:
Linux 1.2.0 with GNU C 2.5.8
SunOS 4.1.3 with GNU C 2.6.3
MS-DOS 6.20 with Borland C++ 4.02
Suggestions and bug reports are welcome.
SEE ALSO
unix2dos(1) mac2unix(1)
1995.03.31 dos2unix v3.0 dos2unix(1)