08-27-2017
Sorry for the delay in responding but I was unwell and could not respond. I would like to thank all who took the trouble to answer my query.
The issue is that the Arabic code block is huge and has characters that look alike. Very often data entry operators use a character /characters which are not part of the character set of a language [in this case Sindhi] and these are what we term as illegal. When these invalid/illegal characters are part of a dictionary, the results are disastrous, especially in storage and Natural Language processing.
This is the reason of my query, I tried the solutions provided and they all work and I am really thankful to all for your help.
Why I needed a simple regex was that my text processors: Ultraedit and Notepad++ both support regexes in perl and Unix and instead of "grepping" the strings, a macro based on a regex would help me identify all such invalid strings. I am still curious why the regex did not work. Any light on the same would really help.
Many thanks once again.
10 More Discussions You Might Find Interesting
1. UNIX for Dummies Questions & Answers
Hi there.
I wonder if anybody can help me. I am very new to this and a bit out of my depth.
I have a .cmd file which sets various environmental variables for me.
When I input a server name that does not contains dots (.) in the name it works fine. As soon as I place in a server name... (5 Replies)
Discussion started by: goodjuju
5 Replies
2. Shell Programming and Scripting
I have a directory with hundreds of files that can not have data pass column 80. I do not know of way to combine "grep" and "cut" command.
I tried:
cat * | cut -c 81-120 |pg
but it only shows me the line, not the file name.
Any help would be appreciated. Been on this all... (3 Replies)
Discussion started by: kcsunsun01dev
3 Replies
3. UNIX for Dummies Questions & Answers
I want to display Arabic characters in QNX4.
This work was been done by a colleague several years ago but he didn't document his work.
I installed fonts and I got this display (attached).
Please let me know how can correct as per the initial display were working in Arabic (attached).
Thanks... (0 Replies)
Discussion started by: hbc
0 Replies
4. UNIX and Linux Applications
Good afternoon
i need your help, i am new at unix, in a ETL scenario like datastage is , there are a bunch of procesess (script shells) conecting to hetereogenius database source servers in order to extract information.
Ive got 2 questions
1. Using unix how can i identify exactly the... (1 Reply)
Discussion started by: alexcol
1 Replies
5. UNIX for Dummies Questions & Answers
First of all, please have mercy on me. I am not a noob to programming, but I am about as noob as you can get with regex. That being said, I have a problem.
I've got a string that looks something like this:
Publication - Bob M. Jones, Tony X. Stark, and Fred D. Man, \"Really Awesome Article... (1 Reply)
Discussion started by: egill
1 Replies
6. Shell Programming and Scripting
Hello,
Splitting a sentence using the full-stop/question-mark/exclamation is a common device. Whereas the question-mark / exclamation do not pose too much of a problem; the full-stop as a sentence delimiter raises certain issues because of its varied use:
just to name a few.
Standard parsers... (9 Replies)
Discussion started by: gimley
9 Replies
7. Shell Programming and Scripting
I am interested in finding a regex to find a word in second position on a line. The word in question is या
I tried the following PERL EXPRESSION but it did not work:
] या
or
^\W या
But both gave Null results
I am giving below a Sample file:
देना या सौंपना=delegate
तह जमना या... (8 Replies)
Discussion started by: gimley
8 Replies
8. Shell Programming and Scripting
I am working on a database of a language using Arabic Script. One of the major issues is that the shape of the characters changes according to their initial, medial or final positioning. Another major issue is that of the clustering of vowels within the word: the clustering changes totally the... (9 Replies)
Discussion started by: gimley
9 Replies
9. Shell Programming and Scripting
Hello,
I have a dictionary which I am building for the Open Source Community. The data structure is as under
HEADWORD=PARTOFSPEECH=ENGLISH MEANING
as shown in the example below
अ=m=Prefix signifying negation.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection... (2 Replies)
Discussion started by: gimley
2 Replies
10. UNIX for Beginners Questions & Answers
Hi
In a file I have string in multiple lines. Like below:
<?=test.getObjectName("L", "testTBL","D") ?>
<?=test.getObjectName("L", "testTBL","testDB", "D") ?>
I want to use regex to search for the pattern "<?=test.getObjectName...?>"
If the parenthesis has 3 parameters then return 2nd... (5 Replies)
Discussion started by: dashing201
5 Replies
GLOB(7) BSD Miscellaneous Information Manual GLOB(7)
NAME
glob -- shell-style pattern matching
DESCRIPTION
Globbing characters (wildcards) are special characters used to perform pattern matching of pathnames and command arguments in the csh(1),
ksh(1), and sh(1) shells as well as the C library functions fnmatch(3) and glob(3). A glob pattern is a word containing one or more unquoted
'?' or '*' characters, or ``[..]'' sequences.
Globs should not be confused with the more powerful regular expressions used by programs such as grep(1). While there is some overlap in the
special characters used in regular expressions and globs, their meaning is different.
The pattern elements have the following meaning:
? Matches any single character.
* Matches any sequence of zero or more characters.
[..] Matches any of the characters inside the brackets. Ranges of characters can be specified by separating two characters by a '-' (e.g.
``[a0-9]'' matches the letter 'a' or any digit). In order to represent itself, a '-' must either be quoted or the first or last
character in the character list. Similarly, a ']' must be quoted or the first character in the list if it is to represent itself
instead of the end of the list. Also, a '!' appearing at the start of the list has special meaning (see below), so to represent
itself it must be quoted or appear later in the list.
Within a bracket expression, the name of a character class enclosed in '[:' and ':]' stands for the list of all characters belonging
to that class. Supported character classes:
alnum cntrl lower space
alpha digit print upper
blank graph punct xdigit
These match characters using the macros specified in ctype(3). A character class may not be used as an endpoint of a range.
[!..] Like [..], except it matches any character not inside the brackets.
Matches the character following it verbatim. This is useful to quote the special characters '?', '*', '[', and '' such that they
lose their special meaning. For example, the pattern ``\*[x]?'' matches the string ``*[x]?''.
Note that when matching a pathname, the path separator '/', is not matched by a '?', or '*', character or by a ``[..]'' sequence. Thus,
/usr/*/*/X11 would match /usr/X11R6/lib/X11 and /usr/X11R6/include/X11 while /usr/*/X11 would not match either. Likewise, /usr/*/bin would
match /usr/local/bin but not /usr/bin.
SEE ALSO
fnmatch(3), glob(3), re_format(7)
HISTORY
In early versions of UNIX, the shell did not do pattern expansion itself. A dedicated program, /etc/glob, was used to perform the expansion
and pass the results to a command. In Version 7 AT&T UNIX, with the introduction of the Bourne shell, this functionality was incorporated
into the shell itself.
BSD
November 30, 2010 BSD