Regex to identify illegal characters in a perso-arabic database


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Regex to identify illegal characters in a perso-arabic database
# 8  
Old 08-27-2017
Quote:
Originally Posted by gimley
Sorry for the delay in responding but I was unwell and could not respond. I would like to thank all who took the trouble to answer my query.
The issue is that the Arabic code block is huge and has characters that look alike. Very often data entry operators use a character /characters which are not part of the character set of a language [in this case Sindhi] and these are what we term as illegal. When these invalid/illegal characters are part of a dictionary, the results are disastrous, especially in storage and Natural Language processing.
This is the reason of my query, I tried the solutions provided and they all work and I am really thankful to all for your help.
Why I needed a simple regex was that my text processors: Ultraedit and Notepad++ both support regexes in perl and Unix and instead of "grepping" the strings, a macro based on a regex would help me identify all such invalid strings. I am still curious why the regex did not work. Any light on the same would really help.
Many thanks once again.
To bring what MadeInGermany said directly into your problem statement...

If the following characters are the only legal characters on a line written in Sindhi:
Code:
ابٻپڀتٺٽثٿفڦگڳڱکيدذڌڏڊڍحجڄڃچڇخعغرڙمنلسشوقصضڻطظھجھگھڪءهآ

(note that there are no punctuation characters and no <space> or <tab> characters), then the basic regular expression (abbreviated BRE):
Code:
^[^ابٻپڀتٺٽثٿفڦگڳڱکيدذڌڏڊڍحجڄڃچڇخعغرڙمنلسشوقصضڻطظھجھگھڪءهآ]+$

will match a line that contains one or more non-Sindhi characters and contain no Sindhi characters.

If you want to find a non-Sindhi character, you just want the BRE:
Code:
[^ابٻپڀتٺٽثٿفڦگڳڱکيدذڌڏڊڍحجڄڃچڇخعغرڙمنلسشوقصضڻطظھجھگھڪءهآ]

If you want to find a line that contains one or more non-Sindhi characters, you could use the BRE:
Code:
^.*[^ابٻپڀتٺٽثٿفڦگڳڱکيدذڌڏڊڍحجڄڃچڇخعغرڙمنلسشوقصضڻطظھجھگھڪءهآ].*$

If you want to find a line that just contains one or more Sindhi characters, you could use the BRE:
Code:
^[ابٻپڀتٺٽثٿفڦگڳڱکيدذڌڏڊڍحجڄڃچڇخعغرڙمنلسشوقصضڻطظھجھگھڪءهآ]+$

If you have a list of non-Sindhi characters that are incorrectly typed into a file and the corresponding Sindhi character that should have been used instead, you might want to look at the tr utility instead of trying to use an editor to manually make all of the changes.
These 2 Users Gave Thanks to Don Cragun For This Post:
# 9  
Old 08-27-2017
Many thanks for your kind reply and your detailed solutions. I tested the which you provided and they work perfectly.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Regex to identify pattern

Hi In a file I have string in multiple lines. Like below: <?=test.getObjectName("L", "testTBL","D") ?> <?=test.getObjectName("L", "testTBL","testDB", "D") ?> I want to use regex to search for the pattern "<?=test.getObjectName...?>" If the parenthesis has 3 parameters then return 2nd... (5 Replies)
Discussion started by: dashing201
5 Replies

2. Shell Programming and Scripting

Regex to identify unique words in a dictionary database

Hello, I have a dictionary which I am building for the Open Source Community. The data structure is as under HEADWORD=PARTOFSPEECH=ENGLISH MEANING as shown in the example below अ=m=Prefix signifying negation. अँहँ=ind=Interjection expressing disapprobation. अं=int=An interjection... (2 Replies)
Discussion started by: gimley
2 Replies

3. Shell Programming and Scripting

Writing a clustering concordance for a Perso-Arabic script

I am working on a database of a language using Arabic Script. One of the major issues is that the shape of the characters changes according to their initial, medial or final positioning. Another major issue is that of the clustering of vowels within the word: the clustering changes totally the... (9 Replies)
Discussion started by: gimley
9 Replies

4. Shell Programming and Scripting

Regex to identify word in second position on a line

I am interested in finding a regex to find a word in second position on a line. The word in question is या I tried the following PERL EXPRESSION but it did not work: ] या or ^\W या But both gave Null results I am giving below a Sample file: देना या सौंपना=delegate तह जमना या... (8 Replies)
Discussion started by: gimley
8 Replies

5. Shell Programming and Scripting

Regex to identify a full-stop as a sentence delimiter

Hello, Splitting a sentence using the full-stop/question-mark/exclamation is a common device. Whereas the question-mark / exclamation do not pose too much of a problem; the full-stop as a sentence delimiter raises certain issues because of its varied use: just to name a few. Standard parsers... (9 Replies)
Discussion started by: gimley
9 Replies

6. UNIX for Dummies Questions & Answers

Use Regex to identify / format a complex string

First of all, please have mercy on me. I am not a noob to programming, but I am about as noob as you can get with regex. That being said, I have a problem. I've got a string that looks something like this: Publication - Bob M. Jones, Tony X. Stark, and Fred D. Man, \"Really Awesome Article... (1 Reply)
Discussion started by: egill
1 Replies

7. UNIX and Linux Applications

Identify server.database connection

Good afternoon i need your help, i am new at unix, in a ETL scenario like datastage is , there are a bunch of procesess (script shells) conecting to hetereogenius database source servers in order to extract information. Ive got 2 questions 1. Using unix how can i identify exactly the... (1 Reply)
Discussion started by: alexcol
1 Replies

8. UNIX for Dummies Questions & Answers

Arabic characters in QNX4

I want to display Arabic characters in QNX4. This work was been done by a colleague several years ago but he didn't document his work. I installed fonts and I got this display (attached). Please let me know how can correct as per the initial display were working in Arabic (attached). Thanks... (0 Replies)
Discussion started by: hbc
0 Replies

9. Shell Programming and Scripting

how do I identify files with characters beyond a certain range.

I have a directory with hundreds of files that can not have data pass column 80. I do not know of way to combine "grep" and "cut" command. I tried: cat * | cut -c 81-120 |pg but it only shows me the line, not the file name. Any help would be appreciated. Been on this all... (3 Replies)
Discussion started by: kcsunsun01dev
3 Replies

10. UNIX for Dummies Questions & Answers

Illegal characters in Servername / Path

Hi there. I wonder if anybody can help me. I am very new to this and a bit out of my depth. I have a .cmd file which sets various environmental variables for me. When I input a server name that does not contains dots (.) in the name it works fine. As soon as I place in a server name... (5 Replies)
Discussion started by: goodjuju
5 Replies
Login or Register to Ask a Question