Hello,
I found this Perl Script on the EuroParl website which does Sentence Splitting.
The script reads from a language file (attached as a zipped file) located separately in a folder
and then splits the sentence accurately.
However there are two issues which need to be solved.
a. It so happens that in quite a few corpora (especially news corpora), the full-stop is inadvertently forgotten and there is a simple hard return as in the example below.
In that case the script treats the text as an absence of a full-stop and instead of retaining the the two lines separately, conjoins them in one running sentence.
How do I make PERL introduce a hard return as a sentence delimiter in the script. I have tried to insert the hex values of a hard return
but they do not seem to do the trick.
My second query is pertinent to other languages such as Indic where characters such as
U+0964 DEVANAGARI DANDA
are used as sentence delimiters.
In case I want to insert these as such where do I insert them. I inserted it at
line 109
# add breaks for sentences that end with some sort of punctuation are followed by a sentence starter punctuation and upper case
But to no avail.
I am providing a test sentence below:
Any solution to these two issues would be of great help. Since the script is in OpenSource it would help other users also because I would be putting up the script with these modifications with due acknowledgement on the Moses site.
Could anybody provide a solution please. Thank you.
Hi,
Extremely new to Perl scripting, but need a quick fix without using TEXT::CSV
I need to read in a file, pass any delimiter as an argument, and convert it to bar delimited on the output. In addition, enclose fields within double quotes in case of any embedded delimiters.
Any help would... (2 Replies)
Hi,
I have a No Delimiter variable length text file with following schema -
Column Name Data length
Firstname 5
Lastname 5
age 3
phoneno1 10
phoneno2 10
phoneno3 10
sample data - ... (16 Replies)
Hello,
Splitting a sentence using the full-stop/question-mark/exclamation is a common device. Whereas the question-mark / exclamation do not pose too much of a problem; the full-stop as a sentence delimiter raises certain issues because of its varied use:
just to name a few.
Standard parsers... (9 Replies)
Hello,
i encountered this in perl but it might be command line related as well:
i am sending text as an argument to echo command on remote computer.
if the text has alphanumeric characters only, say 'hello world' all is well. if however text has metacharachters, e.g. 'hello | world' or even... (2 Replies)
Hi People,
I need some Help to write a unix script that asks for a sentence to be typed out then with the sentence. Counts the number of spaces within the sentence and then echo's out "The Number Of Spaces In The Sentence is 4" as a example
Thanks
Danielle (12 Replies)
Hi,
I do not have a clue how to do this nor can I find information on it but I have a file that looks like this (basically 3 columns and tab delimited). I need this in a particular format in order for a program to actually read it.
chr1 2 4
chr1 2 5
chr1 3 6
chr2 1 4
chr2 2 5
... (2 Replies)
Hi,
I have 3 arrays:
@arr1=("Furthermore, apigenin treatment increased the level of association of the RNA binding protein HuR with endogenous p53 mRNA","one of the mechanisms by which apigenin induces p53 protein expression is enhancement of translation through the RNA binding protein... (1 Reply)
Hi,
I have two strings like this in an array:
For example:
@a=("Brain aging is associated with a progressive imbalance between intracellular concentration of Reactive Oxygen Species","Brain aging is associated with a progressive imbalance between intracellular concentration of Reactive... (9 Replies)
Hi everybody,
This time I am having one issue in perl.
I have to create comma separated file using the following type of information. The problem is the columns do not have any specific delimiter. So while using split I am getting different value. Some where it is space(S) and some where it is... (9 Replies)
Hi
I have a file which have say about 100,000 records..
the records in it look like
Some kind of text 1234567891 abcd February 14, 2008 03:58:54 AM lmnop
This is how it looks.. if u notice there is a 2byte space between each column.. and im planning to replace that with '|' ..
... (11 Replies)