Alignment tool to join text files in 2 directories to create a parallel corpus
I have two directories called English and Hindi. Each directory contains the same number of files with the only difference being that in the case of the English Directory the tag is
Code:
.english
and in the Hindi one the tag is
Code:
.Hindi
The file may contain either a single text or more than one text as in the example below.
Code:
Agro1.english
in the English directory contains 22 lines of which the first four are provided
Code:
India Agriculture
Agriculture is art, science, and industry of managing the growth of plants and animals for human use.
In a broad sense, agriculture includes cultivation of the soil and growing and harvesting crops and breeding and raising livestock and dairying and forestry.
Regional and national agriculture are covered in more detail in individual continent, country, state, and Canadian province articles.
The same number of lines and in the same order are provided in
Code:
Agro1.hindi
in the Hindi directory. The first four are provided by way of sample
Code:
भारतीय कृषि।
कृषि वह कला, विज्ञान और उद्योग है जो मानव उपयोग के लिए पौधों और पशुओं के विकास का प्रबंध करती है।
मोटे तौर पर कृषि में भूमि की जुताई, फसलों की रुपाई और कटाई, पशु-प्रजनन और पालन, दुग्ध-व्यवसाय और वनीकरण सम्मिलित हैं।
प्रत्येक महाद्वीप, देश, राज्य और कनाडा के प्रांतीय लेखों में प्रादेशिक और राष्ट्रीय कृषि का विस्तार से वर्णन किया गया है।
In some cases a given file may contain only one line.
What I need is to join the English lines to the corresponding Hindi lines with
Code:
=
as a delimiter
An example of the output of the four lines given above is shown below
Code:
India Agriculture=भारतीय कृषि।
Agriculture is art, science, and industry of managing the growth of plants and animals for human use.=कृषि वह कला, विज्ञान और उद्योग है जो मानव उपयोग के लिए पौधों और पशुओं के विकास का प्रबंध करती है।
In a broad sense, agriculture includes cultivation of the soil and growing and harvesting crops and breeding and raising livestock and dairying and forestry.=मोटे तौर पर कृषि में भूमि की जुताई, फसलों की रुपाई और कटाई, पशु-प्रजनन और पालन, दुग्ध-व्यवसाय और वनीकरण सम्मिलित हैं।
Regional and national agriculture are covered in more detail in individual continent, country, state, and Canadian province articles.=प्रत्येक महाद्वीप, देश, राज्य और कनाडा के प्रांतीय लेखों में प्रादेशिक और राष्ट्रीय कृषि का विस्तार से वर्णन किया गया है।
Since the number of files in each directory are too many, manual manipulation of the files is difficult. I need an alignment tool which will do the job.
A perl or awk script would be of great help. I do not know how to manipulate directories in Perl or Awk and hence the request
I work in a Windows environment
Many thanks for help.
If you have the same number of lines in each file and have the same number of lines in each corresponding definition, paste can create them on the same line, i.e. it creates one record with line n from each file separated with the delimiter of your choice (default is tab).
Hi Guys,
I want to combine 2 files and and put together in 1 file . See below desired output. Any help will be much appreciated.
FILE AX 2134 101L 12345.00 22222.00 1 10
X 2134 101L 12345.00 22222.00 11 20
X 2134 101L 12345.00 22222.00 21 30
X 2134 111L 77777.00 ... (3 Replies)
Gents,
Please can you help.
I want to create a list which contends the complete patch of the location of some directories with the size of each file.
need to select only .txt file
In this case I am try to find the subdirectories tp1 and tp2 and create the output list.
jd175-1
tp1... (3 Replies)
Can anyone please help me i have 2 text files setup like the one below.
Textfile1:
randomemail1:randompassword1
randomemail2:randompassword2
randomemail3:randompassword3
randomemail4:randompassword4
randomemail5:randompassword5
Textfile2:
randompassword1:randomphrase1... (8 Replies)
Hello guys,
I've got a big corpus (a huge text file in which words are separated by one or several spaces). I would like to know if there is a simple way - using awk for instance - to extract any co-occurrence appearing at least 3times through the whole corpus for a given word. By co-occurrence,... (7 Replies)
Hello everyone,
I work under Ubuntu 11.10 (c-shell)
I need a script to create a new text file whose content is the text of another text files that are in the directory $DIRMAIL at this moment.
I will show you an example:
- On the one hand, there is a directory $DIRMAIL where there are... (1 Reply)
Hi,
I have several files containing experiment measurements per hour (hour_1.txt has measurements for first hour, etc..etc..). I have 720 of these files (i.e. up to hour_720.txt) and i want to create 720 directories and in every one of them i want to copy its associative file (e.g.... (4 Replies)
I need a simple command line executable that allows me to join many wmv files into one output wmv file, preferrably in a simple way like this:
wmvjoin file1.wmv file2.wmv .... > outputfile.wmv
So what I want is the wmv-equivalent of mpgtx
I cannot find it on internet.
Thanks. (2 Replies)