Alignment tool to join text files in 2 directories to create a parallel corpus


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Alignment tool to join text files in 2 directories to create a parallel corpus
# 1  
Old 08-07-2018
Alignment tool to join text files in 2 directories to create a parallel corpus

I have two directories called English and Hindi. Each directory contains the same number of files with the only difference being that in the case of the English Directory the tag is
Code:
.english

and in the Hindi one the tag is
Code:
.Hindi

The file may contain either a single text or more than one text as in the example below.
Code:
Agro1.english

in the English directory contains 22 lines of which the first four are provided
Code:
India Agriculture
Agriculture is art, science, and industry of managing the growth of plants and animals for human use.
In a broad sense, agriculture includes cultivation of the soil and growing and harvesting crops and breeding and raising livestock and dairying and forestry.
Regional and national agriculture are covered in more detail in individual continent, country, state, and Canadian province articles.

The same number of lines and in the same order are provided in
Code:
Agro1.hindi

in the Hindi directory. The first four are provided by way of sample
Code:
भारतीय कृषि।
कृषि वह कला, विज्ञान और उद्योग है जो मानव उपयोग के लिए पौधों और पशुओं के विकास का प्रबंध करती है।
मोटे तौर पर कृषि में भूमि की जुताई, फसलों की रुपाई और कटाई, पशु-प्रजनन और पालन, दुग्ध-व्यवसाय और वनीकरण सम्मिलित हैं। 
प्रत्येक महाद्वीप, देश, राज्य और कनाडा के प्रांतीय लेखों में प्रादेशिक और राष्ट्रीय कृषि का विस्तार से वर्णन किया गया है।

In some cases a given file may contain only one line.
What I need is to join the English lines to the corresponding Hindi lines with
Code:
=

as a delimiter
An example of the output of the four lines given above is shown below
Code:
India Agriculture=भारतीय कृषि।
Agriculture is art, science, and industry of managing the growth of plants and animals for human use.=कृषि वह कला, विज्ञान और उद्योग है जो मानव उपयोग के लिए पौधों और पशुओं के विकास का प्रबंध करती है।
In a broad sense, agriculture includes cultivation of the soil and growing and harvesting crops and breeding and raising livestock and dairying and forestry.=मोटे तौर पर कृषि में भूमि की जुताई, फसलों की रुपाई और कटाई, पशु-प्रजनन और पालन, दुग्ध-व्यवसाय और वनीकरण सम्मिलित हैं। 
Regional and national agriculture are covered in more detail in individual continent, country, state, and Canadian province articles.=प्रत्येक महाद्वीप, देश, राज्य और कनाडा के प्रांतीय लेखों में प्रादेशिक और राष्ट्रीय कृषि का विस्तार से वर्णन किया गया है।

Since the number of files in each directory are too many, manual manipulation of the files is difficult. I need an alignment tool which will do the job.
A perl or awk script would be of great help. I do not know how to manipulate directories in Perl or Awk and hence the request
I work in a Windows environment
Many thanks for help.
# 2  
Old 08-07-2018
Does it have to be perl or awk? Or would shell do?

Code:
cd English; for FN in *.english; do paste -d= "$FN" "../Hindi/${FN%.*}.hindi"; done

This User Gave Thanks to RudiC For This Post:
# 3  
Old 08-07-2018
Sorry for the late reply. Many thanks. I work in a windows environment hence the request for perl or awk.
# 4  
Old 08-07-2018
Could the paste command work here?
# 5  
Old 08-07-2018
Could you please explain in what way?? Thanks.
# 6  
Old 08-07-2018
If you have the same number of lines in each file and have the same number of lines in each corresponding definition, paste can create them on the same line, i.e. it creates one record with line n from each file separated with the delimiter of your choice (default is tab).
# 7  
Old 08-07-2018
@wbport: see post#2 (and #3).
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to join 2 text files using bash scripting?

Hi Guys, I want to combine 2 files and and put together in 1 file . See below desired output. Any help will be much appreciated. FILE AX 2134 101L 12345.00 22222.00 1 10 X 2134 101L 12345.00 22222.00 11 20 X 2134 101L 12345.00 22222.00 21 30 X 2134 111L 77777.00 ... (3 Replies)
Discussion started by: H.R
3 Replies

2. Shell Programming and Scripting

Scan directories and create a list of files

Gents, Please can you help. I want to create a list which contends the complete patch of the location of some directories with the size of each file. need to select only .txt file In this case I am try to find the subdirectories tp1 and tp2 and create the output list. jd175-1 tp1... (3 Replies)
Discussion started by: jiam912
3 Replies

3. Shell Programming and Scripting

Comparing two files in UNIX and create a new file similar to equi join

I have 2 files namely branch.txt file & RXD.txt file as below Ex:Branch.txt ========================= B1,Branchname1,city,country B2,Branchname2,city,country B3,Branchname3,city,country B4,Branchname4,city,country B5,Branchname5,city,country RXD file : will... (11 Replies)
Discussion started by: satece
11 Replies

4. Shell Programming and Scripting

Is there a way to join 2 text files sorted by

Can anyone please help me i have 2 text files setup like the one below. Textfile1: randomemail1:randompassword1 randomemail2:randompassword2 randomemail3:randompassword3 randomemail4:randompassword4 randomemail5:randompassword5 Textfile2: randompassword1:randomphrase1... (8 Replies)
Discussion started by: nufc
8 Replies

5. Shell Programming and Scripting

Linguistic project: extract co-occurrences from text corpus

Hello guys, I've got a big corpus (a huge text file in which words are separated by one or several spaces). I would like to know if there is a simple way - using awk for instance - to extract any co-occurrence appearing at least 3times through the whole corpus for a given word. By co-occurrence,... (7 Replies)
Discussion started by: bobylapointe
7 Replies

6. Shell Programming and Scripting

Script to create a text file whose content is the text of another files

Hello everyone, I work under Ubuntu 11.10 (c-shell) I need a script to create a new text file whose content is the text of another text files that are in the directory $DIRMAIL at this moment. I will show you an example: - On the one hand, there is a directory $DIRMAIL where there are... (1 Reply)
Discussion started by: tenteyu
1 Replies

7. Shell Programming and Scripting

create more than 100 directories and copy files into them

Hi, I have several files containing experiment measurements per hour (hour_1.txt has measurements for first hour, etc..etc..). I have 720 of these files (i.e. up to hour_720.txt) and i want to create 720 directories and in every one of them i want to copy its associative file (e.g.... (4 Replies)
Discussion started by: amarn
4 Replies

8. OS X (Apple)

Command line tool to join multiple .wmv files?

I need a simple command line executable that allows me to join many wmv files into one output wmv file, preferrably in a simple way like this: wmvjoin file1.wmv file2.wmv .... > outputfile.wmv So what I want is the wmv-equivalent of mpgtx I cannot find it on internet. Thanks. (2 Replies)
Discussion started by: karman
2 Replies

9. UNIX for Dummies Questions & Answers

Is there any non graphical tool that make selective merge between text files?

whitout using awk / sad and so on? (3 Replies)
Discussion started by: umen
3 Replies
Login or Register to Ask a Question