A beginner needing some help programming documents


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting A beginner needing some help programming documents
# 1  
Old 01-31-2013
Question A beginner needing some help programming documents

Hi all,

I'm a fairly new beginner with shell programming and python programming. I have a mac (mountain lion OS 10.8.2) and use the terminal for programming. I'm trying to use the unix to easily organize some language data that I am working with. Basically I have to word lists, that I need to combine into one.

Word list 1 (Chinese):
Code:
你们
好
家明 
你
好 

Word List 2 (Chinese pinyin with numerical tone mark):
Code:
ni3men
hao3 
Jia1ming2 
ni3
hao3

My desired outcome would combing the numbers from the second wordlist with the characters in the first word list to look like this:

Code:
你,3,们,0
好,3 
家,1,明,2 
你,3
好,3

It is important that the format is "character," comma, "number"

So far I have done the following with wordlist two:
Code:
tr '[:alpha:]' ',' <WordList2.txt | tr -s ',' >WordList2B.txt 
paste WordList1.txt Wordlist2B.txt > CombinedWordList.txt
tr -d '\t' <CombinedWordList.txt | tr -s '[:space:]' >CombinedWordList2.txt

My current output document looks like this:
Code:
你们,3,
好,3
家明,1,2 
你,3
好,3

It is 'almost' there - but the first and third need to be further 'integrated' so the format is 'character' comma 'number 'character' comma 'number'. So every single Chinese symbol should be followed by a number. One additional problem, is that some words (such as the second character in the first example
(你们,3,) do not have a corresponding number - in this case I would like it to automatically insert a zero '0' - so the first word would appear "你,3,们,0". So specifically - I need help:
1) formatting the document to appear "character" comma "number", "character" comma "number instead of "character" "character" comma "number" comma "number"
2) Having a zero '0' inserted after the comma when there is not already a number.

Any help or suggestions would be greatly appreciated

Last edited by Scrutinizer; 02-01-2013 at 06:33 AM.. Reason: Please use code tags for data and code samples
# 2  
Old 02-01-2013
This is the first time I have to struggle with UTF>8 chars, so I'm feeling a bit overstrained, and you should take my proposal as a mere direction indicator. On top, both your input files have trailing blanks that I removed. If they are needed, you have to insert special action into the code. Here's my meek approach:
Code:
awk    'NR==FNR {sub(/[^0-9]$/, "&0");gsub (/[0-9]/,",&,");  Ar[NR]=$2$4; next}
     {gsub (/.../,"&,"); $1=$1","substr (Ar[FNR],1,1); if ($2) $2=$2","substr (Ar[FNR],2,1)}
     1
    ' FS="," OFS="," file2 file3
你,3,们,0,
好,3,
家,1,明,2,
你,3,
好,3,

The trailing commas are due to the insufficient attempt to separate chinese syllables which I didn't bother to remove - I'm sure you have better means in your locale!

Last edited by RudiC; 02-01-2013 at 07:13 AM..
This User Gave Thanks to RudiC For This Post:
# 3  
Old 02-03-2013
Quote:
Originally Posted by RudiC
This is the first time I have to struggle with UTF>8 chars, so I'm feeling a bit overstrained, and you should take my proposal as a mere direction indicator. On top, both your input files have trailing blanks that I removed. If they are needed, you have to insert special action into the code. Here's my meek approach:
Code:
awk    'NR==FNR {sub(/[^0-9]$/, "&0");gsub (/[0-9]/,",&,");  Ar[NR]=$2$4; next}
     {gsub (/.../,"&,"); $1=$1","substr (Ar[FNR],1,1); if ($2) $2=$2","substr (Ar[FNR],2,1)}
     1
    ' FS="," OFS="," file2 file3
你,3,们,0,
好,3,
家,1,明,2,
你,3,
好,3,

The trailing commas are due to the insufficient attempt to separate chinese syllables which I didn't bother to remove - I'm sure you have better means in your locale!
Nomadblue,
RudiC's code looks reasonable, but I haven't been able to test it. I have found that awk on OS X Version 10.7.5 (Lion) counts bytes instead of counting characters when using substr() and length() and that using a regular expression to search for a space fails if the space follows a multibyte character (not just in awk; but also at least in bash, ed, ex, grep, ksh, sed, and vi). My testing was done with LANG set to en_US.UTF-8 and no LC_* environment variables set.

I would love to hear if this has been fixed in Mountain Lion.

************************
Update: I take back what I said about REs not matching spaces after multibyte characters. The characters that I originally thought were spaces were multibyte characters consisting of the octal byte sequences: 0343 0200 0200 and 0342 0200 0206. Those two characters aren't spaces, but they are in the locale's space character class.

---------- Post updated Feb 3rd, 2013 at 13:46 ---------- Previous update was Feb 2nd, 2013 at 23:13 ----------

The following script seems to do what you want except that it does not print any trailing space character class characters at the ends of the output lines. (Note that Word list 1 had a trailing character in the space character class on lines 3 and 5, Word list 2 on lines 2 and 3, and your desired outcome on lines 2 and 3. The output produced by this script does not include any characters in the space character class.)
Code:
#!/bin/ksh
# The awk on Mac OS X Version 10.7.5 does not meet POSIX/UNIX requirements for
# handing multibyte characters (it processes bytes instead of characters) at
# least in the length() and substr() functions.  This problem should be easy to
# handle in awk, but this script is written entirely as a ksh script which does
# handle multibyte characters correctly.  (The bash on OS X Version 10.7.5 also
# handles multibyte characters correctly and, although this script uses many
# features that are not defined by the standards, this script works both with
# ksh and bash on OS X.  If using this script on another system, you will need
# to use a 1993 or later version of ksh.)

# Read chinese string.
while IFS="" read -r c
do      # Read corresponding Chinese pinyin string with tone marks.
        IFS="" read -r cp <&3
        # Strip a trailing space character class character from each string, if
        # there is one.
        c=${c%[[:space:]]}
        cp=${cp%[[:space:]]}
        # Is there a tone mark at the end of the Chinese pinyin string?
        if [[ ${cp:$((${#cp} - 1))} != [[:digit:]] ]]
        then    # No.  Add "0" as a tone mark.
                cp="${cp}0"
        fi
        # Strip everything but tone marks from the Chinese pinyin string.
        cp=${cp//[![:digit:]]/}
        # Print the Chinese characters with their corresponding tone marks.
        sep=""  # No separator for first character pair.
        for ((i = 0; i < ${#cp}; i++))
        do      printf "%s%s,%s" "$sep" "${c:$i:1}" "${cp:$i:1}"
                sep="," # Separator for all following character pairs.
        done
        # Add the trailing newline.
        echo
done < Word_list_1 3< Word_list_2


Last edited by Don Cragun; 02-03-2013 at 05:50 PM.. Reason: Update with new info re: Mac OS X
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Shell Programming (beginner help)

So guys basically I was really sick and couldn't attend the labs and lectures and I went to my lecture hoping he would say ok I will help you from the start but he just said google it. So If it's possible to make the assignment and explain more in detail why is that would be really helpfull. I can... (1 Reply)
Discussion started by: Joola94
1 Replies

2. Shell Programming and Scripting

perl line needing a tweak

Hi Folks, I have a perl line that looks like this and it works fine as is, but I need it to expand a bid further. perl -aF, -ne 'printf "conf zone %2\$s delete host %s,,,$F\n",split/\./,$F,2 if /^hostrecord/ &&/\b10\.8\.(|1)\.\d/' hosts.csv this code the way it is does this 10.8.3.0... (10 Replies)
Discussion started by: richsark
10 Replies

3. Programming

Shell programming ksh AIX - beginner

Hi! I have two shell scripts - Script1, Script2 Script1, Script2 - have return parameter Script1 - is calling Script2 in Script2 I am calling program sqlldr - if this program is called then I did not get the return parameter from Script1 Do You have any idea how can I avoid this problem. Mroki (6 Replies)
Discussion started by: mroki
6 Replies

4. UNIX for Dummies Questions & Answers

Using the Foreach loop, Needing help

I am trying to make a script for my Counter-Strike: Source servers. What i am wanting it to do is for it to restart each server, the only way i can think of doing this in through for each. Years what i have at the moment. server_start() { START=`ps x | grep SCREEN | grep $SRV | cut -d '?' -f... (5 Replies)
Discussion started by: grahamn95
5 Replies

5. Programming

beginner to c programming

hii friends i m fairy new to c programming.can any one suggest some good websites and some good books for beginner (6 Replies)
Discussion started by: pankajchandel
6 Replies

6. UNIX for Dummies Questions & Answers

Linux noob needing help with a script

Hi, Very new to linux but I've just recently setup an ubuntu server. I have 2 broadband connections and would like to have fallback on the server should one of the lines fail. I know what I want it to do, but dont know how to script it. heres the senario; ubuntu server with 2 ethernet... (0 Replies)
Discussion started by: ziggycat
0 Replies

7. Shell Programming and Scripting

Beginner Shell Programming Question

Hello all, I am currently try to learn the linux operating system as well as some bash programming. I have come across some online course work which has been very helpful, I have been working through some assignments and since I have no teacher to ask I have come to you experts. So the... (6 Replies)
Discussion started by: g2axiom
6 Replies

8. Shell Programming and Scripting

arrays and needing length of fields

I have a sort of complex problem that I just can't figure out. I have data coming into a ksh program in a format that I need to parse out and display into a different format into a text file for printing. I have figured out how to get all the data in the format I need it in for the text file. The... (6 Replies)
Discussion started by: ajgwin
6 Replies

9. UNIX for Dummies Questions & Answers

New User needing Help for upcoming job

Hello All, I'm applying for a new job in telecommunications and have been asked to learn unix and pearl scripting. I've got a copy of knoppix Linux 03. I at this point only know how to list files, create directories, change permissions. I was instructed to learn how to create files, basic... (3 Replies)
Discussion started by: cyberjax21
3 Replies

10. Shell Programming and Scripting

Resources or documents for shell programming

hello friends Please tell me where can I get good documentation for shell programming and examples for shell programming. Please try to help me.. with rgds, varma (2 Replies)
Discussion started by: jarkvarma
2 Replies
Login or Register to Ask a Question