awk or regex


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk or regex
# 1  
Old 04-29-2009
awk or regex

Hi!

I want to made a program that will generate code like this:
{{Navedi XYZ
|avtor=XYZ1
|naslov=XYZ2
|leto_izzida=XYZ3
|zalozba=XYZ4
|kraj=XYZ5
|isbn=XYZ6
|cobiss_id=XYZ7
}}

from input like this:
<b> ODGOVORNOST............. : <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=CB&run=yes&SS1=&quo t;Tauber,%20Daniel%20A.&quot;">Tauber, Daniel A.</a> - zbiratelj</b>
<b> NASLOV.................. : #The #complete Linux kit</b>
<b> IMPRESUM................ : San Francisco [etc.] : Sybex, 1995</b>
<b> FIZIČNI OPIS............ : XXIII, 419 str. ; 23 cm + CD-ROM</b>
<b> ISBN.................... : 0-7821-1669-8</b>
<b> PREDMETNE OZNAKE........ : <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;racunalnistvo&quot;">računalništvo</a> // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;operacijski%20sistemi&quot;">operacijski sistemi</a></b>
<b>// <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;linux&quot;">linux</a> // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;unix&quot;">unix</a> // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;programska%20oprema&quot;">programska oprema</a> // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;Internet&quot;">Internet</a> //</b>
<b><a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;komercialni%20sistemi&quot;">komercialni sistemi</a></b>
<b> UDK..................... : 681.3.06, 519.68</b>
<b> UDK ZA STATISTIKO....... : 62+66/69</b>
<b> VRSTA GRADIVA........... : monografska publikacija, tekstovno gradivo,</b>
<b>tiskano</b>


<b> COBISS.SI-ID............ : 2952</b>
in this example the code would be:
{{Navedi CD-ROM
|avtor= Daniel A. Tauber
|naslov=The complete Linux kit
|kraj=San Francisco
|zalozba=Sybex
|leto=1995
|cobiss_id=2952
|isbn=0-7821-1669-8
}}

This is needed on the Slovenian Wikisource, since some users gave only link to page on national bibliographic system (COBISS - COBISS/OPAC), but we need to cite all these things...
# 2  
Old 04-29-2009
Maybe something like this:

Code:
$ 
$ # show the contents of "input.txt"
$ cat -n input.txt
     1  <b> ODGOVORNOST............. : <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=CB&run=yes&SS1=&quo t;Tauber,%20Daniel%20A.&quot;">Tauber, Daniel A.</a> - zbiratelj</b>                        
     2  <b> NASLOV.................. : #The #complete Linux kit</b>                                                                     
     3  <b> IMPRESUM................ : San Francisco [etc.] : Sybex, 1995</b>                                                           
     4  <b> FIZIČNI OPIS............ : XXIII, 419 str. ; 23 cm + CD-ROM</b>                                                             
     5  <b> ISBN.................... : 0-7821-1669-8</b>                                                                                
     6  <b> PREDMETNE OZNAKE........ : <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;racunalnistvo&quot;">računalništvo</a> // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;operacijski%20sistemi&quot;">operacijski sistemi</a></b>                                                                                                                         
     7  <b>// <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;linux&quot;">linux</a> // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;unix&quot;">unix</a> // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;programska%20oprema&quot;">programska oprema</a> // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;Internet&quot;">Internet</a> //</b>
     8  <b><a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=2047435453134563&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;komercialni%20sistemi&quot;">komercialni sistemi</a></b>
     9  <b> UDK..................... : 681.3.06, 519.68</b>
    10  <b> UDK ZA STATISTIKO....... : 62+66/69</b>
    11  <b> VRSTA GRADIVA........... : monografska publikacija, tekstovno gradivo,</b>
    12  <b>tiskano</b>
    13  <b> COBISS.SI-ID............ : 2952</b>
    14
$
$ # run the script
$ perl -ne '{
>   /^<b> FIZI.*?(\S+)<\/b>/ and $nvd = $1;
>   /^<b> ODGOVORNOST.*?>(.*?), (.*?)<\/a>.*?<\/b>/ and $avt = "$2 $1";
>   /^<b> NASLOV.*?: (.*?)<\/b>/ and $nsl = $1; $nsl =~ s/#//g;
>   /^<b> IMPRESUM.*?: (.*?)<\/b>/ and $kzl = $1; ($krj,$zlz,$lto) = split(/ : |, /, $kzl); $krj =~ s/ \[etc.\]//g;
>   /^<b> COBISS.*?: (\d+)<\/b>/ and $cbs = $1;
>   /^<b> ISBN.*?: (\S+)<\/b>/ and $isb = $1;
> }
> END {
>   printf("{{Navedi %s\n|avtor= %s\n|naslov= %s\n|kraj= %s\n", $nvd, $avt, $nsl, $krj);
>   printf("|zalozba= %s\n|leto= %s\n|cobiss_id= %s\n|isbn= %s\n}}\n", $zlz, $lto, $cbs, $isb);
> }' input.txt
{{Navedi CD-ROM
|avtor= Daniel A. Tauber
|naslov= The complete Linux kit
|kraj= San Francisco
|zalozba= Sybex
|leto= 1995
|cobiss_id= 2952
|isbn= 0-7821-1669-8
}}
$
$

HTH,
tyler_durden

_________________________________________________________________________________________________
"And the eighth and final rule: if this is your first time at Fight Club, you have to fight."
# 3  
Old 04-30-2009
And what is I have following input:
<b> AVTOR................... : <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=0858170590354406&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=CB&run=yes&SS1=&quo t;Valvasor,%20Janez%20Vajkard&quot;">Valvasor, Janez Vajkard</a> - avtor</b>
<b> ODGOVORNOST............. : <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=0858170590354406&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=CB&run=yes&SS1=&quo t;Rupel,%20Mirko&quot;">Rupel, Mirko</a> - prevajalec - urednik -</b>
<b>avtor dodatnega besedila // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=0858170590354406&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=CB&run=yes&SS1=&quo t;Gerlanc,%20Bogomil&quot;">Gerlanc, Bogomil</a> - urednik - avtor</b>
<b>dodatnega besedila // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=0858170590354406&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=CB&run=yes&SS1=&quo t;Justin,%20Elko&quot;">Justin, Elko</a> - ilustrator - urednik</b>
<b> NASLOV.................. : Slava vojvodine Kranjske, ... z</b>
<b>zgodovinsko-topografskim opisom ...</b>
<b> IMPRESUM................ : Ljubljana : Mladinska knjiga, 1977</b>
<b> FIZIČNI OPIS............ : XV, 365 str., X pril. : ilustr. ; 29 cm</b>
<b> ZBIRKA.................. : (#Zbirka #Kultura / Mladinska knjiga)</b>
<b> OSTALI NASLOVI.......... : #Die #Ehre des Hertzogthums Crain // Slava</b>
<b>vojvodine Kranjske</b>
<b> PREDMETNE OZNAKE........ : <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=0858170590354406&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=PN&run=yes&SS5=&quo t;Valvasor,%20Janez%20Vajkard,%201641-1693&quot;">Valvasor, Janez Vajkard</a> (1641-1693) -</b>
<b><a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=0858170590354406&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;Biografije&quot;">Biografije</a> // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=0858170590354406&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;Kranjska&quot;">Kranjska</a> - <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=0858170590354406&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;Domoznanstvo&quot;">Domoznanstvo</a></b>
<b> PREDMETNE OZNAKE........ : <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=0858170590354406&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;Kranjska&quot;">Kranjska</a> // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=0858170590354406&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;Slovenija&quot;">Slovenija</a> //</b>
<b><a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=0858170590354406&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;domoznanstvo&quot;">domoznanstvo</a> // <a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&mode=5&id=0858170590354406&PF1=AU&PF2=TI&PF3=PY&PF4=KW&CS=a&PF5=SU&run=yes&SS5=&quo t;17.st.&quot;">17.st.</a></b>
<b> UDK..................... : 908(497.4), 930.1(497.4):929 Valvasor J.W.,</b>
<b>929 Valvasor J.W.</b>
<b> VRSTA GRADIVA........... : monografska publikacija, tekstovno gradivo,</b>
<b>tiskano</b>
<b> COBISS.SI-ID............ : 1102110</b>

How to add OR sentences in perl?

I mean if there is both AVTOR and ODGOVORNOST paramter, use AVTOR if there is one of them use it.
# 4  
Old 04-30-2009
Maybe this type of input is prettier:
<b>4. DT=m RT=a (tekstovno gradivo, tiskano) TI=Slava vojvodine Kranjske :</b>
<b>izbrana poglavja NT=Prevod dela: Die Ehre des Herzogthums Crain ; 2.000 izv. ;</b>
<b>Opombe: str. 317-335 ; Janez Vajkard Valvasor in Slava vojvodine Kranjske /</b>
<b>Branko Reis: str. 337-352 AU=<a href="http://cobiss2.izum.si/scripts/cobiss?ukaz=FFRM&amp;mode=5&amp;id=0858170590354406&amp;PF1=AU&amp;PF2=TI&amp;PF3=PY&amp;PF4=KW&amp; CS=a&amp;PF5=CB&amp;run=yes&amp;SS1=%22Valvasor,%20Janez%20Vajkard%22">Valvasor, Janez Vajkard</a> PP=Ljubljana</b>
<b>PU=Mladinska knjiga PY=1984 LA=slv (slovenski) CL=#Zbirka# Kultura ID=561694</b>
# 5  
Old 04-30-2009
Quote:
Originally Posted by smihael
And what is I have following input:
...
How to add OR sentences in perl?

I mean if there is both AVTOR and ODGOVORNOST paramter, use AVTOR if there is one of them use it.
Maybe something like this:

Code:
perl -ne '{
  /^<b> FIZI.*?(\S+)<\/b>/ and $nvd = $1;
  /^<b> AVTOR.*?>(.*?), (.*?)<\/a>.*?<\/b>/ and $avt = "$2 $1";
  /^<b> ODGOVORNOST.*?>(.*?), (.*?)<\/a>.*?<\/b>/ and $odg = "$2 $1";
  /^<b> NASLOV.*?: (.*?)<\/b>/ and $nsl = $1; $nsl =~ s/#//g;
  /^<b> IMPRESUM.*?: (.*?)<\/b>/ and $kzl = $1; ($krj,$zlz,$lto) = split(/ : |, /, $kzl); $krj =~ s/ \[etc.\]//g;
  /^<b> COBISS.*?: (\d+)<\/b>/ and $cbs = $1;
  /^<b> ISBN.*?: (\S+)<\/b>/ and $isb = $1;
}
END {
  printf("{{Navedi %s\n|avtor= %s\n|naslov= %s\n|kraj= %s\n", $nvd, ($avt or $odg), $nsl, $krj);
  printf("|zalozba= %s\n|leto= %s\n|cobiss_id= %s\n|isbn= %s\n}}\n", $zlz, $lto, $cbs, $isb);
}' <input_file>

HTH,
tyler_durden

_________________________________________________________________________________________________
"And the eighth and final rule: if this is your first time at Fight Club, you have to fight."
# 6  
Old 05-01-2009
Thanks :)

Ovv, nice. Thanks a lot. I get it and now I also understand the code so I can modify it.... Now I just need to make script that fetches IDs from sl.Wikisource.org and downloads the HTML code, i think that that shouldn't be problem ...

And next thing is to modify pywikipediabot to read data from table ...

I'll report about my progress Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regex within IF statement in awk

Hello to all, I have: X="string 1-" Y="-string 2" Z="string 1-20-string 2"In the position of the number 20 could be different numbers, but I'm interest only when the number is 15, 20,45 or 70. I want to include an IF within an awk code with a regex in the following way. ... (12 Replies)
Discussion started by: Ophiuchus
12 Replies

2. Shell Programming and Scripting

wildcard in regex for awk

Hello I have a file like : 20120918000001413 | 1.17.163.89 | iSelfcare | MSISDN | N 20120918000001806 | 1.33.27.100 | iSelfcare | 5564 | N .... I want to extract all lines that have on 4th field (considering "|" the separator ) something other than just digits. I want to do this using a... (5 Replies)
Discussion started by: black_fender
5 Replies

3. Shell Programming and Scripting

awk regex- include text

Hi I am trying to filter some data using awk. I have a statement- awk 'BEGIN { FS = "\n" ; RS = "" } { if ( $6 = "City: " ) { print "City: Unknown" } else { print $6 } }'` The $6 values are City: London City: Madrid City: City: Tokyo This expression seems to catch all the lines... (4 Replies)
Discussion started by: jamie_123
4 Replies

4. Shell Programming and Scripting

awk equivalent of regex

Hi all, Can someone tell me what's the (g)awk equal of this simple regex to find ip addresses in urls: egrep "^http://{1,3}\.{1,3}\.{1,3}\.{1,3}(:{1,5})?/"Input: http://10.0.0.1/query.exe http://11y10x09w:80/howaboutme http://192.168.100.190:1234/takeme.gpg Output:... (8 Replies)
Discussion started by: r4v3n
8 Replies

5. UNIX for Dummies Questions & Answers

Using AWK and regex

Hi can you suggest in this regard The sample.txt conatins the data name lines type sam 12 txt sam 24 xls sam 36 pdf ram 32 txt ram 45 sxls ram 58 word sam 92 jpeg sam 21 gif sam 22 ltf from the data i need to sum all line... (5 Replies)
Discussion started by: krashraj
5 Replies

6. Shell Programming and Scripting

awk with multiple regex and substring

Hi Experts, I have a file on which i want to print the line which should match following criterias. Line should not start with 0 or 9 and Line should start with 1 and ( 576th character should not be 1 or 2 or 576-580 postion should not be NIPPF or CDIPB or 576-581 postion should... (2 Replies)
Discussion started by: millan
2 Replies

7. Shell Programming and Scripting

awk regex problem

hi everyone suppose my input file is ABC-12345 ABCD-12345 BCD-123456 i want to search the specific pattern which looks like - in a file so i used this command cat $file | awk ' { if ($0 ~ /-/) { print } }' so it gives me the result as ABCD-12345 BCD-12345 BCD-12345 ... (31 Replies)
Discussion started by: aishsimplesweet
31 Replies

8. Shell Programming and Scripting

awk variables in regex expression ?

Hello, Could someone explain why this one returns nothing: $ x=/jon/ $ echo jon | awk -v xa=$x '$1~xa {print}' $ while the following works fine: $ x=jon $ echo jon | awk -v xa=$x '$1==xa {print}' $ jon and the following works fine: $ echo jon | awk '$1~/jon/ {print}' $ jon ... (3 Replies)
Discussion started by: vilius
3 Replies

9. Shell Programming and Scripting

AWK regex to find only numbers

Hi guys I need to find both negative and positive numbers from the following text file. And i also dont need 0. 0 8 -7 -2268 007 -07 -00 -0a0 0a0 -07a0 7a00 0a0 Can someone please give a regex to filter out the values in red. I tried a few things in awk but it didnt work... (9 Replies)
Discussion started by: sridanu
9 Replies

10. Shell Programming and Scripting

Extracting a regex with awk

I have a regexp that I wish to match against every line of a file using awk. But I do not want to substitute it or select the line. I want to pull the matched text out and put it in a different file, line by line. What is the correct awk usage to *extract* a regexp and put it in another... (11 Replies)
Discussion started by: Enobarbus37
11 Replies
Login or Register to Ask a Question