Script to extract forum posts


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Script to extract forum posts
# 8  
Old 02-01-2011
Use:
Code:
 perl -ne  'if(/thread_id[^>]*>([^>]*)<\/a.*div style[^>]*>([^>]*)<\/div>/){print $1."\n\n".$2."\n\n";}' htmlFile

This User Gave Thanks to Klashxx For This Post:
# 9  
Old 02-01-2011
That doesn't seem to catch 'em all, though. It results in 4 titles and posts out of 25.

Looking at the result I get, I think it skips all posts which contains a quote, which is represented by [user:quoted text], or a link. For example, these two aren't caught:

Code:
thread_id=8083&page=3034#1299591">Sandlådan - Prataomvadsomhelstnärsomhelst-tråden</a><br><div style="padding:2px 0px 3px 0px;">[The Ultra:De är menade att skjutas upp i luften egentligen.] 
Den där typen av flares är ju inte till för att skjutas upp.</div>


Code:
thread_id=48046&page=1#1287691"><b>Arcaflex...?</b></a><br><div style="padding:2px 0px 3px 0px;">www.gamer.se

Youmeet, vad är upp?</div>

Or is it maybe the <b> </b> that breaks it in the second example?

This is the url from which the html file is from:

Gameplayer.se - Inlägg - KidCactus

Last edited by KidCactus; 02-01-2011 at 09:09 AM..
# 10  
Old 02-01-2011
A little dirty but:
Code:
perl -pne  's/\n//g' html>html2

Then:
Code:
perl -ne  'while(/<a[^>]*thread_id[^>]*>([^>]*)<\/a>[^<]*<br>[^>]*<div style[^>]*>([^>]*)<\/div>/g){print $1."\n\n".$2."\n\n";}' html2

This User Gave Thanks to Klashxx For This Post:
# 11  
Old 02-01-2011
Dirty or not, I piped the first one to the second one, and it works like a charm! Smilie Thank you so much.

---------- Post updated at 10:38 PM ---------- Previous update was at 02:53 PM ----------

If someone has time to help me, I would need some addition help with the following:

In a html file, I have this text:

Code:
KidCactus';"><div class="forum_thread_text"><span class="forum_text_quote"><strong>Andreas Berg:</strong> Det var inte många år sedan jag tog hjälp av Google för att koka ett ägg <img src="http://gameplayer.se/gfx/smilies/blush.gif" alt="[blush]" border=0 width=15 height=15> (till mitt försvar äter jag i princip aldrig ägg och har knappt gjort det alls, så det har inte riktigt funnits anledning för mig att veta hur länge ett ägg ska koka <img src="http://gameplayer.se/gfx/smilies/crazy.gif" alt="[crazy]" border=0 width=15 height=15>)</span><br/>Är du från <a class="forum_text_url" href="http://www.svd.se/nyheter/inrikes/artikel_774535.svd" target="_blank">Storbritannien</a>?<br/><br/>Jag googlar rätt ofta för att rättstava ord, eller för att kolla om vissa ord ens existerar utanför min hjärna.</div>

Anywhere in the file where this is found:

Code:
KidCactus';"><div class="forum_thread_text">

I want to cut out the text between that and:

Code:
</div>

So the result would be:

Code:
<span class="forum_text_quote"><strong>Andreas Berg:</strong> Det var inte många år sedan jag tog hjälp av Google för att koka ett ägg <img src="http://gameplayer.se/gfx/smilies/blush.gif" alt="[blush]" border=0 width=15 height=15> (till mitt försvar äter jag i princip aldrig ägg och har knappt gjort det alls, så det har inte riktigt funnits anledning för mig att veta hur länge ett ägg ska koka <img src="http://gameplayer.se/gfx/smilies/crazy.gif" alt="[crazy]" border=0 width=15 height=15>)</span><br/>Är du från <a class="forum_text_url" href="http://www.svd.se/nyheter/inrikes/artikel_774535.svd" target="_blank">Storbritannien</a>?<br/><br/>Jag googlar rätt ofta för att rättstava ord, eller för att kolla om vissa ord ens existerar utanför min hjärna.

If the <br/> also could be converted to a new line at the same time, that would be awesome. I have tried this, but I guess something is wrong since I don't get anything at all:

perl -pne 's/\n//g' input.txt | perl -ne 'while(/KidCactus[^>]*forum_thread_text[^>]*>([^>]*)<\/div>/g){print $1."\n\n";}' > output.txt

Last edited by KidCactus; 02-01-2011 at 06:08 PM..
# 12  
Old 02-02-2011
Try this:

Code:
perl -ne 'while(m/KidCactus.;\"><div class=\"forum_thread_text\">(.*?)<\/div>/g){$a=$1;$a=~s/<br\/>/\n/g;print $a."\n";}'

Login or Register to Ask a Question

Previous Thread | Next Thread

5 More Discussions You Might Find Interesting

1. What is on Your Mind?

Mobile: Advanced Forum Statistics to Forum Home Page

For mobile users, I have just added a "first beta" Advanced Forum Statistics to the home page on mobile using CSS overflow:auto; so you can swipe if you need to see more. Google Search Console mobile usability says this page is "mobile friendly" so perhaps this will be useful for some of our... (12 Replies)
Discussion started by: Neo
12 Replies

2. What is on Your Mind?

Forum Update: Disabled Home Page Forum Statistics for Guests (Not Registered)

Just a quick update; to speed up the forums, I have disabled the forum statistics on the home page for non registered users. No changes for registered users. (0 Replies)
Discussion started by: Neo
0 Replies

3. UNIX for Dummies Questions & Answers

Script required (Example of a Bad Forum Subject)

A file contains the following data Name, Age, Sex, city, country abc, 20, m, tokyo, Japan def, 21, f, sydney, Australia ghd, 23, m, chicago, USA rww, 29, f, london, UK I need the city column to be replaced with XXX as follows Name, Age, Sex, city, country abc, 20, m, XXX, Japan... (8 Replies)
Discussion started by: vva
8 Replies

4. UNIX for Advanced & Expert Users

Help! SHELL or AWK script - only the masters of the forum will solve

Hello everybody! I have no experience with shell Programmer, but I need to compare 02 files. Txt and generate an output or a new file, after the comparisons. see: If the column 1 of file1 is equal to column 1 of file2, and column 3 of file2 contains the column 4 of file1, output: column1... (4 Replies)
Discussion started by: He2
4 Replies

5. Shell Programming and Scripting

Script to monitor forum

Hello. I am attempting to write a pretty complex script that monitors a forum and alerts me whenever a new post is made (this part of the script is done). I then want to have the script auto reply to the post with a predetermined message. The one catch here is this is a VERY popular forum. ... (0 Replies)
Discussion started by: yousillygoose
0 Replies
Login or Register to Ask a Question