Update on vB3 Migration to Discourse - Issues and Status of BBCode Transformations


 
Thread Tools Search this Thread
The Lounge What is on Your Mind? Update on vB3 Migration to Discourse - Issues and Status of BBCode Transformations
# 1  
Old 03-31-2020
Update on vB3 Migration to Discourse - Issues and Status of BBCode Transformations

We "completed" the migration of this vB3 site to Discourse a number of days ago. However, deeper testing by @Scrutinizer and @MadeInGermany revealed that a lot of text was mangled in the migration. We traced these bugs to two issues:
  1. A minor bug in the Ruby vbulletin.rb migration script which transformed "\n" in code fragments to hard breaks; and
  2. A major bug in the recommended migration Ruby gem ruby-bbcode-to-md that mangled all code fragments with left square brackets. (strips them out, completely!)

These are serious migration bugs which affect the integrity of all of the hard work our member have contributed over the years and must be corrected.

I posted a bug report over at meta-discourse on the major bug in the Ruby gem ruby-bbcode-to-md but the maintainer of that repo shut me down, deleted my bug reports, washed his hands of a repo with his name on it, and acted very unprofessional, even though the bug was easily confirmed. This was not encouraging and brought my spirits down a bit at the time.

@Scrutinizer then came to the rescue (thank you!), took a step back, and did a formal analysis of the Ruby preprocessing script and the various bbcode transformations including the:
  • Ruby preprocessing method in the vbulletin.rb migration script
  • Discourse builtin bbcode support
  • Ruby gem ruby-bbcode-to-md Discourse plugin

At the same time, I was working on:
  1. Debugging the ruby-bbcode-to-md discourse plugin
  2. New PHP script to reprocess the pagetext from the vB3 MySQL DB and replace the text mangled by the ruby-bbcode-to-md gem in the Discourse Postgres DB.

@Scrutinizer created a spreadsheet and did the analysis and determined that the mangler gem ruby-bbcode-to-md was not required.

In addition, @Scrutinizer suggested we install the discourse-bbcode plugin and test it.

With good preliminary results from discourse-bbcode, I started to learn how to modify a Discourse plugin and found that the discourse-bbcode plugin was not difficult to modify (straight forward javascript), so I set up a development environment on my desktop as follows:
  1. Forked the discourse-bbcode on GitHub to neo-discourse-bbcode
  2. Cloned the neo-discourse-bbcode to my desktop
  3. Modified neo-discourse-bbcode using Visual Studio Code
  4. Pushed changes to my newly minted forked neo-discourse-bbcode repo on GitHub
  5. Rebuilt and tested our staging Discourse apps using the modified neo-discourse-bbcode repo in the app.yml build file.

From this test setup, was able to add new "preliminary" bbcode tags, for example I created two new tags which correspond (roughly) to two of our legacy tags:
  1. ICODEtag
  2. MOD tag

This forked repo is still a work-in-progress and I am still learning to modify and test and try to add other tags. Right now, I'm having some issues with preformatted elements, but that's a discussion for another day.

GitHub - unixneo/neo-discourse-bbcode: vBulletin BBCode plugin

Although we have made progress, @Scrutinzer also discovered that the Markdown and CODE bbcode tags used by Discourse (builtin) do not permit BBCode in the fenced code blocks.

This means that our technique of using color to highlight sections of code in blocks when helping others currently has no solution, but we are working on this. Currently, in the migration script this bbcode is stripped out during migration. However, we want to find a way to preserve this capability and feature if possible.

However, there are gremlins around the corner:

Fixing this will more-than-likely require us to disable the Discourse builtin CODE tag (so far, I have not been about to override this in a plugin); and there is just about zero chance the busy folks at meta discourse will support this or even answer my question in a helpful way if I ask how to do this (disable or override their builtin code tag). Historical discussions from the meta discourse team shows a near religious passion for markdown and any deviation from that they perceive as "right and wrong" is not well received. Plus, this is a migration issue, and they are focused on the future, not the past (understandably). I see zero chance asking about this will be well received over there.

So the current options seem to be:
  • Hack the Discourse code base to disable their builtin code tag and write a plugin to do this (not certain how to do this, really and it will not be persistent when Discourse upgrades, so this seems not a reasonable possibility at the moment).
  • Ask "how to do this on meta" and get beat up severely by the meta team, who will surely oppose this and tell us to "get over it" and "move on".
  • Strip out all bbcode in block code tags during migration (the current "solution").
  • Look at highlight.js and see if we can hack that to get bbcode to work (just a wild idea at the moment).
  • Something yet to appear in the fog of all this.

I don't consider removing color indicators from our code blocks and stripping out bbcode from these fenced code blocks a major issue; but there are some who will consider it a big deal, possibly. We will be "losing" this well-liked feature if we strip it out.

So, we are still looking into this.

However, I am not planning to ask on meta, because that question will surely, based on historical discussions on code tags there and my "not good" experiences with two migration bug reports over there, will not turn out good, this I am sure. Migration issues are not well received over there (they are working on building for the future, and this is understandable, honestly) and we are pretty much "own our own" on this.

That is the latest Smilie

Last edited by vbe; 03-31-2020 at 05:06 AM..
These 4 Users Gave Thanks to Neo For This Post:
# 2  
Old 04-02-2020
Today, armed with a new migration script vbulletin_neo7.rb I will start a migration from scratch on the staging server for the purposes of getting raw preprocessed posts from the postgres DB and uploading them to the community site; but first I will do this on the staging site, as follow:
  • Start a new migration from scratch using vbulletin_neo7.rb. This will take a few days.
  • Test some problematic posts and see if they migrated correctly ( tested this yesterday on a small scale, and it looked fine), if so:
  • Dump the raw posts from the postgres DB created above, along with the mapping table between the vb posts and the discourse posts.
  • Restore the staging server with the current community snapshot.
  • Move the raw from the new migration to the restored staging server DB.
  • Test.

Yesterday, I tested the migration script vbulletin_neo7.rb and it worked OK

@Scrutinizer is also working on some other bbcode migration enhancements which we will apply when he is ready to test (he has a full time job and family demands so no hurry or worry).

I want to create a new baseline without the broken ruby-bbcode-to-md plugin which our new vbulletin_neo7.rb script and new custom bbcode plugin, neo-discourse-bbcode which has a solid new ICODE bbcode tag working.
These 3 Users Gave Thanks to Neo For This Post:
# 3  
Old 04-02-2020
Update:

Change in direction (too slow to keep doing the migration over and over).

Create / write a Ruby script (done):
  • Retrieve the mappings from the vB posts to the Discourse posts stored the Discourse postgres DB.
  • Use these postid-to-postid mappings to grab the original vB post text from each vB post in the original mysql DB.
  • Preprocess the vB post text
  • Postprocess the vB post text
  • Update the raw post in the Discourse DB
  • Test and redo.

This script above processes about a million posts in 45 minutes (much faster) and when happy with the results can rebake the raw posts into the cooked posts. Rebaking 1M posts takes about 16+ hours, so avoiding this when possible.

Ran this yesterday and found that all the bugs posted my @MadeInGermany before (mangled code, missing left square brackets) and the hard line break error reported by @Scrutinzer (where \n in code fragments were converted to hard line breaks) were fixed.

However, still more gremlins to slay, working on:
  • Fixing missing emoji in the preprocessing. In particular the thumps up emoji that Ravinder loves to use :b: converts to :+1:. DONE
  • Fixing a bug in attachments and other images. DONE

However, the main reported gremlins in code fragments appear to be fixed. Now working on other missing transformations (missing emoji, images, etc).

Making progress... slowly but surely.

All work currently done on test / staging server only.
# 4  
Old 04-03-2020
Also, I am finding, by trial and error, that even before preprocessing with the Ruby migration scripts, there are some transformations which are easily done on the copy of the vB3 MySQL DB dump; for example:

Code:
UPDATE post SET  pagetext= LOWER(REGEXP_REPLACE(pagetext,'\\[ATTACH\\](.*)\\[\\/ATTACH\\]', '[IMG]https://www.unix.com/attachment.php?attachmentid=\\1[/IMG]'));

There was some bug in the Ruby attachment preprocessing and some attachment ids were lost, so instead of wasting time trying to find the bug in the Ruby preprocessing routine, it was easier to do the regex search and replace in the staged copy of the DB dump.

All these attachment images are automatically downloaded to the new discourse forum over time as well.

Note: Not being an expert in MariaDB REGEX_REPLACE, started with (.*) and worked my way up, and the matches only worked with the double backslashes (escapes) , would not match with single backslashes. None of the examples on the net worked (most showed no backslashes or one backslash only) in this REGEX_REPLACE expression; but I could get it to work, building it from .* up, step-by-step.

Update: This change is done and confirmed working on the staging server.
This User Gave Thanks to Neo For This Post:
# 5  
Old 04-04-2020
According to a quick check 35 tables (posts with tables) has been stripped or skipped from the migration:


Code:
MariaDB [vb3]> select count(postid) from post where pagetext like '%[table="head"]%';
+---------------+
| count(postid) |
+---------------+
|            35 |
+---------------+
1 row in set (5.559 sec)

MariaDB [vb3]> exit
Bye
root@discourse1-app:/shared/neo/bin# /shared/neo/bin/pg
psql (10.12 (Debian 10.12-1.pgdg100+1))
Type "help" for help.

discourse=> select count(id) from posts where raw like '%[table=“head”¯]%';
 count 
-------
     0
(1 row)

We can easily write code to convert these tables to markdown; but since the posts were stripped from the migration, and this would require a totally fresh migration from the start, I am inclined, as this time, to just drop these 35 posts from the new site.

We could add them at a later time, manually, as doing this manually for 35 posts would take a few hours, but redoing the migration will take a week and be painful. It would take a few hours to write and test the script, to make sure it works anyway.

So, my inclination to not worry about losing 35 posts with TABLE tags at this time (and manually add them back in the future).
This User Gave Thanks to Neo For This Post:
# 6  
Old 04-05-2020
Update (from one hour ago):


Updated the new community sites with latest posts, new users, likes, etc. from legacy site.

Ran an early preprocessing script against the legacy DB.

@Scrutinizer is testing a more refined version of preprocessing which will do even more migration magic. When he is ready, we will run his preprocessing script again the legacy DB and see how it looks.

Thanks for your patience.

We have already fixed the two bugs that we found in the initial launch; but are working to refine more custom bbcode issues.
# 7  
Old 04-05-2020
Well, as a update, this earlier MariaDB REGEX was flawed, my bad.

Code:
UPDATE post SET  pagetext= LOWER(REGEXP_REPLACE(pagetext,'\\[ATTACH\\](.*)\\[\\/ATTACH\\]', '[IMG]https://www.unix.com/attachment.php?attachmentid=\\1[/IMG]'));

Should have been

Code:
UPDATE post SET  pagetext= REGEXP_REPLACE(pagetext,'\\[ATTACH\\](.*?)\\[\\/ATTACH\\]', '[IMG]https://www.unix.com/attachment.php?attachmentid=\\1[/IMG]');

Using the LOWER directive not only worked on the REGEX, but on all text in the post, moving all text to lower case (unexpectedly). That will be fixed after 12 hours of rebaking the 1M posts.

In addition, my original REGEX was too greedy and I had to add the ? to make it less greedy. Somehow, I missed that during initial testing.

Thanks to @Peasant for catching that bug quickly today.

We are "mind warped" writing and rewriting all this (new to us) Ruby migration code. The extra eyes, fresh perspectives, and bug hunting all are much appreciated and valuable contributions to the final migration success (whenever that happens, LOL)
Login or Register to Ask a Question

Previous Thread | Next Thread

6 More Discussions You Might Find Interesting

1. What is on Your Mind?

VBulletin 3.8 to Discourse on Docker Migration Test Take Four

Test Build 4 on New Server, with changes identified in discourse test builds 2 and 3, primarily: Insuring ruby-bbcode-to-markdown is enabled. Removing line breaks from ICODE to markdown in migration script. Added vbpostid to posts in discourse to setup migrating vb "thanks" to discourse... (28 Replies)
Discussion started by: Neo
28 Replies

2. What is on Your Mind?

VBulletin 3.8 to Discourse on Docker Migration Test Take Two

OK. Like we all do, we learn a lot from tests, test migrations, and so forth. Today, I started from scratch on test migration 2, armed with a lot more knowledge, The main differences are as follows: Installed discourse plugin ruby-bbcode-to-md before starting the install Modified... (30 Replies)
Discussion started by: Neo
30 Replies

3. What is on Your Mind?

Status of Migration of Moderation Systems

First a bit of history .... A number of years ago one of our admins built a number of plugin systems for moderation, including (1) a voting system, (2) a "user feelings" system and (3) a confidential posting system. During this time, I was busy on other projects, not very active in the forums,... (1 Reply)
Discussion started by: Neo
1 Replies

4. Programming

How to track table status delete/update/insert status in DB2 V10 z/os?

Dear Team I am using DB2 v10 z/os database . Need expert guidance to figure out best way to track table activities ( Ex Delete, Insert,Update ) Scenario We have a table which is critical and many developer/testing team access on daily basis . We had instance where some deleted... (1 Reply)
Discussion started by: Perlbaby
1 Replies

5. HP-UX

Migration - Compiler Issues.

All, We are migrating an application from HP-UX B.11.00 to HP-UX B.11.31 and both of them have the same informix version - 7.25se. However the compilers are different on both servers. HP-UX B.11.00 - has B3913DB C.03.33 HP aC++ Compiler (S800) HP-UX B.11.31 - has PHSS_40631 1.0 HP C/aC++... (2 Replies)
Discussion started by: helper
2 Replies

6. Shell Programming and Scripting

Shell Script migration issues

Hi All, We will be doing a Solaris 8 to Solaris 10 migration migration, just wanted to know if there are any known / common issues arise from this migration from Shell script point of view. I tried searching this site but mostly post are related to SA's question and jumpstart, etc. If there's... (4 Replies)
Discussion started by: arvindcgi
4 Replies
Login or Register to Ask a Question