Location: Asia Pacific, Cyberspace, in the Dark Dystopia
Posts: 19,118
Thanks Given: 2,351
Thanked 3,359 Times in 1,878 Posts
Similar Threads for Man Pages - In Development
FYI,
I have been quietly updating the man page database adding "similar threads" for man pages.
STEP 1: Full Text MySQL DB Search Matches
The first step, after creating the DB columns, was to process each of the nearly 400K man pages and do a full text mysql search, match and score against each post in the DB and get the top 15 threadids matched (or less than 15, based on the matches and scores).
That process took a few days and resulted in around one third (forgot to record the stats at that point) of the man page entries having similar thread entries.
STEP2: Cross Reference Similar Man Pages in Thread DB Back to Man Page Entries
Then, for the remaining man pages with no entries from the process above (step 1), I took the similarman entries for each thread and did a simple boolean match for man page ids associated with each similar man page (created a number of weeks ago) and created a list of thread matches ordered by the thread reply count in the DB. That process will complete today (in about 3 hours from now, give or take) and there will remain a lot of man pages with no matches based on steps 1 and 2.
STEP3: Boolean Matches Man Page Name with Thread Tags
Then, I will take the remaining man pages without any similar threads and repeat step two matching the name of the man page (only the query, for example 'sshd') against the tags for each thread, and order the matches by thread reply count, and keep up to 15 matches, as before.
After that, I will look at the remaining unmatched man pages to threads and decide what match I can try next.
The purpose of all is to create more relevant content for each man page in the DB, providing users with a list of discussion threads related to the man page; hence as the idea implies "similar threads for man pages". In addition, this could help SEO, as Google is only including between 10 and 15% of our entire man page collection in their index of our man pages. I would like to increase this percentage in 2020 to closer to 25 to 40%.
Currently, there are a few hours remaining for step 2:
After step 2 is done, I will start step 3 (but I will remember to record a few simple stats before I start step 3).
Location: Asia Pacific, Cyberspace, in the Dark Dystopia
Posts: 19,118
Thanks Given: 2,351
Thanked 3,359 Times in 1,878 Posts
OK.
Step 2 is done:
But as you can see, even after processing al the orphans (man pages without matches) for mysql full text matches (step 1) and cross referencing similarman in threads with the man pages (step 2), we see that a whopping 63% of all man pages are still similar thread orphans:
So, I'm on to step 3 now. Boolean matches between the name of the man page and the tags (update: and the thread titles) for each thread, ordered by reply count. I changed the process (from my step 3 above) to match both thread tags and thread titles, to see if this helps speed things along toward the goal of all the man pages having at least one similar thread entry.
Let's see what happens twelve hours from now after this batch processing finishes.
I may move to straight forward boolean matches in the text of the posts (against the name of the man page) for step 4, but that seems too crude, so I'll need to ponder on, and test this, later. But if I add the operating system, that might be too refined and result in a very small number of matches since we have always had quite a hard time getting people to describe their OS when they post a question!
If everyone posted system details when they asked a question, this would make matches a lot better; but they don't and rarely do.
Location: Asia Pacific, Cyberspace, in the Dark Dystopia
Posts: 19,118
Thanks Given: 2,351
Thanked 3,359 Times in 1,878 Posts
Step 3 is done. The result is that the orphans have dropped from 63% to 53%.
So, I now start step 4:
STEP4: Boolean Matches Man Page Name with Post Text
I will take the remaining man pages without any similar threads and repeat step three but matching the name of the man page (only the query, for example 'sshd') against the page text for each post and get the threadid from the post, and order the matches by the number of times the thread was thanked by users, and keep up to 15 matches, as before.
We will see how many man page orphans find thread relatives in this step 4 manner.
Running...... looks like this query and update will take around a week or so, as not to overload the server.
.and so far, it looks like the orphans will be only reduced by a relatively small amount (less than 15% of total remaining orphans, I guess... let's see)
Location: Asia Pacific, Cyberspace, in the Dark Dystopia
Posts: 19,118
Thanks Given: 2,351
Thanked 3,359 Times in 1,878 Posts
Update:
May stop cron job (step 4) which is processes similar threads for man pages using only the name of the man page and the texts of posts.
Not really getting enough "bang for the buck" from loading the server doing these batch jobs in the background (but now the server is that the lowest load point of the year, so I will wait a few more days before deciding to stop this cron), where we can see that approximate 15% of the queries result in a match:
That means, for now, I'm going to put this project on hold. Here are the intermediate results, showing 52% orphans, which is an improvement over the early 63% orphan stat:
Hi Neo,
my 2 cents:
You maybe did so but if not, knowing the type of process it involves, I would have chosen as you did a calm period for the task, and to not waste proc time due to the different caches, try to optimize what I can/ where I can e.g. not sure you can change the cache ration of the FS or underlying storage ( I suppose that is more the provider's duty...) but you have access to your RDBMS kernel I would reduce its cache working storage to force the reading of the true data) this is efficient for big batch processes when you know you are after data not often read ( so no chance of finding them in caches), of course, it impacts ordinary online interactive work but as you have fewer requests thrown by online users its acceptable... it should improve a bit your step 4...
Location: Asia Pacific, Cyberspace, in the Dark Dystopia
Posts: 19,118
Thanks Given: 2,351
Thanked 3,359 Times in 1,878 Posts
Thanks Victor,
Sounds good; but I don't want to put much more effort into this with all the other projects I have ongoing. If you have the exact PHP code for MySQL to do as you suggested, then that might take less of my time to implement. Right now I am quite busy on non-unix.com tags working with a number of people and vendors on LoRA and NB-IoT networking gear, regulatory issues, chips, specifications, development boards, gateways, code, libs, etc.
OBTW, here is an example of this "similar thread for man pages" working (man nologin):
I think these "similar threads for man pages" combined with my earlier "similar man pages for man pages" will help with search engine indexing the man pages with thin content (SEO).
I am still considering a "Step 5" to deal with the orphans, which will end up (after "Step 4") to being around half of the total man page repo, I am guessing, as follows:
Split the names of man pages with underscores, dashes, colons, etc. in the name of the man page and search titles, tags and posts on one or more of those substrings.
Use the operation system of the man page query, and cross reference to the most popular threads in corresponding forums (Linux, Solaris, AIX, etc).
Maybe something else... maybe not. Will sleep on this, since sleep always bring more ideas and solutions. I always get a lot of work and new ideas when I sleep and dream.
Can anyone supply me with the man pages for:
omnidatalist
omnibarlist
omnisap.exe
I prefer the source man pages in nroff format.
A clue about the software bundles which supply these man pages is fine as well.
OS: HP-UX
TIA (11 Replies)
Hi everyone,
I have a small query, in solaris the man pages get displayed on half of the terminal , can i get a full terminal or full screen display ?:) (2 Replies)
Hello sir,
I am using FEDORA 9.
I wanted to know why do we have ".1" extension in the archives of man pages. I know we are giving format.
I want to know the importance or purpose of this format.
Can you please tell me :confused: (2 Replies)
Hi,
I want to install man pages package from solaris 10.
Solaris 10 has already been installed on my servor but I have to add the man pages packages. I search for a long time on internet this package but I didn't find a compatible one... So I downloaded Solaris 10 from Sun site to get this... (1 Reply)
can anybody explain me how to read unix
man pages?
for example when i want to get information about ps command
man ps gives me this output:
***********************************
Reformatting page. Please wait... completed
ps(1) ... (2 Replies)
When reading man pages, I notice that sometimes commands are follwed by a number enclosed in parenthesis. such as:
mkdir calls the mkdir(2) system call.
What exactly does this mean? (4 Replies)
Hi,
I've written now a man pages, but I don't knwo how to get 'man' to view them. Where have I to put this files, which directories are allowed??
THX Bensky (3 Replies)
Hello ,
I just installed openssh in my system . I actually tried to man sshd but it says no entry , though there is a man directory in the installation which have the man pages for sshd .
Can anyone tell me how should i install these man pages .
DP (2 Replies)