Similar Threads for Man Pages - In Development


 
Thread Tools Search this Thread
The Lounge What is on Your Mind? Similar Threads for Man Pages - In Development
# 1  
Old 12-29-2019
Similar Threads for Man Pages - In Development

FYI,

I have been quietly updating the man page database adding "similar threads" for man pages.

STEP 1: Full Text MySQL DB Search Matches

The first step, after creating the DB columns, was to process each of the nearly 400K man pages and do a full text mysql search, match and score against each post in the DB and get the top 15 threadids matched (or less than 15, based on the matches and scores).

That process took a few days and resulted in around one third (forgot to record the stats at that point) of the man page entries having similar thread entries.

STEP2: Cross Reference Similar Man Pages in Thread DB Back to Man Page Entries

Then, for the remaining man pages with no entries from the process above (step 1), I took the similarman entries for each thread and did a simple boolean match for man page ids associated with each similar man page (created a number of weeks ago) and created a list of thread matches ordered by the thread reply count in the DB. That process will complete today (in about 3 hours from now, give or take) and there will remain a lot of man pages with no matches based on steps 1 and 2.

STEP3: Boolean Matches Man Page Name with Thread Tags

Then, I will take the remaining man pages without any similar threads and repeat step two matching the name of the man page (only the query, for example 'sshd') against the tags for each thread, and order the matches by thread reply count, and keep up to 15 matches, as before.

After that, I will look at the remaining unmatched man pages to threads and decide what match I can try next.

The purpose of all is to create more relevant content for each man page in the DB, providing users with a list of discussion threads related to the man page; hence as the idea implies "similar threads for man pages". In addition, this could help SEO, as Google is only including between 10 and 15% of our entire man page collection in their index of our man pages. I would like to increase this percentage in 2020 to closer to 25 to 40%.

Currently, there are a few hours remaining for step 2:

Code:
1577593027 Time: 54 Inserts: 116 Floor: 6000 Limit: 300 ToDo: 64839 RemainingTime: 3.6 Hours QLoad: 1.06
1577593080 Time: 55 Inserts: 103 Floor: 6000 Limit: 300 ToDo: 64548 RemainingTime: 3.6 Hours QLoad: 1.17
1577593138 Time: 53 Inserts: 110 Floor: 6000 Limit: 300 ToDo: 64248 RemainingTime: 3.6 Hours QLoad: 1.27
1577593196 Time: 53 Inserts: 108 Floor: 6000 Limit: 300 ToDo: 63948 RemainingTime: 3.6 Hours QLoad: 1.23
1577593257 Time: 53 Inserts: 98 Floor: 6000 Limit: 300 ToDo: 63648 RemainingTime: 3.5 Hours QLoad: 1.04
1577593332 Time: 54 Inserts: 108 Floor: 6000 Limit: 300 ToDo: 63344 RemainingTime: 3.5 Hours QLoad: 1.01

After step 2 is done, I will start step 3 (but I will remember to record a few simple stats before I start step 3).
This User Gave Thanks to Neo For This Post:
# 2  
Old 12-29-2019
OK.

Step 2 is done:

Code:
1577606166 Time: 53 Inserts: 74 Floor: 6000 Limit: 300 ToDo: 2250 RemainingTime: 0.1 Hours QLoad: 1.74
1577606242 Time: 56 Inserts: 85 Floor: 6000 Limit: 300 ToDo: 1950 RemainingTime: 0.1 Hours QLoad: 1.39
1577606278 Time: 55 Inserts: 97 Floor: 6000 Limit: 300 ToDo: 1750 RemainingTime: 0.1 Hours QLoad: 1.49
1577606342 Time: 53 Inserts: 97 Floor: 6000 Limit: 300 ToDo: 1450 RemainingTime: 0.1 Hours QLoad: 1.48
1577606398 Time: 52 Inserts: 75 Floor: 6000 Limit: 300 ToDo: 1150 RemainingTime: 0.1 Hours QLoad: 1.34
1577606470 Time: 53 Inserts: 75 Floor: 6000 Limit: 300 ToDo: 850 RemainingTime: 0.0 Hours QLoad: 1.26
1577606526 Time: 54 Inserts: 101 Floor: 6000 Limit: 300 ToDo: 550 RemainingTime: 0.0 Hours QLoad: 1.26
1577606589 Time: 55 Inserts: 60 Floor: 6000 Limit: 300 ToDo: 250 RemainingTime: 0.0 Hours QLoad: 1.17
1577606633 Time: 51 Inserts: 70 Floor: 6000 Limit: 300 ToDo:  RemainingTime: 0.0 Hours QLoad: 1.41
1577606650 Time: 1 Inserts: 0 Floor: 6000 Limit: 300 ToDo:  RemainingTime: 0.0 Hours QLoad: 1.25
1577606714 Time: 1 Inserts: 0 Floor: 6000 Limit: 300 ToDo:  RemainingTime: 0.0 Hours QLoad: 0.83
1577606764 Time: 1 Inserts: 0 Floor: 6000 Limit: 300 ToDo:  RemainingTime: 0.0 Hours QLoad: 0.48
1577606826 Time: 1 Inserts: 0 Floor: 6000 Limit: 300 ToDo:  RemainingTime: 0.0 Hours QLoad: 0.57
1577606884 Time: 0 Inserts: 0 Floor: 6000 Limit: 300 ToDo:  RemainingTime: 0.0 Hours QLoad: 0.40

But as you can see, even after processing al the orphans (man pages without matches) for mysql full text matches (step 1) and cross referencing similarman in threads with the man pages (step 2), we see that a whopping 63% of all man pages are still similar thread orphans:

Code:
mysql> select count(1) from neo_man_page_entry where similarthread ="none"; select count(1) from neo_man_page_entry;
+----------+
| count(1) |
+----------+
|   218444 |
+----------+
1 row in set (0.82 sec)

+----------+
| count(1) |
+----------+
|   347938 |
+----------+
1 row in set (0.00 sec)

mysql>

So, I'm on to step 3 now. Boolean matches between the name of the man page and the tags (update: and the thread titles) for each thread, ordered by reply count. I changed the process (from my step 3 above) to match both thread tags and thread titles, to see if this helps speed things along toward the goal of all the man pages having at least one similar thread entry.

Code:
1577609125 Time: 35 Inserts: 63 Floor: 6000 Limit: 300 ToDo: 218340 RemainingTime: 12.1 Hours QLoad: 0.54
1577609318 Time: 35 Inserts: 63 Floor: 6000 Limit: 300 ToDo: 218040 RemainingTime: 12.1 Hours QLoad: 0.41
1577609392 Time: 34 Inserts: 70 Floor: 6000 Limit: 300 ToDo: 217740 RemainingTime: 12.1 Hours QLoad: 0.66
1577609437 Time: 34 Inserts: 68 Floor: 6000 Limit: 300 ToDo: 217440 RemainingTime: 12.1 Hours QLoad: 0.97

Let's see what happens twelve hours from now after this batch processing finishes.

I may move to straight forward boolean matches in the text of the posts (against the name of the man page) for step 4, but that seems too crude, so I'll need to ponder on, and test this, later. But if I add the operating system, that might be too refined and result in a very small number of matches since we have always had quite a hard time getting people to describe their OS when they post a question!

If everyone posted system details when they asked a question, this would make matches a lot better; but they don't and rarely do.
# 3  
Old 12-29-2019
Step 3 is done. The result is that the orphans have dropped from 63% to 53%.

Code:
mysql> select count(1) as count from neo_man_page_entry where similarthread = "notagsmatch"; select count(1) as count from neo_man_page_entry;
+--------+
| count  |
+--------+
| 185765 |
+--------+
1 row in set (0.93 sec)

+--------+
| count  |
+--------+
| 347938 |
+--------+
1 row in set (0.00 sec)

So, I now start step 4:

STEP4: Boolean Matches Man Page Name with Post Text

I will take the remaining man pages without any similar threads and repeat step three but matching the name of the man page (only the query, for example 'sshd') against the page text for each post and get the threadid from the post, and order the matches by the number of times the thread was thanked by users, and keep up to 15 matches, as before.

We will see how many man page orphans find thread relatives in this step 4 manner.

Running...... looks like this query and update will take around a week or so, as not to overload the server.

Code:
1577672816 Time: 54 Inserts: 6 Floor: 6000 Limit: 20 ToDo: 185205 RemainingTime: 154.3 Hours QLoad: 1.65
1577672876 Time: 52 Inserts: 4 Floor: 6000 Limit: 20 ToDo: 185185 RemainingTime: 154.3 Hours QLoad: 1.53
1577672942 Time: 54 Inserts: 1 Floor: 6000 Limit: 20 ToDo: 185165 RemainingTime: 154.3 Hours QLoad: 1.04
1577673000 Time: 53 Inserts: 0 Floor: 6000 Limit: 20 ToDo: 185145 RemainingTime: 154.3 Hours QLoad: 1.20

.and so far, it looks like the orphans will be only reduced by a relatively small amount (less than 15% of total remaining orphans, I guess... let's see)
# 4  
Old 12-31-2019
Update:

May stop cron job (step 4) which is processes similar threads for man pages using only the name of the man page and the texts of posts.

Not really getting enough "bang for the buck" from loading the server doing these batch jobs in the background (but now the server is that the lowest load point of the year, so I will wait a few more days before deciding to stop this cron), where we can see that approximate 15% of the queries result in a match:

Code:
1577727419 Time: 57 Inserts: 4 Floor: 6000 Limit: 15 ToDo: 169355 RemainingTime: 188.2 Hours QLoad: 1.38
1577727480 Time: 57 Inserts: 4 Floor: 6000 Limit: 15 ToDo: 169340 RemainingTime: 188.2 Hours QLoad: 1.58
1577727540 Time: 57 Inserts: 3 Floor: 6000 Limit: 15 ToDo: 169325 RemainingTime: 188.1 Hours QLoad: 1.52
1577727598 Time: 56 Inserts: 2 Floor: 6000 Limit: 15 ToDo: 169310 RemainingTime: 188.1 Hours QLoad: 1.35
1577727664 Time: 56 Inserts: 3 Floor: 6000 Limit: 15 ToDo: 169295 RemainingTime: 188.1 Hours QLoad: 1.05
1577727724 Time: 55 Inserts: 2 Floor: 6000 Limit: 15 ToDo: 169280 RemainingTime: 188.1 Hours QLoad: 1.18
1577727780 Time: 54 Inserts: 3 Floor: 6000 Limit: 15 ToDo: 169265 RemainingTime: 188.1 Hours QLoad: 1.23
1577727835 Time: 53 Inserts: 1 Floor: 6000 Limit: 15 ToDo: 169250 RemainingTime: 188.1 Hours QLoad: 1.52
1577727896 Time: 55 Inserts: 2 Floor: 6000 Limit: 15 ToDo: 169235 RemainingTime: 188.0 Hours QLoad: 1.26
1577727958 Time: 55 Inserts: 3 Floor: 6000 Limit: 15 ToDo: 169220 RemainingTime: 188.0 Hours QLoad: 1.57
1577728021 Time: 55 Inserts: 2 Floor: 6000 Limit: 15 ToDo: 169205 RemainingTime: 188.0 Hours QLoad: 1.44
1577728075 Time: 53 Inserts: 3 Floor: 6000 Limit: 15 ToDo: 169190 RemainingTime: 188.0 Hours QLoad: 1.84
1577728136 Time: 55 Inserts: 1 Floor: 6000 Limit: 15 ToDo: 169175 RemainingTime: 188.0 Hours QLoad: 1.94
1577728198 Time: 57 Inserts: 1 Floor: 6000 Limit: 15 ToDo: 169160 RemainingTime: 188.0 Hours QLoad: 1.52
1577728262 Time: 57 Inserts: 0 Floor: 6000 Limit: 15 ToDo: 169145 RemainingTime: 187.9 Hours QLoad: 1.39
1577728321 Time: 58 Inserts: 1 Floor: 6000 Limit: 15 ToDo: 169130 RemainingTime: 187.9 Hours QLoad: 1.34
1577728379 Time: 55 Inserts: 0 Floor: 6000 Limit: 15 ToDo: 169115 RemainingTime: 187.9 Hours QLoad: 1.05
1577728443 Time: 54 Inserts: 2 Floor: 6000 Limit: 15 ToDo: 169100 RemainingTime: 187.9 Hours QLoad: 1.04
1577728500 Time: 58 Inserts: 0 Floor: 6000 Limit: 15 ToDo: 169086 RemainingTime: 187.9 Hours QLoad: 1.21
1577728560 Time: 57 Inserts: 0 Floor: 6000 Limit: 15 ToDo: 169071 RemainingTime: 187.9 Hours QLoad: 1.30
1577728623 Time: 57 Inserts: 1 Floor: 6000 Limit: 15 ToDo: 169056 RemainingTime: 187.8 Hours QLoad: 1.32
1577728682 Time: 57 Inserts: 1 Floor: 6000 Limit: 15 ToDo: 169041 RemainingTime: 187.8 Hours QLoad: 1.57
1577728738 Time: 55 Inserts: 0 Floor: 6000 Limit: 15 ToDo: 169026 RemainingTime: 187.8 Hours QLoad: 1.52

That means, for now, I'm going to put this project on hold. Here are the intermediate results, showing 52% orphans, which is an improvement over the early 63% orphan stat:

Code:
mysql> select count(1) as count from neo_man_page_entry where similarthread = "nopagetextmatch" or similarthread = "notagsmatch"; select count(1) as count from neo_man_page_entry;
+--------+
| count  |
+--------+
| 182585 |
+--------+
1 row in set (0.96 sec)

+--------+
| count  |
+--------+
| 347938 |
+--------+
1 row in set (0.00 sec)


For now, let it run slowly in background....
This User Gave Thanks to Neo For This Post:
# 5  
Old 12-31-2019
Hi Neo,
my 2 cents:
You maybe did so but if not, knowing the type of process it involves, I would have chosen as you did a calm period for the task, and to not waste proc time due to the different caches, try to optimize what I can/ where I can e.g. not sure you can change the cache ration of the FS or underlying storage ( I suppose that is more the provider's duty...) but you have access to your RDBMS kernel I would reduce its cache working storage to force the reading of the true data) this is efficient for big batch processes when you know you are after data not often read ( so no chance of finding them in caches), of course, it impacts ordinary online interactive work but as you have fewer requests thrown by online users its acceptable... it should improve a bit your step 4...
# 6  
Old 12-31-2019
Thanks Victor,

Sounds good; but I don't want to put much more effort into this with all the other projects I have ongoing. If you have the exact PHP code for MySQL to do as you suggested, then that might take less of my time to implement. Right now I am quite busy on non-unix.com tags working with a number of people and vendors on LoRA and NB-IoT networking gear, regulatory issues, chips, specifications, development boards, gateways, code, libs, etc.

OBTW, here is an example of this "similar thread for man pages" working (man nologin):

Code:
https://www.unix.com/man-page/linux/5/nologin/

I think these "similar threads for man pages" combined with my earlier "similar man pages for man pages" will help with search engine indexing the man pages with thin content (SEO).
# 7  
Old 01-01-2020
Update: Focusing on man page with string length less than 4000, currently showing about 53% orphan man pages:


Code:
mysql> select count(1) as count from neo_man_page_entry where similarthread = "nopagetextmatch" or similarthread = "notagsmatch" and strlen < 4000; select count(1) as count from neo_man_page_entry where strlen < 4000;

+--------+
| count  |
+--------+
| 108115 |
+--------+
1 row in set (1.01 sec)

+--------+
| count  |
+--------+
| 204819 |
+--------+
1 row in set (0.04 sec)

mysql>

Still focusing on man pages with strlen of less than 4000:

Code:
ubuntu# tail -f neo_simthread_for_man_pages_using_pagetext_timing.log
1577851988 Time: 56 Inserts: 9 Ceiling: 4000 Limit: 22 ToDo: 70540 RemainingTime: 53.4 Hours QLoad: 1.39
1577852043 Time: 57 Inserts: 8 Ceiling: 4000 Limit: 22 ToDo: 70519 RemainingTime: 53.4 Hours QLoad: 1.68
1577852104 Time: 56 Inserts: 5 Ceiling: 4000 Limit: 22 ToDo: 70497 RemainingTime: 53.4 Hours QLoad: 1.55
1577852169 Time: 58 Inserts: 2 Ceiling: 4000 Limit: 22 ToDo: 70475 RemainingTime: 53.4 Hours QLoad: 1.57
1577852221 Time: 57 Inserts: 0 Ceiling: 4000 Limit: 22 ToDo: 70455 RemainingTime: 53.4 Hours QLoad: 1.56
1577852285 Time: 57 Inserts: 1 Ceiling: 4000 Limit: 22 ToDo: 70433 RemainingTime: 53.4 Hours QLoad: 1.43
1577852336 Time: 55 Inserts: 5 Ceiling: 4000 Limit: 22 ToDo: 70413 RemainingTime: 53.3 Hours QLoad: 1.22
1577852410 Time: 55 Inserts: 1 Ceiling: 4000 Limit: 22 ToDo: 70391 RemainingTime: 53.3 Hours QLoad: 0.95
1577852477 Time: 56 Inserts: 2 Ceiling: 4000 Limit: 22 ToDo: 70369 RemainingTime: 53.3 Hours QLoad: 1.18
1577852519 Time: 58 Inserts: 2 Ceiling: 4000 Limit: 22 ToDo: 70353 RemainingTime: 53.3 Hours QLoad: 1.39

Let's see how the under 4000 byte-size orphans are doing around this time tomorrow.

Either way, it is working and live on the site, as you can see from this example:

nologin(5) [linux man page]

Code:
https://www.unix.com/man-page/linux/5/nologin/

I am still considering a "Step 5" to deal with the orphans, which will end up (after "Step 4") to being around half of the total man page repo, I am guessing, as follows:
  • Split the names of man pages with underscores, dashes, colons, etc. in the name of the man page and search titles, tags and posts on one or more of those substrings.
  • Use the operation system of the man page query, and cross reference to the most popular threads in corresponding forums (Linux, Solaris, AIX, etc).

Maybe something else... maybe not. Will sleep on this, since sleep always bring more ideas and solutions. I always get a lot of work and new ideas when I sleep and dream.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Commands for man pages

what command should i use for displaying the manual pages for the socket, read and connect system calls? (1 Reply)
Discussion started by: Nabeel Nazir
1 Replies

2. HP-UX

Looking for some man pages.

Can anyone supply me with the man pages for: omnidatalist omnibarlist omnisap.exe I prefer the source man pages in nroff format. A clue about the software bundles which supply these man pages is fine as well. OS: HP-UX TIA (11 Replies)
Discussion started by: sb008
11 Replies

3. Solaris

MAN PAGES

Hi everyone, I have a small query, in solaris the man pages get displayed on half of the terminal , can i get a full terminal or full screen display ?:) (2 Replies)
Discussion started by: M.Choudhury
2 Replies

4. Fedora

why do we have .1 extension in MAN PAGES?

Hello sir, I am using FEDORA 9. I wanted to know why do we have ".1" extension in the archives of man pages. I know we are giving format. I want to know the importance or purpose of this format. Can you please tell me :confused: (2 Replies)
Discussion started by: nsharath
2 Replies

5. UNIX for Dummies Questions & Answers

Man pages on Solaris 10

Hi, I want to install man pages package from solaris 10. Solaris 10 has already been installed on my servor but I have to add the man pages packages. I search for a long time on internet this package but I didn't find a compatible one... So I downloaded Solaris 10 from Sun site to get this... (1 Reply)
Discussion started by: MasterapocA
1 Replies

6. UNIX for Dummies Questions & Answers

how to read man pages

can anybody explain me how to read unix man pages? for example when i want to get information about ps command man ps gives me this output: *********************************** Reformatting page. Please wait... completed ps(1) ... (2 Replies)
Discussion started by: gfhgfnhhn
2 Replies

7. UNIX for Dummies Questions & Answers

man pages

When reading man pages, I notice that sometimes commands are follwed by a number enclosed in parenthesis. such as: mkdir calls the mkdir(2) system call. What exactly does this mean? (4 Replies)
Discussion started by: dangral
4 Replies

8. UNIX for Dummies Questions & Answers

man pages

Hi folks, I want to know all the commands for which man pages are available. How do i get it? Cheers, Nisha (4 Replies)
Discussion started by: Nisha
4 Replies

9. UNIX for Dummies Questions & Answers

man pages

Hi, I've written now a man pages, but I don't knwo how to get 'man' to view them. Where have I to put this files, which directories are allowed?? THX Bensky (3 Replies)
Discussion started by: bensky
3 Replies

10. UNIX for Dummies Questions & Answers

Man pages

Hello , I just installed openssh in my system . I actually tried to man sshd but it says no entry , though there is a man directory in the installation which have the man pages for sshd . Can anyone tell me how should i install these man pages . DP (2 Replies)
Discussion started by: DPAI
2 Replies
Login or Register to Ask a Question