Rebuilding SMW when job queue is going out of control
Symptom
When the job queue seems to always have more and more jobs (“SMWUpdateJob”) to do and the job runner (“maintenance/runJobs.php”) has more than one running for long periods. It means that the extension needs to rebuild the data from scratch.
How to validate if it apply? Look at the stats component of MW api. Last event, we had many thousands (e.g. jobs="318102").
On the database server From the master, read those values.
mysql> use wpwiki;
mysql> SELECT COUNT(*) FROM job WHERE job_cmd LIKE '%SMW%';
+----------+
| COUNT(*) |
+----------+
| 318223 |
+----------+
Note that number is getting bigger and bigger, it is expectable that they are most likely ALL for Semantic Media Wiki.
Things to know
- If the database setup has one master and multiple slave, make sure you do them on the master only, the slaves should follow. Doing the opposite might break the consistency of the database cluster.
- The cronjobs should ideally, in the scripts, use the timeout utility with a maximum duration of an hour. Doing this prevents to have multiple long running tasks slowly eating all CPU resources.
- MediaWiki configuration file (see code below) should show you which is the database master, and which is the read only (“slave”). Only the first entry with
load=0
can read+write, the other entries are read-only.
root@app5:/home/renoirb# head -n 40 /srv/webplatform/wiki/CurrentSettings.php
<?php
// ... truncated file notes ...
$wgDBservers = array(
array(
'host' => "master.db.wpdn", // < The salt states ensures
'dbname' => "wpwiki", // that the master database server
'user' => "HIDDENINFORMATION", // has this hostname among all VMs.
'password' => "HIDDENINFORMATION",
'type' => "mysql",
'flags' => DBO_DEFAULT,
'load' => 0 // < This means read AND write,
// it is specific to the master.
),
);
if ( !$wgCommandLineMode ) {
$wgDBservers[] = array(
'host' => "HIDDENINFORMATION.dho.wpdn",
'dbname' => "HIDDENINFORMATION",
'user' => "HIDDENINFORMATION",
'password' => "HIDDENINFORMATION",
'type' => "mysql",
'flags' => DBO_DEFAULT,
'load' => 1 // < This means read ONLY
);
}
- Not all app servers are used full time by the caching layer, Fastly (Varnish). You can see that in Fastly admin panel, in the “Hosts” within “Configure” for the appropriate service. Current configuration is that two
app*
VMs have the load in equal portions (200), the 3rd VM is exposed as a backup load (190, see B, in attached image). The 3rd is ready if the two first ones can’t serve all requests. - The cronjobs are run from a 4th app server that is not exposed at all. In the app server VM listing below, it’s currently "app5".
- 'app server VM listing. To know which VMs are
app*
servers, run the following.
renoirb@deployment:~$ nova list | grep app
| ... | app3 | ACTIVE | None | Running | vmnet=10.0.0.32, 208.113.157.171 |
| ... | app4 | ACTIVE | None | Running | vmnet=10.0.0.18, 208.113.157.173 |
| ... | app5 | ACTIVE | None | Running | vmnet=10.0.0.2, 208.113.157.166 |
| ... | app6 | ACTIVE | None | Running | vmnet=10.0.0.14, 208.113.157.162 |
Steps
In summary
- Make sure the “job runner” will not run during the whole process (e.g. comment all applicable crontab)
- Run SMW rebuild command
- Truncate job table
- Un comment the “job runner”
In detail
- Find the cron entries mentioning ‘runJob.php’ and comment them. This can be found by searching (
grep -rli 'runJob' /srv/salt
) in the salt state files. The job are assigned to the "www-data
" user on the strongest app server (e.g. app4). - Make sure no job is running. The following shows it is fine.
root@deployment:~# salt 'app*' cmd.run 'ps aux | grep runJob'
app1:
root 32739 0.0 0.0 9220 1188 ? S 00:14 0:00 /bin/sh -c ps aux | grep unJob
root 32741 0.0 0.0 6176 672 ? S 00:14 0:00 grep unJob
app6:
root 10650 0.0 0.0 9220 1188 ? S 00:14 0:00 /bin/sh -c ps aux | grep unJob
root 10652 0.0 0.0 6180 704 ? S 00:14 0:00 grep unJob
app5:
www-data 23979 0.0 0.0 4112 580 ? Ss 14:44 0:00 /bin/sh -c /srv/webplatform/wiki/mediawiki-runJobs.sh #1st run
www-data 23980 0.0 0.0 9228 1332 ? S 14:44 0:00 /bin/bash -l /srv/webplatform/wiki/mediawiki-runJobs.sh
www-data 23983 0.0 0.0 3876 412 ? Ss 14:44 0:00 /usr/bin/timeout 3100 /usr/bin/php /srv/webplatform/wiki/current/maintenance/runJobs.php
www-data 23984 71.7 0.6 203932 50700 ? R 14:44 24:59 /usr/bin/php /srv/webplatform/wiki/current/maintenance/runJobs.php
www-data 24347 0.0 0.0 4112 584 ? Ss 15:16 0:00 /bin/sh -c /srv/webplatform/wiki/mediawiki-runJobs.sh #2nd run
www-data 24348 0.0 0.0 9228 1328 ? S 15:16 0:00 /bin/bash -l /srv/webplatform/wiki/mediawiki-runJobs.sh
www-data 24351 0.0 0.0 3876 412 ? Ss 15:16 0:00 /usr/bin/timeout 3100 /usr/bin/php /srv/webplatform/wiki/current/maintenance/runJobs.php
www-data 24352 69.2 0.5 202424 49160 ? S 15:16 1:59 /usr/bin/php /srv/webplatform/wiki/current/maintenance/runJobs.php
root 24395 0.0 0.0 3876 408 pts/1 S+ 15:18 0:00 /usr/bin/timeout 10800 /usr/bin/php /srv/webplatform/wiki/current/maintenance/runJobs.ph
p
root 24396 65.6 0.5 202472 49164 pts/1 R+ 15:18 0:13 /usr/bin/php /srv/webplatform/wiki/current/maintenance/runJobs.php
root 24403 0.0 0.0 9220 1188 ? S 15:18 0:00 /bin/sh -c ps aux | grep unJob
root 24405 0.0 0.0 6180 708 ? S 15:18 0:00 grep unJob
app4:
root 4183 0.0 0.0 9220 1188 ? S 00:14 0:00 /bin/sh -c ps aux | grep unJob
root 4185 0.0 0.0 6176 672 ? S 00:14 0:00 grep unJob
- Note that in this sample, we can see the use of the
/usr/bin/timeout
with various durations. The current salt configuration has the cronjob to run tasks for a maximum of 3100 seconds. The other job that has a duration of 10800 has been started manually to attempt emptying the queue. - Connect via SSH to the strongest app server with lowest weight in Fastly caching service. It is most likely the one that had crontabs with the
'runJob.php'
scheduled tasks. - Kill all related process on the server, and make sure they are not running anymore
root@app5:/srv/webplatform/wiki/test/extensions/SemanticMediaWiki/maintenance# kill -9 24395 24351 23983
root@app5:/srv/webplatform/wiki/test/extensions/SemanticMediaWiki/maintenance# ps aux | grep unJob
root 32427 0.0 0.0 7640 916 pts/3 S+ 02:02 0:00 grep --color=auto unJob
- Start a
screen
ortmux
session and run the following from within it. That way, if your SSH connection dies, the script will continue to run. - Go to the appropriate folder where MediaWiki is installed. We have more than one installation, in this situation the appropriate place is in
/srv/webplatform/wiki/current/
. Always refer to the Salt states on the deployment server in/srv/salt
- Run the Semantic Media Wiki refreshData script, it might take a while. Expect about 20 minutes of time to wait.
cd /srv/webplatform/wiki/wpwiki/mediawiki
php extensions/SemanticMediaWiki/maintenance/SMW_refreshData.php -v
...
(29477) Processing ID 29478 ...
(29478) Processing ID 29479 ...
(29479) Processing ID 29480 ...
(29480) Processing ID 29481 ...
(29481) Processing ID 29482 ...
(29482) Processing ID 29483 ...
(29483) Processing ID 29484 ...
(29484) Processing ID 29485 ...
(29485) Processing ID 29486 ...
(29486) Processing ID 29487 ...
29487 IDs refreshed.
- If the job runner dies, you can use the ID it died on, and re-run the
SMW_refreshData.php
with the-s
option. - When all is done, you can connect to the master database server and truncate the job table. Note that I made sure that the count is the same as I checked before running the
SMW_refreshData.php
.
mysql> use wpwiki;
mysql> SELECT COUNT(*) FROM job WHERE job_cmd LIKE '%SMW%';
+----------+
| COUNT(*) |
+----------+
| 318223 |
+----------+
mysql> truncate job;
Query OK, 0 rows affected (0.34 sec)
mysql> SELECT COUNT(*) FROM job WHERE job_cmd LIKE '%SMW%';
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
1 row in set (0.00 sec)
- All should be fine now!
- Re-enable the jobRun in the appropriate cron jobs.