Linux Sysadmin Blog

MySQL Replication Problems

- | Comments

We are using a replication setup for our databases. The master database takes care of all the transactions and the slave database is used for hot standby, making backups and running reports. Running mysqldump for backups locks the main tables one by one (we’re using the MyISAM storage engine) and was causing user interruptions. Hence using the replication slave for this is quite effective but it makes it very important to monitor the replication lag. If the replication stops the backups fall behind and suddenly the hot standby is not so hot anymore. A second problem is that the reports are reporting on stale data.

Recently our master database server decided it was time for a break and spontaneusly rebooted. Since the server immediately came back up and started all services correctly production was only interrupted for a couple of minutes. A little later alerts started coming in that the replication lag was growing. It turns out the replication slave had gotten confused and was processing old replication logs. This means that transactions from a month ago were being processed again. This could be measured in several ways. The lag was showing a number of seconds equivalent to about a month This meant that the records it was trying to insert already existed and we were getting the 1062 MySQL error. This means that the primary key already exists and the new insert cannot complete. This stops replication dead in its tracks. The second way to see that old transactions are being processed is to look when the duplicate records were originally inserted. This is not always possible, but in this case there was a timestamp in the records.

Our conclusion was that we needed to take two actions. First of all we needed to rebuild the replication slave database and restart replication from scratch. To do this you have to lock the entire database while making a mysqldump (FLUSH TABLES WITH READ LOCK). This would lock out users for 10 to 15 minutes which is unacceptable during the daytime so we decided to execute this later in the evening when fewer users were online. The second action was to let the replication catch up so that we would have a reasonably consistent database in the mean time. This was done by adding the following line to the MySQL configuration file:

1
slave-skip-errors=1062

After that the replication slave server was restarted and the database caught up in 2 hours of processing. The reason we considered this an intermediate solution is that we don’t delete many records. The applications execute mainly updates and inserts to modify data. As long as the latest updates are executed last the data will be in decent shape. After that we rebuilt the complete slave database at a more convenient time and all was well again…..

Calculating Videos on Site to Bandwidth or Aggregate Transfer to Cost of Video Played

- | Comments

A client had asked us a very innocent question about bandwidth usage What will my average cost of video download be?. This question lead me down a path of assumptions and translations of different bandwidth terminologies which would make sense to someone without having to learn how to convert binary and calculate video compressions.  This article is first of an outline of how to get from a technical measurement to a non technical answer.

Bitrate

The assumptions for calculating the bandwidth of the video are the following:

  • Size of average video, which depends on:

    • pixel hight
    • pixel width
    • frames speed
    • compression or movement
    • length of video (time)

We’ll assume video bitstream is understood, there are lots of resources on the web which cover this topic, most widely adobe itself, as well as some old school sources.

Let’s assume that our video bitrate is 440kbs and audio 128kbs, making the whole a 568kbs.

For explanation of KiloBytes vs. kilobits vs. Kibibytes vs. Monkeybytes see Lyberty blog or one of these nifty calculators on the web.

Site bandwidth usage

Once we have our average video size, lets consider the usage of the site, as the videos are embedded in a webiste, whose pages also incorporate images, javascript, css, etc, adding to the bandiwdth requirements. So what we need to determine next is some baseline by which we can determine the download of videos considering pageviews, visitors, etc.

If our baseline is visitors to the site, then we need to assume the number of pages viewed and videos watched per visit.  If its going to be determined by pageviews, then we need to understand number of videos watched per page.  This is a huge assumption as we assume each video preloads.

Our assumptions here would therefor be:

  • Number of videos watched per visitor

    • average page size outside of video
    • average page views per visitor
    • average videos viewed per page

The easiest number to assume here as it requires no assumptions but rather a guess, is:

  • Number of visitors

In our case, for simplicty of calculation, lets assume our video lenght is 3600 seconds and we’ll just add a overhead percentage penalty on pages encapsulating the video of about 45%.

Bandwidth Cost

We then must turn the calculation into something useful which can be applied to a cost formula for bandwidth. There are generally two ways in which bandwidth is sold, either by measured transfer, typically per month, or by average badnwidth usage, generally at 95% capacity, meaning the lowest and the highest 2.5% of usage do not count. Either transfer or metered bandwidth costs are typically tiered, basically the more you use the less it costs per unit.

One of the better bandwidth explanations I found on the web is from Rackforce:

Understanding Bandwidth by Doug Alder

I am not taking into consideration the “unmetered” offers which are in fact throttled or in worse case oversaturated which does not provide a good user experience.

To assume:

  • Cost of Bandwidth

    • based on tiered cost which is based on usage
    • based on total data transfer
    • based on connection

Once we have our assumptions, the rest is relatively easy.  To calculate total monthly bandwidth use the following formula:

[video encoding rate] x [average video length] x [daily uniques] x [assumed views per unique] x [30] x [overhead bandwidth addon]

Then backout your cost of bandwidth, either fixed or connection based:

[total monthly bandwidth] x [usage cost]

This last calculation is simplified, we shall devote another article for continuation.

Run-parts Not Running All the Scripts in /etc/cron.daily

- | Comments

Okay, I realize that this is a simple one but it caught me by surprise nonetheless. Most of our scheduled scripts are placed in the /etc/cron.daily, /etc/cron.weekly and /etc/cron.monthly directories. After migrating some of these scripts to another server it turned out they were not being executed. After some research it came up that run-parts (or at least some implementations of it) was just skipping the files that had . in it. Our CentOS server was executing them perfectly but the Debian server just skipped them. The man page mentions it by omission:

If  the  –lsbsysinit  option is not given then the names must consist entirely of upper and lower case letters, digits, under-scores, and hyphens.

To check what scripts will be executed call run-parts with the –test option.

Nagios Plugin: Check_hparray Error

- | Comments

We have check_hparray plugin installed on all of our HP servers, running CentOS, for monitoring hardware raid via HP Array Configuration Utility CLI (hpacucli) tool, and we have NRPE installed as well for this check from our remote Nagios server.

Yesterday we’ve setup another HP server and installed NRPE, check_hparray, and hpacucli, same process used on previous installations. NRPE worked fine locally and from the Nagios server as the local disk space check was configured properly, but when we tried check_hparray we got “check_hparray Error”. This error can have different causes like invalid slot value used or problems with permissions on executing hpacucli command.

We reviewed our setup and installations and we have the same settings (based on setup with other servers).

We run check_hparray from NRPE we got the error:

1
2
[root@web161 nagios]# /usr/local/nagios/libexec/check_nrpe -H localhost -c check_raid
check_hparray Error.

and it worked fine if run check_hparray command directly:

1
2
[root@web161 nagios]# /usr/local/nagios/libexec/check_hparray -s 1
RAID OK - (Smart Array P400 in Slot 1 array A logicaldrive 1 (546.8 GB, RAID 1+0, OK))

Both of the commands above were tested using root and nagios users and they have the same results. Then we enabled NRPE DEBUG option to get details on the problem:

edit: /usr/local/nagios/etc/nrpe.cfg

1
2
3
4
5
6
# DEBUGGING OPTION
# This option determines whether or not debugging messages are logged to the
# syslog facility.
# Values: 0=debugging off, 1=debugging on

debug=1

and by looking on the system logs we saw the problem with our sudo:

1
2
Sep  10 04:11:35 hostname sudo:   root : TTY=pts/2 ; PWD=/var/log ; USER=root ; COMMAND=/usr/sbin/hpacucli controller slot=1 ld all show
Sep  10 04:12:55 hostname sudo:   nagios : sorry, you must have a tty to run sudo ; TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/sbin/hpacucli controller slot=1 ld all show"

and the solution was to comment out in /etc/sudoers file the line:

1
_Defaults    requiretty_

When MySQL Starts Counting Sheep

- | Comments

We encountered a situation recently where the number of connections to our MySQL database started creeping up slowly but steadily. Strangely enough all the connections were in sleep mode and the database was not being stressed. The number of connections first reached 30 where it normally stays below 20 and started triggering our monitoring scripts. After another hour the number of connections had reach 40 and this trend continued. All connections were coming from the same server which was visible in the show processlist output:

+---------+------------+-------------------+-------------+ | Id      | User       | Host              | Command     | +---------+------------+-------------------+-------------+ |      26 | repl       | 10.0.0.23:32795   | Binlog Dump | | 1074631 | sfront | 10.0.0.32:49906   | Sleep       | | 1085681 | sfront     | 10.0.0.32:50823   | Sleep       | | 1085932 | sfront     | 10.0.0.32:46255   | Sleep       | | 1091130 | sfront     | 10.0.0.32:39273   | Sleep       | | 1092442 | sfront     | 10.0.0.32:42023   | Sleep       | | 1106425 | sfront     | 10.0.0.222:38971  | Query       | +---------+------------+-------------------+-------------+

The MySQL documentation defines the sleep mode as: 

The thread is waiting for the client to send a new statement to it.

The application running on the offending server was a Java application and luckily the only application on that server. A new module in this application was monitoring a setting in the database in a loop with 5 second breaks. The problem was caused by inefficient use use of database connections. In each loop a new connection would be opened, a new prepared statement created and a new resultset generated. None of these were explicitly closed by the application. Now, because Java has a built in garbage collecter these objects would be cleaned automatically in time, but apparently the rate of creating new connections was just slightly higher than the rate of garbage collection so the number of connections rose with about 10 per hour. Although our maximum number of connections is set rather high this would not have caused a problem for another couple of days but we would eventually have maxed out.

The solution was to have the developer nicely close all the database components and reuse were possible. After this the application has been consistently showing just one database connection.

Enabling and Using the Slow Query Log in MySQL

- | Comments

By special request here is a post about the MySQL slow query log. MySQL has a wonderful feature that lets you keep track of all queries that took longer than a certain time to complete. To enable it simply add the following line to your my.cnf file:

1
log-slow-queries = [path to the log file]

Secondly it is very useful to specify the minimum amount of time a query should take before being considered a slow query. Again, simply add the following line to your my.cnf file:

1
long_query_time = [minimum time in seconds for query]

Unfortunately you do have to restart MySQL for this to catch on, but once you do you will have a very powerful and simple tool for optimizing your queries. After letting it run for while and using the application that is accessing it open the log file and you will see entries like this:

1
2
3
4
# Time: 080826 15:33:48
# User@Host: testuser[testuser] @  [10.0.0.1]
# Query_time: 2  Lock_time: 0  Rows_sent: 0  Rows_examined: 628951
select * from customers where customers_email_address = 'test@test.com' and customers_sites_id = '1';

As you can see the limit has been set pretty low (1 second to be exact) because the Query_time is 2 seconds. Still, most simple queries should not last more than a fraction of a second. The complete query is shown here and the number of rows that MySQL examined to get the result. The fact that the number is pretty high (628951) means that no index was used. The next step is to take this query and run the EXPLAIN command on it to verify whether an index was used or not. If that is the problem than the simple solution is to add an index for this column if this query is used often.

The second way to extract useful information from the slow query log is to look for repeated queries. This will mainly occur in web applications where the user can hit the refresh button to restart the query. If the first one did not complete quickly enough you can bet the second one won’t fare any better while the first one is still running. Seeing the same query appearing several times is a good sign that the query needs to be optimized.

Tomcat Failure After Apache Rebuild in CPanel

- | Comments

After rebuilding Apache in CPanel using easyapache or WHM there are many things that can go wrong even if you use a stable branch of CPanel. For example Tomcat that was installed from CPanel can fail after the build and even if this might look unrelated to the apache build this can happen quite frequently.

If you are running tomcat on a cpanel server here is what you can do to help you from losing your tomcat instance:

  • backup; newer cpanel versions backup apache, configs, etc. but they will not care about tomcat. This means it is your job to save the tomcat files. Here are some important folders you should save: /usr/local/jdk - this is a link to the real jdk used on the system (jdk1.5.0_05 or jdk1.6.0_02 for ex.); save the real jdk also, just in case… /usr/local/jakarta/tomcat - this will contain all your tomcat configs, apps, logs, etc. /usr/sbin/starttomcat and /usr/sbin/stoptomcat - scripts used to start and stop tomcat (in case you made local changes, memory tunings, etc.)

The files above can be lost during the rebuild, and it is important to have them on hand to restore tomcat if it fails starting after the apache build. For example last time this happened for us we had to fix the /usr/local/jdk link to point to the proper jdk (we use 1.5 and cpanel changed the link to 1.6) and also the startup scripts that contained various local customizations were obviously overwritten :( .

Hopefully this information will be useful for other people, and hopefully you will see it before running easyapache :-) .

Drupal Filebrowser Configuration Issue

- | Comments

We have experienced a strange issue with our Drupal filebrowser module installation. We’re getting repeated error messages about open_basedir restriction in effect.

1
2
# warning: file_exists() [function.file-exists]: open_basedir restriction in effect. File(/file-folder.png) is not within the allowed path(s): (/home/username:/usr/lib/php:/usr/local/lib/php:/tmp) in /home/username/public_html/modules/filebrowser/filebrowser.module on line 338.
# warning: file_exists() [function.file-exists]: open_basedir restriction in effect. File(/file-default.png) is not within the allowed path(s): (/home/username:/usr/lib/php:/usr/local/lib/php:/tmp) in /home/username/public_html/modules/filebrowser/filebrowser.module on line 338.

After several checks on our account setup we found the issue was caused by our configuration in filebrowser module. There are only two values to set for Filebrowser: “Root directory” and “Icons directory”. Since we forgot to set the value for “Icons directory” and left it empty, this caused the open_basedir restriction issue. Because the module couldn’t find the location of the icon directory, it tries to blame the open_basedir restriction settings.

Virtual Desktop in Gnome 2.22 / Fedora Core 9

- | Comments

Trying to figure out how to put your second monitor and dual port graphics card to good use in FC9? How about creating a virtual desktop that will span the second monitor. Resist the urge to edit xorg.conf and think about the layout or position the second monitor will take. I like to have my 1600x1200 display right of the 1280x960 laptop display.

Since I want the virtual desktop to span both screens horizontally I need to specify a virtual screen of 2880x1200. I derived at this number by adding the the pixel length of both displays together (1280+1600) and the pixel height of the biggest monitor (1200).

In xorg.conf I added "Virtual 2880 1200" in the Section "Screen"

1
2
3
4
5
6
7
8
9
Section "Screen"
Identifier "Screen0"
Device     "Videocard0"
DefaultDepth     24
SubSection "Display"
Viewport   0 0
Depth     24
Virtual   2880 1200
EndSubSection

Now restart Xorg by pressing Ctrl-Alt-Backspace and run gnome-display-properties to adjust the position of the screen (left of right) and set the proper resolution. Be sure to uncheck "Mirror Screen"