Redis troubleshooting: read error on connection

Overview

We’re facing “read error on connection” exception while trying to send “select” command (selects DB) to Redis. Ilya has updated the exception message recently and now it says “The redis database could not be selected.”

Exception appears spontaneously and holds for quite a while (e.g. few hours) occurring a couple of times per minute (everything is clearly seen from reports on v185 and v186 for LIVE ENV).

Ideas on the cause of the issue + TODO`s

Ensure we use Phpredis flow and not “Standalone” flow

Simply confirm that Live ENV uses Phpredis flow (note, from local.xml we already know phpredis should be used, so this is about confirming that the code uses it indeed and doesn’t fallback to standalone).

We should use Phpredis due to multiple comments about Phpredis as a more mature and reliable solution (comments like this one: https://github.com/colinmollenhour/Cm_Cache_Backend_Redis/issues/37#issuecomment-58007678 and few more found).

Nevertheless I have found Guthy Renker (quite a solid app) having1

<force_standalone>1</force_standalone>

in their local.xml.

So we might switch to “standalone” flow just to give it a try at some point if we’re out of any other ideas.

Check the Redis error which gets returned

Hook into the exception we currently get (changed to more comprehensive text by Ilya recently) and add call to getLastError() phpredis method (see https://github.com/phpredis/phpredis#getlasterror for implementation details). Log message carefully)

Retry the failed redis call or Reinitiate the whole application in case redis error has been caught

Once Redis exception has been caught (meaning end customer should get a report instead of the valid page) try to:

  1. re-send the failed call. So let’s simply try to re-execute the failed call to Redis. This would allow to continue the execution with no report shown. And together with what’s mentioned in “Check the Redis error which gets returned” section it will enable both: collecting the info on error and seamless app execution
  2. In case prev. variant doesn’t work let’s use Mage::reset() and then Mage::run(…proper run type and code…) to reinitiate the application and once again make an attempt to deliver the valid page
  3. in case both 1st and 2nd variants don’t work let’s terminate the app with “Location: <current uri>” header so that end customer’s browser repeat the request itself

First option is the preferred one while the last one is the worst variant due to monitoring systems attached and for better customer experience. Note, we should repeat reinitiating for 3-4 times max and then give up showing the report (in case none of the rounds resulted in a valid page shown).

Too much data was saved / appended into redis key or value

Ideas comes from the following comment in phpredis issue #70: https://github.com/phpredis/phpredis/issues/70#issuecomment-6025598.

It might happen that we send too much data within one request to Redis.

Ideas was already partially confirmed by too long keys detected (fixed by Ilya in Vaimo_Dyson_Model_Product::getCategoryByLevel() with https://bitbucket.org/vaimo/vaimo_dyson/pull-requests/5/dyson-1729-add-filter-by-store-specific/diff (see changes to this method in app/code/local/Vaimo/Dyson/Model/Product.php))

Proposal is to log the length of the Redis Key and length of the Value if they exceed some threshold.

Note, it is very likely that our previous idea about logging the last error would also provide this kind of info (if Redis got stuck because of too much data transferred).

@TODO: log all the savings bigger than 1Mb
@TODO: test save / load operations with big chunks of data being transferred
@TODO: log all the loads from redis (in a format: KEY – data size) in order to see if there’s an issue with redis value length (because it gets appended constantly and never gets flushed)

Using old phpredis which might result in this error

Using old phpredis which might result in this error (“read error on connection”)

Latest phpredis version is 2.2.7

Ours is 2.2.4 from 2013-09-02

(see https://pecl.php.net/package/redis)

This is what makes me think it might be a version-related issue:

https://github.com/phpredis/phpredis/pull/643 (Aug 5, 2015 – open). So let’s ensure our phpredis lib already has this commit inside (https://github.com/phpredis/phpredis/pull/643/commits).

Issues for Credis lib

https://github.com/colinmollenhour/Cm_Cache_Backend_Redis/issues/37

(looks like ours – see the very beginning. Also check this https://github.com/colinmollenhour/Cm_Cache_Backend_Redis/issues/37#issuecomment-19012366)

Issues for Phpredis lib

https://github.com/phpredis/phpredis/pull/643 (Aug 5, 2015 – open)

https://github.com/phpredis/phpredis/issues/492 (Aug 3, 2014 – open)

https://github.com/phpredis/phpredis/issues/70 (open, very long with just few days old comments).

BTW, here Colin himself writes (check https://github.com/phpredis/phpredis/issues/70#issuecomment-4721338):

– Standalone PHP driver used with no errors (while phpredis has errors)

– he’s not using persistent connections (neither for phpredis nor standalone mode)

do this for debugging: https://github.com/phpredis/phpredis/issues/70#issuecomment-38945798

https://github.com/phpredis/phpredis/issues/668 (October, 5 – open)

Useful info gathered (incl. well-known stuff)

Magento local.xml redis config explained:

https://github.com/colinmollenhour/Cm_Cache_Backend_Redis

SUNION with 180k sets

Investigate why SUNION calls take a lot of time sometimes.

In observed case (from SLOWLOG), the SUNION command took just over 20 sec when it was called with 180k tags.

Is this normal?

Why would Magento ever try to run SUNION with so many tags at one time?

Looking at the list of tags, why is it 180k tags long? Is it all the tags that exist? Even if so, why do we ever have that many tags in total?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s