Skip this if you know how apache routes traffics to puma and how puma dispatches work to threads.
The way our puma processes work and how you get timeouts in ssl_error.log
They have a pool of 5 threads (from config/puma.rb).
Apache sits in front and checks if it can make a socket connection on each puma process and marks them as good or bad. Note, it contacts the puma “worker” (process), not the thread that handles the request, what puma calls the handler.
Apache gets a request and gives it to a “good” puma process that’s available according to the balancer configuration, which doesn’t really matter here.
If 1 puma thread runs code that hits an infinite loop, is blocked by a database lock or a filesystem lock, or any other blocking operation, puma will send the next request to another thread handler. And another, and another, until it reaches the maximum of 5 (config/puma.rb).
At this point, apache still “thinks” the puma process is able to handle requests because it can make a socket connection to it. But, the next request that apache tries to route to the puma process will timeout (502 to the user) because puma can’t create new threads to handle the request.
For future me and others who are seeing 502/500 errors
It’s possible that the underlying puma worker (ui/webservice/web socket) thread is hung, meaning it
Started a request but never
Completed it. If you’re hitting these errors, and especially timeouts in the
ssl_error.log, this grep helps identify manageiq puma worker process thread ids that may be hung and from this, you can look to see what they last did:
Grep for unmatched pairs of Started/Completed lines in production logs:
zgrep -E 'Started|Completed' production.log* | grep -Eo "#[0-9]+[:a-f0-9]+" | sort | uniq -c
318 #11448:e10118 # the threads for this process are all even numbers, so they're ok
2181 #21413:113c01c # these threads are all odd numbers, so we need to look at each of these
Then, grep to find which log file has the unmatched pairs of Started/Completed for one of the above puma threads:
zgrep -cE "#21413:113d19c.+(Started|Completed)" production.log*
production.log-20180118.gz:107 # <---- Here it is
Finally, grep the thread id for what requests it made in that log to see what didn’t complete:
zgrep -E "#21413:113d19c" production.log-20180118.gz
Puma worker threads under rails will log a
Started, other things, then a
Completed line before starting on a new request.
Therefore, you’re looking for the last
Started. You want to note the request and any parameters passed to it.
You can then grep the same Process:threadid combination,
#21413:113d19c in this case, in the evm.log to see what happens to that worker.
At this point, you can open an issue with your findings.