502 Proxy Error after migrating from Fine to Gaprindashvili

gaprindashvili

#13

I think about rolling the vm back to fine, i am almost unable to go trough a new vm provision because this thing popups up several times trying to fill the values.


#14

if they are causing actual errors in the UI, I want to know about it.

That would presumably be the error modal.

We can make it ignore 502 errors, but … any user action which receives a 502 in response won’t do anything and the user won’t get notified anymore … does that really sound like a good idea?

Or should we be retrying those or something?


#15
diff --git a/app/assets/javascripts/miq_api.js b/app/assets/javascripts/miq_api.js
index ff4fae778..a3098c244 100644
--- a/app/assets/javascripts/miq_api.js
+++ b/app/assets/javascripts/miq_api.js
@@ -205,6 +205,10 @@
         return ret;
       }
 
+      if (response.status === 502) {
+        return ret;
+      }
+
       // true means skip all of them - no error modal at all
       if (options.skipErrors === true) {
         return ret;

would be how to prevent the error modal from appearing for 502 API requests. (But a similar thing would need to be made for angular $http requests and miqJqueryRequest.)


#16

That fix works for me.

Thank you.


#17

I don’t think we should blindly discard 502 errors. We would have to know what is failing in the apache and possibly evm.log/production.log. If they’re timeouts sending requests from the apache worker to our backend worker, it’s possible the UI/web service worker/web socket worker is hitting a bug in the code causing it to hang or spin at 100% cpu. We should investigate them as they happen.


#18

Skip this if you know how apache routes traffics to puma and how puma dispatches work to threads.

The way our puma processes work and how you get timeouts in ssl_error.log

They have a pool of 5 threads (from config/puma.rb).

Apache sits in front and checks if it can make a socket connection on each puma process and marks them as good or bad. Note, it contacts the puma “worker” (process), not the thread that handles the request, what puma calls the handler.

Apache gets a request and gives it to a “good” puma process that’s available according to the balancer configuration, which doesn’t really matter here.

If 1 puma thread runs code that hits an infinite loop, is blocked by a database lock or a filesystem lock, or any other blocking operation, puma will send the next request to another thread handler. And another, and another, until it reaches the maximum of 5 (config/puma.rb).

At this point, apache still “thinks” the puma process is able to handle requests because it can make a socket connection to it. But, the next request that apache tries to route to the puma process will timeout (502 to the user) because puma can’t create new threads to handle the request.

For future me and others who are seeing 502/500 errors

It’s possible that the underlying puma worker (ui/webservice/web socket) thread is hung, meaning it Started a request but never Completed it. If you’re hitting these errors, and especially timeouts in the ssl_error.log, this grep helps identify manageiq puma worker process thread ids that may be hung and from this, you can look to see what they last did:

From /var/www/miq/vmdb/log directory:

Grep for unmatched pairs of Started/Completed lines in production logs:

zgrep -E 'Started|Completed' production.log* | grep -Eo "#[0-9]+[:a-f0-9]+" | sort | uniq -c
 318 #11448:e10118 # the threads for this process are all even numbers, so they're ok
 334 #11448:e10334
 326 #11448:e10528
 326 #11448:e178c8
 318 #11448:e17c60
2181 #21413:113c01c # these threads are all odd numbers, so we need to look at each of these
2045 #21413:113d19c
2075 #21413:113c828
2159 #21413:113cc4c
2201 #21413:113d2c8

Then, grep to find which log file has the unmatched pairs of Started/Completed for one of the above puma threads:

zgrep -cE "#21413:113d19c.+(Started|Completed)" production.log*
production.log:0
production.log-20180115.gz:0
production.log-20180116.gz:470
production.log-20180117.gz:1468
production.log-20180118.gz:107  # <---- Here it is
production.log-20180129.gz:100
production.log:0
production.log-20180115.gz:0
production.log-20180116.gz:470
production.log-20180117.gz:1468
production.log-20180118.gz:107
production.log-20180129.gz:0

Finally, grep the thread id for what requests it made in that log to see what didn’t complete:

zgrep -E "#21413:113d19c" production.log-20180118.gz

Puma worker threads under rails will log a Started, other things, then a Completed line before starting on a new request.
Therefore, you’re looking for the last Started. You want to note the request and any parameters passed to it.

You can then grep the same Process:threadid combination, #21413:113d19c in this case, in the evm.log to see what happens to that worker.

At this point, you can open an issue with your findings.


#19

I just wanted to update this talk discussion to mention that our puma websocket worker, port 5000+, also will show up in production.log but instead of doing Started/Completed bookends, it does Started/Successfully. So the updated grep is this:

To see the PID/TID combinations and if they’re missing the Completed part, use this grep:

zgrep -E 'Started|(Completed|Successfully)' production.log | grep -Eo "#[0-9]+[:a-f0-9]+" | sort | uniq -c


#20

I did a reinstallation few weeks ago with Gaprindashvili-3.

Deployed OVA
Configured Network (IPv4+IPv6)
FreeIPA
SSL Certificiates
Added vSphere (Small Cluster with 3 Hosts)
Boom, after login in took about 1 Min. and i get hammered by 502 Proxy Errors. I got around 30 502 Popups in the process of deploying a template VM.

I checked the logs and didn’t find any odd lines. I found lots of socket upgrade errors in the log.

I am really the only one with this issue?


#21

@Sven_Jansen I believe apache times out a request after 2 minutes so grep your UI and web service worker(UI needs to hit /api to do things) PIDs from the production.log. If all puma threads (5 per ui/web service worker) are tied up or a single request takes more than 2 minutes, you’ll get 502 errors.


#22

Okay i just loggged in and use your updated grep command, i can find several entries just right after logging in…

zgrep -E ‘Started|(Completed|Successfully)’ production.log | grep -Eo “#[0-9]+[:a-f0-9]+” | sort | uniq -c
16 #2106:184c140
10 #2106:184c30c
14 #2106:184c99c
20 #2106:184cd98
12 #2106:184d1a8
6 #2153:4514024
6 #2153:4514808
4 #2153:4514a9c
4 #2153:4517580
6 #2153:4517a44


#23

@Sven_Jansen try a similar grep but look for the start and end and see if you have any long requests for those pids:

zgrep -E '(2106|2153)Started|(Completed|Successfully)' production.log


#24
[----] I, [2018-06-29T18:02:33.862614 #2106:184cd98]  INFO -- : Completed 200 OK in 14ms (Views: 7.6ms | ActiveRecord: 0.8ms)

[----] I, [2018-06-29T18:02:51.604546 #2106:184d1a8] INFO – : Completed 200 OK in 11ms (Views: 5.6ms | ActiveRecord: 0.6ms)
[----] I, [2018-06-29T18:03:11.076083 #2106:184c30c] INFO – : Completed 200 OK in 54ms (Views: 0.5ms | ActiveRecord: 10.8ms)
[----] I, [2018-06-29T18:03:11.407593 #2106:184c30c] INFO – : Completed 200 OK in 194ms (Views: 142.1ms | ActiveRecord: 24.7ms)
[----] I, [2018-06-29T18:03:12.314553 #2153:4517a44] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)
[----] I, [2018-06-29T18:03:12.318670 #2106:184c140] INFO – : Completed 200 OK in 1ms (Views: 0.7ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:12.453171 #2106:184c99c] INFO – : Completed 200 OK in 1ms (Views: 0.5ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:12.496238 #2106:184c140] INFO – : Completed 200 OK in 23ms (Views: 0.3ms | ActiveRecord: 3.0ms)
[----] I, [2018-06-29T18:03:12.510054 #2106:184cd98] INFO – : Completed 200 OK in 1ms (Views: 0.6ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:12.569911 #2106:184cd98] INFO – : Completed 200 OK in 1ms (Views: 0.5ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:12.589562 #2106:184c99c] INFO – : Completed 200 OK in 36ms (Views: 0.5ms | ActiveRecord: 19.7ms)
[----] I, [2018-06-29T18:03:12.593542 #2106:184c140] INFO – : Completed 200 OK in 40ms (Views: 0.4ms | ActiveRecord: 22.7ms)
[----] I, [2018-06-29T18:03:12.666187 #2106:184cd98] INFO – : Completed 200 OK in 28ms (Views: 0.4ms | ActiveRecord: 11.3ms)
[----] I, [2018-06-29T18:03:12.675919 #2106:184c99c] INFO – : Completed 200 OK in 17ms (Views: 16.8ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:12.689404 #2106:184c140] INFO – : Completed 200 OK in 31ms (Views: 0.3ms | ActiveRecord: 5.8ms)
[----] I, [2018-06-29T18:03:12.689625 #2106:184d1a8] INFO – : Completed 200 OK in 12ms (Views: 11.8ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:12.730478 #2106:184cd98] INFO – : Completed 200 OK in 1ms (Views: 0.5ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:12.750139 #2106:184d1a8] INFO – : Completed 200 OK in 3ms (Views: 2.6ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:12.772700 #2106:184c99c] INFO – : Completed 200 OK in 31ms (Views: 0.3ms | ActiveRecord: 11.4ms)
[----] I, [2018-06-29T18:03:12.794241 #2106:184c140] INFO – : Completed 200 OK in 35ms (Views: 0.4ms | ActiveRecord: 12.7ms)
[----] I, [2018-06-29T18:03:12.812284 #2106:184cd98] INFO – : Completed 200 OK in 20ms (Views: 0.8ms | ActiveRecord: 2.6ms)
[----] I, [2018-06-29T18:03:12.821941 #2106:184d1a8] INFO – : Completed 200 OK in 1ms (Views: 0.5ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:12.852657 #2106:184c140] INFO – : Completed 200 OK in 1ms (Views: 0.8ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:12.861915 #2106:184cd98] INFO – : Completed 200 OK in 1ms (Views: 0.7ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:18.800490 #2106:184c30c] INFO – : Completed 200 OK in 221ms (Views: 132.6ms | ActiveRecord: 14.4ms)
[----] I, [2018-06-29T18:03:19.680920 #2106:184c99c] INFO – : Completed 200 OK in 1ms (Views: 0.6ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:19.701942 #2153:4514024] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)
[----] I, [2018-06-29T18:03:20.464955 #2106:184c140] INFO – : Completed 200 OK in 346ms (Views: 7.4ms | ActiveRecord: 29.8ms)
[----] I, [2018-06-29T18:03:20.467957 #2106:184cd98] INFO – : Completed 200 OK in 3ms (Views: 2.5ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:20.517662 #2106:184cd98] INFO – : Completed 200 OK in 1ms (Views: 0.5ms | ActiveRecord: 0.0ms)
[----] I, [2018-06-29T18:03:29.179953 #2153:4514808] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)
[----] I, [2018-06-29T18:03:41.263502 #2106:184c30c] INFO – : Completed 200 OK in 90ms (Views: 9.6ms | ActiveRecord: 4.5ms)
[----] I, [2018-06-29T18:03:41.638065 #2153:4517580] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)
[----] I, [2018-06-29T18:03:41.935035 #2106:184c99c] INFO – : Completed 200 OK in 433ms (Views: 7.8ms | ActiveRecord: 37.6ms)
[----] I, [2018-06-29T18:03:44.372358 #2106:184d1a8] INFO – : Completed 200 OK in 95ms (Views: 1.3ms | ActiveRecord: 13.8ms)
[----] I, [2018-06-29T18:03:46.503227 #2106:184c140] INFO – : Completed 200 OK in 12ms (Views: 0.5ms | ActiveRecord: 1.0ms)
[----] I, [2018-06-29T18:03:49.700941 #2153:4514a9c] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)
[----] I, [2018-06-29T18:03:50.477289 #2106:184cd98] INFO – : Completed 200 OK in 1822ms (Views: 1.1ms | ActiveRecord: 82.9ms)
[----] I, [2018-06-29T18:03:53.164469 #2106:184c30c] INFO – : Completed 200 OK in 281ms (Views: 98.9ms | ActiveRecord: 19.0ms)
[----] I, [2018-06-29T18:03:54.166204 #2106:184c99c] INFO – : Completed 200 OK in 259ms (Views: 120.8ms | ActiveRecord: 17.4ms)
[----] I, [2018-06-29T18:03:55.604659 #2106:184d1a8] INFO – : Completed 200 OK in 622ms (Views: 371.9ms | ActiveRecord: 19.4ms)
[----] I, [2018-06-29T18:03:58.667404 #2153:4517a44] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)
[----] I, [2018-06-29T18:04:08.401851 #2153:4514024] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)
[----] I, [2018-06-29T18:04:18.815553 #2153:4514808] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)
[----] I, [2018-06-29T18:04:29.808237 #2153:4517580] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)
[----] I, [2018-06-29T18:04:41.321950 #2153:4514a9c] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)
[----] I, [2018-06-29T18:04:53.330078 #2153:4517a44] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)
[----] I, [2018-06-29T18:05:05.762953 #2153:4514024] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)
[----] I, [2018-06-29T18:05:18.591062 #2153:4514808] INFO – : Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)


#25

Keep in mind, a UI login requires:

  1. web service worker running so the UI can do requests on /api for authentication and other things
  2. obviously, ui worker running
  3. memcached running if using session_store: cache in advanced settings
  4. database running and properly configured
  5. apache/httpd running (we start it automatically, it could fail to start if configuration is invalid)

    Maybe other things

#26

I didnt changed anything on the appliance except IP , FreeIPA Authentication, DNS Name, Changed SSL Certificates for the Web Interface and added my first Hypervisor which is VMware vSphere 6.0 with around 120 VMs. I put a little bit more memory in it because 8GB was not enough, currently running 12GB.


#27

@Sven_Jansen I don’t see anything in those log lines that are relevant. Maybe grep-i timeout in the ssl_error.log. This should live in /var/www/miq/vmdb/log/apache/ssl_error.log . From this log, get the ports and times that are seeing timeouts.

Then, grep the top_output.log for the failing port to find the process id that was using that port at that time. For example, if the timeout was on port 3000:

grep -E "3000|timesync|top" top_output.log

Once you know the process id from top, you can grep the production log for just that process id (12345 for example).

grep "12345" production.log


#28

I found that production.log ist quite small and easy to review because no one is working with it atm. the only strange thing i found is upgrade errors and unauthorized connections.

[----] E, [2018-06-29T18:04:41.322613 #2153:8678448] ERROR – : An unauthorized connection attempt was rejected
[----] E, [2018-06-29T18:04:41.322806 #2153:8678448] ERROR – : Failed to upgrade to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: WebSocket)

I click something, see that rendering stuff happing without issues and after sitting around that reject und upgrade errors appear in production log.

The other stuff that looks good is just rendering stuff like dashboard and provision window i opened for testing.


#29

Checked session_store: cache
Checked memcached process is runnig
Checked database… running, disk io/memory okay.
Checked httpd running

currently checking other logs for strange stuff, but so far production.log is not helpful except the two errors i mentioned.


#30

@Sven_Jansen we’ve seen a 502 error in the UI because the web service worker deadlocks. I’m not sure what versions are affected besides the master branch and gaprindashvili. Here is the proposed workaround for now: https://github.com/ManageIQ/manageiq-api/pull/416


#31

Hi Joe,

i applied your patch for “lib/api/environment.rb” and it solved my issue, i see no more deadlocks or 502 errors.

Thank you very much for your help!

Sven


#32

@Sven_Jansen that’s great to hear! I’ll try to update this thread when we figure out a solution instead of the workaround above. Either way, it should be transparent.