502 Proxy Error after migrating from Fine to Gaprindashvili


a few days ago i setup a new appliance with gaprindashvili GA and imported my database from my fine-4 appliance including transferring my GUID file. Importing the v2_key fails because appliance_console always crashes with decipher not possible so i just removed the key and let in regenerate. I ran the upgrade scripts without any issues and startet the services.

The ManageIQ run and provision, delete, metric everything works but i get proxy error 502 popup all over the place and i have no clue whats the issue.

Popup Message:
Status 502 Proxy Error
Content-Type text/html; charset=iso-8859-1
Data 502 Proxy Error

Proxy Error

The proxy server received an invalid
response from an upstream server.

The proxy server could not handle the request GET /api.

Reason: Error reading from remote server

With my Chrome Console i can see lots of 502 errors and errors about “Unable to retrieve a valid ws_token!”.

Any idea whats wrong here? is this a issue with my generated/not migrated key?

1 Like

update: It look like this is some sort of session issue, after login everything looks fine, after around 1-2 minutes the system behave like this and throws this error every few seconds. I tested this with Chrome, IE, Firefox and Edge and all Browser behave like this.

I’m experiencing this as well

Looks like this is deliberate, and has been this way for a while, but some recent change in the UI code base has started advertising Apache errors in the UI.

To reduce the frequency of seeing these messages I’ve bumped the number of UI workers on my UI appliances so more balancer ports will be occupied. Is there somewhere we can filter these messages?

Good to know that i am not the only one. I keep an eye on that bug and hope this get fixed.

This was one such cause of a 502 error in the UI:

If you’re getting 502 errors, you can use the troubleshooting steps listed here to troubleshoot.

Basically, you have to see if you’re getting “The timeout specified has expired” timeouts in the ssl_error.log, then you should note which port is timing out… “from remote server”… (3000 in that case).

Then, find which worker process had port 3000 at that time by grep’ing the top_output.log for that time period.

Finally, once you know the process id of that worker, you can grep what that process was doing around that time in the production.log and evm.log. See the troubleshooting steps above for examples.

1 Like

Note, the errors in the ssl_error.log that looks like this:

[Wed Jul 05 07:06:40.803674 2017] [proxy:error] [pid 12987] (111)Connection refused: AH00957: HTTP: attempt to connect to ( failed
[Wed Jul 05 07:06:40.803719 2017] [proxy:error] [pid 12987] AH00959: ap_proxy_connect_backend disabling worker for ( for 60s

are expected. Instead of restarting apache whenever we stop or start UI, web service, and web socket worker processes, we pre-populate the apache balancer pool with port ranges for these workers and allow apache to detect which ones are actually available and send traffic to them. These log lines are less than ideal but we’re more than open to finding a better way of doing this.

Note, apache has had issues with gracefully restarting itself with balancer configuration changes, with that long standing issue fixed, perhaps we could finally use the graceful restart option. I’m not sure if frequently gracefully restarting apache due to workers starting and stopping is going to cause other issues though.

yes i got a lot of these in ssl_error.log after login into ManageIQ

[Wed Feb 21 10:48:27.293987 2018] [proxy:error] [pid 3394] (111)Connection refused: AH00957: HTTP: attempt to connect to ( failed
[Wed Feb 21 10:48:27.294047 2018] [proxy:error] [pid 3394] AH00959: ap_proxy_connect_backend disabling worker for ( for 60s
[Wed Feb 21 10:48:27.294069 2018] [proxy_http:error] [pid 3394] [client 2001:xxx:xxx:xxxx:xxxx:xxxx:xxxx:60369] AH01114: HTTP: failed to make connection to backend: 0.0.0. 0, referer: https://miq.xxxx.com/

After just one login the logs fills up with connection attempts to localhost 3001-3009, 4001-4009,
5001… and so on. Looks like all ports get refused.

I hope this has nothing to do with ipv6 enabled on this host.

@Sven_Jansenar as long as you have worker processes listening on 3000, 4000, 5000, those errors should be fine albeit annoyingly verbose. We put the worker’s port in the proctitle and it should be visible in ps, top (with full command). It’s also visible in bin/rake evm:status in the queue name / URL column.

Make sure you have both the web services and UI role enabled and those workers running, then look to see what port the UI worker is using and see if you’re getting any errors in apache for that specific port. You can also look for that worker’s PID and grep the evm.log and production.log for that pid to see if there are errors.

I think our issue is less that these errors appear in the log, because as you said they are harmless, and more that they generate a big error box in the ui-classic.

@01100010011001010110 if they are causing actual errors in the UI, I want to know about it. They should not be causing problems. My concern is they are so verbose that they hide the actual error.

The Azure PaaS Postgres I moved my dev environment to has decided to crash; I will get you a screencap as soon as it’s back up. I’ve been chatting with support for several days, so it may be a while.

I think about rolling the vm back to fine, i am almost unable to go trough a new vm provision because this thing popups up several times trying to fill the values.

if they are causing actual errors in the UI, I want to know about it.

That would presumably be the error modal.

We can make it ignore 502 errors, but … any user action which receives a 502 in response won’t do anything and the user won’t get notified anymore … does that really sound like a good idea?

Or should we be retrying those or something?

diff --git a/app/assets/javascripts/miq_api.js b/app/assets/javascripts/miq_api.js
index ff4fae778..a3098c244 100644
--- a/app/assets/javascripts/miq_api.js
+++ b/app/assets/javascripts/miq_api.js
@@ -205,6 +205,10 @@
         return ret;
+      if (response.status === 502) {
+        return ret;
+      }
       // true means skip all of them - no error modal at all
       if (options.skipErrors === true) {
         return ret;

would be how to prevent the error modal from appearing for 502 API requests. (But a similar thing would need to be made for angular $http requests and miqJqueryRequest.)

That fix works for me.

Thank you.

I don’t think we should blindly discard 502 errors. We would have to know what is failing in the apache and possibly evm.log/production.log. If they’re timeouts sending requests from the apache worker to our backend worker, it’s possible the UI/web service worker/web socket worker is hitting a bug in the code causing it to hang or spin at 100% cpu. We should investigate them as they happen.

Skip this if you know how apache routes traffics to puma and how puma dispatches work to threads.

The way our puma processes work and how you get timeouts in ssl_error.log

They have a pool of 5 threads (from config/puma.rb).

Apache sits in front and checks if it can make a socket connection on each puma process and marks them as good or bad. Note, it contacts the puma “worker” (process), not the thread that handles the request, what puma calls the handler.

Apache gets a request and gives it to a “good” puma process that’s available according to the balancer configuration, which doesn’t really matter here.

If 1 puma thread runs code that hits an infinite loop, is blocked by a database lock or a filesystem lock, or any other blocking operation, puma will send the next request to another thread handler. And another, and another, until it reaches the maximum of 5 (config/puma.rb).

At this point, apache still “thinks” the puma process is able to handle requests because it can make a socket connection to it. But, the next request that apache tries to route to the puma process will timeout (502 to the user) because puma can’t create new threads to handle the request.

For future me and others who are seeing 502/500 errors

It’s possible that the underlying puma worker (ui/webservice/web socket) thread is hung, meaning it Started a request but never Completed it. If you’re hitting these errors, and especially timeouts in the ssl_error.log, this grep helps identify manageiq puma worker process thread ids that may be hung and from this, you can look to see what they last did:

From /var/www/miq/vmdb/log directory:

Grep for unmatched pairs of Started/Completed lines in production logs:

zgrep -E 'Started|Completed' production.log* | grep -Eo "#[0-9]+[:a-f0-9]+" | sort | uniq -c
 318 #11448:e10118 # the threads for this process are all even numbers, so they're ok
 334 #11448:e10334
 326 #11448:e10528
 326 #11448:e178c8
 318 #11448:e17c60
2181 #21413:113c01c # these threads are all odd numbers, so we need to look at each of these
2045 #21413:113d19c
2075 #21413:113c828
2159 #21413:113cc4c
2201 #21413:113d2c8

Then, grep to find which log file has the unmatched pairs of Started/Completed for one of the above puma threads:

zgrep -cE "#21413:113d19c.+(Started|Completed)" production.log*
production.log-20180118.gz:107  # <---- Here it is

Finally, grep the thread id for what requests it made in that log to see what didn’t complete:

zgrep -E "#21413:113d19c" production.log-20180118.gz

Puma worker threads under rails will log a Started, other things, then a Completed line before starting on a new request.
Therefore, you’re looking for the last Started. You want to note the request and any parameters passed to it.

You can then grep the same Process:threadid combination, #21413:113d19c in this case, in the evm.log to see what happens to that worker.

At this point, you can open an issue with your findings.

1 Like

I just wanted to update this talk discussion to mention that our puma websocket worker, port 5000+, also will show up in production.log but instead of doing Started/Completed bookends, it does Started/Successfully. So the updated grep is this:

To see the PID/TID combinations and if they’re missing the Completed part, use this grep:

zgrep -E 'Started|(Completed|Successfully)' production.log | grep -Eo "#[0-9]+[:a-f0-9]+" | sort | uniq -c

I did a reinstallation few weeks ago with Gaprindashvili-3.

Deployed OVA
Configured Network (IPv4+IPv6)
SSL Certificiates
Added vSphere (Small Cluster with 3 Hosts)
Boom, after login in took about 1 Min. and i get hammered by 502 Proxy Errors. I got around 30 502 Popups in the process of deploying a template VM.

I checked the logs and didn’t find any odd lines. I found lots of socket upgrade errors in the log.

I am really the only one with this issue?