I’ve had an interesting system down call with an existing Sametime 9.0.1 customer in the past week. The environment is over 18 months old and consists of every server component in single instances including ST Proxy, Meetings, ST Advanced and all Media components. The media components were added in Dec 2015 and everything has been fine. The Meeting and Proxy servers both have WAS proxies in front of them to handle traffic over port 80 / 443 separately. Last week the Meeting node was restarted and the WAS Proxy stopped working. It would load. The Meeting server was responding on its own application ports to http(s)://hostname:9080 / 9443 both worked but http(s)://hostname failed with
503 Service Unavailable
The WAS Proxy server showed started. There were no errors in the logs for that or the ST Meeting server. Not all WAS proxies were broken because the one in front of the ST Proxy server worked. In short that error suggests that the Meeting server is offline when we knew it wasn’t and since there isn’t any real configuration for the WAS Proxy other than what node it points to – there was nothing to troubleshoot. I tried deleting and recreating the WAS Proxy a few times, I tried switching it to use alternate ports 81/444, nothing would fix it.
It took a few days and some combined effort to find. The WAS team wanted us to upgrade to WAS fixpack 5 but that would mean upgrading 8 working servers in the hopes of fixes one WAS proxy. There was a suggestion that since the Meeting server was a single, not a cluster, I could just change the Meeting server ports to use 80/443 instead of 9080/9443 and do away with the WAS proxy entirely. That would get rid of the problem but not fix it, just circumvent it. I wanted to fix it and find out why it happened.
I had checked the virtual hosts to make sure the hostname / port combination was in the stmeet host and wasn’t anywhere else and discovered that in default_host new wildcard port entries had appeared for ports 80 and 443. I had already deleted those but that didn’t fix the problem. How did those port entries appear ? I’ve seen this before when you install new ST servers (as we did with Media in Dec) it come sometimes write virtual host entries to the wrong places. In fact that was my first guess but after I removed those entries from default_host and it still didn’t fix the problem I was out of ideas. Then Tony Payne from IBM spotted that the admin_host virtual host which is only used by the SSC had the ports 9080 and 9443 in it when it should only have 8700 and 8701. Again I assume these were added by the previous server installs and of course I never went to look there because the Meeting server was specifically set to use the STMeet host.
I removed those extra ports from the admin_host virtual host definition and restarted the Meeting node and servers (clearing the temp directories first \profilename\temp and \profilename\wstemp as well as \profilename\config\temp) and that fixed the problem.
So why was the presence of those two ports 9080/9443 (used by the ST Meeting server) that were in a virtual host the ST Meeting server doesn’t even use causing the WAS Proxy to break? Why didn’t the Meeting server itself break and why didn’t the ST Proxy Server which also had a WAS proxy in front of it break?
Turns out that no matter what virtual host mapping you have in place for applications, in Sametime the code checks the admin_host and if a port appears there – it silently disables looking up any other host. The fact that the Meeting server ports appeared at all in the admin_host meant that the STMeet host was being ignored and the WAS Proxy had no way to direct the traffic.
Unfortunately none of that is visible in the logs or in debug logs which all reported the servers and services using the correct STMeet host. So it wasn’t something that was able to be seen. It was a combination of Tony seeing the admin entries and me having had a previous call with a server install which added ports to unwanted virtual hosts that allowed us to find it and fix it.
The ST Proxy server itself wasn’t affected because that server was running on 9082/9445 so its ports weren’t in admin_host and its virtual host therefore wasn’t ignored.
Always good to have a problem fixed and learn a ton of stuff about application behaviour at the same time 🙂