[ICE-3359] Failover support - ICEsoft JIRA Issue Tracker

Hide

Permalink

Ted Goddard added a comment - 28/Jul/08 5:27 PM

Failover in Java enterprise clusters is typically implemented through the following: make use of a database that has an internal (clustered) failover mechanism, and support failover in the web-tier through session replication.

Database clustering is outside the scope of this discussion and is provided by existing third party implementations.

The central interest in failover for ICEfaces then becomes support for session replication.

A stock JSF application supports failover through persistence of component saved state in the session. This saved state is replicated to alternate cluster nodes. In the case of node failure, the "sticky session" is transferred to the alternate node and execution proceeds with minimal impact (since the session is the only user-specific state stored in the web tier).

The following is a strategy that should allow an ICEfaces application to failover to an alternate node:

1. Support existing JSF state-saving. This is currently being undertaken to reduce server-side memory requirements (the persistent component tree becomes optional) but will have important benefits in failover. the essential aspect is that it allows the application to continue execution with the component tree in the same state as it was before the failure occurred.
2. Treat the case of a null "old DOM" as a full page refresh. Rather than suffer the performance cost of constantly transferring the DOM to the alternate node, simply refresh user's pages during the (rare) event of node failure. The important point here is that application functionality is not affected; a full-page refresh imparts only a usability impact and can occur under circumstances as well.
3. Analyze the servlet stack for "session-like" state (such as icefacesID and viewNumbers). It may be worthwhile to move some of this state into the session in a controlled manner (such as a single serializable session object).

Show

Ted Goddard added a comment - 28/Jul/08 5:27 PM Failover in Java enterprise clusters is typically implemented through the following: make use of a database that has an internal (clustered) failover mechanism, and support failover in the web-tier through session replication. Database clustering is outside the scope of this discussion and is provided by existing third party implementations. The central interest in failover for ICEfaces then becomes support for session replication. A stock JSF application supports failover through persistence of component saved state in the session. This saved state is replicated to alternate cluster nodes. In the case of node failure, the "sticky session" is transferred to the alternate node and execution proceeds with minimal impact (since the session is the only user-specific state stored in the web tier). The following is a strategy that should allow an ICEfaces application to failover to an alternate node: 1. Support existing JSF state-saving. This is currently being undertaken to reduce server-side memory requirements (the persistent component tree becomes optional) but will have important benefits in failover. the essential aspect is that it allows the application to continue execution with the component tree in the same state as it was before the failure occurred. 2. Treat the case of a null "old DOM" as a full page refresh. Rather than suffer the performance cost of constantly transferring the DOM to the alternate node, simply refresh user's pages during the (rare) event of node failure. The important point here is that application functionality is not affected; a full-page refresh imparts only a usability impact and can occur under circumstances as well. 3. Analyze the servlet stack for "session-like" state (such as icefacesID and viewNumbers). It may be worthwhile to move some of this state into the session in a controlled manner (such as a single serializable session object).

Hide

Permalink

Ted Goddard added a comment - 28/Jul/08 5:28 PM

The biggest unknown in this is the "session-like" state that we have, but it will be easier to tackle this when we have state-saving in place and can see how ICEfaces breaks when failover occurs in a cluster.

Show

Ted Goddard added a comment - 28/Jul/08 5:28 PM The biggest unknown in this is the "session-like" state that we have, but it will be easier to tackle this when we have state-saving in place and can see how ICEfaces breaks when failover occurs in a cluster.

Hide

Permalink

Ted Goddard added a comment - 28/Jul/08 5:28 PM

Will try to address this for 1.7.2.

Show

Ted Goddard added a comment - 28/Jul/08 5:28 PM Will try to address this for 1.7.2.

Hide

Permalink

Deryk Sinotte added a comment - 30/Jul/08 1:44 PM

Assigning to Greg

Show

Deryk Sinotte added a comment - 30/Jul/08 1:44 PM Assigning to Greg

Hide

Permalink

Greg Dick added a comment - 28/Aug/08 6:28 PM

I've had a bit closer look at the object relationship in the framework and there's some good news and bad news. The good news is that the object structure is constructed in the service methods of the various objects, not in response to a sessionCreated event. For quick review, here's a high level diagram of the object structure in question

MainServlet 1 – 1 PathDispatcher 1----- 1 SessionDispatcher

These three objects are constructed monolithically, and exist on whatever clustered server we run. The SessionDispatcher maintains a map of sessionBoundServers keyed by session id.

SessionDispatcher 1 ------ * MainSessionBoundServlets 1 ---- * Views

View 1 ---- 1 PathDispatcherServer 1 ----- * ViewBoundServers dispatched on request path

This object chain is constructed on the fly, the MainSessionBoundServlets are identified by ice.session, and the Views are identified by ice.view parameters.

Assume Node 2 has just taken over sole processing of requests in light of Node 1 halting and a client makes a request of the server. I think what happens now is that the call in the SessionDispatcher to request.getSession(true) will be able to return a session object (due to session replication) for a given JSESSIONID without creating it from scratch. The bad news is that this sessionid will have no matching session bound server in its hashmap, and a new MainSessionBoundServlet will be created. This creates a new ice.session value in the constructor which is problematic, since it wont match the one the client has.

There's another potential difficulty in the ViewBoundServer instances depending on the nature of the first request that arrives from the client. If the request is a GET, and contains an "RVN" parameter, a View object will be constructed in response to the request even if it doesn't already exist. If the request is a POST, and it contains a ice.view parameter that doesn't exist in the views Map, then the response will be that the session has expired.

I know there are difficulties in the protocol between the client and server, and that the server can't just create views at will when receiving requests because of problems in creating server objects when they aren't wanted.. We have events when a new icefacesId and a new view number is created and destroyed. I wonder if those can be broadcast to other servers in the cluster using JMS? We could build a little bit of functionality that would keep various nodes appraised of the other nodes session and view state.

Show

Greg Dick added a comment - 28/Aug/08 6:28 PM I've had a bit closer look at the object relationship in the framework and there's some good news and bad news. The good news is that the object structure is constructed in the service methods of the various objects, not in response to a sessionCreated event. For quick review, here's a high level diagram of the object structure in question MainServlet 1 – 1 PathDispatcher 1----- 1 SessionDispatcher These three objects are constructed monolithically, and exist on whatever clustered server we run. The SessionDispatcher maintains a map of sessionBoundServers keyed by session id. SessionDispatcher 1 ------ * MainSessionBoundServlets 1 ---- * Views View 1 ---- 1 PathDispatcherServer 1 ----- * ViewBoundServers dispatched on request path This object chain is constructed on the fly, the MainSessionBoundServlets are identified by ice.session, and the Views are identified by ice.view parameters. Assume Node 2 has just taken over sole processing of requests in light of Node 1 halting and a client makes a request of the server. I think what happens now is that the call in the SessionDispatcher to request.getSession(true) will be able to return a session object (due to session replication) for a given JSESSIONID without creating it from scratch. The bad news is that this sessionid will have no matching session bound server in its hashmap, and a new MainSessionBoundServlet will be created. This creates a new ice.session value in the constructor which is problematic, since it wont match the one the client has. There's another potential difficulty in the ViewBoundServer instances depending on the nature of the first request that arrives from the client. If the request is a GET, and contains an "RVN" parameter, a View object will be constructed in response to the request even if it doesn't already exist. If the request is a POST, and it contains a ice.view parameter that doesn't exist in the views Map, then the response will be that the session has expired. I know there are difficulties in the protocol between the client and server, and that the server can't just create views at will when receiving requests because of problems in creating server objects when they aren't wanted.. We have events when a new icefacesId and a new view number is created and destroyed. I wonder if those can be broadcast to other servers in the cluster using JMS? We could build a little bit of functionality that would keep various nodes appraised of the other nodes session and view state.

Hide

Permalink

Greg Dick added a comment - 28/Oct/08 4:35 PM

Observations with Tomcat 6.0.14

I can get session scoped backing bean values to be duplicated from one node to the other, but only under a very prescribed set of circumstances.

Both nodes must be up and in good health when the application creates a session. It appears that Tomcat isn't very robust in duplicating sessions to a new node member when the node comes back online if the node was down when the node was created. This appears to mean that the nodes fire session duplication events which work if all nodes are receiving, but there is no session duplication strategy (persistent attempts at synchronization, retrograde updates, etc) to handle the case where something changes when a node is down.

Getting failover to occur is not that easy. Apache will load balance between the nodes, but gracefully shutting down tomcat on the primary node invalidates the sessions, which kills the sessions on all nodes. The only way to leave the application in the desired state but to achieve failover is to terminate the app server (kill -9).

Our active protocol between client and server is not helping. If the first interaction between client and server is any of the blocking requests (ping, receive-updated-views, receive-updates), and not a manual page reload, the application gets the session timeout message because the blocking request verifies the request via view number, and the necessary object structure is not set up for the user.

All we ever get is a one way failover. A session on node A is found on Node B. Once A is restarted, there doesn't seem any way to fail back. I think this is on account of the lack of coherent session duplication strategy in Tomcat. Also, I have never been able to get the session duplicated from lntest 7 to lntest6, (lntest6 being the first node in Apache's load balancing table, or node A). I can't account for this, and have no idea why. The configuration should not have any bias one way or the other.

In the scenario that works, when I kill tomcat on Node A, I immediately get a 'service not available' display in the browser. I can reload the application, and the session scoped values appear. When I kill Tomcat on Node B (lntest7) I don't get this interaction. The application just stays on the same page, and when the reload comes back, it has always started a new session with empty backing bean values.

Sticky session mechanics between Apache and Tomcat Tomcat append a string (defined in httpd.conf) to the session id which is used to identify the intended target Node. The approach puts the burden of identifying the target node with the client, since this string is visible to the client and the server. The client passes this string around rather that Apache keeping a hash of sessionId's pointing to their current target node. This means that sessions like CDXDC[...].node1 and CDXDC[...].node2 can be created for the same client. Tomcat manages to keep this straight and is able to duplicate the session information between these two independent sessions, but our SessionDispatcher Hashmap cannot. This doesn't really effect the Session failover stuff, but it does mean we will occasionally keep a MainSessionBoundServlet sub-object tree around for no reason. However, we need to see more of the server strategies for session affinity before we start looking for "." characters in the session id.

Show

Greg Dick added a comment - 28/Oct/08 4:35 PM Observations with Tomcat 6.0.14 I can get session scoped backing bean values to be duplicated from one node to the other, but only under a very prescribed set of circumstances. Both nodes must be up and in good health when the application creates a session. It appears that Tomcat isn't very robust in duplicating sessions to a new node member when the node comes back online if the node was down when the node was created. This appears to mean that the nodes fire session duplication events which work if all nodes are receiving, but there is no session duplication strategy (persistent attempts at synchronization, retrograde updates, etc) to handle the case where something changes when a node is down. Getting failover to occur is not that easy. Apache will load balance between the nodes, but gracefully shutting down tomcat on the primary node invalidates the sessions, which kills the sessions on all nodes. The only way to leave the application in the desired state but to achieve failover is to terminate the app server (kill -9). Our active protocol between client and server is not helping. If the first interaction between client and server is any of the blocking requests (ping, receive-updated-views, receive-updates), and not a manual page reload, the application gets the session timeout message because the blocking request verifies the request via view number, and the necessary object structure is not set up for the user. All we ever get is a one way failover. A session on node A is found on Node B. Once A is restarted, there doesn't seem any way to fail back. I think this is on account of the lack of coherent session duplication strategy in Tomcat. Also, I have never been able to get the session duplicated from lntest 7 to lntest6, (lntest6 being the first node in Apache's load balancing table, or node A). I can't account for this, and have no idea why. The configuration should not have any bias one way or the other. In the scenario that works, when I kill tomcat on Node A, I immediately get a 'service not available' display in the browser. I can reload the application, and the session scoped values appear. When I kill Tomcat on Node B (lntest7) I don't get this interaction. The application just stays on the same page, and when the reload comes back, it has always started a new session with empty backing bean values. Sticky session mechanics between Apache and Tomcat Tomcat append a string (defined in httpd.conf) to the session id which is used to identify the intended target Node. The approach puts the burden of identifying the target node with the client, since this string is visible to the client and the server. The client passes this string around rather that Apache keeping a hash of sessionId's pointing to their current target node. This means that sessions like CDXDC [...] .node1 and CDXDC [...] .node2 can be created for the same client. Tomcat manages to keep this straight and is able to duplicate the session information between these two independent sessions, but our SessionDispatcher Hashmap cannot. This doesn't really effect the Session failover stuff, but it does mean we will occasionally keep a MainSessionBoundServlet sub-object tree around for no reason. However, we need to see more of the server strategies for session affinity before we start looking for "." characters in the session id.

Hide

Permalink

Greg Dick added a comment - 08/Dec/08 5:47 PM - edited

Several issues were identified and finally resolved.

The above comment related to having only 1 request grace period available to update the JSESSIONID cookie. When the new primary server sees a request, the very first request it sees is 'rebranded' (Tomcat only) to essentially have the primary session id with the new node id (from the jvmRoute parameter in server.xml) postpended to that. A new sessionCreated event is generated and the request is handled.

In ICEfaces, if this first request was a blocking request, the set-cookie header wasn't getting back to the client before the receive-updates request was being sent, so it was being sent with the wrong JSESSIONID, and this was causing the SessionVerifier to say the session was invalidated. This was fixed by having the handler for the blocking request detect this case and to immediately return with a reload response.

Also Apache 2.2.10 immediately rejects requests during the failover process with a 503 error, whereas 2.2.3 would leave some of the requests dangling. This would tie up IE completely for a period of 5 minutes by default as it was easy to generate 2 unanswered requests. This also has been fixed by getting the client to enter into a reload protocol, partially to help with failover, but also to help with general AHS errors.

Server push rendering can work on failover if the users code is kept in request scoped beans and does a bit more gardening on failover. The Jira for server push is http://jira.icefaces.org/browse/ICE-3815

Reloads are necessary in order to reconstitute the object plumbing in ICEfaces so that we can get the <pong> responses back to the blocking servlet queues.

Show

Greg Dick added a comment - 08/Dec/08 5:47 PM - edited Several issues were identified and finally resolved. The above comment related to having only 1 request grace period available to update the JSESSIONID cookie. When the new primary server sees a request, the very first request it sees is 'rebranded' (Tomcat only) to essentially have the primary session id with the new node id (from the jvmRoute parameter in server.xml) postpended to that. A new sessionCreated event is generated and the request is handled. In ICEfaces, if this first request was a blocking request, the set-cookie header wasn't getting back to the client before the receive-updates request was being sent, so it was being sent with the wrong JSESSIONID, and this was causing the SessionVerifier to say the session was invalidated. This was fixed by having the handler for the blocking request detect this case and to immediately return with a reload response. Also Apache 2.2.10 immediately rejects requests during the failover process with a 503 error, whereas 2.2.3 would leave some of the requests dangling. This would tie up IE completely for a period of 5 minutes by default as it was easy to generate 2 unanswered requests. This also has been fixed by getting the client to enter into a reload protocol, partially to help with failover, but also to help with general AHS errors. Server push rendering can work on failover if the users code is kept in request scoped beans and does a bit more gardening on failover. The Jira for server push is http://jira.icefaces.org/browse/ICE-3815 Reloads are necessary in order to reconstitute the object plumbing in ICEfaces so that we can get the <pong> responses back to the blocking servlet queues.

Hide

Permalink

Greg Dick added a comment - 08/Dec/08 5:49 PM

Failover (with noted behaviour for Apache 2.2.10 and Tomcat 6.0.18) now works in synchronous and asynchronous modes.

Show

Greg Dick added a comment - 08/Dec/08 5:49 PM Failover (with noted behaviour for Apache 2.2.10 and Tomcat 6.0.18) now works in synchronous and asynchronous modes.

Failover support

Details

Description

Issue Links

Activity

People

Dates