ICEpush
  1. ICEpush
  2. PUSH-183

Clustered Cloud Push Out Of Bound Notification

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: EE 3.0.0
    • Fix Version/s: EE-3.2.0.GA
    • Component/s: Push Library, Push Server
    • Labels:
      None
    • Environment:
      ICEpush, EPS, clustered & non-clustered
    • Assignee Priority:
      P1

      Description

      ICEpush was designed to be as stateless as possible, even in the cluster case. A rough upper bound should be 50,000 clients per node since that is the maximum number of TCP connections per IP address (this is an astronomical number for JSF but is conceivable for ICEpush itself, so our data structures and intra-cluster traffic should keep this in mind).

      Group join/leave is broadcast to the cluster allowing each node to maintain a list of groups and PUSHID members.

      Each browser maintains a list of listening PUSHIDs and this is sent with the listen request. The browser is responsible for cleaning up PUSHIDs no longer active in any of its windows.

      A push is broadcast to the cluster with just a group name. Each node notifies the listening PUSHID members it has.

      PUSHIDs are discarded when no listen request has occurred for a timeout period. (*)

      (*) This is a problem for the cluster case: a listen at Node A may go to Node B on the next request. Node A should not discard the PUSHID just because it has not seen it -- the PUSHID is still
      active in the cluster at Node B. There is no time-critical nature to discarding PUSHIDs, though, so we can reduce intra-cluster traffic with batch processing.

      Cloud Push adds some additional complications:

      A push for a PUSHID (actually just the BROWSERID of the BROWSERID:SUBID that makes up a PUSHID) that has not been acknowledged for a timeout period should be sent via Cloud Push. (*)

      (*) Again, this is a problem because a listen at Node A may next be sent to Node B. Node A should not send a Cloud Push if the PUSHID is still active in the cluster at Node B.

      At first, we thought that Cloud Push was time critical (it would certainly demo better if this was the case), but it turns out that the 3G network conditions where you need Cloud Push are already plagued by high latency. We have adaptive timeouts, and they frequently settle near 5 seconds. In other words, where Cloud Push appears to break the autonomy of our cluster nodes (requiring every node to be aware of every listen request across the cluster) the long timeouts involved allow us to use batch processing just as with the PUSHID cleanup.
      Not only that, all we need for both cleanup and Cloud Push is the active listener list.

      Each node could broadcast a cluster request: are these PUSHIDs active? This would occur constantly, however, and the responses would be lists of active PUSHIDs.

      Instead, as each second goes by (configurable) each node will broadcast its list of listening PUSHIDs (with sequence numbers) to allow every node in the cluster to maintain the active status of each PUSHID.
      (The sequence number allows a cluster node to determine if it has the most recent listen request, hence is the "master" of that PUSHID. All PUSHIDs listened for since the last broadcast are listed with the most recent sequence number from each.)

      The main difference with the single-node case is that the listener list is maintained entirely via local listen requests.

      One other aspect of Cloud Push is that the current push may be the one that is never acknowledged by the browser (this is actually the root of the current bug that caused us to have two different code paths). Every client has a different adaptive timeout (in the future we may want to quantize these into one second batches for scalability, but we can handle thousands with individual timers, so this is not yet necessary -- but will be necessary to hit a goal of 50,000 per node). When a push is dispatched a timer should be started for each browser. It's OK to "block" the application thread due to CPU overhead or to write to an existing network connection, but not to wait for an indeterminate network event, such as push acknowledgement.

      Incoming listen requests can cancel these timers, but if a given timer elapses completely, then the Cloud Push provider should be used if the client supports cloud push (it's not strictly necessary to cancel the timer as long as the listen status is updated for when the timer wakes up).
      1. PUSH-183.patch
        63 kB
        Jack Van Ooststroom

        Activity

        Hide
        Jack Van Ooststroom added a comment -

        Sending icepush-ee/core/src/main/java/org/icepush/BlockingConnectionServer.java
        Sending icepush-ee/core/src/main/java/org/icepush/LocalNotificationBroadcaster.java
        Sending icepush-ee/core/src/main/java/org/icepush/LocalPushGroupManager.java
        Sending icepush-ee/core/src/main/java/org/icepush/NoopPushGroupManager.java
        Sending icepush-ee/core/src/main/java/org/icepush/NotificationBroadcaster.java
        Sending icepush-ee/core/src/main/java/org/icepush/PushGroupManager.java
        Sending icepush-ee/core/src/main/java/org/icepush/PushGroupManagerFactory.java
        Sending icepush-ee/core/src/main/java/org/icepush/servlet/MainServlet.java
        Transmitting file data ........
        Committed revision 33034.

        Sending icepush-ee/core-ee/src/main/java/com/icesoft/icepush/MainServlet.java
        Transmitting file data .
        Committed revision 33282.

        Sending icepush-ee/eps/src/main/java/com/icesoft/push/DynamicPushGroupManager.java
        Sending icepush-ee/eps/src/main/java/com/icesoft/push/RemotePushGroupManager.java
        Sending icepush-ee/eps/src/main/java/com/icesoft/push/messaging/PushMessageService.java
        Sending icepush-ee/eps/src/main/java/com/icesoft/push/servlet/ICEpushServlet.java
        Transmitting file data ....
        Committed revision 33283.

        Show
        Jack Van Ooststroom added a comment - Sending icepush-ee/core/src/main/java/org/icepush/BlockingConnectionServer.java Sending icepush-ee/core/src/main/java/org/icepush/LocalNotificationBroadcaster.java Sending icepush-ee/core/src/main/java/org/icepush/LocalPushGroupManager.java Sending icepush-ee/core/src/main/java/org/icepush/NoopPushGroupManager.java Sending icepush-ee/core/src/main/java/org/icepush/NotificationBroadcaster.java Sending icepush-ee/core/src/main/java/org/icepush/PushGroupManager.java Sending icepush-ee/core/src/main/java/org/icepush/PushGroupManagerFactory.java Sending icepush-ee/core/src/main/java/org/icepush/servlet/MainServlet.java Transmitting file data ........ Committed revision 33034. Sending icepush-ee/core-ee/src/main/java/com/icesoft/icepush/MainServlet.java Transmitting file data . Committed revision 33282. Sending icepush-ee/eps/src/main/java/com/icesoft/push/DynamicPushGroupManager.java Sending icepush-ee/eps/src/main/java/com/icesoft/push/RemotePushGroupManager.java Sending icepush-ee/eps/src/main/java/com/icesoft/push/messaging/PushMessageService.java Sending icepush-ee/eps/src/main/java/com/icesoft/push/servlet/ICEpushServlet.java Transmitting file data .... Committed revision 33283.
        Hide
        Jack Van Ooststroom added a comment -

        Both Push and Cloud Push now use a single non-blocking code path for its functionality. The following scenarios have been successfully tested:

        Stand-Alone without EPS and without Cloud Push
        Stand-Alone without EPS but with Cloud Push
        Stand-Alone with EPS but without Cloud Push
        Stand-Alone with EPS and with Cloud Push
        Cluster with EPS but without Cloud Push (including fail-over)

        Detailed logging has been added and existing logging has been improved in certain areas. The detailed logging is now available at Level.FINE instead of Level.FINEST.

        Non-blocking is achieved through TimerTasks. For each PushID there are two different TimerTasks:

        1. expiryTimeout TimerTask checks if a PushID needs to be expired. The default timeout is set to 30 mins (in milliseconds). Whenever a Push Notification for a certain PushID occurs, an expiryTimeout TimerTask is scheduled on the Timer. This TimerTask will be cancelled upon the following listen.icepush request containing the same PushID. Otherwise the TimerTask is executed and the PushID will be expired.

        2. confirmationTimeout TimerTask checks if a Cloud Push needs to be initiated for a PushID. The timeout is a dynamic value (in milliseconds). Whenever a Push Notification for a certain PushID occurs and the associated BlockingConnectionServer has a NotifyBackURI defined, a confirmationTimeout TimerTask is scheduled on the Timer. This TimerTask will be cancelled upon the following listen.icepush request containing the same PushID. Otherwise the TimerTask is executed and the Cloud Push is initiated for the PushID.

        Marking this one as FIXED.

        The following scenarios are still a possible TODO:

        • Test cluster with EPS and with Cloud Push
        • Test the previous scenario with fail-over
        Show
        Jack Van Ooststroom added a comment - Both Push and Cloud Push now use a single non-blocking code path for its functionality. The following scenarios have been successfully tested: Stand-Alone without EPS and without Cloud Push Stand-Alone without EPS but with Cloud Push Stand-Alone with EPS but without Cloud Push Stand-Alone with EPS and with Cloud Push Cluster with EPS but without Cloud Push (including fail-over) Detailed logging has been added and existing logging has been improved in certain areas. The detailed logging is now available at Level.FINE instead of Level.FINEST. Non-blocking is achieved through TimerTasks. For each PushID there are two different TimerTasks: 1. expiryTimeout TimerTask checks if a PushID needs to be expired. The default timeout is set to 30 mins (in milliseconds). Whenever a Push Notification for a certain PushID occurs, an expiryTimeout TimerTask is scheduled on the Timer. This TimerTask will be cancelled upon the following listen.icepush request containing the same PushID. Otherwise the TimerTask is executed and the PushID will be expired. 2. confirmationTimeout TimerTask checks if a Cloud Push needs to be initiated for a PushID. The timeout is a dynamic value (in milliseconds). Whenever a Push Notification for a certain PushID occurs and the associated BlockingConnectionServer has a NotifyBackURI defined, a confirmationTimeout TimerTask is scheduled on the Timer. This TimerTask will be cancelled upon the following listen.icepush request containing the same PushID. Otherwise the TimerTask is executed and the Cloud Push is initiated for the PushID. Marking this one as FIXED. The following scenarios are still a possible TODO: Test cluster with EPS and with Cloud Push Test the previous scenario with fail-over
        Hide
        Jack Van Ooststroom added a comment -

        Adding icepush-ee/eps/src/main/java/com/icesoft/push/LocalPushGroupManager.java
        Transmitting file data .
        Committed revision 33284.

        Show
        Jack Van Ooststroom added a comment - Adding icepush-ee/eps/src/main/java/com/icesoft/push/LocalPushGroupManager.java Transmitting file data . Committed revision 33284.
        Hide
        Jack Van Ooststroom added a comment -

        Forgot to add the new class LocalPushGroupManager

        Show
        Jack Van Ooststroom added a comment - Forgot to add the new class LocalPushGroupManager
        Hide
        Jack Van Ooststroom added a comment -

        Sending icepush-ee/core/src/main/java/org/icepush/BlockingConnectionServer.java
        Sending icepush-ee/core/src/main/java/org/icepush/LocalPushGroupManager.java
        Sending icepush-ee/core/src/main/java/org/icepush/NoopPushGroupManager.java
        Adding icepush-ee/core/src/main/java/org/icepush/NotifyBackURI.java
        Sending icepush-ee/core/src/main/java/org/icepush/PushGroupManager.java
        Transmitting file data .....
        Committed revision 33053.

        Sending icepush-ee/eps/src/main/java/com/icesoft/push/DynamicPushGroupManager.java
        Sending icepush-ee/eps/src/main/java/com/icesoft/push/RemotePushGroupManager.java
        Transmitting file data ..
        Committed revision 33299.

        Show
        Jack Van Ooststroom added a comment - Sending icepush-ee/core/src/main/java/org/icepush/BlockingConnectionServer.java Sending icepush-ee/core/src/main/java/org/icepush/LocalPushGroupManager.java Sending icepush-ee/core/src/main/java/org/icepush/NoopPushGroupManager.java Adding icepush-ee/core/src/main/java/org/icepush/NotifyBackURI.java Sending icepush-ee/core/src/main/java/org/icepush/PushGroupManager.java Transmitting file data ..... Committed revision 33053. Sending icepush-ee/eps/src/main/java/com/icesoft/push/DynamicPushGroupManager.java Sending icepush-ee/eps/src/main/java/com/icesoft/push/RemotePushGroupManager.java Transmitting file data .. Committed revision 33299.
        Hide
        Jack Van Ooststroom added a comment -

        Additional flood protection logic has been added to avoid flooding the OutOfBand services with Cloud Pushes. By default no subsequent Cloud Push for a single NotifyBackURI will be requested withing 10 seconds of the previous Cloud Push request. The default can be changed by setting the org.icepush.minCloudPushInterval to the desired number of milliseconds.

        Show
        Jack Van Ooststroom added a comment - Additional flood protection logic has been added to avoid flooding the OutOfBand services with Cloud Pushes. By default no subsequent Cloud Push for a single NotifyBackURI will be requested withing 10 seconds of the previous Cloud Push request. The default can be changed by setting the org.icepush.minCloudPushInterval to the desired number of milliseconds.
        Hide
        Jack Van Ooststroom added a comment -

        Just to emphasize what has been done here is that we now have a single non-blocking code path for all kind of server pushes. Certain parts have taken the Clustered Cloud Push scenario in mind as well, but Cluster Cloud Push has not been tested yet. However, this major change is a good step towards achieving that goal.

        Show
        Jack Van Ooststroom added a comment - Just to emphasize what has been done here is that we now have a single non-blocking code path for all kind of server pushes. Certain parts have taken the Clustered Cloud Push scenario in mind as well, but Cluster Cloud Push has not been tested yet. However, this major change is a good step towards achieving that goal.
        Hide
        Jack Van Ooststroom added a comment -

        During some longer test runs I noticed that it is possible to not receive any updates anymore.

        Show
        Jack Van Ooststroom added a comment - During some longer test runs I noticed that it is possible to not receive any updates anymore.
        Hide
        Jack Van Ooststroom added a comment -

        After some investigation it seems that it is possible for 2 expiryTimeout TimerTasks to get scheduled for the same PushID. As only one reference is kept, the previous expiryTimeout will occur and get executed instead of being cancelled. This causes the PushID to be removed. I need to go back what differs with the expiryTimeout strategy from what we had before the changes contained in this JIRA.

        Show
        Jack Van Ooststroom added a comment - After some investigation it seems that it is possible for 2 expiryTimeout TimerTasks to get scheduled for the same PushID. As only one reference is kept, the previous expiryTimeout will occur and get executed instead of being cancelled. This causes the PushID to be removed. I need to go back what differs with the expiryTimeout strategy from what we had before the changes contained in this JIRA.
        Hide
        Jack Van Ooststroom added a comment -

        I moved the startExpiryTimeout(...) invocation in the BlockingConnectionServer to its right spot. This should avoid the expiryTimeout to be scheduled multiple times for the same PushID. In addition upon PushID creation the expiryTimeout is also scheduled as this is conform with the old code. Finally, I added some additional checks to avoid double scheduling of either confirmationTimeout and expiryTimeout and logging a message at level FINE if start is invoked while a previous Timeout is still scheduled. Marking this one as FIXED again.

        Show
        Jack Van Ooststroom added a comment - I moved the startExpiryTimeout(...) invocation in the BlockingConnectionServer to its right spot. This should avoid the expiryTimeout to be scheduled multiple times for the same PushID. In addition upon PushID creation the expiryTimeout is also scheduled as this is conform with the old code. Finally, I added some additional checks to avoid double scheduling of either confirmationTimeout and expiryTimeout and logging a message at level FINE if start is invoked while a previous Timeout is still scheduled. Marking this one as FIXED again.
        Hide
        Jack Van Ooststroom added a comment -

        Sending core/src/main/java/org/icepush/BlockingConnectionServer.java
        Sending core/src/main/java/org/icepush/LocalPushGroupManager.java
        Transmitting file data ..
        Committed revision 33099.

        Show
        Jack Van Ooststroom added a comment - Sending core/src/main/java/org/icepush/BlockingConnectionServer.java Sending core/src/main/java/org/icepush/LocalPushGroupManager.java Transmitting file data .. Committed revision 33099.
        Hide
        Jack Van Ooststroom added a comment -

        There seem to be a couple of issues with the TimerTasks when running in a clustered environment:

        1. The TimerTasks for a particular PushID are not always being cancelled on the other nodes when it should.
        2. The TimerTasks for a particular PushID are not always being scheduled on the other nodes when it should.
        3. When scheduling a expiryTimeout TimerTask for a particular PushID on the other nodes it is unable to decide if a PushID is considered a Cloud PushID.

        Show
        Jack Van Ooststroom added a comment - There seem to be a couple of issues with the TimerTasks when running in a clustered environment: 1. The TimerTasks for a particular PushID are not always being cancelled on the other nodes when it should. 2. The TimerTasks for a particular PushID are not always being scheduled on the other nodes when it should. 3. When scheduling a expiryTimeout TimerTask for a particular PushID on the other nodes it is unable to decide if a PushID is considered a Cloud PushID.
        Hide
        Jack Van Ooststroom added a comment -

        Sending core/src/main/java/org/icepush/BlockingConnectionServer.java
        Sending core/src/main/java/org/icepush/LocalPushGroupManager.java
        Sending core/src/main/java/org/icepush/NoopPushGroupManager.java
        Sending core/src/main/java/org/icepush/PushGroupManager.java
        Transmitting file data ....
        Committed revision 33281.

        Show
        Jack Van Ooststroom added a comment - Sending core/src/main/java/org/icepush/BlockingConnectionServer.java Sending core/src/main/java/org/icepush/LocalPushGroupManager.java Sending core/src/main/java/org/icepush/NoopPushGroupManager.java Sending core/src/main/java/org/icepush/PushGroupManager.java Transmitting file data .... Committed revision 33281.
        Hide
        Jack Van Ooststroom added a comment -

        Sending eps/src/main/java/com/icesoft/push/DynamicPushGroupManager.java
        Sending eps/src/main/java/com/icesoft/push/LocalPushGroupManager.java
        Sending eps/src/main/java/com/icesoft/push/RemotePushGroupManager.java
        Sending eps/src/main/java/com/icesoft/push/messaging/MessagePayload.java
        Sending eps/src/main/java/com/icesoft/push/messaging/PushMessageService.java
        Transmitting file data .....
        Committed revision 33437.

        Show
        Jack Van Ooststroom added a comment - Sending eps/src/main/java/com/icesoft/push/DynamicPushGroupManager.java Sending eps/src/main/java/com/icesoft/push/LocalPushGroupManager.java Sending eps/src/main/java/com/icesoft/push/RemotePushGroupManager.java Sending eps/src/main/java/com/icesoft/push/messaging/MessagePayload.java Sending eps/src/main/java/com/icesoft/push/messaging/PushMessageService.java Transmitting file data ..... Committed revision 33437.
        Hide
        Jack Van Ooststroom added a comment -

        I applied the following fixes:

        1. The TimerTasks for a particular PushID are not always being cancelled on the other nodes when it should.

        This seem to be due to the sequence number being set to 0 upon PushID initialization. This is now set to -1. This results in all PushID instances on all nodes being initialized with a sequence number of -1. The first listen.push request that is received by a node does not have a sequence number set yet, but through the ListeningPushIDs messages it should be communicated to the other nodes that the receiving node accepted the listen.icepush. As -1 < 0 the TimerTasks can be cancelled appropriately on the other nodes.

        2. The TimerTasks for a particular PushID are not always being scheduled on the other nodes when it should.

        On the other nodes the expiryTimeout TimerTasks must be started right after cancelling the confirmationTimeout and expiryTimeout TimerTasks upon receiving the ListeningPushID message. Receiving a ListeningPushIDs message on the other nodes should be considered the "same" event as receiving the listen.icepush on the handling node.

        3. When scheduling a expiryTimeout TimerTask for a particular PushID on the other nodes it is unable to decide if a PushID is considered a Cloud PushID.

        Whenever the NotifyBackURI for the participating PushIDs changes, for instance from null to an actual URI or from an actual URI to a different URI, this must be communicated to the other nodes within the cluster. As the NotifyBackURI shouldn't change often this shouldn't be hard on the performance.

        All the discovered deficiencies should now be resolved. Marking this one as FIXED again.

        Show
        Jack Van Ooststroom added a comment - I applied the following fixes: 1. The TimerTasks for a particular PushID are not always being cancelled on the other nodes when it should. This seem to be due to the sequence number being set to 0 upon PushID initialization. This is now set to -1. This results in all PushID instances on all nodes being initialized with a sequence number of -1. The first listen.push request that is received by a node does not have a sequence number set yet, but through the ListeningPushIDs messages it should be communicated to the other nodes that the receiving node accepted the listen.icepush. As -1 < 0 the TimerTasks can be cancelled appropriately on the other nodes. 2. The TimerTasks for a particular PushID are not always being scheduled on the other nodes when it should. On the other nodes the expiryTimeout TimerTasks must be started right after cancelling the confirmationTimeout and expiryTimeout TimerTasks upon receiving the ListeningPushID message. Receiving a ListeningPushIDs message on the other nodes should be considered the "same" event as receiving the listen.icepush on the handling node. 3. When scheduling a expiryTimeout TimerTask for a particular PushID on the other nodes it is unable to decide if a PushID is considered a Cloud PushID. Whenever the NotifyBackURI for the participating PushIDs changes, for instance from null to an actual URI or from an actual URI to a different URI, this must be communicated to the other nodes within the cluster. As the NotifyBackURI shouldn't change often this shouldn't be hard on the performance. All the discovered deficiencies should now be resolved. Marking this one as FIXED again.

          People

          • Assignee:
            Jack Van Ooststroom
            Reporter:
            Jack Van Ooststroom
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: