ICEpush
  1. ICEpush
  2. PUSH-183

Clustered Cloud Push Out Of Bound Notification

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: EE 3.0.0
    • Fix Version/s: EE-3.2.0.GA
    • Component/s: Push Library, Push Server
    • Labels:
      None
    • Environment:
      ICEpush, EPS, clustered & non-clustered
    • Assignee Priority:
      P1

      Description

      ICEpush was designed to be as stateless as possible, even in the cluster case. A rough upper bound should be 50,000 clients per node since that is the maximum number of TCP connections per IP address (this is an astronomical number for JSF but is conceivable for ICEpush itself, so our data structures and intra-cluster traffic should keep this in mind).

      Group join/leave is broadcast to the cluster allowing each node to maintain a list of groups and PUSHID members.

      Each browser maintains a list of listening PUSHIDs and this is sent with the listen request. The browser is responsible for cleaning up PUSHIDs no longer active in any of its windows.

      A push is broadcast to the cluster with just a group name. Each node notifies the listening PUSHID members it has.

      PUSHIDs are discarded when no listen request has occurred for a timeout period. (*)

      (*) This is a problem for the cluster case: a listen at Node A may go to Node B on the next request. Node A should not discard the PUSHID just because it has not seen it -- the PUSHID is still
      active in the cluster at Node B. There is no time-critical nature to discarding PUSHIDs, though, so we can reduce intra-cluster traffic with batch processing.

      Cloud Push adds some additional complications:

      A push for a PUSHID (actually just the BROWSERID of the BROWSERID:SUBID that makes up a PUSHID) that has not been acknowledged for a timeout period should be sent via Cloud Push. (*)

      (*) Again, this is a problem because a listen at Node A may next be sent to Node B. Node A should not send a Cloud Push if the PUSHID is still active in the cluster at Node B.

      At first, we thought that Cloud Push was time critical (it would certainly demo better if this was the case), but it turns out that the 3G network conditions where you need Cloud Push are already plagued by high latency. We have adaptive timeouts, and they frequently settle near 5 seconds. In other words, where Cloud Push appears to break the autonomy of our cluster nodes (requiring every node to be aware of every listen request across the cluster) the long timeouts involved allow us to use batch processing just as with the PUSHID cleanup.
      Not only that, all we need for both cleanup and Cloud Push is the active listener list.

      Each node could broadcast a cluster request: are these PUSHIDs active? This would occur constantly, however, and the responses would be lists of active PUSHIDs.

      Instead, as each second goes by (configurable) each node will broadcast its list of listening PUSHIDs (with sequence numbers) to allow every node in the cluster to maintain the active status of each PUSHID.
      (The sequence number allows a cluster node to determine if it has the most recent listen request, hence is the "master" of that PUSHID. All PUSHIDs listened for since the last broadcast are listed with the most recent sequence number from each.)

      The main difference with the single-node case is that the listener list is maintained entirely via local listen requests.

      One other aspect of Cloud Push is that the current push may be the one that is never acknowledged by the browser (this is actually the root of the current bug that caused us to have two different code paths). Every client has a different adaptive timeout (in the future we may want to quantize these into one second batches for scalability, but we can handle thousands with individual timers, so this is not yet necessary -- but will be necessary to hit a goal of 50,000 per node). When a push is dispatched a timer should be started for each browser. It's OK to "block" the application thread due to CPU overhead or to write to an existing network connection, but not to wait for an indeterminate network event, such as push acknowledgement.

      Incoming listen requests can cancel these timers, but if a given timer elapses completely, then the Cloud Push provider should be used if the client supports cloud push (it's not strictly necessary to cancel the timer as long as the listen status is updated for when the timer wakes up).
      1. PUSH-183.patch
        63 kB
        Jack Van Ooststroom

        Activity

        Hide
        Jack Van Ooststroom added a comment -

        Sending core/src/main/java/org/icepush/BlockingConnectionServer.java
        Sending core/src/main/java/org/icepush/LocalPushGroupManager.java
        Transmitting file data ..
        Committed revision 33099.

        Show
        Jack Van Ooststroom added a comment - Sending core/src/main/java/org/icepush/BlockingConnectionServer.java Sending core/src/main/java/org/icepush/LocalPushGroupManager.java Transmitting file data .. Committed revision 33099.
        Hide
        Jack Van Ooststroom added a comment -

        There seem to be a couple of issues with the TimerTasks when running in a clustered environment:

        1. The TimerTasks for a particular PushID are not always being cancelled on the other nodes when it should.
        2. The TimerTasks for a particular PushID are not always being scheduled on the other nodes when it should.
        3. When scheduling a expiryTimeout TimerTask for a particular PushID on the other nodes it is unable to decide if a PushID is considered a Cloud PushID.

        Show
        Jack Van Ooststroom added a comment - There seem to be a couple of issues with the TimerTasks when running in a clustered environment: 1. The TimerTasks for a particular PushID are not always being cancelled on the other nodes when it should. 2. The TimerTasks for a particular PushID are not always being scheduled on the other nodes when it should. 3. When scheduling a expiryTimeout TimerTask for a particular PushID on the other nodes it is unable to decide if a PushID is considered a Cloud PushID.
        Hide
        Jack Van Ooststroom added a comment -

        Sending core/src/main/java/org/icepush/BlockingConnectionServer.java
        Sending core/src/main/java/org/icepush/LocalPushGroupManager.java
        Sending core/src/main/java/org/icepush/NoopPushGroupManager.java
        Sending core/src/main/java/org/icepush/PushGroupManager.java
        Transmitting file data ....
        Committed revision 33281.

        Show
        Jack Van Ooststroom added a comment - Sending core/src/main/java/org/icepush/BlockingConnectionServer.java Sending core/src/main/java/org/icepush/LocalPushGroupManager.java Sending core/src/main/java/org/icepush/NoopPushGroupManager.java Sending core/src/main/java/org/icepush/PushGroupManager.java Transmitting file data .... Committed revision 33281.
        Hide
        Jack Van Ooststroom added a comment -

        Sending eps/src/main/java/com/icesoft/push/DynamicPushGroupManager.java
        Sending eps/src/main/java/com/icesoft/push/LocalPushGroupManager.java
        Sending eps/src/main/java/com/icesoft/push/RemotePushGroupManager.java
        Sending eps/src/main/java/com/icesoft/push/messaging/MessagePayload.java
        Sending eps/src/main/java/com/icesoft/push/messaging/PushMessageService.java
        Transmitting file data .....
        Committed revision 33437.

        Show
        Jack Van Ooststroom added a comment - Sending eps/src/main/java/com/icesoft/push/DynamicPushGroupManager.java Sending eps/src/main/java/com/icesoft/push/LocalPushGroupManager.java Sending eps/src/main/java/com/icesoft/push/RemotePushGroupManager.java Sending eps/src/main/java/com/icesoft/push/messaging/MessagePayload.java Sending eps/src/main/java/com/icesoft/push/messaging/PushMessageService.java Transmitting file data ..... Committed revision 33437.
        Hide
        Jack Van Ooststroom added a comment -

        I applied the following fixes:

        1. The TimerTasks for a particular PushID are not always being cancelled on the other nodes when it should.

        This seem to be due to the sequence number being set to 0 upon PushID initialization. This is now set to -1. This results in all PushID instances on all nodes being initialized with a sequence number of -1. The first listen.push request that is received by a node does not have a sequence number set yet, but through the ListeningPushIDs messages it should be communicated to the other nodes that the receiving node accepted the listen.icepush. As -1 < 0 the TimerTasks can be cancelled appropriately on the other nodes.

        2. The TimerTasks for a particular PushID are not always being scheduled on the other nodes when it should.

        On the other nodes the expiryTimeout TimerTasks must be started right after cancelling the confirmationTimeout and expiryTimeout TimerTasks upon receiving the ListeningPushID message. Receiving a ListeningPushIDs message on the other nodes should be considered the "same" event as receiving the listen.icepush on the handling node.

        3. When scheduling a expiryTimeout TimerTask for a particular PushID on the other nodes it is unable to decide if a PushID is considered a Cloud PushID.

        Whenever the NotifyBackURI for the participating PushIDs changes, for instance from null to an actual URI or from an actual URI to a different URI, this must be communicated to the other nodes within the cluster. As the NotifyBackURI shouldn't change often this shouldn't be hard on the performance.

        All the discovered deficiencies should now be resolved. Marking this one as FIXED again.

        Show
        Jack Van Ooststroom added a comment - I applied the following fixes: 1. The TimerTasks for a particular PushID are not always being cancelled on the other nodes when it should. This seem to be due to the sequence number being set to 0 upon PushID initialization. This is now set to -1. This results in all PushID instances on all nodes being initialized with a sequence number of -1. The first listen.push request that is received by a node does not have a sequence number set yet, but through the ListeningPushIDs messages it should be communicated to the other nodes that the receiving node accepted the listen.icepush. As -1 < 0 the TimerTasks can be cancelled appropriately on the other nodes. 2. The TimerTasks for a particular PushID are not always being scheduled on the other nodes when it should. On the other nodes the expiryTimeout TimerTasks must be started right after cancelling the confirmationTimeout and expiryTimeout TimerTasks upon receiving the ListeningPushID message. Receiving a ListeningPushIDs message on the other nodes should be considered the "same" event as receiving the listen.icepush on the handling node. 3. When scheduling a expiryTimeout TimerTask for a particular PushID on the other nodes it is unable to decide if a PushID is considered a Cloud PushID. Whenever the NotifyBackURI for the participating PushIDs changes, for instance from null to an actual URI or from an actual URI to a different URI, this must be communicated to the other nodes within the cluster. As the NotifyBackURI shouldn't change often this shouldn't be hard on the performance. All the discovered deficiencies should now be resolved. Marking this one as FIXED again.

          People

          • Assignee:
            Jack Van Ooststroom
            Reporter:
            Jack Van Ooststroom
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: