ICEpush
  1. ICEpush
  2. PUSH-183

Clustered Cloud Push Out Of Bound Notification

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: EE 3.0.0
    • Fix Version/s: EE-3.2.0.GA
    • Component/s: Push Library, Push Server
    • Labels:
      None
    • Environment:
      ICEpush, EPS, clustered & non-clustered
    • Assignee Priority:
      P1

      Description

      ICEpush was designed to be as stateless as possible, even in the cluster case. A rough upper bound should be 50,000 clients per node since that is the maximum number of TCP connections per IP address (this is an astronomical number for JSF but is conceivable for ICEpush itself, so our data structures and intra-cluster traffic should keep this in mind).

      Group join/leave is broadcast to the cluster allowing each node to maintain a list of groups and PUSHID members.

      Each browser maintains a list of listening PUSHIDs and this is sent with the listen request. The browser is responsible for cleaning up PUSHIDs no longer active in any of its windows.

      A push is broadcast to the cluster with just a group name. Each node notifies the listening PUSHID members it has.

      PUSHIDs are discarded when no listen request has occurred for a timeout period. (*)

      (*) This is a problem for the cluster case: a listen at Node A may go to Node B on the next request. Node A should not discard the PUSHID just because it has not seen it -- the PUSHID is still
      active in the cluster at Node B. There is no time-critical nature to discarding PUSHIDs, though, so we can reduce intra-cluster traffic with batch processing.

      Cloud Push adds some additional complications:

      A push for a PUSHID (actually just the BROWSERID of the BROWSERID:SUBID that makes up a PUSHID) that has not been acknowledged for a timeout period should be sent via Cloud Push. (*)

      (*) Again, this is a problem because a listen at Node A may next be sent to Node B. Node A should not send a Cloud Push if the PUSHID is still active in the cluster at Node B.

      At first, we thought that Cloud Push was time critical (it would certainly demo better if this was the case), but it turns out that the 3G network conditions where you need Cloud Push are already plagued by high latency. We have adaptive timeouts, and they frequently settle near 5 seconds. In other words, where Cloud Push appears to break the autonomy of our cluster nodes (requiring every node to be aware of every listen request across the cluster) the long timeouts involved allow us to use batch processing just as with the PUSHID cleanup.
      Not only that, all we need for both cleanup and Cloud Push is the active listener list.

      Each node could broadcast a cluster request: are these PUSHIDs active? This would occur constantly, however, and the responses would be lists of active PUSHIDs.

      Instead, as each second goes by (configurable) each node will broadcast its list of listening PUSHIDs (with sequence numbers) to allow every node in the cluster to maintain the active status of each PUSHID.
      (The sequence number allows a cluster node to determine if it has the most recent listen request, hence is the "master" of that PUSHID. All PUSHIDs listened for since the last broadcast are listed with the most recent sequence number from each.)

      The main difference with the single-node case is that the listener list is maintained entirely via local listen requests.

      One other aspect of Cloud Push is that the current push may be the one that is never acknowledged by the browser (this is actually the root of the current bug that caused us to have two different code paths). Every client has a different adaptive timeout (in the future we may want to quantize these into one second batches for scalability, but we can handle thousands with individual timers, so this is not yet necessary -- but will be necessary to hit a goal of 50,000 per node). When a push is dispatched a timer should be started for each browser. It's OK to "block" the application thread due to CPU overhead or to write to an existing network connection, but not to wait for an indeterminate network event, such as push acknowledgement.

      Incoming listen requests can cancel these timers, but if a given timer elapses completely, then the Cloud Push provider should be used if the client supports cloud push (it's not strictly necessary to cancel the timer as long as the listen status is updated for when the timer wakes up).
      1. PUSH-183.patch
        63 kB
        Jack Van Ooststroom

        Activity

        Repository Revision Date User Message
        ICEsoft Public SVN Repository #33281 Thu Jan 31 04:40:23 MST 2013 jack.van.ooststroom Fixed JIRA PUSH-183 : Clustered Cloud Push Out Of Bound Notification
        Files Changed
        Commit graph MODIFY /icepush/trunk/icepush/core/src/main/java/org/icepush/PushGroupManager.java
        Commit graph MODIFY /icepush/trunk/icepush/core/src/main/java/org/icepush/LocalPushGroupManager.java
        Commit graph MODIFY /icepush/trunk/icepush/core/src/main/java/org/icepush/NoopPushGroupManager.java
        Commit graph MODIFY /icepush/trunk/icepush/core/src/main/java/org/icepush/BlockingConnectionServer.java
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #33099 Thu Jan 17 13:18:53 MST 2013 jack.van.ooststroom Fixed JIRA PUSH-183 : Clustered Cloud Push Out Of Bound Notification; Moved the startExpiryTimeout invocation to the right spot and added checks to avoid firing multiple similar timeouts for the same PushID
        Files Changed
        Commit graph MODIFY /icepush/trunk/icepush/core/src/main/java/org/icepush/LocalPushGroupManager.java
        Commit graph MODIFY /icepush/trunk/icepush/core/src/main/java/org/icepush/BlockingConnectionServer.java

          People

          • Assignee:
            Jack Van Ooststroom
            Reporter:
            Jack Van Ooststroom
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: