ICEpush
  1. ICEpush
  2. PUSH-183

Clustered Cloud Push Out Of Bound Notification

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: EE 3.0.0
    • Fix Version/s: EE-3.2.0.GA
    • Component/s: Push Library, Push Server
    • Labels:
      None
    • Environment:
      ICEpush, EPS, clustered & non-clustered
    • Assignee Priority:
      P1

      Description

      ICEpush was designed to be as stateless as possible, even in the cluster case. A rough upper bound should be 50,000 clients per node since that is the maximum number of TCP connections per IP address (this is an astronomical number for JSF but is conceivable for ICEpush itself, so our data structures and intra-cluster traffic should keep this in mind).

      Group join/leave is broadcast to the cluster allowing each node to maintain a list of groups and PUSHID members.

      Each browser maintains a list of listening PUSHIDs and this is sent with the listen request. The browser is responsible for cleaning up PUSHIDs no longer active in any of its windows.

      A push is broadcast to the cluster with just a group name. Each node notifies the listening PUSHID members it has.

      PUSHIDs are discarded when no listen request has occurred for a timeout period. (*)

      (*) This is a problem for the cluster case: a listen at Node A may go to Node B on the next request. Node A should not discard the PUSHID just because it has not seen it -- the PUSHID is still
      active in the cluster at Node B. There is no time-critical nature to discarding PUSHIDs, though, so we can reduce intra-cluster traffic with batch processing.

      Cloud Push adds some additional complications:

      A push for a PUSHID (actually just the BROWSERID of the BROWSERID:SUBID that makes up a PUSHID) that has not been acknowledged for a timeout period should be sent via Cloud Push. (*)

      (*) Again, this is a problem because a listen at Node A may next be sent to Node B. Node A should not send a Cloud Push if the PUSHID is still active in the cluster at Node B.

      At first, we thought that Cloud Push was time critical (it would certainly demo better if this was the case), but it turns out that the 3G network conditions where you need Cloud Push are already plagued by high latency. We have adaptive timeouts, and they frequently settle near 5 seconds. In other words, where Cloud Push appears to break the autonomy of our cluster nodes (requiring every node to be aware of every listen request across the cluster) the long timeouts involved allow us to use batch processing just as with the PUSHID cleanup.
      Not only that, all we need for both cleanup and Cloud Push is the active listener list.

      Each node could broadcast a cluster request: are these PUSHIDs active? This would occur constantly, however, and the responses would be lists of active PUSHIDs.

      Instead, as each second goes by (configurable) each node will broadcast its list of listening PUSHIDs (with sequence numbers) to allow every node in the cluster to maintain the active status of each PUSHID.
      (The sequence number allows a cluster node to determine if it has the most recent listen request, hence is the "master" of that PUSHID. All PUSHIDs listened for since the last broadcast are listed with the most recent sequence number from each.)

      The main difference with the single-node case is that the listener list is maintained entirely via local listen requests.

      One other aspect of Cloud Push is that the current push may be the one that is never acknowledged by the browser (this is actually the root of the current bug that caused us to have two different code paths). Every client has a different adaptive timeout (in the future we may want to quantize these into one second batches for scalability, but we can handle thousands with individual timers, so this is not yet necessary -- but will be necessary to hit a goal of 50,000 per node). When a push is dispatched a timer should be started for each browser. It's OK to "block" the application thread due to CPU overhead or to write to an existing network connection, but not to wait for an indeterminate network event, such as push acknowledgement.

      Incoming listen requests can cancel these timers, but if a given timer elapses completely, then the Cloud Push provider should be used if the client supports cloud push (it's not strictly necessary to cancel the timer as long as the listen status is updated for when the timer wakes up).
      1. PUSH-183.patch
        63 kB
        Jack Van Ooststroom

        Activity

        Jack Van Ooststroom created issue -
        Jack Van Ooststroom made changes -
        Field Original Value New Value
        Attachment PUSH-183.patch [ 14789 ]
        Ken Fyten made changes -
        Salesforce Case []
        Fix Version/s EE 3.2.0 [ 10323 ]
        Ken Fyten made changes -
        Description ICEpush was designed to be as stateless as possible, even in the
        cluster case. A rough upper bound should be 50,000 clients per
        node since that is the maximum number of TCP connections per IP
        address (this is an astronomical number for JSF but is conceivable
        for ICEpush itself, so our data structures and intra-cluster traffic
        should keep this in mind).

        Group join/leave is broadcast to the cluster allowing each node
        to maintain a list of groups and PUSHID members.

        Each browser maintains a list of listening PUSHIDs and this
        is sent with the listen request. The browser is responsible for
        cleaning up PUSHIDs no longer active in any of its windows.

        A push is broadcast to the cluster with just a group name. Each node
        notifies the listening PUSHID members it has.

        PUSHIDs are discarded when no listen request has occurred for a
        timeout period. (*)

        (*) This is a problem for the cluster case: a listen at Node A
        may go to Node B on the next request. Node A should not discard
        the PUSHID just because it has not seen it -- the PUSHID is still
        active in the cluster at Node B. There is no time-critical nature
        to discarding PUSHIDs, though, so we can reduce intra-cluster traffic
        with batch processing.

        Cloud Push adds some additional complications:

        A push for a PUSHID (actually just the BROWSERID of the
        BROWSERID:SUBID that makes up a PUSHID) that has not been acknowledged
        for a timeout period should be sent via Cloud Push. (*)

        (*) Again, this is a problem because a listen at Node A may next be
        sent to Node B. Node A should not send a Cloud Push if the
        PUSHID is still active in the cluster at Node B.

        At first, we thought that Cloud Push was time critical (it would
        certainly demo better if this was the case), but it turns out that
        the 3G network conditions where you need Cloud Push are already plagued
        by high latency. We have adaptive timeouts, and they frequently settle
        near 5 seconds. In other words, where Cloud Push appears to break the
        autonomy of our cluster nodes (requiring every node to be aware of
        every listen request across the cluster) the long timeouts involved
        allow us to use batch processing just as with the PUSHID cleanup.
        Not only that, all we need for both cleanup and Cloud Push is the
        active listener list.

        Each node could broadcast a cluster request: are these PUSHIDs active?
        This would occur constantly, however, and the responses would be lists
        of active PUSHIDs.

        Instead, as each second goes by (configurable) each node will broadcast
        its list of listening PUSHIDs (with sequence numbers) to allow every
        node in the cluster to maintain the active status of each PUSHID.
        (The sequence number allows a cluster node to determine if it has
        the most recent listen request, hence is the "master" of that PUSHID.
        All PUSHIDs listened for since the last broadcast are listed with
        the most recent sequence number from each.)

        The main difference with the single-node case is that the listener
        list is maintained entirely via local listen requests.

        One other aspect of Cloud Push is that the current push may be the
        one that is never acknowledged by the browser (this is actually the
        root of the current bug that caused us to have two different code
        paths). Every client has a different adaptive timeout (in the future
        we may want to quantize these into one second batches for scalability,
        but we can handle thousands with individual timers, so this is not
        yet necessary -- but will be necessary to hit a goal of 50,000 per node).
        When a push is dispatched a timer should be started for each browser.
        It's OK to "block" the application thread due to CPU overhead or to
        write to an existing network connection, but not to wait for an
        indeterminate network event, such as push acknowledgement.
        Incoming listen requests can cancel these timers, but if a given timer
        elapses completely, then the Cloud Push provider should be used if
        the client supports cloud push (it's not strictly necessary to cancel
        the timer as long as the listen status is updated for when the timer
        wakes up).
        ICEpush was designed to be as stateless as possible, even in the cluster case. A rough upper bound should be 50,000 clients per node since that is the maximum number of TCP connections per IP address (this is an astronomical number for JSF but is conceivable for ICEpush itself, so our data structures and intra-cluster traffic should keep this in mind).

        Group join/leave is broadcast to the cluster allowing each node to maintain a list of groups and PUSHID members.

        Each browser maintains a list of listening PUSHIDs and this is sent with the listen request. The browser is responsible for cleaning up PUSHIDs no longer active in any of its windows.

        A push is broadcast to the cluster with just a group name. Each node notifies the listening PUSHID members it has.

        PUSHIDs are discarded when no listen request has occurred for a timeout period. (*)

        (*) This is a problem for the cluster case: a listen at Node A may go to Node B on the next request. Node A should not discard the PUSHID just because it has not seen it -- the PUSHID is still
        active in the cluster at Node B. There is no time-critical nature to discarding PUSHIDs, though, so we can reduce intra-cluster traffic with batch processing.

        Cloud Push adds some additional complications:

        A push for a PUSHID (actually just the BROWSERID of the BROWSERID:SUBID that makes up a PUSHID) that has not been acknowledged for a timeout period should be sent via Cloud Push. (*)

        (*) Again, this is a problem because a listen at Node A may next be sent to Node B. Node A should not send a Cloud Push if the PUSHID is still active in the cluster at Node B.

        At first, we thought that Cloud Push was time critical (it would certainly demo better if this was the case), but it turns out that the 3G network conditions where you need Cloud Push are already plagued by high latency. We have adaptive timeouts, and they frequently settle near 5 seconds. In other words, where Cloud Push appears to break the autonomy of our cluster nodes (requiring every node to be aware of every listen request across the cluster) the long timeouts involved allow us to use batch processing just as with the PUSHID cleanup.
        Not only that, all we need for both cleanup and Cloud Push is the active listener list.

        Each node could broadcast a cluster request: are these PUSHIDs active? This would occur constantly, however, and the responses would be lists of active PUSHIDs.

        Instead, as each second goes by (configurable) each node will broadcast its list of listening PUSHIDs (with sequence numbers) to allow every node in the cluster to maintain the active status of each PUSHID.
        (The sequence number allows a cluster node to determine if it has the most recent listen request, hence is the "master" of that PUSHID. All PUSHIDs listened for since the last broadcast are listed with the most recent sequence number from each.)

        The main difference with the single-node case is that the listener list is maintained entirely via local listen requests.

        One other aspect of Cloud Push is that the current push may be the one that is never acknowledged by the browser (this is actually the root of the current bug that caused us to have two different code paths). Every client has a different adaptive timeout (in the future we may want to quantize these into one second batches for scalability, but we can handle thousands with individual timers, so this is not yet necessary -- but will be necessary to hit a goal of 50,000 per node). When a push is dispatched a timer should be started for each browser. It's OK to "block" the application thread due to CPU overhead or to write to an existing network connection, but not to wait for an indeterminate network event, such as push acknowledgement.

        Incoming listen requests can cancel these timers, but if a given timer elapses completely, then the Cloud Push provider should be used if the client supports cloud push (it's not strictly necessary to cancel the timer as long as the listen status is updated for when the timer wakes up).
        Ken Fyten made changes -
        Assignee Jack Van Ooststroom [ jack.van.ooststroom ]
        Ken Fyten made changes -
        Assignee Priority P1 [ 10010 ]
        Jack Van Ooststroom made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Jack Van Ooststroom made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Jack Van Ooststroom made changes -
        Status Reopened [ 4 ] In Progress [ 3 ]
        Jack Van Ooststroom made changes -
        Status In Progress [ 3 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Jack Van Ooststroom made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Jack Van Ooststroom made changes -
        Status Reopened [ 4 ] In Progress [ 3 ]
        Jack Van Ooststroom made changes -
        Status In Progress [ 3 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Ken Fyten made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Jack Van Ooststroom
            Reporter:
            Jack Van Ooststroom
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: