[PUSH-183] Clustered Cloud Push Out Of Bound Notification - ICEsoft JIRA Issue Tracker

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: EE 3.0.0
Fix Version/s: EE-3.2.0.GA
Component/s: Push Library, Push Server
Labels:
None
Environment:
ICEpush, EPS, clustered & non-clustered

Assignee Priority:
P1

Description

ICEpush was designed to be as stateless as possible, even in the cluster case. A rough upper bound should be 50,000 clients per node since that is the maximum number of TCP connections per IP address (this is an astronomical number for JSF but is conceivable for ICEpush itself, so our data structures and intra-cluster traffic should keep this in mind).

Group join/leave is broadcast to the cluster allowing each node to maintain a list of groups and PUSHID members.

Each browser maintains a list of listening PUSHIDs and this is sent with the listen request. The browser is responsible for cleaning up PUSHIDs no longer active in any of its windows.

A push is broadcast to the cluster with just a group name. Each node notifies the listening PUSHID members it has.

PUSHIDs are discarded when no listen request has occurred for a timeout period. (*)

(*) This is a problem for the cluster case: a listen at Node A may go to Node B on the next request. Node A should not discard the PUSHID just because it has not seen it -- the PUSHID is still
active in the cluster at Node B. There is no time-critical nature to discarding PUSHIDs, though, so we can reduce intra-cluster traffic with batch processing.

Cloud Push adds some additional complications:

A push for a PUSHID (actually just the BROWSERID of the BROWSERID:SUBID that makes up a PUSHID) that has not been acknowledged for a timeout period should be sent via Cloud Push. (*)

(*) Again, this is a problem because a listen at Node A may next be sent to Node B. Node A should not send a Cloud Push if the PUSHID is still active in the cluster at Node B.

At first, we thought that Cloud Push was time critical (it would certainly demo better if this was the case), but it turns out that the 3G network conditions where you need Cloud Push are already plagued by high latency. We have adaptive timeouts, and they frequently settle near 5 seconds. In other words, where Cloud Push appears to break the autonomy of our cluster nodes (requiring every node to be aware of every listen request across the cluster) the long timeouts involved allow us to use batch processing just as with the PUSHID cleanup.
Not only that, all we need for both cleanup and Cloud Push is the active listener list.

Each node could broadcast a cluster request: are these PUSHIDs active? This would occur constantly, however, and the responses would be lists of active PUSHIDs.

Instead, as each second goes by (configurable) each node will broadcast its list of listening PUSHIDs (with sequence numbers) to allow every node in the cluster to maintain the active status of each PUSHID.
(The sequence number allows a cluster node to determine if it has the most recent listen request, hence is the "master" of that PUSHID. All PUSHIDs listened for since the last broadcast are listed with the most recent sequence number from each.)

The main difference with the single-node case is that the listener list is maintained entirely via local listen requests.

One other aspect of Cloud Push is that the current push may be the one that is never acknowledged by the browser (this is actually the root of the current bug that caused us to have two different code paths). Every client has a different adaptive timeout (in the future we may want to quantize these into one second batches for scalability, but we can handle thousands with individual timers, so this is not yet necessary -- but will be necessary to hit a goal of 50,000 per node). When a push is dispatched a timer should be started for each browser. It's OK to "block" the application thread due to CPU overhead or to write to an existing network connection, but not to wait for an indeterminate network event, such as push acknowledgement.

Incoming listen requests can cancel these timers, but if a given timer elapses completely, then the Cloud Push provider should be used if the client supports cloud push (it's not strictly necessary to cancel the timer as long as the listen status is updated for when the timer wakes up).

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Download All

Attachments

PUSH-183.patch

09/Aug/12 6:14 PM

63 kB

Jack Van Ooststroom

Activity

Ascending order - Click to sort in descending order

Jack Van Ooststroom created issue - 06/Aug/12 10:11 AM

Jack Van Ooststroom made changes - 09/Aug/12 6:14 PM

Field	Original Value	New Value
Attachment		PUSH-183.patch [ 14789 ]

Ken Fyten made changes - 21/Aug/12 4:07 PM

Salesforce Case		[]
Fix Version/s		EE 3.2.0 [ 10323 ]

Ken Fyten made changes - 06/Dec/12 10:11 PM

Description

ICEpush was designed to be as stateless as possible, even in the
cluster case. A rough upper bound should be 50,000 clients per
node since that is the maximum number of TCP connections per IP
address (this is an astronomical number for JSF but is conceivable
for ICEpush itself, so our data structures and intra-cluster traffic
should keep this in mind).

Group join/leave is broadcast to the cluster allowing each node
to maintain a list of groups and PUSHID members.

Each browser maintains a list of listening PUSHIDs and this
is sent with the listen request. The browser is responsible for
cleaning up PUSHIDs no longer active in any of its windows.

A push is broadcast to the cluster with just a group name. Each node
notifies the listening PUSHID members it has.

PUSHIDs are discarded when no listen request has occurred for a
timeout period. (*)

(*) This is a problem for the cluster case: a listen at Node A
may go to Node B on the next request. Node A should not discard
the PUSHID just because it has not seen it -- the PUSHID is still
active in the cluster at Node B. There is no time-critical nature
to discarding PUSHIDs, though, so we can reduce intra-cluster traffic
with batch processing.

Cloud Push adds some additional complications:

A push for a PUSHID (actually just the BROWSERID of the
BROWSERID:SUBID that makes up a PUSHID) that has not been acknowledged
for a timeout period should be sent via Cloud Push. (*)

(*) Again, this is a problem because a listen at Node A may next be
sent to Node B. Node A should not send a Cloud Push if the
PUSHID is still active in the cluster at Node B.

At first, we thought that Cloud Push was time critical (it would
certainly demo better if this was the case), but it turns out that
the 3G network conditions where you need Cloud Push are already plagued
by high latency. We have adaptive timeouts, and they frequently settle
near 5 seconds. In other words, where Cloud Push appears to break the
autonomy of our cluster nodes (requiring every node to be aware of
every listen request across the cluster) the long timeouts involved
allow us to use batch processing just as with the PUSHID cleanup.
Not only that, all we need for both cleanup and Cloud Push is the
active listener list.

Each node could broadcast a cluster request: are these PUSHIDs active?
This would occur constantly, however, and the responses would be lists
of active PUSHIDs.

Instead, as each second goes by (configurable) each node will broadcast
its list of listening PUSHIDs (with sequence numbers) to allow every
node in the cluster to maintain the active status of each PUSHID.
(The sequence number allows a cluster node to determine if it has
the most recent listen request, hence is the "master" of that PUSHID.
All PUSHIDs listened for since the last broadcast are listed with
the most recent sequence number from each.)

The main difference with the single-node case is that the listener
list is maintained entirely via local listen requests.

One other aspect of Cloud Push is that the current push may be the
one that is never acknowledged by the browser (this is actually the
root of the current bug that caused us to have two different code
paths). Every client has a different adaptive timeout (in the future
we may want to quantize these into one second batches for scalability,
but we can handle thousands with individual timers, so this is not
yet necessary -- but will be necessary to hit a goal of 50,000 per node).
When a push is dispatched a timer should be started for each browser.
It's OK to "block" the application thread due to CPU overhead or to
write to an existing network connection, but not to wait for an
indeterminate network event, such as push acknowledgement.
Incoming listen requests can cancel these timers, but if a given timer
elapses completely, then the Cloud Push provider should be used if
the client supports cloud push (it's not strictly necessary to cancel
the timer as long as the listen status is updated for when the timer
wakes up).

Ken Fyten made changes - 06/Dec/12 10:14 PM

Assignee

Jack Van Ooststroom [ jack.van.ooststroom ]

Ken Fyten made changes - 06/Dec/12 10:24 PM

Assignee Priority

P1 [ 10010 ]

Jack Van Ooststroom made changes - 11/Jan/13 3:44 PM

Status	Open [ 1 ]	Resolved [ 5 ]
Resolution		Fixed [ 1 ]

Jack Van Ooststroom made changes - 17/Jan/13 3:09 PM

Resolution	Fixed [ 1 ]
Status	Resolved [ 5 ]	Reopened [ 4 ]

Jack Van Ooststroom made changes - 17/Jan/13 3:09 PM

Status

Reopened [ 4 ]

In Progress [ 3 ]

Jack Van Ooststroom made changes - 17/Jan/13 8:22 PM

Status	In Progress [ 3 ]	Resolved [ 5 ]
Resolution		Fixed [ 1 ]

Jack Van Ooststroom made changes - 31/Jan/13 10:38 AM

Resolution	Fixed [ 1 ]
Status	Resolved [ 5 ]	Reopened [ 4 ]

Jack Van Ooststroom made changes - 31/Jan/13 10:40 AM

Status

Reopened [ 4 ]

In Progress [ 3 ]

Jack Van Ooststroom made changes - 31/Jan/13 11:50 AM

Status	In Progress [ 3 ]	Resolved [ 5 ]
Resolution		Fixed [ 1 ]

Ken Fyten made changes - 17/Nov/14 11:17 AM

Status

Resolved [ 5 ]

Closed [ 6 ]

People

Assignee:

Jack Van Ooststroom

Reporter:

Jack Van Ooststroom

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

06/Aug/12 10:11 AM

Updated:

17/Nov/14 11:17 AM

Resolved:

31/Jan/13 11:50 AM