[PUSH-296] Pushes stop working after awhile - ICEsoft JIRA Issue Tracker

Hide

Permalink

Jack Van Ooststroom added a comment - 19/Nov/13 7:30 AM - edited

This one was quite tricky to understand why pushes stop working after awhile. It only seems to happen when multiple Groups and multiple Push-IDs are involved, and here is why:

Assume two Groups with a single unique Push-ID in each:
- Browser-ID1 has Push-ID1 which is part of Group1
- Browser-ID2 has Push-ID2 which is part of Group2
Using the adaptive heartbeat let them sit there for awhile until the heartbeat interval reached 45000 ms.
Upon receiving a listen.icepush request the BlockingConnectionServer invokes the LocalPushGroupManager.scan(String[]) method giving it the current list of listening Push-IDs
This scan(String[]) method is responsible for touching the Groups of which the list of listening Push-IDs are a member of and to discard any Groups who haven't been touched within the last specified groupTimeout (which is 120000 by default)
To avoid the touch/discard logic getting run for each received listen.icepush request the method has a built-in mechanism causing it to only get executed up to a maximum of once every 3000 ms.

Herein lies the problem. If the heartbeat interval is up to 45000 ms a double miss of the touch/discard logic due to the built-in cap into the scan(String[]) method can result in the discard of a Group. Even though listen.icepush requests are still being received containing Push-IDs that are a member of that Group. Once this Group has been removed, pushes to that Group no longer have an effect.

We should reconsider the implemented capping mechanism for the touch/discard logic. For now, moving the touch logic outside of the capped mechanism should suffice for an immediate fix of this issue.

Show

Jack Van Ooststroom added a comment - 19/Nov/13 7:30 AM - edited This one was quite tricky to understand why pushes stop working after awhile. It only seems to happen when multiple Groups and multiple Push-IDs are involved, and here is why: Assume two Groups with a single unique Push-ID in each: Browser-ID1 has Push-ID1 which is part of Group1 Browser-ID2 has Push-ID2 which is part of Group2 Using the adaptive heartbeat let them sit there for awhile until the heartbeat interval reached 45000 ms. Upon receiving a listen.icepush request the BlockingConnectionServer invokes the LocalPushGroupManager.scan(String[]) method giving it the current list of listening Push-IDs This scan(String[]) method is responsible for touching the Groups of which the list of listening Push-IDs are a member of and to discard any Groups who haven't been touched within the last specified groupTimeout (which is 120000 by default) To avoid the touch/discard logic getting run for each received listen.icepush request the method has a built-in mechanism causing it to only get executed up to a maximum of once every 3000 ms. Herein lies the problem. If the heartbeat interval is up to 45000 ms a double miss of the touch/discard logic due to the built-in cap into the scan(String[]) method can result in the discard of a Group. Even though listen.icepush requests are still being received containing Push-IDs that are a member of that Group. Once this Group has been removed, pushes to that Group no longer have an effect. We should reconsider the implemented capping mechanism for the touch/discard logic. For now, moving the touch logic outside of the capped mechanism should suffice for an immediate fix of this issue.

Hide

Permalink

Jack Van Ooststroom added a comment - 19/Nov/13 7:47 AM

Sending core/src/main/java/org/icepush/Group.java
Sending core/src/main/java/org/icepush/LocalPushGroupManager.java
Transmitting file data ..
Committed revision 38817.

Show

Jack Van Ooststroom added a comment - 19/Nov/13 7:47 AM Sending core/src/main/java/org/icepush/Group.java Sending core/src/main/java/org/icepush/LocalPushGroupManager.java Transmitting file data .. Committed revision 38817.

Hide

Permalink

Jack Van Ooststroom added a comment - 19/Nov/13 8:43 AM

Some additional thoughts as there seem to be 2 mechanisms involved in the automatic clean-up of Groups and/or Push-IDs.

The first mechanism is as described above and focuses only on automatic Group clean-up. It is driven by receiving listen.icepush requests and invoking the scan(String[]) method for the listening Push-IDs. This leads to possible clean-up of a Group if no listen.icepush requests have been received with listening Push-IDs that are members of that Group.

The second mechanism focuses on automatic Push-ID clean-up and the possible subsequent Group clean-up. Upon creating a Push-ID a new Expiry Timeout is started for this Push-ID. This Expiry Timeout is cancelled upon receiving a listen.icepush request containing the Push-ID as a listening Push-ID. Subsequently, a new Expiry Timeout is started. If no listen.icepush request containing the Push-ID as a listening Push-ID was received within the Expiry Timeout, the Expire Timeout executes and removes the Push-ID as well as removing the Push-ID as a member of any Groups it was a member of. If in result a Group became empty because of this removal, the Group is removed as well.

Non-automatic clean-up of Groups happens through the removeMember(String, String) method. If the specified Group is empty after the removal of the specified Push-ID from it, the Group is also removed.

A Group is never created without a Push-ID. Only the addMember(String, String) method results in the creation of Group and Push-ID objects. (Please note that createPushID(...) doesn't actually result in the creation of a Push-ID object.)

Focusing on just the second mechanism for automatic clean-up of Groups and Push-IDs, here is a list of scenarios:

An add-group-member.icepush request is handled with a specified Group and Push-ID. However, no subsequent listen.icepush request is received with the Push-ID as a listening Push-ID. The Expiry Timeout gets executed after the specified timeout (default: 120000) and results in the removal of the Group if empty in result.
An add-group-member.icepush request is handled with a specified Group and Push-ID. Subsequent listen.icepush requests are received with the Push-ID as a listening Push-ID. An Expiry Timeout gets subsequently cancelled and created due to the incoming listen.icepush requests.
Take the previous scenario, but this time the listen.icepush requests with the Push-ID as a listening Push-ID stop coming in after awhile. The Expiry Timeout gets executed after the specified timeout (default: 120000) and results in the removal of the Group if empty in result.

The 2nd mechanism seems to cover both healthy (add-group-member.icepush with subsequent listen.icepush requests) and unhealthy (add-group-member.icepush without subsequent listen.icepush requests) scenarios well by automatic clean-up of Groups.

I think we should consider removing the 1st mechanism, unless I overlooked a scenario.

Show

Jack Van Ooststroom added a comment - 19/Nov/13 8:43 AM Some additional thoughts as there seem to be 2 mechanisms involved in the automatic clean-up of Groups and/or Push-IDs. The first mechanism is as described above and focuses only on automatic Group clean-up. It is driven by receiving listen.icepush requests and invoking the scan(String[]) method for the listening Push-IDs. This leads to possible clean-up of a Group if no listen.icepush requests have been received with listening Push-IDs that are members of that Group. The second mechanism focuses on automatic Push-ID clean-up and the possible subsequent Group clean-up. Upon creating a Push-ID a new Expiry Timeout is started for this Push-ID. This Expiry Timeout is cancelled upon receiving a listen.icepush request containing the Push-ID as a listening Push-ID. Subsequently, a new Expiry Timeout is started. If no listen.icepush request containing the Push-ID as a listening Push-ID was received within the Expiry Timeout, the Expire Timeout executes and removes the Push-ID as well as removing the Push-ID as a member of any Groups it was a member of. If in result a Group became empty because of this removal, the Group is removed as well. Non-automatic clean-up of Groups happens through the removeMember(String, String) method. If the specified Group is empty after the removal of the specified Push-ID from it, the Group is also removed. A Group is never created without a Push-ID. Only the addMember(String, String) method results in the creation of Group and Push-ID objects. (Please note that createPushID(...) doesn't actually result in the creation of a Push-ID object.) Focusing on just the second mechanism for automatic clean-up of Groups and Push-IDs, here is a list of scenarios: An add-group-member.icepush request is handled with a specified Group and Push-ID. However, no subsequent listen.icepush request is received with the Push-ID as a listening Push-ID. The Expiry Timeout gets executed after the specified timeout (default: 120000) and results in the removal of the Group if empty in result. An add-group-member.icepush request is handled with a specified Group and Push-ID. Subsequent listen.icepush requests are received with the Push-ID as a listening Push-ID. An Expiry Timeout gets subsequently cancelled and created due to the incoming listen.icepush requests. Take the previous scenario, but this time the listen.icepush requests with the Push-ID as a listening Push-ID stop coming in after awhile. The Expiry Timeout gets executed after the specified timeout (default: 120000) and results in the removal of the Group if empty in result. The 2nd mechanism seems to cover both healthy (add-group-member.icepush with subsequent listen.icepush requests) and unhealthy (add-group-member.icepush without subsequent listen.icepush requests) scenarios well by automatic clean-up of Groups. I think we should consider removing the 1st mechanism, unless I overlooked a scenario.

Hide

Permalink

Ted Goddard added a comment - 19/Nov/13 9:59 AM

Neither of these mechanisms seem ideal, but it does appear that the second mechanism would be sufficient. Since the current implementation is in-memory Java objects, some of this can likely be solved with WeakReference and related techniques.

Cloud Push needs to be considered carefully here as well. I've split this topic into PUSH-297.

A future implementation may require database backing or sharding and will certainly require a persistence capability across server restarts.

Show

Ted Goddard added a comment - 19/Nov/13 9:59 AM Neither of these mechanisms seem ideal, but it does appear that the second mechanism would be sufficient. Since the current implementation is in-memory Java objects, some of this can likely be solved with WeakReference and related techniques. Cloud Push needs to be considered carefully here as well. I've split this topic into PUSH-297. A future implementation may require database backing or sharding and will certainly require a persistence capability across server restarts.

Hide

Permalink

Jack Van Ooststroom added a comment - 20/Nov/13 8:50 AM

Today I did some more investigation into these mechanisms as we are still seeing pushes stop working. I'll keep referring to 1st and 2nd mechanism as described in earlier comments.

The 1st mechanism only gets a chance to run upon receiving a listen request. The 2nd mechanism is always in play as long as Push-ID instances exist.

Following is a list of the various default timeouts that are in play here:

groupTimeout = 2 minutes (120,000 ms)
pushIDTimeout = 2 minutes (120,000 ms)
cloudPushIDTimeout = 30 minutes (1,800,000 ms)

Following is a list of scenarios to show the workings of these mechanism and the associated timeouts:

Single User with a Single Group and a Single Push-ID: Group clean-up only occurs via Expiry Timeout, as the scan doesn't get invoked when the User goes away.
Two Users with a Single Group each and a Single Push-ID each: Group clean-up occurs when:
1. a single User goes away via scan(String[]) after groupTimeout
2. both Users go away before groupTimeout via Expiry Timeout
Two Users with a Single Group and a Single Push-ID each: Group clean-up only occurs via Expiry Timeout

I think we can do one of two things here:

Increase the groupTimeout to more than the cloudPushIDTimeout (maybe twice the cloudPushIDTimeout)
Get rid off the 1st mechanism as I believe the 2nd mechanism suffices

Show

Jack Van Ooststroom added a comment - 20/Nov/13 8:50 AM Today I did some more investigation into these mechanisms as we are still seeing pushes stop working. I'll keep referring to 1st and 2nd mechanism as described in earlier comments. The 1st mechanism only gets a chance to run upon receiving a listen request. The 2nd mechanism is always in play as long as Push-ID instances exist. Following is a list of the various default timeouts that are in play here: groupTimeout = 2 minutes (120,000 ms) pushIDTimeout = 2 minutes (120,000 ms) cloudPushIDTimeout = 30 minutes (1,800,000 ms) Following is a list of scenarios to show the workings of these mechanism and the associated timeouts: Single User with a Single Group and a Single Push-ID: Group clean-up only occurs via Expiry Timeout, as the scan doesn't get invoked when the User goes away. Two Users with a Single Group each and a Single Push-ID each: Group clean-up occurs when: a single User goes away via scan(String[]) after groupTimeout both Users go away before groupTimeout via Expiry Timeout Two Users with a Single Group and a Single Push-ID each: Group clean-up only occurs via Expiry Timeout I think we can do one of two things here: Increase the groupTimeout to more than the cloudPushIDTimeout (maybe twice the cloudPushIDTimeout) Get rid off the 1st mechanism as I believe the 2nd mechanism suffices

Hide

Permalink

Jack Van Ooststroom added a comment - 14/May/14 7:14 AM

Sending core/src/main/java/org/icepush/AbstractPushGroupManager.java
Sending core/src/main/java/org/icepush/BlockingConnectionServer.java
Sending core/src/main/java/org/icepush/Group.java
Sending core/src/main/java/org/icepush/LocalPushGroupManager.java
Sending core/src/main/java/org/icepush/NoopPushGroupManager.java
Sending core/src/main/java/org/icepush/PushGroupAdapter.java
Sending core/src/main/java/org/icepush/PushGroupListener.java
Sending core/src/main/java/org/icepush/PushGroupManager.java
Sending core/src/main/java/org/icepush/PushID.java
Transmitting file data .........
Committed revision 41158.

Sending eps/src/main/java/com/icesoft/push/DynamicPushGroupManager.java
Sending eps/src/main/java/com/icesoft/push/RemotePushGroupManager.java
Transmitting file data ..
Committed revision 38307.

Show

Jack Van Ooststroom added a comment - 14/May/14 7:14 AM Sending core/src/main/java/org/icepush/AbstractPushGroupManager.java Sending core/src/main/java/org/icepush/BlockingConnectionServer.java Sending core/src/main/java/org/icepush/Group.java Sending core/src/main/java/org/icepush/LocalPushGroupManager.java Sending core/src/main/java/org/icepush/NoopPushGroupManager.java Sending core/src/main/java/org/icepush/PushGroupAdapter.java Sending core/src/main/java/org/icepush/PushGroupListener.java Sending core/src/main/java/org/icepush/PushGroupManager.java Sending core/src/main/java/org/icepush/PushID.java Transmitting file data ......... Committed revision 41158. Sending eps/src/main/java/com/icesoft/push/DynamicPushGroupManager.java Sending eps/src/main/java/com/icesoft/push/RemotePushGroupManager.java Transmitting file data .. Committed revision 38307.

Hide

Permalink

Jack Van Ooststroom added a comment - 14/May/14 7:20 AM

After an analysis and careful consideration the following was decided and implemented:

A Group can only exist if Push-IDs are contained.
Push-ID's ExpiryTimeout mechanism is responsible for Push-ID clean-up upon expiry (either based on pushIdTimeout or cloudPushIdTimeout).
If a Cloud Push-ID expires it is unparked if parked and cleaned-up.
If in result a Group becomes empty, this Group is also clean-up.

The Group clean-up mechanism mentioned earlier as mechanism 1 has been removed as well as code snippets related to it.

In the future we might want to expand on this as mentioned in the follow-up JIRA.

Marking this one as FIXED.

Show

Jack Van Ooststroom added a comment - 14/May/14 7:20 AM After an analysis and careful consideration the following was decided and implemented: A Group can only exist if Push-IDs are contained. Push-ID's ExpiryTimeout mechanism is responsible for Push-ID clean-up upon expiry (either based on pushIdTimeout or cloudPushIdTimeout). If a Cloud Push-ID expires it is unparked if parked and cleaned-up. If in result a Group becomes empty, this Group is also clean-up. The Group clean-up mechanism mentioned earlier as mechanism 1 has been removed as well as code snippets related to it. In the future we might want to expand on this as mentioned in the follow-up JIRA. Marking this one as FIXED.

Pushes stop working after awhile

Details

Description

Activity

People

Dates