A sticky problem – Wi-Fi clients that won’t roam

I work in a very interesting industry. It seems like in no time, WLANs have gone from being a nice to have but definitely optional thing to something that everyone must have in order to operate their businesses. Part of the issue with this rapid change is that we are left with some decisions made in the past that have turned out not so great in the present day. One of those issues is caused by the decision to leave roaming decisions (if a WLAN client moves to a new AP and BSSID) up to the client. At the time I’m sure this seemed like a good solution. Craig Mathias over at NetworkWorld gave a bit of history around this decision that I wasn’t aware of before. Apparently it was felt that ‘few sites would be purchasing access points, so it was assumed that most networks would be peer-based’

We who deploy WLANs professionally know well the pain caused by buggy client drivers and the wide variance between different vendors on how they decide to do something as simple as roaming to a new AP. In comparison, cellular networks largely leave the roaming decision in the hands of the cell tower, which results in much smoother changes (such as not dropping your calls) in the main for clients as the move about. Two factors are converging to make this issue into a bigger problem for Wi-Fi networks. The first is that speeds are increasing on wireless (with 802.11ac it will go much higher) and people are using the WLAN because of this for more and more critical applications such as voice and video. The second is that the sheer number of devices using Wi-Fi as their only means of accessing the network has exploded since the introduction of the iPad in 2010. As Wi-Fi access becomes something that is relied on to run businesses, people naturally expect to be able to access it from wherever they need to work without a lot of hassle. They are not aware of such issues as limited bandwidth and contention caused by the increase in clients. They only want to know why their video doesn’t run smoothly or their voice call was dropped.

I attended yesterday Aruba Network’s 802.11ac day (as part of Tech Field Day where they announce their newest AP, the AP-220, and had several other things to talk about. Foremost amongst those was the announcement of their new ClientMatch technology to help with the issues with roaming caused by sticky clients. What is happening here is that clients are not making a decision to roam when a much better connection is available to them or when an AP is overloaded with other clients and a much lighter loaded AP is nearby. We have, in fact, a standardized way of dealing with this issue in the 802.11k and 802.11v standards, but not all clients support these standards. Aruba’s solution to this is to build intelligence into the access point and controller to help ‘match’ the client with the best AP from the clients point of view. This intelligence is happening on several levels. At layer 1, the link is being optimized by moving the clients either using 802.11k/v or by disassociating the client and only offering association to the ‘better’ AP. At layers 2-3 the load on the APs is taken into account and then for layers 4-7 what the application is doing is taken into account.

Interestingly enough Aruba believes that what they are doing is particularly unique and therefore patentable. They have submitted the following patent, US20130036188 which describes what they are doing for the clients that do not support 802.11k. Essentially they are creating the beacon reports used in the standard by collecting information from the APs when the client sends probe requests, authenticates or associates with the AP. What happens then is the AP takes the SNR information, does an adjustment and then sends the information upstream to either a controller, IAP or Airwave and that is used to figure out which AP is best for the client. This is a quite clever indirect way of figuring out how the client sees the network. The information that could be gathered is SNR, MAC address (as a key), channel, band, timestamp, noise floors, channel loading, AP capabilities and more. This combined with data the controller already has about applications means a decision can be made to whitelist or blacklist a certain STA on a group of APs. I’ve been told that they discovered in testing this that the actual SNR on the client was off from the readings they were getting from their algorithm by a constant amount, so they were able to just adjust to account for that.

Once you have built up this database of information, the next decisions you have to make is how to use it. Aruba told me that they decided to push this back down to the APs in a distributed way, so that the decision to associate a client wouldn’t suffer from having to look back to a controller or other device. As was pointed out on twitter, moving active voice or video streams off a poorly performing AP is not a decision to take lightly. Most voice calls use a metric called a MOS score to figure out what the quality of the call is. Aruba is using, as much as they can, information from the call itself to figure out the MOS score on a realtime basis and then move other, non-voice clients off the AP if that will interfere with the call quality. This was particularly highlighted by the Microsoft presenter who spoke about how they had opened an API up so that Aruba could gather this type of information about the Lync call. It wasn’t discussed, but I think that the basic client roaming algorithm would move a voice client if it got too far away from an AP for its inbuilt MOS score to be sufficiently high. In this case, the Aruba ClientMatch system would encourage it to pick a much better AP by only offering those that would be good for the client. One final point to make, Aruba was very careful to point out they took a cautious approach with ClientMatch so they were not moving clients with an 80% score just to get a 90% score. They wanted to make sure they concentrated on the bottom percentage of clients so that the improvement would be much larger and this would increase stability of the WLAN in general.

There are some additional uses that can come from being able to track what is happening from the client point of view. These uses come in Aruba’s Airwave product where the information is being used to add to the ability of Airwave to troubleshoot user’s connectivity. Aruba has created additional reports to give more visibility into client behavior, one of which is called the steering report. This report gives you information about what clients are being steered by ClientMatch and how often they have been steered. This gives you some clues into which clients might have firmware or driver issues that should be looked at because they are constantly sticking to poor AP connections. VisualRF additionally shows the status of client connections, indicating in a nice red color when a client has a poor connection. All this builds your control over what is happening on your Wi-Fi network and especially control over the client behavior that you haven’t had before.

Whew, this has definitely been one of my longer blogs. There was quite a lot of information to get through and I wanted to make sure, as much as I was able, to have it correct. If anyone from Aruba reads this and see’s I got something very wrong in my explanations of what’s going on I’d love to hear from you so I can correct this. Indeed I would also love to hear from anyone else with an opinion on this technology. I think this could be the start of fixing something that has been very annoying for some time, the ‘client’ problem. I know I’m looking forward to doing some testing of my own of ClientMatch once its released.

21 Responses to “A sticky problem – Wi-Fi clients that won’t roam”

  1. Chris,

    Great summary, thanks for the coverage and thanks again for being a delegate at the 11ac TechFieldDay.

    I think one of the most important aspect of ClientMatch is that it does not require mobile devices to initiate the roaming / association event. It just does not wait for the mobile device to realize that there is a better radio close by.

    Another important point is the fact that even if the mobile device is static – not moving, not changing locations – ClientMatch might take action. For instance, consider the fact that there is increased load on a specific radio with increasing interference from a neighboring network while all mobile devices on the radio staying static. ClientMatch will start moving devices if there are “better signal quality” options around.

  2. Scott says:

    Nice write up, but I’m concerned from a client perspective.

    Anytime there is discussion of the controller or AP only authorizing authentications from a ‘better’ AP or a ‘better’ band I get nervous.

    Obviously the devil is in the details, but until the auth/assoc requests start coming from the APs it’s up to the client to connect to who it feels it is best. I’m ok with the idea of providing improved data (802.11k) to the client or preferred APs to improve the client’s decision, but if the client has a BSSID list and it picks one, isn’t it only going to hurt the perceived performance if it is rejected unnecessarily?

    If the client issues a probe request and gets an answer from the AP, isn’t it an implicit contract that the AP will service the client if it chooses to connect in one of the detailed BSSIDs?

    I’m concerned that the clients that are the least sophisticated here are going to be impacted the most. If they were already implementing 802.11k chances are they aren’t in the bottom 10%-40% of sticky clients. So, aren’t these the ones that are going to be hurt the most by unexplained deauthentications/disassociations? Because in general, the next thing they will do is another probe request, listen to the responses, and pick the ‘best’ AP again, and there is a decent chance it was the one they were just connected to.

  3. Devin Akin says:

    Chris,

    This is an excellent overview. Thanks for the effort. Kudos to Aruba for a nice innovation.

    Devin

  4. Bob O'Hara says:

    Nice writeup, Chris!

    The Client Match technology is something that Airespace was doing in 2004. I read the Aruba patent application and can say that the dynamic load balancing of the Airespace controller/AP is prior art for both of the independent claims, since I designed it. We also introduced the exchange of AP reports between client and AP (essentially portions of 11k and 11v) in a proprietary extension with NEC. These were incorporated into Cisco’s Client Extensions (CCX) after the acquisition of Airespace in 2005 and, subsequently, introduced to the 802.11 working group by Tim Olsen (from Cisco).

    My guess at what they are doing for the clients that do not implement 11k or 11v is some or all of the following.
    1. Disassociation of sticky clients
    2. Ignoring probe requests at APs where they don’t want the client associated
    3. Ignoring or refusing association requests at APs where they don’t want the client associated

    All this, of course, assumes that these clients without 11k or 11v will react rationally to being refused by their top choice of AP. Unless things have changed with client vendors in the last five years or so, many of these clients will enter a death spiral, where they react properly to the disassociation or failed association, but then simply reevaluate the same information and come to the same decision, coming right back to that AP where Aruba doesn’t want them to associate. I hope client vendors have gotten smarter. But, I would not bet the farm on it.

    It is nice that Aruba has finally caught up.

  5. Andrew says:

    Chris,
    Great information on this technology. It sounds like Aruba has really thought through the implementation of ClientMatch.

    One question that is really minor in your article – did you do much research on the cellular roaming since you pretty definitively state that roaming is controlled by the cell towers? I did several hours of research on this topic when I wrote one of my blogs on Wi-Fi roaming analysis and it came up mixed. It seemed like the older cellular technologies relied on the cell towers, but newer ones like GSM and LTE were relying more on distributed roaming decisions by the handsets themselves but with key enhancements and information exchange with the cell towers that improved connection reliability. It seems that this mirrors where Wi-Fi is going with 802.11k/v. For reference: http://revolutionwifi.blogspot.be/2011/12/wi-fi-roaming-analysis-part-1.html.

    Thanks again!
    Andrew

  6. Will Jones says:

    Hi Chris,

    Great post and an interesting new feature from Aruba, I was thinking about 802.11k the other day in the context of Cisco WLC’s and it seems in 7.4 code at least they also have some sort of feature to help with non 11k clients roaming although not working quite as far up the stack as Aruba is:

    http://www.cisco.com/en/US/docs/wireless/controller/7.4/configuration/guides/wlan/config_wlan_chapter_0100011.html#task_E29231879BE540BD9A361774226A64B5

    I’ve yet to test this or enabling 11k on a WLC in the real world but it seems like a sensible idea on the surface.

    Will

  7. WildDev says:

    Andrew,
    No I didn’t really research the cell tower info in depth like you did. I had taken this as a given based on other blog articles I’d seen. I see that it’s not always the case. Hopefully we’ll see a similar progression with Wi-Fi to improved information exchange between clients and APs.

  8. WildDev says:

    I see your concerns but I don’t believe there is an implicit contract for the AP to answer a client. There is always the possibility that for one reason or another an AP or a client for that matter will refuse the connection.
    From what I understand, this has had some pretty significant real world testing (I have personally seen it in two large networks) to make sure it was tweaked to work with the majority of clients that are out there. I expect that there will be corner cases come up where there may be some issues, but in those cases there is always the choice to turn it off if its making a significant impact.

  9. How about “roaming” the BSSID instead of the station?

    if only few clients in the network are not supporting 11k you can for these specific client do:

    1) Ask current AP-A modem to stop ACK’ing the sticky client
    2) Create a new virtual BSSID as AP-A mac address (no beacons!) on AP-B and ask the modem to ignore all packets except the sticky client packets.

    0 packet loss

  10. Matt says:

    Chris,

    I am concerned by the disassociation that the AP will be doing for clients that do not support the 802.11k/v standards. For instance, if I’m using VoIP and the AP disassociates from my handset, could the re-connection time be considerably greater than what I would experience on a vendor’s network that has seamless roaming? All in all, how does this compare to a solution like Meru’s, or even Ubiquiti’s new software? They claim to be able to produce fully seamless roaming and load-balancing by taking the decision out of the client’s hands.

  11. Guy says:

    Didn’t Symbol (now Motorola ) do this a long time ago with preemptive roaming?

  12. WildDev says:

    If a VoIP call is in progress, the system knows not to move them. The VoIP handset would probably move itself sooner if its signal got low enough as they aren’t very tolerant of low signal strength. This is nothing like Meru’s or Ubiquiti’s systems as they do not roam at all due to the SCA architecture. SCA means there is only one BSSID so nothing to roam to. Their system does take the decision out of the clients hands by not roaming at all! The issue with this is I believe a fundamental one. It doesn’t scale as a system that keeps all APs on the same channel like SCA means an inherent co-channel interference problem which increases as the number of APs increases. Even Meru realized this and so designed a second architecture that basically layered a second and then third channel on top to deal with the interference.

  13. Adrian S. says:

    Thanks for the post. I’d be curious to learn the technical details on how to force a non-11k/11v client to roam to a certain AP without being too disruptive. Certainly the current AP can deauthenticate or disassociate the client. At this time the client either already has a list of roaming candidates (built from prior scans) or will perform an immediate/emergency scan, typically active. Not very clear if an AP may freely ignore probe requests in order to steer the client towards the matched AP. Clause 10.1.4.3.2 (Sending a probe response) seems to use the “shall” language, e.g. “[…] An AP shall respond to all probe requests meeting the above criteria.” Also I wonder how accurately the infrastructure can figure out if a client is moving and in what direction, eventually how fast, in order to predict the next best AP match.
    Thanks,
    Adrian

  14. Yo’re right,i didn’t considered that it will force all the network to work on one channel.

    BTW, in cellular WiFi offload there is a similar problem, The client connect to the AP from a fringe area where it would be better if he chose to connect to the cellular data. many vendor attack this issue with different approaches.

  15. agn says:

    “The client report contains at least a list of client MAC addresses and the associated SNRs or equivalent, and may include additional information such as timestamp, channel and/or band. The client report may also contain AP specific information such as channel noise floors, channel loading, AP capabilities, and the like”. Above is what used to generate a report which is being pushed to all APs to either blacklist or white list the stations.How does the system decide to not push voice clients having an already better MOS to a new AP as said in this article? I guess my point is there is no application layer awareness built in and this is purely L2. I am not so sure but the underlying “invention” is all about the AP not responding to clients in the blacklist thereby forcing them to disconnect and rejoin to a better AP. Hence i don’t think there is any need to discuss “roaming” here as the technology is not fundamentally addressing that.Next we need to know if the algorithm is made available by default on all radios(i bet i wont be) or can be selectively enabled on radios as a network “fine tuning” parameter.

  16. James says:

    Ruckus have been doing this for years, it just works ™, whereas doing a PoC I had big problems with iPads being sticky, and Aruba’s old-load balancing of only sending disassociate packets didn’t work half as well.

  17. Paul says:

    agn, I believe client match is a “fine tuning” parameter in the AP Radio config profile and is not enabled by default.

  18. JP says:

    Has someone tested this feature in real production environment? i just wonder how well it works with any client devices that doesn’t have roaming features(like 802.11k or roaming sensitivity adjustment). If this is totally rely on the AP functionality, Aruba hits the bull-eye.

  19. Jon Foster says:

    I’ve been using ClientMatch on an academic campus now for some time. Critically, that means it’s BYOD by default, requiring support for anything and everything with respect to client devices. Generally it seems to work well with a few exceptions – We had some issues with Samsung Galaxy Tab’s which were ignoring the band-steers and simply continuing to re-auth to 2.4. The delay caused by the steer attempt caused software at higher layers to decide the network connection had dropped.

  20. flipdee says:

    Does anyone know if a aruba 620 branch controller can manage a group of aruba ap-105 or AP-125 access points with the full client match, seamless roaming configuration that the access points and a full mobility controller could muster?
    Also, what’s the minimum hardware requirement to achieve the same with a 802.11n configuration using ruckus equipment?

    Thanks a lot.
    flipdee

  21. MKL says:

    We are using Aerohive. Does Aerohive have a similar technology as Aruba’s ClientMatch?

Leave a Response