Delivering SLA for Critical Applications with SD-WAN Policies
The Viptela Software-Defined WAN (SD-WAN) platform delivers an agile, cloud-ready network infrastructure. Viptela SD-WAN has been deployed by the largest banks, retailers, conglomerates, healthcare providers and insurance companies. The major benefits of the SD-WAN technology is a single overlay architecture with centralized policy & management. As more enterprises deploy SD-WAN at large scale, one important question is how can network administrators guarantee performance and resiliency of critical enterprise applications – even during outages and unpredictability of the physical links.
- Hear about the Viptela SD-WAN architecture that implements a single, software overlay over MPLS, Broadband & LTE
- Understand how the SD-WAN architecture provides centralized policies for critical applications
- Learn how the network detects network quality changes and enforces traffic steering in real-time based on SLA requirements
- See a live demo of applications getting steered in real-time based on network outage and SLA
Lloyd: Good morning, everybody. Good morning, good evening, good night, depending on the location you are in. My name is Lloyd Marona, and I’m joined by David Klebanov. Today we’re going to talk to you about meeting SLA for critical applications using SD-WAN. The entire webinar is really quick. We’re just going to cover, potentially, a few major points.
The first point we’d like to cover essentially is, when you look at the applications within your enterprise, what are the various influences from a networking perspective that determine the SLA of those applications. Next we’re going to talk about, what mechanisms do you use today, using the traditional networking technologies to manage the SLA for these applications and how SD-WAN changes that when you move to a more unified hybrid WAN infrastructure. Lastly, we’re going to end up showing you a demo of one of the cases that we are talking about. We won’t have time to cover all the cases, but we’re more than happy to conduct one-on-one sessions or demos for any more cases that you’d like to investigate further.
So let me get started. Potentially, in your enterprise today, it’s very likely that you have more than one link infrastructure in your WAN. Essentially, you might have it using a traditional mechanism, a traditional WAN infrastructure where you have things that are separately managed and most likely in active standby mode, or it could be more in tune with an SD-WAN infrastructure, where everything is managed using a unified overlay, centrally managed, and everything is in active-active mode.
So in these scenarios, when you have multiple links, one of the elements that determines failure is essentially what happens when one of those links goes down. How do you handle the applications when one of those links goes down, especially the critical applications? Next, what happens when both your links go down? Do you have a circuit of last resort that kicks in in order to keep at least the critical applications alive? Number three, if you have a redundant CPE infrastructure for sites, what happens when one site goes down?
Again, if you have a traditional infrastructure, this will most be likely be active-standby. I you have an SD-WAN based infrastructure, this is an active-active design, typically. So what happens to all the flows and all the traffic that’s going on on the device that has failed? How does that get switched over to the backup device? That’s a category of failure. What happens – either after link failure or an under normal mode of operation – when one of your links, especially the [unintelligible 00:03:10] link, is close to being over subscribed? How do the applications get managed, and what policy kicks in at that time to detect as well as steer traffic away from that link?
Next – and this is a major point – if you look at typical application problems, brownout is a big factor. Brownout is potentially where a link degrades for no good reason. It degrades slowly, to using increased jitter, increased loss. Potentially you have a situation where you’re getting erratic performance from an application. So how do you detect the situation again, and how do you work around it?
The next big problem is related to path MTU changes. Now, as you know, applications themselves prefer MTU that is large as possible, because that delivers the best case application performance. But as you essentially transit from one carrier’s network to the other, the interchange points tend to have varying MTU. What happens, essentially, at those points, is erratic behavior. Again, does the [unintelligible 00:04:22] message come back to the sending application to adjust its MTU? Many times it doesn’t. In addition to that, does the packet get fragmented and reach the other side, and will it be joined again before [unintelligible] to the application?
So there is some unpredictability when it comes to MTU, and the question is, can the WAN infrastructure, SD-WAN infrastructure, detect this proactively and manage, essentially, the entire traffic flow through the varying MTUs such that the application itself is unaffected.
Next, can we design topologies that are application specific. So essentially there are a category of applications that communicate directly site to site. There is a category of applications that communicate to the cloud, and a category of applications that communicate to the data center. So the question is, can topologies be defined in advance such that these applications perform optimally based on [unintelligible 00:05:30]. So typically voice applications would require directing branch to branch connectivity. Cloud applications might require some kind of a regional exit. Financial transactions or POS transactions might require more of a hub and spoke like network.
Now, moving on to the next category, that is cloud applications, what are the factors that determine cloud application performance? Again, this can be broken down into two. One is the portion of the network that resides within the enterprise, and next the portion of the network that resided outside the enterprise. That is essentially everything from your exit point to the cloud application and everything within the enterprise. So this is a factor of optimizing both parts into the application.
So essentially, this other set of factors that you view collectively together to determine how the applications in particular will perform under different network scenarios. Now, at this point I want to turn it over to David, who will walk us through a couple of the scenarios and how it’s handled today, and how SD-WAN handles it. Go ahead, David.
David: Right. Thank you very much, Lloyd. So as you guys have seen, the delivery of the application is not just a single or unidimensional view of the world. It takes quite a few features and functionalities and approaches and architectures that are brought together to be able to deliver the SLA for the critical applications successfully. SD-WAN, of course, with its innovative approach, excels at those tasks; and as you have seen, there are quite a few considerations in here.
Now, how are the things handled today? What are the different approaches that exist today? Some of these are SDN, and some of these are not SDN-related approaches. But when we observe the trends that exist today in the market, we basically see two larger categories of approaches. The first approach is a traditional approach that relies on a complex CLI bound provisioning.
This is the approach that had been predominant for quite some time, and frankly it doesn’t quite cut it anymore. It doesn’t live up to the expectations of agility. It doesn’t live up to the expectations of quick turnaround service delivery. Doing things in this fashion is not something a traditional enterprise or service provider can really live with, especially delivering the SLAs for the applications as we mentioned before. It’s a multidimensional thing. Handling those things in complex CLI provisioning is not feasible.
The other approach we see is a unidimensional approach to delivering an application SLA through a single feature or handful of features that come to address a specific segment of an application delivery. They’re not architecturally sound solutions. These are the solutions that come to basically give remediation or give remedy to a specific situation that may or may not be present at all the deployments or the deployments of interest. So we want to make sure that we have the grasp for the landscape today to transition to SD-WAN, and specifically how Viptela, views this set of problems or this set of challenges that needs to be solved.
Let me spend a few minutes now to walk you through the philosophy of delivering the critical applications in a holistic way, what we believe here at Viptela we have the solution to do that.
Lloyd: So I just want to interrupt you for a minute, David, and say, for folks that have questions or would like to chat about anything, we are taking questions during the webinar. So please ask your questions, and we’ll answer them during the webinar. We’ll also bring it up at the end, during the Q&A session [unintelligible 00:10:00].
David: Yes, right. Thank you, Lloyd. We definitely encourage an open conversation, so feel free to post your questions in a Q&A window, and we’ll take them as they come if they’re relevant to the discussion. If this is something more generic in nature, we’ll of course take them at the end.
So back to the philosophy, we want to look at three tiers of the solution, three elements or pinnacles of this holistic approach. The first and most fundamental one is building a transport independent fabric. It’s become a given fact in SD-WAN deployment today that you want to diversify your transports; because the transports have different characteristics, and they’re changing in nature. So you want to make sure that you don’t put all your eggs into the same basket, so to speak. You want to make sure that you diversify the capacity that you are getting and the capacity that you are delivering to your organization.
The way you deliver that is by selecting from the variety of different transports that are available today; be those broadband – cable, DSL – solutions, be those the traditional MPLS carriers, be those 3G, 4G, CDMA type of cellular solutions, and even satellite solutions, that provide you different capacities that you want to deliver to your organization. You make the decision to deliver the capacity based on the technology, based on the cost and features and benefits at each individual site you want to deliver that capacity to.
So building a transport independent fabric across multiple transport is important to deliver the critical applications and SLAs. In addition, you want to make sure that that transport fabric, or transport independent fabric, has two distinct characteristics of being highly secure and being highly scalable. The security comes inherent with running today’s businesses. Scalability comes to make sure that your business grows, and the growth of the business is not inhibited by the type of solution that you have decided to deploy.
It’s really the scalability and security of the transport independent fabric that lays the foundation for what you really want to do: to have a platform for an application delivery. Now, the application delivery is really, as we mentioned earlier, is a multidimensional approach that has several elements that you, as an organization that is walking down the path to deploy SD-WAN solutions, should really be paying attention to.
So this relates to what Lloyd has been talking about earlier as far as the set of the influencers to delivery and to success with delivering the SLA for your applications. Now, this was a set of problems. Now what we’re going to talk about is a set of solutions. So site survivability: There is really no way for you to deliver an SLA for an application if the site is not survivable. The Survivability of the site has to be addressed from every possible failure scenario, as Lloyd had mentioned earlier.
This is where you want to have redundant CPE devices or redundant devices that are installed in the locations where the secure SD-WAN fabric is extended to. The redundancy is provided through standards based routed and bridged interfaces. What I mean by routed and bridged interfaces: This is the full support for OSPF, BGP, [unintelligible 00:13:49] standard protocols, to accommodate every possible deployment scenario.
In case of loss of all the links, where you do not have any more connectivity through either MPLS or wired broadband, you want to make sure you have the capacity to deliver the connectivity service using cellular technology. That’s just 2G, 3G technologies available out there. Now, you want to make sure that technology is also integrated into the solution that you are looking to deploy, and it’s not an add-on through a different mechanism.
QoS is obviously a building block for successful application deployment, especially when you talk about delivering SLAs to that application deployment. But is QoS only something that is enacted at the network level? Absolutely not. The thing that we hold very important in this regard is the ability to deliver the comprehensive suite of the distributed, per device QoS features; yet keeping the control centralized. Here we are talking about not only things like traffic markings, but also things like shaping the traffic based on the circuit bandwidth that you are getting from the service provided.
So you can eliminate any behavior in a service provider network where you are always subscribing the bandwidth that you been provided with. You want to make sure your device has a comprehensive set of shaping, policing, queueing capabilities in addition to the traditional marking, to make sure that the SLAs are delivered.
Now, when you go past the individual device QoS characteristics, this is when you start looking at network wide QoS characteristics. Here at Viptela we call that set of functionalities an application rerouting, which means that you want to make sure that the application, the routing of the traffic in your fabric, is done in a way that is sensitive to the changing conditions of the underlying transports. So we want to make sure that we abstract the transports through the transport independent fabric.
At the same time we want to ensure that we provide a comprehensive monitoring functionalities too, for the fabric to also detect the changing QoS characteristics of those underlying transports. Once those violations of the SLA that you have set for the applications of interest have been violated, you want to take an action to remediate that condition and continue delivering the SLA to the applications.
Not all the applications are delivered in house. Many times the applications are delivered in the cloud. So you want to make sure that the path to get to those applications that are in the cloud is the most optimal one. If you think, for example, applications such as Office 365; which is becoming very interesting for the organizations; if you try to answer the question, where does Office 365 really live, we know it’s in the cloud; but where in the cloud?
So the ability to detect the location of an application in the cloud and deliver the shortest possible path considering the SLA that you want for that application, is something that is inherent in our philosophy of delivering overall SLA for applications; be those applications, again, hosted in your own data centers or in the cloud.
Finally the application specific acceleration features: At times the applications that are either hosted in your data centers or hosted in the cloud have a specific set of QoS characteristics to make sure the performance is adequate that you need assistance to deliver. That is done through either built in features in the solution, or these are done through the features such as service insertion where you can insert third-party optimization solutions, which are application specific, into the SD-WAN fabric.
So these are the main care-abouts that you want to make sure the SD-WAN solution you are exploring has in order to be able to deliver a comprehensive set of critical applications, SLAs. Now once those are delivered, of course, you want to ensure that things are monitored. You want to make sure that you, at any given time, have the visibility into the actual performance of the applications as they’re being delivered. You want to make sure that you collect eh analytics to make sure that you have a long-term view of the performance of your fabric, and make sure that you can identify different trends that emerge over time that may not be visible on just pure, simple operational monitoring.
So both monitoring and analytics together provide you this comprehensive way of looking at, taking a close look at, your application delivery. Again, the ecosystem, the partnerships that we strike, expand the monitoring and analytics capabilities beyond what a single product can offer.
So this is the three-tiered approach that I mentioned earlier, about building fabric that serves as the foundation for an application delivery that has the monitoring and analytics capabilities to make sure that you have everything in check.
Before we go to the demo, are there any pressing questions that we see that …?
Lloyd: The major question that is asked now is, how do we detect link quality in real time and implement – how do we detect link quality and implement policy in real time.
David: Right, okay. It’s a very, very important question; because there’s lots of ambiguity as far as how that can be delivered. So our philosophy, as far as delivering link quality monitoring in real time is to perform that monitoring proactively, starting from the moment that the secure connection is being established, the secure IPsec tunnels are being established, between the two endpoints, based on the application aware topology that we talked about earlier.
So be it hub and spoke topology, be that a full mesh topology, partial mesh topology, star topology, whatever the case may be for a specific application, we support an environment where you can have multiple topologies coexisting all at the same time, completely segmented, so you can map your application to the best topology that gives you the best SLA performance.
So as those topologies are being established, the system also automatically, immediately and proactively, starts monitoring the performance characteristics of loss, latency, jitter, changes in path MTU, things of that nature; to make sure that, should you give the system an instruction to uphold a specific SLA for an application of interest, the system is always continuously aware of those characteristics, and is also aware of the changing nature of those characteristics; because in circuits that are misbehaving or having a brownout, be those direct circuits or indirect circuits, that condition can come and go. So you want to make sure that the monitoring of that condition is something that is done continuously.
One more thing that is important to consider is how sensitive you want to be to those changing conditions. It really comes with the flexibility of our architecture. When you think about different parts of the world where circuits are behaving differently, you can have parts of the world or regions where you have circuits that have a very low tolerance to QoS changes; because these circuits are misbehaving more frequently or having brownouts more frequently.
You don’t want to take a remediating action immediately, because you don’t want to create a [churn] in a system as far as dealing with those conditions. You want to make sure that you have the granularity to define how tolerant you want to be to those QoS changes, SLA changes, in the underlying transports.
Once you detect those brownout conditions or hard down conditions, that is when you want to take remediating action through either rerouting or other mechanisms. This is done through the centralized policies.
All right. So, let’s move on to the demonstration. I will set the stage by explaining what you guys are going to be seeing. Again, as you’ve seen in the past couple of minutes, it’s a very comprehensive set of functionalities to be able to deliver the SLA assurances to the applications. So we won’t have the time to demonstrate all of them. As Lloyd mentioned in the beginning, I would love to have one-on-one sessions where you can reach out Lloyd, myself, or just any representative of Viptela. We can definitely schedule more comprehensive deep dives and show you more cases where delivering critical application SLA could be done in your environment.
In this specific case, we’re going to demonstrate an application of rerouting functionality. What we’re going to see is, we have two sites. Both sites have vEdge routers, provisioned, which is a CPE device which allows you to extend a secure virtual network to that location and uphold the policies, the centralized policies, defined on the SDN controllers.
Now, in site number one we have the sender, and in site number two we have the receiver. Now, we have two video streams going between the two sites. Video stream number one uses traditional routing. That stream is configured to be sent over an MPLS network unconditionally, regardless of the behavior of the MPLS network from the QoS standpoint.
Video stream number two is actually configured to leverage an application of rerouting with MPLS being the preferred circuit. So as you can see, both streams – video number one and video number two – are going over the MPLS network in the normal condition. Now when a brownout of either direct or indirect link occurs – what I mean by direct or indirect is either, for example, bandwidth saturation of the link that is directly connected to the vEdge router. That could, in turn, result in increased latency and loss, which would trigger an SLA violation and remediating action in the system.
It could also be an indirect link where something is not in your control, and it’s happening in the MPLS network. You want to make sure that this end to end link quality measurement that we talked about earlier is able to detect this end to end condition. So when latency loss is occupying on a path between vEdge one and vEdge two, video number one that is unconditionally sent through the MPLS network is experiencing severe performance degradation, because there is really no mechanism that is baked into the traditional routing to be able to successfully take remediating action on that traffic.
Flow number two, or video number two, leveraging the application of rerouting, is taking remediating action. The SLA is restored through sending the traffic, or rerouting the traffic, very quickly, from the MPLS to the internet; and the video quality is restored.
So let’s go into our demonstration environment. This is our vManage, our Viptela vManage tool, which is a single pane of glass for all the operational tasks that are performed on the Viptela SD-WAN fabric. Before we introduce loss in latency, let’s first start in two applications streams. Let me just move one window aside so you can see. Hold on a second. I think we hadn’t cleaned up the environment after the previous demonstration. Apologies for that.
All right. These are the two video streams that you see that are going side by side, video number one and video number two that I mentioned earlier. Now if I go back to the vManage tool, what I’m going to do now is, I’m going to apply or attach a template into the vEdge router in site one. What that is going to do: It’s going to force the traffic for both video streams, one and two, to go through a device that is introducing loss and latency into this path. The video stream that has traditional routing is not going to be able to take any remediating action, and it’s going to have degraded performance. The video stream that has the application rerouting configured for it is going to take remediating action, and it’s going to reroute the traffic to the alternative transport.
As you can see, the video on the left, which is video stream number one, just turned really bad in its quality. It’s been severely degraded. This is the traffic that goes over the MPLS circuit. It’s not able to take any remediating action. As I mentioned, it’s not able to reroute anything. This is traditional routing behavior.
What you see on the right is, in fact, remediating action was taken. The application rerouting has kicked in and rerouted the traffic from MPLS to the broadband circuit that was delivered to that individual site. If I go back to the vManage tool, and I go to monitoring, and I go to events, I can look in here, and I can see different SLA changes that have occurred in the system; which is the trigger for an operator that an SLA had been violated for an application; and then you can investigate further and see why that has happened.
But from the user experience standpoint, you really see that there is no degradation. There is no noticeable degradation for that service for the users. As far as users are concerned, there was no impact; because the application rerouting took care of remediating that brownout condition. The operations team: There’s a different story. They have to go and investigate why the brownout occurred. But there is no impact to the user population.
Now, what you’ve seen here is something that shows well, which is the video. But imagine the same type of capacity being expanded into thousands of applications through the deep packet inspection that is embedded into our system. So it classifies over 2500 applications as they’re going through the system and maps those to the SLA characteristics that you have defined. You can pick the applications of interest that you want to deliver an SLA to, and you can deliver those instructions in a centralized manner through the SDN controllers. Those are enforced on the entire fabric; be that a 10-site fabric, 100, 1000, or 10,000-site fabric.
This is the power of the SDN and, particularly, SD-WAN solution to roll out services like that to deliver the differentiated QoS services across an entire fabric with just centralized control through the policies.
So that concludes the demonstration. I want to make sure we address any questions that …
Lloyd: We have a few questions coming in through the Q&A window. One of the questions is, how do we implement policies based on loss and latency. Can you actually implement policy based on link loss and link latency?
David: Yes. As I mentioned, the policies, the QoS policies – and again, I sound like a broken record a little bit – delivering QoS to an application is really not a single feature that you can just turn on and say, I’m done, and QoS is all taken care of. It’s a collection. It’s a multidimensional approach to delivering QoS from a survivability perspective, from individual device QoS perspective, from an end to end path perspective, from delivering the best [unintelligible 00:32:40] cloud applications.
All of those things come together as an architectural approach to deliver the SLA to an application. Now, not all the applications in your environment require SLAs. You choose the applications of interest, that you want to deploy the SLAs for. You define the SLAs in terms of, what is the acceptable loss and latency that you want to uphold throughout your entire fabric. You deliver those instructions centrally. They get distributed across the entire fabric, and they get enforced in a distributive fashion across the entire fabric.
That gives the system a high level of resiliency and a high level of scale that you can drive through the system because of the centralized nature of policies and distributive nature of the enforcement. Loss and latency are the characteristics that you can factor in when you decide what are the acceptable SLA levels that you want for that specific application of interest.
Lloyd: The next question is, can you give examples of how you implemented hub and spoke and full mesh architecture on the same WAN.
David: Yes. This is also a very important topic. We called this, earlier, an application aware topology. So what it actually means is, you have different applications that you would want to send through different paths in your network; things like unified communications such as voice and video. It’s something you would want to deliver in a full mesh fashion. You want to makes sure that one location, one bank location, one store, one gas station, one construction site, is able to do voice-video, any sort of web sharing, any unified communications traffic, directly between the two sites to ensure that this is the shortest possible path to eliminate any adverse effects of increased latency and possible loss.
Because as you’re going to more of a service provider network infrastructure, the more chances for degraded service you get. So the shorter the path that you can plot between the two sites, the less likely you are to encounter QoS issues. So for this unified communications traffic, you would want the path to be direct or [unintelligible 00:35:11] between the two sites.
For other types of traffic, which are more security centered, for either compliance or security policy enforcement reasons, you would want the traffic to go through [unintelligible 00:35:23] applications. That could be your EIP traffic or HR and finance traffic, things of that nature. You want to make sure that unified communications traffic and this HR, finance, hub and spoke traffic, coexist at the same time on the network; and yet each one of those applications gets its own optimal QoS experience; while [unintelligible 00:35:50] also to the security angle, if that is required.
So then you also have the meet in between type of approach. You can build meshes, full meshes, but not on your entire fabric scale; something that you want to create on a regional scale. For example, you may want to mesh only specific portions of your network, such as bank branches in a specific metro area. You want the transactions between those offices to go on the shortest path possible, as long as they stay in the same metro area.
Once it leaves this metro area, you would want to enforce a hub and spoke topology on the traffic; for example, to make sure that anything that goes out of that area has a security enforcement point in either a regional facility or a centralized data center. So it’s not just the ability to deliver the full mesh or a hub and spoke topology. It’s the ability to deliver all of them, including regional mesh; at the same time, over the same fabric; while keeping them separated.
So this is exactly the type of thing I mentioned earlier: making an architectural approach to this, and not just a feature by feature approach where an individual feature is not able to solve an entire challenge of delivering QoS.
Lloyd: Thank you, David. So the next question is, do we interoperate with WAN optimization devices.
David: Right. It’s a question that comes up very frequently. We do. There are several deployment models that we have that are in production right now with our customers that are using leading WAN optimization solutions. The two predominant approaches in that include either an inline deployment, which is something that is actually preferred by the WAN optimization vendors – it simplifies the deployment of their appliances. So an inline deployment is something is easy to operationalize from the WAN optimization standpoint; and also the operational practices that exist today many times follow this approach. So there’s no dramatic change, and sometimes no change at all, to how it’s being delivered. So this is the inline approach.
The other approach, which is more innovative, is to use a service insertion framework that the Viptela solution offers, to move away from a WAN optimization appliance per site and position those at regional locations where you can take the traffic of interest – again, everything is done in the context of an application of interest. You can take the application of interest that you would like to optimize and leverage the service insertion policies to forward that traffic to a regional facility or regional hub where the WAN optimization appliance is provisioned. So it’s not an inline deployment but rather a deployment where you leverage service insertion traffic steering to send this traffic of interest to the WAN optimization appliance.
Both approaches have been successfully validated and have been successfully deployed in the field. So it is definitely possible and being done.
Lloyd: Thank you, David. So we have a few more questions, but we are a little over our time. So we will respond to you all individually on your questions. We’ll also have the recording available to you. We’ll send you link within a few days. So thank you guys for joining. We’re going to do this at least once a month or twice a month. We’re going to cover different topics. Most of the topics, including this one, are driven by what our customers are doing. As you know, we have 25 Fortune 500 customers right now, and about 1200 production sites. So a lot of these examples come from those production deployments, and we’ll be doing a webinar regularly. So stay tuned for more updates on this. With that, thank you very much.
David: Yes, thank you very much for listening, attending.
Lloyd: Great. Thank you.
David: All right. Have a great day.