People have been speaking of self-healing networks for years, and we can all retire if they ever reach the real world. In the meantime, network administrators are still hunting for tools that tell them what is going on in the network so they can solve those problems themselves. One old standby which is often underutilized is NetFlow.
Cisco routers and switches have been reporting NetFlow data for over a decade. With the recent Internet Engineering Task Force (IETF) adoption of the IPFIX (IP Flow Information eXport) protocol, which is based on NetFlow v.9, even more vendors are offering flow reporting on their network equipment.
NetFlow examines the packets passing through the ports on a network device based on a set of five to seven attributes:
That data can be used for much more than just locating bottlenecks. So, let's take a look at how network admins can use NetFlow data to solve problems on their networks.
VoIP QoS
Rolling out VoIP adds traffic to the network, but more importantly it requires constant monitoring of Quality of Service (QoS). NetFlow comes in handy there in two ways. To begin with, one can monitor what applications are running on the link to ensure it doesn't get overloaded. But it also allows one to directly examine QoS as one of the parameters.
Insurance company American International Group (AIG) started using VoIP and is using NetFlow to ensure that QoS requirements are met.
"You have to keep voice at the top of your priority queues," says network management engineer," Jose L. Alvarez.
AIG has an Avaya 8720 PBX at its company's headquarters in Wilmington, Del. with a backup at a disaster recovery site in Apharetta, Georgia. Both PBXes connect to the headquarters, the company's four call centers and the print shop. While he could see traffic load on the E-3 (34 Mbps) connections between the sites, he couldn't see what was causing high utilization. Since he had Cisco 3800 Series routers and 6500 Series switches, he decided to start using NetFlow to access the hidden data.
He looked at several products including Scrutinizer from Plixer, Inc. of Sanford, Me. and software from NetIQ Corporation of Houston, Tex.
"When you get down to basics, both Scrutinizer and NetIQ will give you what you want: a way to identify your traffic," he says. "So, was it worth an extra $250,000 to get NetIQ's bells and whistles? The answer was 'no,' I couldn't justify that price."
He installed Scrutinizer and its MySQL database on a Fujitsu blade server. One of the first things he discovered was that his Avaya PBXes were using the H.323 protocol to send control characters, rather than UDP as expected. He then set up Scrutinizer to identify all the expected protocols on the network, as well as the parameters for key enterprise applications.
"Once you have the application identified, it will tell you all the traffic that fits within those parameters," he says. "You are able to break out the 20 percent of my traffic that is voice and can see the most common protocols such as HTTP or Telnet. We can also identify our policy systems and claims systems in Scrutinizer, separate from other types of traffic."
It is also easy to spot when some of the Avaya phones wind up on the wrong VLANs, since the NetFlow data shows VoIP running on the wrong port. When Plixer updated the software to include QoS, he started using Plixer to verify that phone traffic was receiving priority service.
"Now I can identify the traffic within my priority queues so I can see that my QoS is working," he says. "I can tell that no user going to YouTube is sneaking into my voice queues and I know my voice traffic is going to get priority over everything else."
And, just as voice is getting top billing on the network, NetFlow has taken over as the top tool for resolving network slows.
"The minute we get a ticket saying a particular site or application is down, Scrutinizer is the first place we go," says Alvarez. "We have other monitoring tools, but with Scrutinizer we can see in an instant what kind of traffic is going through."
Capacity Planning
SNMP gives one overall statistics on bandwidth that can be used to predict when one will be running out of room and need a larger pipe. But just adding more bandwidth is a lazy (and costly) way to go. SNMP tells you the traffic is there, but not what it consists of or whether it is vital business. Adding more bandwidth so more employees can download videos is not a good use of IT's budget or payroll.
Derek Fink, System Vice President of Networks and Communications for Education Management Corporation (EDMC) in Pittsburgh, Pa. manages a WAN connecting 72 colleges in 24 states and two Canadian provinces. Each college has dual T-1 connections. The primary is a Multiprotocol Label Switching MPLS line and the secondary one is either a VPN connection or an MPLS connection.
"There is a list of known critical applications, a few hundred of them, that go across that primary MPLS to our data center," says Fink. "Anything not listed on that is an Internet application and goes across that secondary link to the Internet collocation facility. This allows us to protect the performance of things we know are important vs. Internet traffic."
The school uses Cisco routers exclusively, and most of its switches are routers as well, making it easy to use NetFlow.
"We are able to use NetFlow data to see in realtime what is causing performance problems," he says. "Based on that, we are able to develop the application lists, direct traffic and do capacity planning on the primary link."
He uses two appliances from NetQos, Inc. of Austin, Texas to collect and analyze the NetFlow data. Before adding more bandwidth he first looks at the NetFlow to see what applications are using the link and causing the enterprise applications to slow down.
"It is almost always someone sending an email attachment to everyone in the school," Fink says. "Then we can decide whether to implement more bandwidth or change the filtering policy."
But what about when the traffic is valid? Then an administrator can look at moving it to off peak hours. This can include such actions as software updates (Patch Tuesday or antivirus signatures) or massive file transfers between servers. Rescheduling these can be a lot cheaper than building up the network to accommodate them.
Fink also used NetFlow data before rolling out a new application on the production network. By looking at what is running on a port, and calculating the per user demand, he is able to see what the impact of switching one application for another, or adding a new one, would have on the network. Sometimes this shows that he does have to beef up a link to accommodate the new application, but it is nice to know this ahead of time rather than dealing with the flood of service calls after the roll out.
Replacing Packet Sniffers
Packet sniffers are a good technology when needed to resolve buggy applications, but are expensive and difficult to use on a routine basis. Dave Edgecomb, Manager of Global Technical Operations for watch manufacturer Timex, Inc, which has been using the Observer suite from Minneapolis-based Network Instruments, LLC.
"It is very cumbersome to use, very administrator heavy," he says. "So we started playing around with using NetFlow natively off our Cisco routers to capture network bandwidth, top talkers, etc."
NetFlow was much simpler to use and provided the data he needed for finding the cause of most network slows. He gives the example of installing a new router in France and trying to configure it over the network.
"We couldn't even get to the router with a command prompt because the network was so slow," Edgecomb says.
Rather than giving up, Edgecomb told the person setting up the router to activate NetFlow on it. Within seconds he was able to discover the top talker.
"One IP address was flooding the network with traffic that wasn't business related," he says. He had the network provider block that IP address and "the traffic went to nothing so we were able to finish configuring the router."
That helped sell him on using NetFlow as a regular part of his tool set and he now routinely uses it to keep his global network flowing. One unique challenge arose when the company was moving its data center from headquarters to a collocation facility in Singapore. While that would move it closer to the company's Asian factories and sales offices in the Eastern Hemisphere, they didn't want the move to slow down service to the headquarters staff who were used to having the servers in the basement. He was testing out packet shapers from Cisco, Packeteer, Orbital LAN (since bought by Citrix) and Riverbed. Riverbed seemed to be the best, but he couldn't really see what was happening to the packets to verify it was doing what it should be doing inside the tunnel on the WAN. So he decided to give NetFlow a try. By getting the NetFlow data from the router and the Riverbed appliance, he was able to see the non-optimized traffic on the LAN vs. the optimized traffic on the WAN. That gave him the information he needed.
"I don't really want to see the traffic as if it is not optimized on my WAN, because that is giving me a false reading," he says. "All the numbers that feed back through NetFlow are real numbers I can work with, and it definitely helps me manage the network far better."
The company went ahead and bought Riverbed appliances and the data center switch went through without the headquarters users experiencing any drop in service. Edgecomb says that NetFlow has now become his favorite tool for monitoring the network, used in conjunction with Scrutinizer.
"You look at this dashboard - Riverbed traffic, network traffic, tunnel traffic; this is email; this is Oracle," he says. "Oh, who is this - port 80 - someone listening to music or pulling down a movie."
One thing he was surprised to find was that DNS lookups and Active Directory transfers were showing up as top talkers over the WAN. Since the environment doesn't change that much, it pointed out that there was an error in the way the AD server caches were set up. Correcting that cut down on the network traffic.
He also uses the NetFlow data to keep his network providers in line. If the network is slow, but NetFlow doesn't show that he is using up the bandwidth, then it is over to the service providers to dig deeper.
"I just need to say I am getting 500 ms traffic and it is the provider's problem to dig down into the packets," Edgecomb says. "If I have to go that deep, then I'm not getting my money's worth out of my vendor."
That doesn't mean that he has completely dropped use of use of packet sniffing. He does still use the Network Instruments software to dig down deeper when needed. But NetFlow analysis can generally provide the answer much faster. The AD problem, for example, could have been detected years earlier with Observer if anyone had dug into the data.
"I had Observer suite configured with mirrored ports for four years, and it's not that the data wasn't there, but no one was able to look at it and pull it out," he says. "It's like Beta vs. VHS; one may be better than the other, but the one that people actually use is more valuable."
Joe Zwers is an IT-specific freelance writer.