Managing.network.routers

For a good one hundred years or so, postal mail and phone-type services were provided in Australia by the Post Master General (PMG), a person who reported to the federal government, and the name for a nationwide service.

The PMG was entirely modelled on the British version, even down to the arrangement of the keys on dial and touch phone pads.

The name was later simplified to Australia Post (AP), and at some time phone services were split off into Telecom, later Telstra.

Telstra was floated on the stock exchange a few years ago, as a combined wholesale-retail entity. These 2 components are probably going to be split next year, much to the delight of the competition.

Customers, though, will probably not see any meaningful change in service quality.

The most unhappy customers have long called themselves Cot (Casualties of Telecom) cases [1].

Professional Service

I was born in Tasmania, but from the age of 5 grew up in Mooroopna [2] , in the heart of Victoria.

We were very poor, but eventually (in the early 60s) could afford a phone. It was one of those square, bakelite ones, with a couple of small cylindrical batteries nearby, behind the sitting (living) room door.

We were allocated number 28, which tells you just how small Mooroopna was.

Mooroopna is a (local) Aboriginal word meaning deep water hole, since the town is on the Goulburn River.

A week later a PMG technician called, and wanted to know why we never answered the phone.

My mother told him: Because it doesn't ring!

Yes it does. No it doesn't. Yes it does...

Uh, ok, I'll have a look at it.

He tipped it upside down, and unscrewed the plate which covered the lower surface.

Ummmm, we forgot to connect the bell.

My, how things have changed - Not!

Telstra

Over a long period, I've worked about 8 years full-time, in various stints, and about 2 years part-time, at Telstra, all at their HQ in Melbourne.

Here, I wish to talk about the last 3 years' effort.

Nationwide

One aspect of a company coming from a government monopoly is its automatic presence everywhere, since - in Australia - distances between cities and towns are measured from post office to post office. So, without a PO you're not even a one horse town.

Anyway, the result in internet terms is that Telstra now has to manage about 15,000 network routers, with growth in some types of devices at 60% per annum (as of 2009-10-20).

Luckily, the Global Financial Crisis means things are slow right now. Yeah, right.

BigPond

I should mention Telstra operates an ISP called BigPond.

A local wit, who was quick off the mark, registered the email address small.fish@bigpond.com.

Groups of Routers

These routers fall into various broad categories:

Generic support - Core devices
Support for cable access to Big Pond
Support for ADSL access to any ISP/phone service
Various other blocks of routers

ADSL - DSLAM

Asymmetric Digital Subscriber Line (ADSL) access is from houses and businesses into the Telstra-operated phone/communication network.

An ADSL [3] modem connects to a Digital Subscriber Line Access Multiplexer (DSLAM) [4].

So Telstra operates a bank of such devices servicing all users coming in via ADSL.

The way in which these are wired together is called topology.

Similar setups handle the other groups of routers mentioned above.

Traffic

The work I have been doing has nothing to do with the actual traffic flowing across the network, and everything to do with managing the devices themselves.

Devices

A small network router is like the 4-port one you could well have a home, costing say $50.

A large one costs (fully-configured) around $1,000,000, and is the size of 2 fat briefcases.

They are purchased from a small range of manufacturers, but with a much larger range of models, each model having different components and throughput capabilities.

They have names like:

Cisco 2700 [5]
Juniper MX960 [6]
Netscreen 5400 [7]
Tellabs 8840 [8]

These network routers can have a range of port types, e,g, ATM, FastEthernet and GigabitEthernet.

The topology, then, could specify a certain port on 1 device is connected to a matching port on another device.

The topology (held in a manually-updated text file) will also include the street address of the (upstream) device, and many other attributes.

Device Self-monitoring

The thing directly related to my work is how these devices monitor their own activity, and accumulate statistics on traffic volume, error rates, CPU activity, disk activity, temperature, and so on.

The devices are polled regularly (4-hourly, daily), by software running on computers called pollers. These machines are all SunFires in this network, and run a package called eHealth to download (from the devices) the statistics into Oracle databases.

Demand

The demand for services obviously comes for people and businesses, but also involves the manufacturers pushing new devices on to the market.

The latter are tested until they pass what's called certification, and then go into service.

But because demand doesn't just grow, but also fluctuates, the devices are ceaselessly pulled off-line and reconfigured.

This reconfiguration can change the number and type of elements within the device, and its these elements which determine the device's capabilites.

Fluctuations are due to things like office and apartment blocks opening up, which results in a sudden and localized increase in demand, or an old building is pulled down, with a similar localized drop in demand.

This means a central database (or several) need to keep track of what devices are installed, whereas the polling process just mentioned returns a detailed list of exactly what the current configuration of each device consists of.

Manual Labour

eHealth comes with a gui, which is used to interrogate the database on a poller, and as you know, guis are a type of manual labour.

eHealth also comes with a large set of command line scripts.

In fact, while working on this project, I was told never to write directly to the database, but only ever call these command line scripts to update data.

Of course, for reading out data, I hit the database all the time.

Those scripts are almost all shell scripts, which pass their paramters to compiled programs or Perl programs.

AutoDiscovery

Before me there were 2 programmers, one for a long time and one for a very short time.

The former wrote the first Perl script in the category of autodiscovery.

What this means is:

Process the Device List
Basically, loop over all devices, and group them by type within manufacturer.
Farm out Discovery Work
Use load-balancing across 3 pollers to drive command line scripts to download from all devices their current configuration.

This load-balancing depends on the work load of the machines, and not on the business rules which govern other phases in this whole process.
Load Discovery Results into Databases
Use another type of load-balancing to load all discovery results into the databases on the 3 pollers.

This load-balancing used just one business rule.
Combine Devices into Groups
Use more business rules to determine group membership, based on location of the device within the network, various categories into which devices are classified, and lastly device model.
Schedule Reports per Group
For those groups (i.e. devices) being leased to specific customers, then statistical reports are scheduled which, when run, will produce PDFs which are in turn emailed out the customers.

This allows customers to compare what they've contracted for (i.e. a paying for) with what throughput they're getting.

Ad hoc reports are generated frequently by Telstra staff to track customer customer complaints.

For example, when the customer sees throughput fall, a Telstra technician will run reports until they can determine which device seems to be failing, and then generate a report on the CPUs etc of that device.

This might show that a CPU is using 100% of its capacity, i.e. it's locked up, which means just 1 CPU needs rebooting.

Device-level and Element-level Grouping Rules

All work to this point used grouping rules which operated at the level of the devices themselves.

My first job was to design software which used element-level rules to group elements within the device, rather than the devices, on the DSLAM side of things.

This meant that a few elements from one device could be combined with a few elements from another device, to give a precise view of a specific customer's usage, when they were leasing just those elements from various devices.

This new code has been running without patches for about two and a half years now.

One problem Telstra did have is that they went overboard in specifying rules and groups, so that the system ran too slowly. To counter that we devised ways of running the programs of subsets of the rules.

Various Small Systems

Banks of network routers are all over them place, so progressively I adapted the above code to replace the manual grouping previously used.

For instance, the devices managing access to BigPond via cable was one of these systems.

Device Emulation

One remote system holds details of devices actually plugged in by technicians in the field, and that system feeds transactions daily in 3 separate streams to my code, as text files.

Unfortunately, there is no real data validation performed on the values entered by the technicians, which has lead to much grief downstream.

Further, after months of mystifying data erros, I found out a Russian staff member was not actually giving me these exports from the remote system, he had written a program which extracted the data from that system and then massaged it. He eventually admitted patching his code up to 5 times a day.

No wonder I was bamboozled.

I used the same business rules as other code to pretend these text files have been derived from devices, and perform discovery of the devices in some cases, and do grouping and report scheduling in all cases.

One of these programs runs for 19 hours, but as yet we haven't determined exactly why it's so slow.

One Large System

Most recently, I've been doing a great deal of maintenance on the original programmer's code, which has to handle over 11,000 devices per run.

His code is in the classic pre-OO style of:

	my($some global variables);

	some initialization code.

	main();

	sub main{...}

You get the idea :-(.

All my code is OO, and I've just had to graft it into his, while slowly eliminating the global variables wherever it was practicable.

This program takes up to 26 hours to run, so we're definitely not talking gui-style here.

Summary

It's been a marvellous job. I've had to design code I've never worked on before. This included inventing a new language to describe the grouping rules, and writing a parser for it, and lastly executing the rules to give effect to the (human) intention encoded in those rules.

I ended up writing about 105 command line scripts, and few CGI scripts (search engines) and around 17,000 lines of OO code.

For the first year is was full time, and for the last 2 years it's been part-time, and now it's just about wound down to nothing.

Early next year, perhaps, it'll all be replaced by a massive system, called InfoVista [9], written in (shudder) Java, from Alcatel.

About a year ago the prototype of InfoVista was operating at a failure rate of 100%, so who knows what will happen.

The current system runs on about 9 pollers, whereas the last I heard InfoVista needed 70 (sic) machines, so you'll understand my amazement.

Of course there isn't enough electricity near Telstra's HQ (in the center of Melbourne) for this load, so I expect it'll all be housed at Alcatel's HQ in the outer eastern suburbs.

Natually such huge decisions are not technically based, so it's no real reflection on eHealth that it's going to be wiped out, at least within Telstra.

That doesn't mean Telstra were esctatic about eHealth, but it did the job, and did it very well to my way of thinking.

References

[1] http://www.smh.com.au/articles/2002/11/06/1036308355108.html

[2] http://en.wikipedia.org/wiki/Mooroopna,_Victoria (yep, there's a comma in there)

[3] http://en.wikipedia.org/wiki/Adsl

[4] http://en.wikipedia.org/wiki/Dslam

[5] http://www.cisco.com/en/US/products/ps6398/

[6] http://www.juniper.net/us/en/products-services/routing/mx-series/

[7] http://www.juniper.net/us/en/products-services/security/netscreen/

[8] http://www.tellabs.com/products/8000/tellabs8840.shtml

[9] http://www.infovista.com

Author

Perl@Work#3 was written by Ron Savage in 2009.

Home page: http://savage.net.au/index.html

Copyright

	All Programs of mine are 'OSI Certified Open Source Software';
	you can redistribute them and/or modify them under the terms of
	The Artistic License, a copy of which is available at:
	http://www.opensource.org/licenses/index.html

Date

Written 2009-10-22.

Top of page

Managing.network.routers

Table of Contents

Managing.network.routers