Adventures in Infrastructure Testing

Adventures in Infrastructure Testing

Infrastructure Testing?

As a test manager on a waterfall project, there’s nothing more disheartening than finding out on day one of your scheduled testing that that there are major problems. Your developers deployed the software, and started finding major problems right away.

At this point suspicion firmly turns to the environment in use.  Its infrastructure’s been set up for weeks but has it been built correctly?  Now everyone has to take a major step back and untangle the mess …

It was from such horror stories (which were all too common 10 years ago) that environment provisioning or infrastructure testing has grown.  My organisation provides and manages not just software, but all kinds of infrastructure on a massive scale within New Zealand.  So it’s important that we want to take leadership in this space to deliver the best, most reliable service we can.

For us, this testing is about confirming that infrastructure components required for an environment are present and behaving as expected.  It’s about as much as possible “testing early”, to try and put each component through its paces a bit as it’s delivered.

And as we’ll talk about later, you don’t need to have your final application ready to do this – although as we’ll discover it does allow you to test so much more.

A typical infrastructure

Let’s start by thinking about what are the standard attributes of any environment, whether used for testing or production.  Yours may vary, but it’s a good common starting point …

Dual sites

Most modern infrastructure is designed around the idea of robustness and continuity of service, typically called “high availability”.  Typically this means you need to have two sites which mirror each other, each of which can operate completely independently of the other.

The rules of thumb when I worked at EDS was that these sites should be a minimum of 30 miles from each other, so a calamity at one shouldn’t impact the other.

In New Zealand we tend to host one site in Wellington and one in Auckland, our two major cities who are on opposite ends of the North Island – and I’ll use this naming of Wellington vs Auckland throughout this article.

Having two sites so far away allows continuity if Wellington is hit by a major earthquake or Auckland by an erupting volcano.  However much more likely is the scenario of “workmen accidentally cutting the power lines/cables to one of the data centres”, which I’ve heard of happening.

 

Server stack

A standard infrastructure stack contains the following layers,

Web layer – this handles the HTTP for web pages, creating pages for customers using the system through a browser, and handling any requests from the user’s web page when they commit an action.  It passes any request from the browser to the application layer below.

App layer – the application layer is where most of the business logic is coded up.  If you have a mobile application, then requests will most likely come directly to this layer through an API connection, not through the web servers.  It’s connected to the database below, the web server above.  If someone under 18 tried to buy alcohol online, this is the layer which would contain a rule for “this is not allowed”.

Database – for the application layer to work its magic, it needs to be fuelled by information.  Unless it’s a very simple website, all this information cannot be provided by the user.  In the buying alcohol scenario, the database would provide information of the user’s date of birth which was provided, for the App layer to determine if this customer was allowed to purchase.

As we’ve eluded to, the networks are as important as the servers.  The outside world needs to connect to the web server (possibly the app server as well via mobile).  And much like the nursery rhyme “Dem Bones” that we used to sing at school, the web server’s connected to the app server, the app server’s connect to the database.  “Hear the word of dev lore!”

 

Replication

Finally, it’s important that we have a continuous experience no matter which site we use.  The site should be invisible to the user.

If the user registers an account whilst using Auckland site, and they log back in 10 minutes later on Wellington, you need their account details to be available there.

This process is called replication, where data created on one site is copied over to a database on the other site.

Planning and pairing

Fundamentally testing infrastructure involves a lot of technical skills, and a lot of access permissions as well – and not surprisingly very few testers have all these.

We’ve found it’s best for a tester to pair up with someone to verify infrastructure – ideally the person delivering each piece of the system.  The UNIX/network/database admin will execute the commands, but the tester will be steering the session with questions like “what happens if…” and “show me when…”.

As a tester, ideally you’ve been through the technical specs in your architecture documents, highlighting all those areas you think should be testable.  We said earlier that no two sets of infrastructure are the same, but here are some of the typical “recipes” we like to use for verifying a new system.  They should get you started, and get you thinking, but don’t be afraid to go beyond these …

 

Monitoring

You have a whole host of servers, are they up or are they down?  Are they built to the right capability?

Increasingly like many organisations, the servers we use are virtual machines, but that means you can’t even go to a physical machine and confirm that “the lights are on”.

This is where monitoring is vital.  It helps us to confirm the state and capacity of each machine.  We use a version of Nagios which we’ve customised to allow us to do this.  [It’s essentially a souped up version of Windows Task Manager that goes into much more detail]

It allows us to check for each machine,

  • IP address

  • Network traffic

  • Memory installed

  • Memory available

  • Number of CPU cores

  • CPU Load

  • Hard disk space

It even triggers alerts when we’re close to capacity to support operational staff, so it’s a useful tool for any tester to get access to when they need to ask “what is going on with my test environment today?”.

 

Backups

Each of your servers should be backing itself up nightly.  However, seeing that a backup file was saved last night isn’t the same as knowing your backup process can be used to rebuild a machine when needed.

We attempt to prove our backups are working by selecting part of the system, marking it down and rebuilding it using the backups. We then confirm the server is back up and working using monitoring.  This allows us to have a lot more confidence that the backups will work when they matter.

We try and do this once for each type of server – so if all we have is a mix of UNIX and Windows server, we’ll aim to do this at least once for a UNIX server and once for a Windows server.

Databases are something else which gets backed up a lot, so we’ll do a similar process, turning off a database, deleting everything in it, and replacing it using the backup.  A good way to check the before and after state is to run queries in SQL which return

  • Total number of records

  • Total number of records beginning A*, D* etc

This allows you to reconcile that the data before and after.  But note, you might lose some data in this manner.

If your system has dual databases and you use an old snapshot, you will probably restore one using an old snapshot, to find they differ.  [If you used last night’s backup, there’ll be no data from today]

Your replication should eventually bring the second database back into sync – this is also a good thing to test.  More on this next …

Database replication

On the subject of databases, replication is a key component where any information on Wellington is replicated to Auckland and vice versa.

I’ve been able to test this by pairing up with a database admin.  I’ve noticed before when demo’ing software, it shares a lot with the kind of declarations made by a stage magician to his audience.  For us the script goes a little like this …

  • We log into both the Auckland and Wellington databases and search for a record with my name. There are clearly none in either database.

  • We will then create a record on Auckland, and as you can see, this record can now be seen.

  • If we go over to Wellington, as if by magic, the same record can be viewed.  Only it’s not “magic”, it’s replication.

We’ll go through some scenarios which go the other way Wellington -> Auckland.  We’ll turn off the replication, and confirm the records doesn’t exist in the opposite database until we put it back on again.

We’ll also go through and confirm we can delete records as well.

Connectivity

Each step of the way we’ve managed to tick an item off our list.  So far we know,

√ Our servers are up

√ Database replication is happening

√ We can rebuild our servers

What we now need to do is a connectivity check – we need to check our web server talks to our app server which talks to our database server.  We have a couple of options here,

  • We can remotely log into a machine and use something like a telnet or webcat command from one server to another.  We expect them to ping a success back.

  • We can smoke test our application by installing across these servers

  • If the application isn’t ready for smoke testing, we can create a test harness which mimics the protocols of the final application across the different server

The telnet command option is useful because you can also check not only the visibility of one machine in your system to another.  But also you can check above and beyond that the security settings on your system don’t allow you to just connect to “just anything”, so you should have some telnet checks that you don’t expect to be allowed.  For instance you might not want to be able to connect to any machines in your Auckland site from machines in your Wellington site.

That said, smoke testing can be highly useful too if your application is ready.  Ideally you are focusing on trying to just Create, Retrieve Update, Delete a record (CRUD) using the front end (web) to check records in the database.

Overall, smoke testing is usually the better option (although I do like using the telnet commands to check where servers are not allowed to connect to).  It’s a good confidence check for later testing – what better proof of concept that the infrastructure supports the application?

It also allows checking that the deployment process for the application to an environment will work, whilst not involve the cost of building a dedicated test harness.

But of course, the application might not be ready, or indeed your company might be providing the infrastructure whilst the application itself has been sourced elsewhere.  And there’s a general fear from some that we’ll get sidetracked whilst smoke testing, raising all bugs on process flows and spelling mistakes for an unfinished application.

 

Site autonomy

Finally we have site autonomy.

When we looked at our two sites Wellington and Auckland, we talked about the importance of one site being able to run whilst the other is down.

This is how we create a highly available system with good redundancy.

Unfortunately when we build an environment for the first time, we usually build Wellington first, and then Auckland is a copy of Wellington.

Occasionally what will happen is a little bit of human error – Auckland will be a little bit too much of a literal copy of Wellington.  Within its resources it will be pointing to services not at the Auckland site, but at the Wellington site.

The way to test this is relatively simple – you run smoke tests,

  • With Auckland up, but Wellington down

  • With Wellington up, but Auckland down

It’s surprising how often something isn’t quite built right like this on a first go!

 

 

Can’t this kind of testing be automated?

Good question … complicated answer.  In a nutshell, yes … and no.

Talking to people who’ve done infrastructure testing within my organisation, the first thing they’ll say is to leave a lot more time for the first instance you put together.

The first time you do anything, there are always challenges.  Testing in such a scenario is very much a discovery process, and almost always there are a few things which need to be set right.

We recently worked through the validation of infrastructure for a product where we provided several testing environments, a pre-production environment and the production environment.  By the last environment, things were going really smoothly, this was about the point where we could have used automation to check and verify the system from this point, because we knew then what made a good set of tests for this configuration.

Within my organisation, we are increasingly doing more, not just with our own servers but using cloud hosting resources like AWS and Azure.  For one product, the servers are provided by AWS, and they save money by only having the test environment for 12 hours on weekdays.  At 7pm, the servers are all turned off and AWS reclaims the capacity.  At 7am, the whole environment is stood up again.  An important part of the “stand up process” is the automated validation to confirm that the environment is working, and they include some of the tests we’ve gone through on liveness, replication, and connectivity.  The environment has to be confirmed to be working, otherwise it creates a drag on any testing effort.

 

Wrapping up

As we build more complicated, robust environments, there’s that much more to put under our testing microscope and analyse.  It’s very easy to be in awe of the technology being used, and forget there are basic levels of testing which can be applied inventively to confirm the system operates as expected.

This article will have opened your eyes a little to what can be done, every system will have it’s own rules, but some of these methods discussed are great starters for looking at new systems.  Much as you’d approach any new feature in software, you need to keep asking about your infrastructure “how can we know this is working as designed?”, thinking about it’s testability.  Fundamentally it’s about taking leadership and even a dash of management of others to make sure the system is being put through it’s paces to find pain points before it impacts other activities.

Most importantly, treat infrastructure as a testable component of your final system, and follow the well trod maxim of “test early” to find problems early.

 

More from Mike Talks ...

AgileTD Mondays Talk: 

Fun Times in Environment Provisioning Testing  [21:39]

About Mike Talks

Mike

Mike Talks is a test manager at Datacom, one of Australasia’s largest IT vendors. He’s forever trying his hands at things, come to his chatbot session at Agile Test Days where he’ll teach you to build your own chatbot …