See what you’re doing

My “day job” as a tech consultant brings me into contact with a lot of software companies, but occasionally I encounter an enterprise that is in the process of becoming a software company. This happens when the company determines that their traditional line of business not only needs to incorporate a software/online strategy, but actually needs to make this the new focus. Motivations range from survival in the face of tech-savvy competitors to expansion into a previously untapped market.

While I enjoy helping these companies explore technology options, it is equally rewarding to help them establish a development process, being a formal approach to the creation of their new products or services. The processes that we in the software world take for granted are often alien to these companies. To drop them into the middle of a full-blown Agile framework, for example, would have them run for the hills! Systems engineers will recognise this as the homeostatic nature of established processes, and it can be challenging to overcome. My preferred strategy is to ease slowly into a process that they can all adopt, gradually introducing select pieces of contemporary methodologies that have immediate and obvious benefits.

Documentation is key. See what you need to do, are doing and have done so that you can compare against your goals. Every member of the team should see what the other members are doing. Awareness of one’s role in meeting team objectives is important. Precision and measurement are highly valued. Effort and complexity often have greater significance than time spent. So what tools do I introduce to encourage all this?

Primarily I look to Agile principles and tools, and propose brief daily meetings (stand-ups) with their quick summaries and plans for the day ahead, the notion of the task backlog, the idea that the team should be allowed to get on with their work, while taking regular cues from customer/management feedback regarding the prioritisation of activity. I borrow from the Kanban approach because of the immediacy of the board and the way it communicates to the team and company at large. I focus on short iterations to refine the development, allowing the team to learn along the way. Tasks in the backlog are kept simple, with a title, one sentence description, completion criteria, an expression of complexity and an owner (or list of candidate owners if not yet active). I espouse the reduction of waste, team empowerment and willingness to drop an unproductive line of action, as promoted by the Lean approach. Scrum’s use of retrospectives allows the team to learn from recent work and use that to directly affect what happens next, so I hold a look-back session every two weeks. And there’s more where that came from.

This is rather piecemeal, certainly, but for companies that have no prior exposure to any of these approaches, this small handful of somewhat radical ideas can be inspiring. Each approach or tool that I select has immediate benefits that the entire team appreciates. Resistance is reduced when the value is easily understood. Typically we move through the company mission, to capturing the company goals that support that mission, then identifying the features of each goal and finally breaking down each feature into a set of ordered tasks for the backlog. Sometimes the goal has only one feature, achievable in just a few tasks in a certain order, but that’s often enough to establish a template for a development process. Sometimes there are multiple goals, and many features leading to hundreds of tasks (and many more being created as the team learns more about what needs to be done).

After a few iterations, the company has a process, and the team members have a taste for what the tools and techniques can do for them. They can either continue with their own process or, as they gain appreciation for the established processes from which theirs is derived, they could choose to adopt the whole of a particular process along with all its bells and whistles. They’ll probably need specialist input from that point, which gives me a chance to refocus on their tech.

Few companies that I encounter actually follow a development process 100%. Most have selected various bits that they believe are a good fit for their company and their people. By and large that appears to work. Regardless of which process (derivative) they use, one thing they all agree on is that it helps them see what they are doing. So when I’m tasked with helping a company become a software company, the first thing I try to do is help them see what they are doing, and then the journey begins.

Front doors

We’ve all heard of “back door” access. This refers to a situation where some kind of access to the system is available that does not go through the normal procedures, and is sometimes present during the early stages of development to provide convenient and efficient ways to interact with a partially complete system.

Obviously, it is essential that the final version of the solution is built without these back doors present, otherwise you have a major hole in your security.

Then there is the front door, and that will be present in the final version you put into the hands of your customers.

During development it is tempting to make the front door as “convenient” as the back door, just to speed things up. You could, for example, just take the lock off the front door, so long as you remember to restore it before your product ships.

Big companies with millions of customers would never ship a product with the front door unlocked, would they?

Sadly, Apple did just exactly that. Yesterday the Apple ecosystem woke up to discover that macOS High Sierra, released to a bazillion Mac users, allowed anyone to log in as root with no password! The problem had been spotted a few weeks ago, but has suddenly gone public thanks to social media. Apple quickly issued instructions on how to put the lock back on the door, but you can be sure that there will be many Macs exposed for a long time, at least until Apple’s next automatic update.

To make matters worse, the bug first needs to be “activated” by someone actually attempting the root login. Once activated, it makes it far easier for malware to exploit the newly exposed root access. Thanks to the social media storm and human nature, this bug is going to be activated over, and over, and over.

It is quite possible that while the door is open, some malware (or someone with brief physical access to your laptop) could take advantage of this total loss of security to go in and leave something inside that will permit access at a later date, even when Apple fix their blunder. So the repercussions will be long lasting.

No doubt this episode will be a case study in text books for years to come.

Data clouds

Migrating MySQL from local servers to AWS RDS has its ups and downs…

Several of my clients, and a few of my own projects, have local instances of relational databases, usually MySQL (or MariaDB/Percona derivatives) and there is growing interest in moving these into the cloud. Amazon’s almost-drop-in replacement is the obvious candidate. You can find plenty of detailed information online about AWS RDS, typically addressing matters that concern large database installations. Maybe such matters sound more exciting than moving a small database that has probably been ticking away nicely on that small box in the corner for years. So what about the small scale cases? Most of my consultancy clients would be using single database instances with just a few gig of data. Small fry. Would RDS suit them and what are the pros/cons?

There are several reasons why a small business with a modest database would be looking to move it to the cloud, which mainly come down to avoiding the need to worry about:

  • Maintaining hardware for the local database.
  • Checking for security patches or other software updates.
  • Making regular backups of the live database.
  • Restoring the database in case of disaster (e.g. server power failure).
  • Testing the restore process, rather than waiting for the disaster.
  • Tuning the database.
  • Adding space or additional processing capacity.
  • Protecting the database against intrusion.

This is a lot for a small business, especially those with only a few (or just one!) IT staff. While there are obvious costs in the above list, there are also obvious risks, so moving all this to a cloud service might not do much for costs but could do a lot of good for the risks.

The obvious cost savings come from:

  • No need to own or operate hardware to support the database.
  • No need to purchase or license software.
  • Avoiding time-consuming and costly maintenance work.

The obvious new costs are:

  • Ongoing fees for the replacement cloud service.
  • Training associated with the introduction of the service.
  • Increased Internet traffic.

At the time of writing, Amazon can offer a basic RD instance for about €$£50 per month giving you 2Gb RAM, 10Gb storage, low intensity network usage, enough to keep 2k of data on every single citizen of New Zealand; woman, man and child (but not sheep). A modest database in the cloud isn’t going to break the budget.

Migrating to the cloud can be a bit of a hit-and-miss affair. Let’s suppose you’re only maintaining 1Gb of data on your local MySQL instance, and you want to move it to a similar instance in RDS (also known as a homogeneous migration). You have two main options:

  • Shut down your server, do a mysqldump to create an SQL file representing your entire database, then pump that into RDS, connect your application to the database in RDS and go back online.
  • Use the AWS Data Migration Service to replicate your data to RDS without shutting down your current system, and when done just reconfigure your application to point to RDS instead.

I recently tried the mysqldump route. I loaded a compressed version of the dump onto an EC2 instance, and then piped the uncompressed record stream through the mysql tool to restore the database to a freshly minted RD instance. To my horror, this experiment with an image under 1Gb took over a day to complete! There’s no way I could be expected to shut down a system for a day while the data is moved.

The alternative is to move the data while it is still live, while making sure that any changes made to the live system during the migration are also copied over. The AWS DMS can do this. It will use a connection to your current live database and a connection to an RDS instance and gradually copy across. This approach might take a little longer since the data isn’t compressed (and presumably is also going via an encrypted connection) but the only downtime you will need is at the end when you have to switch your application from using the current local database to using the RDS instance in the cloud.

Once your data is in the cloud, you get additional features such as automatic backups, a hot standby server in case your main instance gets into trouble, or you can introduce data replication, clusters, geographic distribution and more. You scale according to your needs, up or down, and often just by making some configuration adjustments via a Web interface.

Are there any gotchas? Sure there are. Here’s a few you might want to consider, though for modest database situations these probably won’t be much to worry about:

  • If you need BLOBs over 25Mb, think again (suggestion: put them into S3 and replace the database fields with S3 references).
  • If you regularly use MySQL logs written to the file system, be prepared to survive without them.
  • If you need SUPER privileges then you will need to figure out alternative ways to achieve what you want. You need SUPER, for example, to kill an ongoing query, which can be a problem when you have some unintentionally compute-intensive select that just keeps going, and going, and going… (Note: Amazon knows about this particular limitation, which is why they offer an alternative via rds_kill_query.)
  • The underlying file system is case sensitive, so beware if you have been relying on case insensitivity (e.g. your current MySQL server has a Windows OS).
  • Backups to places outside AWS have to be done via mysqldump because you don’t have access to the underlying MySQL files. AWS snapshots are efficient, and can be automated to occur at whatever rate you want, but they stay in Amazon. If you are used to the idea of restoring a MySQL database by dropping the raw files into place, be advised that you can’t do this with the snapshots.
  • Only the InnoDB engine can be used if you want to take advantage of Amazon’s crash recovery.
  • There are plenty of other limitations (like the 6Tb size limit, the 10,000 table limit, the lack of Federated Storage Engine support etc.) but none of these are going to be of any interest to someone running a modest database.

If you want an encrypted RDS database, make sure to set it up as encrypted when creating it, before you put any data into it. The nice thing about using this approach is that your application doesn’t need to know that the database is encrypted. You just need to manage your access keys properly, and naturally there’s a cloud service for managing keys. Again, these are benefits that the staff-challenged small business might bear in mind.

So, why Amazon? In my case, partly because I’ve been using their cloud services for years, and have been impressed by how the service has constantly improved. Objectively, I note the Forrester report (PDF, download from Amazon) earlier this year (2017) that put AWS RDS on top of the pile.

Of course, if you are going to have your database in the cloud, you might as well put your applications there too, but that’s another story.

npm Hell

A few years back I blogged about the RPM Hell (TreeTops, Jan 2012) that came about because of ineffective dependency checking. Needless to say, this reminded a lot of people about the infamous DLL Hell from many years prior. DLL and RPM dependency issues have generally been overcome with a variety of approaches and improvements in the tools and operating environments. Pushing functionality onto the Web or into the Cloud can help move the problem away from your local machine. After all, if the functionality is now just some JavaScript on a Web page then this shouldn’t be affected by whatever mess of libraries are installed on your local machine.

That worked for a while. Eventually the JavaScript workload grew to the point that putting the scripts into reusable packages made sense, and we have seen a mushrooming of JavaScript packages in recent times. Followed by JavaScript package versioning, server-side JavaScript (i.e. Node) and JavaScript package managers (e.g. npm1).

The inevitable outcome of this evolution was yet another dependency hell. Initially, npm would incorporate specific dependencies within a package, so that each package could have a local copy of whatever it depended on. That led to a large, complex and deep dependency hierarchy and a lot of unnecessary duplications. Subsequent refinements led to automatic de-duplication processes so that commonly used packages could be shared (until the next time the dependencies change), thus flattening the hierarchy somewhat.

Unlike the days of DLL and RPM hell, the current trend in the JavaScript community is to version and release often. Very often, in fact. Seldom does a day go by that I don’t have to update multiple packages because one in the hierarchy has released an urgent fix for a bug, security hole, spec change or whatever. Unfortunately this often leads to two or more packages conflicting about the versions of dependencies that they will accept.

Such conflicts would not be a big problem were it not for two unfortunate consequences:

  1. If the packages are going to be used client-side then you have to ship multiple versions of these dependencies with your app (SPA or native embedded) and thus bloat the binary even though there’s massive functional overlap across the versions.
  2. Runtime conflicts when a package exposes a public interface to be used for app-wide functionality, but has version incompatibility.

Bloat is a major headache, but at least the application still works. Conflicts, on the other hand, are terminal. To overcome these you can try winding some package versions back, until you find a compatible combination. Then you cap (freeze) each of these versions until some future release of one or more of the conflicting packages enables a revised compatible combination. This risks you keeping a known security hole, or missing out on some fix or feature. Sadly, this juggling is still for the most part a manual process of guesswork.

Throw in the possibility (no, likelihood) that a recently released package will have critical bugs due to unforeseen clashes with your particular combination of package versions and you can imagine the nightmare.

Why do I mention this now? Because yesterday, for several hours, I was struggling with TypeScript issue 16593 (though it took me a long time to discover that I had encountered a problem that had been flagged a week earlier). TypeScript, the JavaScript superset under Microsoft’s stewardship, is a package that I would normally update without concern, but suddenly v2.4.1 was causing problems for rxjs (another package I would normally have no qualms about updating). The argument over who is responsible for the problem continues, with two options open to me: downgrade and fix at v2.4.0 or “upgrade” to the Alpha version of rxjs 6. The latter option doesn’t really appeal, not for something that I want in production pretty soon. I chose to fix TS at v2.4.0, and I am glad I did because along comes issue 16777 that alarmingly says that v2.4.1 is causing out of memory errors!

As I finished a late-night session I reflected on the fact that despite the passage of years we are still facing this kind of hell, with no solution on the horizon other than vigilance and an ability to juggle.


  1. The creators of npm really don’t like seeing it capitalized.

Sync or swim

The headline article in last month’s Communications of the ACM (“Attack of the Killer Microseconds”1) got me thinking. The article was drawing attention to the fact that computations are now happening so fast that we now must now even consider the time it takes data to move from one side of a server warehouse to the other at the speed of light. Multi-threading (a)synchronous programming methodologies need to be re-thought given that the overheads of supporting these methodologies are becoming more of a burden compared to the actual computation work being undertaken.

That is not what got me thinking.

What got me thinking was a point being made that despite our generally held belief that asynchronous models are better performers in demanding high-throughput situations, the authors claim that synchronous APIs lead to “code that is shorter, easier-to-understand, more maintainable, and potentially even more efficient.” This last claim is supported by saying that abstracting away an underlying microsecond-scale asynchronous device (e.g. document storage) requires support for a wait primitive and these incur overheads in OS support (specifically: tracking state via the heap instead of the thread stack). So, asynchronous callbacks are bad. Not only that, they’re also ugly.

Here’s a terse paraphrasing of the coding illustration presented by Borroso:

Input:   V
Program: X = Get1(V); Y = Get2(X); Result = fn(Y);

The two Get function are assumed to involve a microsecond-scale device, so the above program would incur two very brief delays before being able to calculate the final result using the synchronous “fn” function. During those delays, the state of the program is in the thread’s stack.

An asynchronous equivalent would invoke the Get1 and provide a callback (or “continuance”) for the Get1 to invoke when it has finished. The callback would then invoke Get2, again with a callback to carry on when Get2 completes. The second callback would invoke fn, synchronously, to compute the final result. This somewhat ugly program would look like this2:

Program:    X = AysncGet1(V,ProgramCB1);
ProgramCB1: Y = AsyncGet2(X,ProgramCB2);
ProgramCB2: Result = fn(Y);

To support the callback parameter, these AsyncGet functions now need to maintain the current state information, including references to the required callback programs, which occupies heap space because the context can switch away during the async operations. It’s this extra program state management overhead that leads to underperformance, while making the code harder to follow.

However, I’m not sure if it is right to expose the callback chain and heap objects to demonstrate unnecessary complexity, in order to “prove” that such code is longer, harder to understand and less maintainable. The authors presented an example in C++, which can be obscure at the best of times, and packed it with comments, which certainly made it look big and messy, but it feels contrived to be a bad example.

So I offer this:

Program: X = Get1(V); &{ Y = Get2(X); &{ Result = fn(Y); } }

In this I have introduced a simple notation “&{…}” to indicate the continuation following a function call. This works whether or not the Get functions are synchronous or asynchronous. Furthermore, a clever compiler (perhaps with some metadata hints) could easily discover the subsequent dependencies following a function call and automatically apply the continuation delimiters. This means that a compiler should be able to deal with automatically introducing the necessary async support even when the code is written like this without being explicit about continuations:

Program: X = Get1(V);    Y = Get2(X);    Result = fn(Y);

This is just the original program, with the compiler imbued with the ability to inject asynchronous support. I’m not saying that this solves the problems, as it’s obvious that even an implied (compiler discovered) continuation still needs to be translated into additional structures and heap operations, with the consequent performance overheads. However, appealing to the complexity of the program source by presenting a contrived piece of ugly C++, especially after admitting that “languages and libraries can make it somewhat easier” does their argument no service at all. The claim that the code “is arguably messier and more difficult to understand than [a] simple synchronous function-calling model” doesn’t hold up. Explicitly coding the solution in C++ is certainly “arguably messier”, but that’s more a problem of the language. Maintainable easy-to-understand asynchronous programs are achievable. Just maybe not in C++.

As for the real focus of the paper, the need to be aware of overheads down at the microsecond scale is clear. When the “housekeeping” surrounding the core operations is taking an increasing proportion of time, because those core operations are getting faster, then you have a problem. Perhaps a programming approach that has the productivity and maintainability of simple synchronous programming while automatically incorporating the best of asynchronous approaches could be part of the solution.

  1. CACM 04/2017 Vol.60 No.04, L. Barroso (Google) et al. DOI:10.1145/3015148
  2. For the C++ version, about 40 lines long, see the original paper.