Data clouds

Migrating MySQL from local servers to AWS RDS has its ups and downs…

Several of my clients, and a few of my own projects, have local instances of relational databases, usually MySQL (or MariaDB/Percona derivatives) and there is growing interest in moving these into the cloud. Amazon’s almost-drop-in replacement is the obvious candidate. You can find plenty of detailed information online about AWS RDS, typically addressing matters that concern large database installations. Maybe such matters sound more exciting than moving a small database that has probably been ticking away nicely on that small box in the corner for years. So what about the small scale cases? Most of my consultancy clients would be using single database instances with just a few gig of data. Small fry. Would RDS suit them and what are the pros/cons?

There are several reasons why a small business with a modest database would be looking to move it to the cloud, which mainly come down to avoiding the need to worry about:

  • Maintaining hardware for the local database.
  • Checking for security patches or other software updates.
  • Making regular backups of the live database.
  • Restoring the database in case of disaster (e.g. server power failure).
  • Testing the restore process, rather than waiting for the disaster.
  • Tuning the database.
  • Adding space or additional processing capacity.
  • Protecting the database against intrusion.

This is a lot for a small business, especially those with only a few (or just one!) IT staff. While there are obvious costs in the above list, there are also obvious risks, so moving all this to a cloud service might not do much for costs but could do a lot of good for the risks.

The obvious cost savings come from:

  • No need to own or operate hardware to support the database.
  • No need to purchase or license software.
  • Avoiding time-consuming and costly maintenance work.

The obvious new costs are:

  • Ongoing fees for the replacement cloud service.
  • Training associated with the introduction of the service.
  • Increased Internet traffic.

At the time of writing, Amazon can offer a basic RD instance for about €$£50 per month giving you 2Gb RAM, 10Gb storage, low intensity network usage, enough to keep 2k of data on every single citizen of New Zealand; woman, man and child (but not sheep). A modest database in the cloud isn’t going to break the budget.

Migrating to the cloud can be a bit of a hit-and-miss affair. Let’s suppose you’re only maintaining 1Gb of data on your local MySQL instance, and you want to move it to a similar instance in RDS (also known as a homogeneous migration). You have to main options:

  • Shut down your server, do a mysqldump to create an SQL file representing your entire database, then pump that into RDS, connect your application to the database in RDS and go back online.
  • Use the AWS Data Migration Service to replicate your data to RDS without shutting down your current system, and when done just reconfigure your application to point to RDS instead.

I recently tried the mysqldump route. I loaded a compressed version of the dump onto an EC2 instance, and then piped the uncompressed record stream through the mysql tool to restore the database to a freshly minted RD instance. To my horror, this experiment with an image under 1Gb took over a day to complete! There’s no way I could be expected to shut down a system for a day while the data is moved.

The alternative is to move the data while it is still live, while making sure that any changed made to the live system during the migration are also copied over. The AWS DMS can do this. It will use a connection to your current live database and a connection to an RDS instance and gradually copy across. This approach might take a little longer since the data isn’t compressed (and presumably is also going via an encrypted connection) but the only downtime you will need is at the end when you have to switch your application from using the current local database to using the RDS instance in the cloud.

Once your data is in the cloud, you get additional features such as automatic backups, a hot standby server in case your main instance gets into trouble, or you can introduce data replication, clusters, geographic distribution and more. You scale according to your needs, up or down, and often just by making some configuration adjustments via a Web interface.

Are there any gotchas? Sure there are. Here’s a few you might want to consider, though for modest database situations these probably won’t be much to worry about:

  • If you need BLOBs over 25Mb, think again (suggestion: put them into S3 and replace the database fields with S3 references).
  • If you regularly use MySQL logs written to the file system, be prepared to survive without them.
  • If you need SUPER privileges then you will need to figure out alternative ways to achieve what you want. You need SUPER, for example, to kill an ongoing query, which can be a problem when you have some unintentionally compute-intensive select that just keeps going, and going, and going… (Note: Amazon knows about this particular limitation, which is why they offer an alternative via rds_kill_query.)
  • The underlying file system is case sensitive, so beware if you have been relying on case insensitivity (e.g. your MySQL server has a Windows OS).
  • Backups to places outside AWS have to be done via mysqldump because you don’t have access to the underlying MySQL files. AWS snapshots are efficient, and can be automated to occur at whatever rate you want, but they stay in Amazon. If you are used to the idea of restoring a MySQL database by dropping the raw files into place, be advised that you can’t do this with the snapshots.
  • Only the InnoDB engine can be used if you want to take advantage of Amazon’s crash recovery.
  • There are plenty of other limitations (like the 6Tb size limit, the 10,000 table limit, the lack of Federated Storage Engine support etc.) but none of these are going to be of any interest to someone running a modest database.

If you want an encrypted RDS database, make sure to set it up as encrypted when creating it, before you put any data into it. The nice thing about using this approach is that your application doesn’t need to know that the database is encrypted. You just need to manage your access keys properly, and naturally there’s a cloud service for managing keys. Again, these are benefits that the staff-challenged small business might bear in mind.

So, why Amazon? In my case, partly because I’ve been using their cloud services for years, and have been impressed by how the service has constantly improved. Objectively, I note the Forrester report (PDF, download from Amazon) earlier this year (2017) that put AWS RDS on top of the pile.

Of course, if you are going to have your database in the cloud, you might as well put your applications there too, but that’s another story.

npm Hell

A few years back I blogged about the RPM Hell (TreeTops, Jan 2012) that came about because of ineffective dependency checking. Needless to say, this reminded a lot of people about the infamous DLL Hell from many years prior. DLL and RPM dependency issues have generally been overcome with a variety of approaches and improvements in the tools and operating environments. Pushing functionality onto the Web or into the Cloud can help move the problem away from your local machine. After all, if the functionality is now just some JavaScript on a Web page then this shouldn’t be affected by whatever mess of libraries are installed on your local machine.

That worked for a while. Eventually the JavaScript workload grew to the point that putting the scripts into reusable packages made sense, and we have seen a mushrooming of JavaScript packages in recent times. Followed by JavaScript package versioning, server-side JavaScript (i.e. Node) and JavaScript package managers (e.g. npm1).

The inevitable outcome of this evolution was yet another dependency hell. Initially, npm would incorporate specific dependencies within a package, so that each package could have a local copy of whatever it depended on. That led to a large, complex and deep dependency hierarchy and a lot of unnecessary duplications. Subsequent refinements led to automatic de-duplication processes so that commonly used packages could be shared (until the next time the dependencies change), thus flattening the hierarchy somewhat.

Unlike the days of DLL and RPM hell, the current trend in the JavaScript community is to version and release often. Very often, in fact. Seldom does a day go by that I don’t have to update multiple packages because one in the hierarchy has released an urgent fix for a bug, security hole, spec change or whatever. Unfortunately this often leads to two or more packages conflicting about the versions of dependencies that they will accept.

Such conflicts would not be a big problem were it not for two unfortunate consequences:

  1. If the packages are going to be used client-side then you have to ship multiple versions of these dependencies with your app (SPA or native embedded) and thus bloat the binary even though there’s massive functional overlap across the versions.
  2. Runtime conflicts when a package exposes a public interface to be used for app-wide functionality, but has version incompatibility.

Bloat is a major headache, but at least the application still works. Conflicts, on the other hand, are terminal. To overcome these you can try winding some package versions back, until you find a compatible combination. Then you cap (freeze) each of these versions until some future release of one or more of the conflicting packages enables a revised compatible combination. This risks you keeping a known security hole, or missing out on some fix or feature. Sadly, this juggling is still for the most part a manual process of guesswork.

Throw in the possibility (no, likelihood) that a recently released package will have critical bugs due to unforeseen clashes with your particular combination of package versions and you can imagine the nightmare.

Why do I mention this now? Because yesterday, for several hours, I was struggling with TypeScript issue 16593 (though it took me a long time to discover that I had encountered a problem that had been flagged a week earlier). TypeScript, the JavaScript superset under Microsoft’s stewardship, is a package that I would normally update without concern, but suddenly v2.4.1 was causing problems for rxjs (another package I would normally have no qualms about updating). The argument over who is responsible for the problem continues, with two options open to me: downgrade and fix at v2.4.0 or “upgrade” to the Alpha version of rxjs 6. The latter option doesn’t really appeal, not for something that I want in production pretty soon. I chose to fix TS at v2.4.0, and I am glad I did because along comes issue 16777 that alarmingly says that v2.4.1 is causing out of memory errors!

As I finished a late-night session I reflected on the fact that despite the passage of years we are still facing this kind of hell, with no solution on the horizon other than vigilance and an ability to juggle.

___

  1. The creators of npm really don’t like seeing it capitalized.

Sync or swim

The headline article in last month’s Communications of the ACM (“Attack of the Killer Microseconds”1) got me thinking. The article was drawing attention to the fact that computations are now happening so fast that we now must now even consider the time it takes data to move from one side of a server warehouse to the other at the speed of light. Multi-threading (a)synchronous programming methodologies need to be re-thought given that the overheads of supporting these methodologies are becoming more of a burden compared to the actual computation work being undertaken.

That is not what got me thinking.

What got me thinking was a point being made that despite our generally held belief that asynchronous models are better performers in demanding high-throughput situations, the authors claim that synchronous APIs lead to “code that is shorter, easier-to-understand, more maintainable, and potentially even more efficient.” This last claim is supported by saying that abstracting away an underlying microsecond-scale asynchronous device (e.g. document storage) requires support for a wait primitive and these incur overheads in OS support (specifically: tracking state via the heap instead of the thread stack). So, asynchronous callbacks are bad. Not only that, they’re also ugly.

Here’s a terse paraphrasing of the coding illustration presented by Borroso:

Input:   V
Program: X = Get1(V); Y = Get2(X); Result = fn(Y);

The two Get function are assumed to involve a microsecond-scale device, so the above program would incur two very brief delays before being able to calculate the final result using the synchronous “fn” function. During those delays, the state of the program is in the thread’s stack.

An asynchronous equivalent would invoke the Get1 and provide a callback (or “continuance”) for the Get1 to invoke when it has finished. The callback would then invoke Get2, again with a callback to carry on when Get2 completes. The second callback would invoke fn, synchronously, to compute the final result. This somewhat ugly program would look like this2:

Program:    X = AysncGet1(V,ProgramCB1);
ProgramCB1: Y = AsyncGet2(X,ProgramCB2);
ProgramCB2: Result = fn(Y);

To support the callback parameter, these AsyncGet functions now need to maintain the current state information, including references to the required callback programs, which occupies heap space because the context can switch away during the async operations. It’s this extra program state management overhead that leads to underperformance, while making the code harder to follow.

However, I’m not sure if it is right to expose the callback chain and heap objects to demonstrate unnecessary complexity, in order to “prove” that such code is longer, harder to understand and less maintainable. The authors presented an example in C++, which can be obscure at the best of times, and packed it with comments, which certainly made it look big and messy, but it feels contrived to be a bad example.

So I offer this:

Program: X = Get1(V); &{ Y = Get2(X); &{ Result = fn(Y); } }

In this I have introduced a simple notation “&{…}” to indicate the continuation following a function call. This works whether or not the Get functions are synchronous or asynchronous. Furthermore, a clever compiler (perhaps with some metadata hints) could easily discover the subsequent dependencies following a function call and automatically apply the continuation delimiters. This means that a compiler should be able to deal with automatically introducing the necessary async support even when the code is written like this without being explicit about continuations:

Program: X = Get1(V);    Y = Get2(X);    Result = fn(Y);

This is just the original program, with the compiler imbued with the ability to inject asynchronous support. I’m not saying that this solves the problems, as it’s obvious that even an implied (compiler discovered) continuation still needs to be translated into additional structures and heap operations, with the consequent performance overheads. However, appealing to the complexity of the program source by presenting a contrived piece of ugly C++, especially after admitting that “languages and libraries can make it somewhat easier” does their argument no service at all. The claim that the code “is arguably messier and more difficult to understand than [a] simple synchronous function-calling model” doesn’t hold up. Explicitly coding the solution in C++ is certainly “arguably messier”, but that’s more a problem of the language. Maintainable easy-to-understand asynchronous programs are achievable. Just maybe not in C++.

As for the real focus of the paper, the need to be aware of overheads down at the microsecond scale is clear. When the “housekeeping” surrounding the core operations is taking an increasing proportion of time, because those core operations are getting faster, then you have a problem. Perhaps a programming approach that has the productivity and maintainability of simple synchronous programming while automatically incorporating the best of asynchronous approaches could be part of the solution.

  1. CACM 04/2017 Vol.60 No.04, L. Barroso (Google) et al. DOI:10.1145/3015148
  2. For the C++ version, about 40 lines long, see the original paper.

Post-truth

As the kindling of 2017 takes hold, a new concept is slowly rolling across the land. The compound word that best captures it is “post-truth”, which was declared by Oxford Dictionaries to be the “Word of the Year, 2016“. It refers to the nascent contemporary period in which self-evident or provable truth is no longer the generally accepted truth by a growing proportion of society.

Why is this? What would make seemingly intelligent people say or do things that are more in keeping with Alice’s “up is down” Wonderland?

Let’s suppose that most intelligent people will accept the following: “if we are informed that A is true and we know that A being true implies that B is true, then we must accept that B is true.” This is just basic modus ponens from the study of propositional logic, more succinctly expressed as “A implies B; A; therefore B”. Each step in the logical argument is true. Yet we increasingly find that people don’t accept the third step. Indeed, many seem to reach “Not B” as their conclusion, completely contrary to any logical argument!

Similarly, people presented with contrary evidence (e.g. “Not B”) while knowing that “A implies B” still adhere to the belief that A remains true, in defiance of modus tollens. In other words, they maintain their beliefs despite contrary evidence.

You can apply this to many things, such as the arguments surrounding climate change, the origins of life, the causes of conflict, support for political positions, attendances at events, consequences of economic decisions and more. It has become so prevalent that many stories circulating in the various media merely state the conclusion of an argument without exploring the actual argument itself. Then, since what is written in black and white is often taken as fact, any subsequent arguments become disastrously off-track since they start from falsehoods and work ever downwards.

So what has gone wrong?

At first I believed the problem rests with a failure to properly affirm the antecedent property (A), or incorrectly identify A so that any conclusion regarding B is perceived as flawed. Since the argument is being presented to prove B and the argument fails then the – naturally human – response is to assume the opposite must be true. (This of course is another logical fallacy, as the failure of one argument does not mean that all other arguments must fail.)

This is what leads to one group claiming only a quarter of a million people attended an event while the organiser claims six times that number (or more) because if you have a photo showing people all the way to the back and you know the capacity of the auditorium then surely you know how many attended. A full picture means numbers were at capacity, here’s a picture, therefore we had full capacity. Then someone else comes along with a photograph from a different angle showing something different and suddenly the facts and arguments get muddied. In the end, people come away believing whatever they want to believe, regardless of the facts.

That was my first thought. But on reflection I think there may be something more serious going on. We are awash with facts, or things that look like facts, so we are hard pressed to determine what’s real and what’s not real. Of course, we could be presented with dozens of photographs from many angles in the auditorium and many of these photos would be packed with people, but all it takes is one photo from the right angle to give the right perspective. The film industry has long known about using camera angles to make a place look full. Our national sporting agencies have often only taken photos of the stadia where the spectators have been compressed to give the impression of full attendance, despite 80% of the seats being unoccupied. So why then does a counter-image not destroy the argument?  Scientists know that all it takes is one failure to destroy a theory, but they are guided by logic. They call it the scientific method. The general public, however, are swayed by rhetoric and the volume of “facts”. The one photograph showing a mostly empty auditorium from a good vantage point does not destroy the argument backed by many ground-level photographs and the oratory of a person who declares any counter-argument to be “false news”.

So, it is not just that people are convinced by the volume of “facts” and the volume of the orator, while suppressing any contrary information or sources, it appears also to be the case that logical reasoning has suffered. Rational thinking is falling by the wayside. Indeed, those who are ill-equipped to think rationally are doing their best to demonize anyone who does not share their world-view. There is a battle going on between logic and emotion, and it looks like emotion is starting to win. Belief is replacing proof, and even if the consequences of this madness result in disaster, those who are responsible will not accept that responsibility because it does not fit the narrative they want to hear.

Welcome to the post-truth world, where facts and rational thinking are optional.

Long, long, long paths

You may not be familiar with the MAX_PATH limitation in Windows, but if you are then you know how much of a nuisance it can be. Windows, unlike Unix, limits the full path of a file to 260 characters. People of a certain vintage who remember the 8.3 file name format will laugh at 260 characters being called a limitation but to Unix followers the notion of such a limit is crazy.

That 260 character limit is there, has been for a very long time, and is even present in the current incarnation: Windows 10.

Why is this a problem? Normally it is not and one can work away for years and not even be aware of it, but as luck would have it I hit that limit earlier today. You see, I was examining a Node.js project and it is quite normal for dependencies to be stored in a sub-directory (i.e. sub-folder) called node_modules, which in turn can have their own dependencies (in sub-sub-directories) and so on. These dependencies can run quite deep.

I had not created this project structure myself, but was examining it within a Zip file with the help of 7Zip. Then I decided to extract the entire hierarchy so that I could apply some tools to the collection. The extraction seemed to go OK, the tool extracted the entire collection into a temporary folder and then attempted to move it to its final destination. At 90% of the way an error message popped up, saying:

Destination Path Too Long
The file name(s) would be too long for the destination folder. You can
shorten the file name and try again, or try a location that has a shorter
path.

Nasty. Fortunately there is a little-known feature in Windows 10. Run the Group Policy Editor (gpedit.msc) and navigate to:

Local Computer Policy | Computer Configuration | Administrative Templates | System | Filesystem*

There you should find an entry called “Enable Win32 Long Paths”, which you should enable. This allows many Win32 programs, developed when MAX_PATH was 260 characters, to now create longer paths.

I made the change and rebooted. (That was probably unnecessary, but after a GP change it’s probably a good idea.) The result? Another failure. Once again the error message appeared when 7Zip was moving the extracted files to their final destination. This operation appears to be done by Windows itself, given the appearance of the dialog box that summarizes the progress of files during the move.

Although rather annoyed by this failure, I tried one more approach. Instead of dragging the root folder out of the 7Zip GUI to the target folder, I used the context menu (right-click) to cause an “extract here”. This bypasses the extraction to a temporary folder and dumps the extraction piece by piece into the target. This time it worked.

Out of curiosity, I then dragged the root of the newly extracted hierarchy to the same window, thus initiating a complete copy of the hierarchy in-situ using only Windows’ built-in facilities. It failed at exactly the same place.

What did I learn from this? It would appear that although the “Enable Win32 Long Paths” can be enabled in Windows 10, and tools like 7Zip appear to work with long file names, it seems that the change does not do anything for Windows Explorer as its own copy operation fails because of the old MAX_PATH limit. Where there’s one problem, others are sure to be found.

This is probably going to make my ongoing work in Node on Windows a little painful. Maybe I should stick to Unix.

* On some older versions of Win10 you may have to drill a step further down to “NTFS”.

PS In a post on the npm Blog two years ago they mentioned that npm3 was going to have flatter node_modules dependency hierarchies. Not a complete fix for the Windows limit, but a lot better than the bottomless pit that triggered the MAX_PATH issue. This doesn’t help me, however, because I’m trawling through legacy code that started life before the flatter approach appeared.