Tuesday, November 17, 2015

The Brick Wall of C++ Source Code Transformation

In 1992, I was responsible for organizing the Advanced Topics Workshop that accompanied the USENIX C++ Technical Conference. The call for workshop participation said:
The focus of this year's workshop will be support for C++ software development tools. Many people are beginning to experiment with the idea of having such tools work off a data structure that represents parsed C++, leaving the parsing task to a single specialized tool that generates the data structure. 
As the workshop approached, I envisioned great progress in source code analysis and transformation tools for C++. Better lints, deep architectural analysis tools, automatic code improvement utilities--all these things would soon be reality! I was very excited.

By the end of the day, my mood was different. Regardless of how we approached the problem of automated code comprehension, we ran into the same problem: the preprocessor. For tools to understand the semantics of source code, they had to examine the code after preprocessing, but to produce acceptable transformed source code, they had to modify what programmers work on: files with macros unexpanded and preprocessor directives intact. That means tools had to map from preprocessed source files back to unpreprocessed source files. That's challenging even at first glance, but when you look closer, the problem gets harder. I found out that some systems #include a header file, modify preprocessor symbols it uses, then #include the header again--possibly multiple times. Imagine back-mapping from preprocessed source files to unpreprocessed source files in such systems!

Dealing with real C++ source code means dealing with real uses of the preprocessor, and at that workshop nearly a quarter century ago, I learned that real uses of the preprocessor doomed most tools before they got off the drawing board. It was a sobering experience.

In the ensuing 23 years, little has changed. Tools that transform C++ source code still have to deal with the realities of the preprocessor, and that's still difficult. In my last blog post, I proposed that the C++ Standardization Committee take into account how source-to-source transformation tools could reduce the cost of migrating old code to new standards, thus permitting the Committee to be more aggressive about adopting breaking changes to the language. In this post, I simply want to acknowledge that preprocessor macros make the development of such tools harder than my last post implied.

Consider this very simple C++:
#define ZERO 0

auto x = ZERO;
int *p = ZERO;
In the initialization of x, ZERO means the int 0. In the initialization of p, ZERO means the null pointer. What should a source code transformation tool do with this code if its job is to replace all uses of 0 as the null pointer with nullptr? It can't change the definition of ZERO to nullptr, because that would change the semantics of the initialization of x. It could, I suppose, get rid of the macro ZERO and replace all uses with either the int 0 or nullptr, depending on context, but (1) that's really outside its purview (programmers should be the ones to determine if macros should be part of the source code, not tools whose job it is to nullptr-ify a code base), and (2) ZERO could be used inside other macros that are used inside other macros that are used inside other macros..., and especially in such cases, reducing the macro nesting could fill the transformed source code with redundancies and make it harder to maintain. (It'd be the moral equivalent of replacing all calls to inline functions with the bodies of those functions.)

I don't recall a lot of talk about templates at the workshop in 1992. At that time, few people had experience with them. (The first compiler to support them, cfront 3.0, was released in 1991.) Nevertheless, templates can give rise to the same kinds of problems as the preprocessor:
template<typename T>
void setToZero(T& obj) { obj = 0; }

int x;
setToZero(x);    // "0" in setToZero means the int

int *p;
setToZero(p);    // "0" in setToZero means the null pointer
I was curious about what clang-tidy did in these situations (one of its checks is modernize-use-nullptr), but I was unable to find a way to enable that check in the version of clang-tidy I downloaded (LLVM version 3.7.0svn-r234109). Not that it matters. The way that clang-tidy approaches the problem isn't the only way, and one of the reasons I propose a decade-long time frame to go from putting a language feature on a hit list to actually getting rid of it is that it's likely to take significant time to develop source-to-source translation tools that can handle production C++ code, macros and templates and all.

The fact that the problem is hard doesn't mean it's insurmountable. The existence of refactoring tools like Clang-tidy (far from the only example of such tools) demonstrates that industrial-strength C++ source transformation tools can be developed. It's nonetheless worth noting that such tools have to take the existence of templates and the preprocessor into account, and those are noteworthy complicating factors.

-- UPDATE --

A number of comments on this post include references to tools that chip away at the problems I describe here. I encourage you to pursue those references. As I said, the problem is hard, not insurmountable.

Friday, November 13, 2015

Breaking all the Eggs in C++

If you want to make an omelet, so the saying goes, you have to break a few eggs. Think of the omelet you could make if you broke not just a few eggs, but all of them! Then think of what it'd be like to not just break them, but to replace them with newer, better eggs. That's what this post is about: breaking all the eggs in C++, yet ending up with better eggs than you started with.

NULL, 0, and nullptr

NULL came from C. It interfered with type-safety (it depends on an implicit conversion from void* to typed pointers), so C++ introduced 0 as a better way to express null pointers. That led to problems of its own, because 0 isn't a pointer, it's an int. C++11 introduced nullptr, which embodies the idea of a null pointer better than NULL or 0. Yet NULL and 0-as-a-null-pointer remain valid. Why? If nullptr is better than both of them, why keep the inferior ways around?

Backward-compatibility, that's why. Eliminating NULL and 0-as-a-null-pointer would break existing programs. In fact, it would probably break every egg in C++'s basket. Nevertheless, I'm suggesting we get rid of NULL and 0-as-a-null-pointer, thus eliminating the confusion and redundancy inherent in having three ways to say the same thing (two of which we discourage people from using).

But read on.

Uninitialized Memory

If I declare a variable of a built-in type and I don't provide an initializer, the variable is sometimes automatically set to zero (null for pointers). The rules for when "zero initialization" takes place are well defined, but they're a pain to remember. Why not just zero-initialize all built-in types that aren't explicitly initialized, thus eliminating not only the pain of remembering the rules, but also the suffering associated with debugging problems stemming from uninitialized variables?

Because it can lead to unnecessary work at runtime. There's no reason to set a variable to zero if, for example, the first thing you do is pass it to a routine that assigns it a value.

So let's take a page out of D's book (in particular, page 30 of The D Programming Language) and zero-initialize built-ins by default, but specify that void as an initial value prevents initialization:
int x;              // always zero-initialized
int x = void;       // never zero-initialized
The only effect such a language extension would have on existing code would be to change the initial value of some variables from indeterminate (in cases where they currently would not be zero-initialized) to specified (they would be zero-initialized). That doesn't lead to any backward-compatibility problems in the traditional sense, but I can assure you that some people will still object. Default zero initialization could lead to a few more instructions being executed at runtime (even taking into account compilers' ability to optimize away dead stores), and who wants to tell  developers of a finely-tuned safety-critical realtime embedded system (e.g., a pacemaker) that their code might now execute some instructions they didn't plan on?

I do. Break those eggs!

This does not make me a crazy man. Keep reading.

std::list::remove and std::forward_list::remove

Ten standard containers offer a member function that eliminates all elements with a specified value (or, for map containers, a specified key): list, forward_list, set, multiset, map, multimap, unordered_set, unordered_multiset, unordered_map, unordered_multimap. In eight of these ten containers, the member function is named erase. In list and forward_list, it's named remove. This is inconsistent in two ways. First, different containers use different member function names to accomplish the same thing. Second, the meaning of "remove" as an algorithm is different from that as a container member function: the remove algorithm can't eliminate any container elements, but the remove member functions can.

Why do we put up with this inconsistency? Because getting rid of it would break code. Adding a new erase member function to list and forward_list would be easy enough, and it would eliminate the first form of inconsistency, but getting rid of the remove member functions would render code calling them invalid. I say scramble those eggs!

Hold your fire. I'm not done yet.

override

C++11's override specifier enables derived classes to make explicit which functions are meant to override virtual functions inherited from base classes. Using override makes it possible for compilers to diagnose a host of overriding-relating errors, and it makes derived classes easier for programmers to understand. I cover this in my trademark scintillating fashion (ahem) in Item 12 of Effective Modern C++, but in a blog post such as this, it seems tacky to refer to something not available online for free, and that Item isn't available for free--at least not legally. So kindly allow me to refer you to this article as well as this StackOverflow entry for details on how using override improves your code.

Given the plusses that override brings to C++, why do we allow overriding functions to be declared without it? Making it possible for compilers to check for overriding errors is nice, but why not require that they do it? It's not like we make type checking optional, n'est-ce pas?

You know where this is going. Requiring that overriding functions be declared override would cause umpty-gazillion lines of legacy C++ to stop compiling, even though all that code is perfectly correct. If it ain't broke, don't fix it, right? Wrong!, say I. Those old functions may work fine, but they aren't as clear to class maintainers as they could be, and they'll cause inconsistency in code bases as newer classes embrace the override lifestyle. I advocate cracking those eggs wide open.

Backward Compatibility 

Don't get me wrong. I'm on board with the importance of backward compatibility. Producing software that works is difficult and expensive, and changing it is time-consuming and error-prone. It can also be dangerous. There's a reason I mentioned pacemakers above: I've worked with companies who use C++ as part of pacemaker systems. Errors in that kind of code can kill people. If the Standardization Committee is going to make decisions that outlaw currently valid code (and that's what I'd like to see it do), it has to have a very good reason.

Or maybe not. Maybe a reason that's merely decent suffices as long as existing code can be brought into conformance with a revised C++ specification in a way that's automatic, fast, cheap, and reliable. If I have a magic wand that allows me to instantly and flawlessly take all code that uses NULL and 0 to specify null pointers and revises the code to use nullptr instead, where's the downside to getting rid of NULL and 0-as-a-null-pointer and revising C++ such that the only way to specify a null pointer is nullptr? Legacy code is easily updated (the magic wand works instantly and flawlessly), and we don't have to explain to new users why there are three ways to say the same thing, but they shouldn't use two of them. Similarly, why allow overriding functions without override if the magic wand can instantly and flawlessly add override to existing code that lacks it?

The eggs in C++ that I want to break are the old ways of doing things--the ones the community now acknowledges should be avoided. NULL and 0-as-a-null-pointer are eggs that should be broken. So should variables with implicit indeterminate values. list::remove and forward_list::remove need to go, as do overriding functions lacking override. The newer, better eggs are nullptr, variables with indeterminate values only when expressly requested, list::erase and forward_list::erase, and override. 

All we need is a magic wand that works instantly and flawlessly.

In general, that's a tall order, but I'm willing to settle for a wand with limited abilities. The flawless part is not up for negotiation. If the wand could break valid code, people could die. Under such conditions, it'd be irresponsible of the Standardization Committee to consider changing C++ without the above-mentioned very good reason. I want a wand that's so reliable, the Committee could responsibly consider changing the language for reasons that are merely decent.

I'm willing to give ground on instantaneousness. The flawless wand must certainly run quickly enough to be practical for industrial-sized code bases (hundreds of millions of lines or more), but as long as it's practical for such code bases, I'm a happy guy. When it comes to speed, faster is better, but for the speed of the magic wand, good enough is good enough.

The big concession I'm willing to make regards the wand's expressive power. It need not perform arbitrary changes to C++ code bases. For Wand 1.0, I'm willing to settle for the ability to make localized source code modifications that are easy to algorithmically specify. All the examples I discussed above satisfy this constraint:
  • The wand should replace all uses of NULL and of 0 as a null pointer with nullptr. (This alone won't make it possible to remove NULL from C++, because experience has shown that some code bases exhibit "creative" uses of NULL, e.g., "char c = (char) NULL;". Such code typically depends on undefined behavior, so it's hard to feel too sympathetic towards it, but that doesn't mean it doesn't exist.)
  • The wand should replace all variable definitions that lack explicit initializers and that are currently not zero-initialized with an explicit initializer of void. 
  • The wand should replace uses of list::remove and forward_list::remove with uses of list::erase and forward_list::erase. (Updating the container classes to support the new erase member functions would be done by humans, i.e. by STL implementers. That's not the wand's responsibility.)
  • The wand should add override to all overriding functions.
Each of the transformations above are semantics-preserving: the revised code would have exactly the same behavior under C++ with the revisions I've suggested as it currently does under C++11 and C++14.

Clang

The magic wand exists--or at least the tool needed to make it does. It's called Clang. All hail Clang! Clang parses and performs semantic analysis on C++ source code, thus making it possible to write tools that modify C++ programs. Two of the transformations I discussed above appear to be part of clang-tidy (the successor to clang-modernize): replacing NULL and 0 as null pointers with nullptr and adding override to overriding functions. That makes clang-tidy, if nothing else, a proof of concept. That has enormous consequences.

Revisiting Backward Compatibility 

In recent years, the Standardization Committee's approach to backward compatibility has been to preserve it at all costs unless (1) it could be demonstrated that only very little code would be broken and (2) the cost of the break was vastly overcompensated for by a feature enabled by the break. Hence the Committee's willingness to eliminate auto's traditional meaning in C and C++98 (thus making it possible to give it new meaning in C++11) and its C++11 adoption of the new keywords alignas, alignof, char16_t, char32_t, constexpr, decltype, noexcept, nullptr, static_assert, and thread_local.

Contrast this with the perpetual deprecation of setting bool variables to true by applying ++ to them. When C++14 was adopted, that construct had been deprecated for some 17 years, yet it remains part of C++. Given its lengthy stint on death row, it's hard to imagine that a lot of code still depends on it, but my guess is that the Committee sees nothing to be gained by actually getting rid of the "feature," so, failing part (2) of the break-backward-compatibility test, they leave it in.

Incidentally, code using ++ to set a bool to true is another example of the kind of thing that a tool like clang-tidy should be able to easily perform. (Just replace the use of ++ with an assignment from true.)

Clang makes it possible for the Standardization Committee to retain its understandable reluctance to break existing code without being quite so conservative about how they do it. Currently, the way to avoid breaking legacy software is to ensure that language revisions don't affect it. The sole tool in the backward-compatibility toolbox is stasis: change nothing that could affect old code. It's a tool that works, and make no mistake about it, that's important. The fact that old C++ code continues to be valid in modern C++ is a feature of great importance to many users. It's not just the pacemaker programmers who care about it.

Clang's contribution is to give the Committee another way to ensure backward compatibility: by recognizing that tools can be written to automatically modify old code to conform to revised language specifications without any change in semantics. Such tools, provided they can be shown to operate flawlessly (i.e., they never produce transformed programs that behave any differently from the code they're applied to) and at acceptable speed for industrial-sized code bases, give the Standardization Committee more room to get rid of the parts of C++ where there's consensus that we'd rather not have them in the language.

A Ten-Year Process

Here's how I envision this working:
  • Stage 1a: The Standardization Committee identifies features of the language and/or standard library that they'd like to get rid of and whose use they believe can be algorithmically transformed into valid and semantically equivalent code in the current version or a soon-to-be-adopted version of C++. They publish a list of these features somewhere. The Standard is probably not the place for this list. Perhaps a technical report would be a suitable avenue for this kind of thing. 
  • Stage 1b: Time passes, during which the community has the opportunity to develop tools like clang-tidy for the features identified in Stage 1a and to get experience with them on nontrivial code bases. As is the case with compilers and libraries, the community is responsible for implementing the tools, not the Committee.
  • Stage 2a: The Committee looks at the results of Stage 1b and reevaluates the desirability and feasibility of eliminating the features in question. For the features where they like what they see, they deprecate them in the next Standard.
  • Stage 2b: More time passes. The community gets more experience with the source code transformation tools needed to automatically convert bad eggs (old constructs) to good ones (the semantically equivalent new ones).
  • Stage 3: The Committee looks at the results of Stage 2b and again evaluates the desirability and feasibility of eliminating the features they deprecated in Stage 2a. Ideally, one of the things they find is that virtually all code that used to employ the old constructs has already been converted to use the new ones. If they deem it appropriate, they remove the deprecated features from C++. If they don't, they either keep them in a deprecated state (executing the moral equivalent of a goto to Stage 2b) or they eliminate their deprecated status. 
I figure that the process of getting rid of a feature will take about 10 years, where each stage takes about three years. That's based on the assumption that the Committee will continue releasing a new Standard about every three years.

Ten years may seem like a long time, but I'm not trying to optimize for speed. I'm simply trying to expand the leeway the Standardization Committee has in how they approach backward compatibility. Such compatibility has been an important factor in C++'s success, and it will continue to be so.

One Little Problem

The notion of algorithmically replacing one C++ construct with a different, but semantically equivalent, construct seems relatively straightforward, but that's only because I haven't considered the biggest, baddest, ruins-everythingest aspect of the C++-verse: macros. That's a subject for a post of its own, and I'll devote one to it in the coming days. [The post now exists here.] For now, I'm interested in your thoughts on the ideas above.

What do you think?