A Deeper Look at Expressions in C
Evaluation Semantics

A lot of C code consists of expressions. Expressions are everywhere, really. For example, consider the assignment

i = 7 + j;

The sum on the right-hand side (7 + j) is an expression. That is clear. The expression is defined by the addition operator (+) and its two operands, and the value of the expression is the value of j plus 7. All of this is pretty obvious. However, j could be considered an expression, and even 7, and i. And the assignment as a whole is also an expression, defined by the assignment operator (=) and its two operands: a left-hand-side expression that determines the object whose value will be assigned, and a right-hand-side expression that determines the value that will be assigned to the left-hand-side object.

We now take a closer look at expressions in order to better understand the semantics of the evaluation of expressions, which is a fundamental part of the semantics of the language.

General Structure

Expressions have a recursive structure. That is, an expression is a combination of sub-expressions. This “combination” is done by means of operators. For example, the addition operator (+) combines two sub-expressions that define the two operands of the addition. The semantics of an operator, meaning its concrete execution at run-time, usually involves some computation. For most operators, including the + operator in our example, the operands are the “input” of the computation, and its “output” is the value of that operator as an expression (or sub-expression). Specifically, the + operator in our example computes—you guessed it—the arithmetic sum of the operands.¹ This is it for the + operator. However, that’s not the end of the story for other operators.

In addition to taking input values and producing output values, some operators also cause side-effects. For example, consider the (sub)expression i++ in the the expression j = i++. The operator is ++ (“postfix increment”) takes one operand (i) and returns its value. But of course that’s not the end of it. As a side-effect, the operator also increments the value of its operand. Do you see any other operator in that whole expression (j = i++) that might have a side-effect? What are the input values? What is the output value? What are the side effects?

So, in essence, in order to understand expressions in C and C++, you must understand two things: the first one is individual operators, with the properties and constraints of their computations and side-effects. The second one is the recursive structure of an entire expression, that is, how multiple operators in a source expression are interconnected, with the output of some operators feeding into the input of others. For example, the expression i = 7 + j that we already discussed has the structure illustrated below:

The sub-expressions 7 and j feed into the + operator, whose result feeds into the = operator (assignment) as its right-hand side operand, while the sub-expression i feeds into the same = operator as its left-hand side operand. The = operator also has a result, but that goes nowhere, since the whole expression forms a single statement.

And here is our other example: j = i++, which has the following structure:

Here, the sub-expression i feeds into the ++ operator (postfix), whose result feeds into the = operator (assignment) as its right-hand side operand, while the sub-expression j feeds into the same = operator as its left-hand side operand.

Objects, L-values, and R-values

Recall that an object is a region of storage (in memory) that stores a value. Objects can of course be used in expressions. For example, in the expression i = 7 + j, we use objects i and j. In particular, i is the first (left-hand side) operand of the = operator while j is the second (right-hand side) operand of the + operator. Now, do you see a difference between those two cases?

Yeah, there is a difference: the = operator writes into i while the + operator reads from j. In other words, the = operator uses i as a destination object to store a value, while + uses j as a source value. That is, i is used as storage (an object) by the = operator—because the whole point of the = operator is store a value into an object. Instead, j is only used for its current value by the + operator. Of course, j also happens to be an object, but + doesn’t use it to store anything, so it might just as well be a temporary computed value that will never be stored anywhere.

And there is also the case of the i++ expression where the ++ operator uses its operand i both as an object, to store the incremented value, and as a value, to compute the incremented value.

These two ways of using operands and therefore subexpressions—as objects to store values, or as a pure values, with or without an object—are fundamental in understanding expressions. We therefore give them two different names. We call them l-values and r-values, respectively.

An l-value expression is an expression that refers to an object. An obvious example of an l-value expression, which we have already seen, is an identifier. For example, the left-hand side of the assignment i = 7 is the expression i, which is an l-value expression, since it refers to an object (i). However, single variables aren’t the only l-value expressions. Here is a slightly more complex example:

char s[100];

s[i] = 'a';

Here the expression s[i] is an l-value. That is, it is an expression that refers to an object, namely the char object at position i in the s array. Here’s another one, just a bit more complicated—but you got the idea, right?

struct S {
    int x;
    char s[10];
};

struct S * fun2(int);

(fun2(i + 2) + 5)->s[i] = '?';

The left-hand side of the last expression statement (above) refers an object: a particular char object in the s array of a struct S object located in some sequence of struct S objects, specifically 5 struct S objects down from one whose address is returned by the invocation of the fun2 function, which in fact returns pointers to struct S objects. Bottom line, it’s an l-value expression.

That’s no coincidence. It must be an l-value. The left-hand side of an assignment must be an l-value. An assignment stores a value into an object. The right-hand side of the assignment determines the value, and the left-hand side determines the object. In fact, that’s what the name comes from: an l-value is what you want on the left-hand side of the assignment. This concept might be grasped intuitively by considering the following two expression statements:

char s[100];

/* CORRECT: s[i + 1] is an l-value, that's the object being assigned. */
s[i + 1] = 'a'; 
/* INCORRECT:  What object are we assigning? Makes no sense. */
s[i] + 1 = 'a';

Did you try compiling this example? What does the compiler say?

So, just to reiterate for the fourth time, an l-value is what you need whenever you need to refer to an object. Like on the left-hand side of an assignment. However, the right-hand side of an assignment may also denote a value that does not correspond to an object. The most obvious example is a literal value:

i = 7;

But many other expressions evaluate to a value without a corresponding object. These are called r-values. Here’s another example:

i = (fun2(i + 2) + 5)->s[i] * i;

So, both l-values and r-values are expressions, but the main difference is that an l-value refers to an object, while an r-value refers to a value without a corresponding object. Of course, an object also has a value, so an l-value expression can be used in both cases, when you need an object or when you just need a value.

Another thing you can do with an l-value that you can not do with an r-value is to apply the address-of operator (&). Makes sense: the address-of operator returns the address of an object, so it must apply to an object.

/* CORRECT: the & operator applies to l-values */
int * p = &i;
char * c = &((fun2(i + 2) + 5)->s[i]);
/* INCORRECT: these are not l-values */
int * p_bad = &(i + 2);
char * c_bad = &(i > 0 ? 'x' : 'y');

Okay, but what is this all about? I can alredy read you mind: Why should one care about distinguishing l-value from r-value expressions? The compiler does that for us anyway. Why should you care? The reason is that these are fundamental concepts to understand the semantics of expressions in general. These concepts are particularly important to understand C++, where operators can be defined by the programmer, and where functions can return l-values, which in C++ are called references.

Each operator in the C language has specific requirements for its operands. As we have seen, the assignment operator (=) requires that its first operator (left-hand side) is an l-value; the address-of operator (&), which is a unary operator, requires that its operand be an l-value, and the same is true for the increment and decrement operators (++, --); instead, the addition operator (binary +) only requires that its operands be r-values.

Furthermore, the result of an expression consisting of a particular operator may be an l-value or an r-value. For example, the return of the addition operator (binary +) is an r-value, while the array-index operator [] returns an l-value expression. For example A[3] is an l-value.

Pop quiz 1: does the dereference operator (*) take an l-value or an r-value?

Pop quiz 2: does the dereference operator (*) return an l-value or an r-value?

Pop quiz 3: does the assignment operator (=) return an l-value or an r-value?

Pop quiz 4: does the prefix increment (++) return an l-value or an r-value?

Pop quiz 5: does the postfix increment (++) return an l-value or an r-value?

Operator Precedence and Associativity

See http://en.cppreference.com/w/c/language/operator_precedence

Operator precedence determines the exact semantics of an expression in terms of which operators are evaluated first, and therefore whether the output of one operator feeds into another one as its input, or vice-versa.

For example, consider the following program:

int i = 5;
int j = 10;

i = i + j * 2;

printf("i = %d\n", i);

Here the result is 25, not 30, since the multiplication operator (*) has a higher precedence than the addition operator (+). Of course, you can use parenthesized expressions to force a certain evaluation precedence. So, the output of the code below is 30, not 25:

int i = 5;
int j = 10;

i = (i + j) * 2;

printf("i = %d\n", i);

When operators have the same precedence, the semantics is defined by the associativity. For example, the + and - operators, when used as binary operators representing subtraction and addition, have the same precedence, and are left-associative:

int i = 10;
int j = 2;
i = i - j + 4;

So the above expression is equivalent to the following:

i = ((i - j) + 4); /* == 12 */

as opposed to this:

i = (i - (j + 4)); /* == 6 */

Other operators associate from right to left. For example:

int i = 3;
int j = 5;

i = j = 7;
printf("i = %d, j = %d\n", i, j);

where the assignment expression is equivalent to this:

i = (j = 7);

Pop quiz 1: what happens here? And why?

int i = 3;
int j = 5;

(i = j) = 7;
printf("i = %d, j = %d\n", i, j);

Pop quiz 2: what happens here? And why?

int i = 3;
int j = 5;

i = -j--;

printf("i = %d, j = %d\n", i, j);

…and now for super-hackers:

Pop quiz 3: what happens here? And why?

int i = 3;
int j = 5;

i = ---j;

printf("i = %d, j = %d\n", i, j);

Evaluation Order

The semantics of an expression, as determined by the precedence and associativity of operators, dictates how an expression is to be parsed. In other words, as we have seen, we can interpret an expression as a tree, where the nodes are operators and the edges connect each operator to the subextressions that make up its operands. In this view, the precedence and associativity of operators defines the shape of the tree. The lower-precedence operators will be at the top of the tree, while the higher-precedence operators will be at the bottom of the trees (closer to the leaves).

Another way to look at the associativity rules and therefore the parse tree of an expression is to consider it as a data flow diagram, where every sub-expression is a source of data, and data flows from sub-expressions into and out of operators to other operators up to determine the value of the highest-level expression. For example, in this interpretation of the expression i = j + k * 10, data flows from k and literal value 10 into the * operator; then the result of the * operator flows into + operator together with the data from j; then the result of the + operator goes into the = operator (right-hand side) together with i (left-hand side).

This is all very nice. It’s a good way to look at the semantics of expressions. In fact, you should take a piece of paper and practice drawing the parse and data-flow tree for many expressions. But this is not what this section is about. Here we talk about the order of evaluation of operands and subexpressions, which also includes the order in which their side-effects take place and therefore become visible to other expressions. This aspect of the semantics of expressions is crucial and yet it might be quite confusing. One might think that, once you understand the precedence and associativity of operators, and therefore once you have the parse tree, you can also figure out the order of evaluation. But that isn’t the case. We need to learn new rules for that.

To illustrate the question of the evaluation order, consider the following code:

i = f() + g() - h();

What is the order of invocation of the three functions?

Does the order change in the following expression? And if so, how?

i = f() + g() * h();

And what about this case:

A[f()] = B[g()] = C[h()];

What is the order of invocation of the three functions above?

The answer is, the order is unspecified! The order is in fact unspecified in the previous examples as well. Notice that “unspecified” here does not mean that the behavior of your program is undefined, like when you dereference an invalid pointer or something. I repeat, unspecified evaluation order is not undefined behavior. The behavior is well-defined, and corresponds to one of the possible orders. Six of them, to be precise. You just don’t know which one. But you know that your computer won’t suddenly catch fire—at least not because of the evaluation of those expressions.

Notice also that the evaluation of an expression (or subexpression) might include the computations of values, as well as the initiation of side-effects. Consider for example the expression in the following conditional statement (if-condition):

int A[10];
/* ... */
if (++i < 10 && A[i] == 0) {
    /* ... */
}

Does that look okay? Do you understand exactly what happens in each case?

I’m guessing you figured: you first increment \(i\), then check that the incremented value is less than 10 and the value of \(A[i]\) is zero, where \(i\) as the index in \(A\) is the incremented value. Right?

Well, good. That is correct.

But then, consider what happens here:

int A[10];
/* ... */
if (i < 9) {
    int x = A[i] + A[++i];
    /* ... */
}

Anything wrong here? You might think, first of all, \(i\) is less than 9, so it is safe to use A[i] and A[++i], and then \(x\) is computed as the sum of \(A[i]\) \(A[i+1]\). Easy, right?

Nope. The above code invokes undefined behavior, which is bad (in case you were wondering). The question is, why?

Sequencing and Sequence Points

The C and C++ languages do not define a precise evaluation order for all expressions. (Other languages such as Java do that.) This means that in C and C++ the evaluation order of expressions is sometimes unspecified. For example, the order in which the arguments of a function call are evaluated is unspecified. One reason for this design choice is to give maximum flexibility to the compiler to perform aggressive code optimization. On the other hand, this means that the programmer must make sure that the code is correct in all allowable evaluation orders. Notice that the order is sometimes unspecified. Other times, the order of evaluation is fully specified and therefore unambiguous. And in any case, the programmer can explicitly specify any evaluation order.

Let’s see how all of this works. We start by defining the fundamental notion of execution order. We consider the order between any two elementary evaluations \(A\) and \(B\). An elementary evaluation might represent the use of the value of an object; the computation of a value by the application of an operator; the side-effect of the application of an operator on a scalar value, such as the assignment or the increment of an int value; or the invocation of a function.

In some cases, an evaluation \(A\) is sequenced before another evaluation \(B\). In other cases, it’s the opposite. And yet in other cases, it is neither, that is, neither \(A\) is squenced before \(B\) nor \(B\) is sequenced before \(A\). In this latter case, we also say that \(A\) and \(B\) are unsequenced relative to each other.

Here is an example:

A[i] = A[i+1] * A[i+2];

Here the evaluations of the two operands of the multiplication operator are unsequenced. This means that the expressions A[i+1] and A[i+2] may be evaluated in any order. In fact, the same is true for the subexpressions that make up the array indexes. In fact, even the evaluation of the two operands of the assignment operator are unsequenced. This means that the determination of the destination object (l-value A[i]) and the determination of the value to be assigned (r-value A[i+1]*A[i+2]) are unsequenced. However, those two evaluations are sequenced before the side effect of storing the value of the right operand (r-value) into the object of the left operand (l-value). This is confusing. I know. So, just go back and read this paragraph once again from the beginnig. Trust me: do it!

Here is another, related example:

i = f(i) + g(i);

Here the evaluations of the left- and right-hand sides of the assignment, i (l-value) and f(i) + g(i), respectively, are unsequenced. However, all of them are sequenced before the side effect of the assignment. This means that the value of i used in evaluating f(i) and g(i) is the initial value of i, which is later replaced by the value of the expression f(i) + g(i).

Furthermore, as in the previous case for multiplication, the evaluation of the operands of the addition operator are unsequenced. So, neither f(i) is sequenced before g(i), nor g(i) is sequenced before f(i). However, in this case, both evaluations include the invocation of a function, f(i) and g(i) respectively, and since the executions of functions do not overlap, f(i) is sequenced either before or after g(i), but we do not know which. In this case we say that the two evaluations (function executions) are indeterminately sequenced.

Here is yet another example:

if (i > 0 || f(i) == 1 || g(i) == 2) {
    /* ... */
}

Here the evaluation is completely specified. In fact, the evaluation of i > 0 is sequenced before the evaluation of f(i) == 1, which is sequenced before g(i) == 2. Furthermore, since the evaluation of logical operators always uses a short-cut semantics, f(i) == 1 is evaluated only if i > 0 evaluates to false, and g(i) == 2 is evaluated only if f(i) == 1 also evaluates to false.

So, what determines the sequencing of evaluations? Many language contstructs do that. Some operators induce some sequencing. We have seen that an assignment (= operator) introduces a sequencing between, on the one hand the computations of its two operands, and on the other hand the side effect of assignment. But the assignment does not induce a complete sequencing between any other evaluations. In particular, if the computations of the operands of the assignment have side-effects, those are unsequenced relative to all other computations in the assignment expression and also relative to the side-effect of the assignment.

Other language constructs introduce complete sequencing between their sub-expressions or even more generally. These points in the program (flow) are what we call sequence points. Everything before a sequence point is sequenced before anything after that sequence point.

The most straightforward example of a sequence point is the end of a full expression statement. Syntactically, that is indicated by a semicolon (;):

i = g(j) + h(j);
j = f(i);

Here g(j) and h(j) are sequenced before f(i), although g(j) and h(j) are unsequenced. Also, the assignment to i in the first expression statement is sequenced before the evaluation of f(i) in the second expression statement.

There is also a sequence point between the evaluation of function parameters (as well as the function designator) and the actual call. So, for example

j = f(g(j) + h(j));

here the evaluation of the parameter to function f is sequenced before the actual call of f, so the invocations of g and h are both sequenced before the invocation of f. However, the invocations of g and h are indeterminately sequenced.

Other syntactic structures introduce sequence points. The details are intuitive in most cases.

So, what happens when two evaluations are unsequenced? There are good and bad cases. Good cases are perfectly fine expressions (indeed most non-trivial expressions) in which unsequenced expressions do not interfere with each other, and the semantics of the program is unambiguous anyway.

Bad cases are those in which unsequenced expressions have interrelated side-effects, which leads to ambiguities in the interpretation of the evaluation. For example:

j = i++ - i--;

What is the value of j? And what is the value of i?

And again:

int i = 0;
j = A[++i] + B[++i];

What’s the semantics here? Is it j == A[1] + B[2]? Or perhaps j == A[2] + B[1]? Or j == A[1] + B[1]?

The answer is, in all these cases the behavior is undefined. And it should be at least intuitively clear, because they are all ambiguous cases. However, notice that there are other, perhaps more subtle cases in which the behavior is also undefined. For example:

int i = 0;
i = ++i;

This statements might look innoquous, in the sense that it doesn’t look ambiguous. You might think that object i gets its own increment, which changes the value of i from 0 to 1. The increment is also changing the value of \(i\), but the final value is the same, so… no problem, right? There does not seem to be any other possible interpretation. But no, the assigment expression invokes undefined behavior. Now, I know what you’re thinking. Nobody in their right might should write that assignment anyway. And you’re right. If you want \(i\) to take the value \(i+1\), then just write i = i + 1. So yes, that example is contrived. But what about the following expression?

A[j] = j++;

This one seems totally reasonable, doesn’t it? And yet, it too is undefined behavior.

Why?

This is the precise wording in the definition of the C language:

If a side effect on a scalar object is unsequenced relative to either a different side effect on the same scalar object or a value computation using the value of the same scalar object, the behavior is undefined. If there are multiple allowable orderings of the subexpressions of an expression, the behavior is undefined if such an unsequenced side effect occurs in any of the orderings.

So, the key point is that a side effect on an object should never be unsequenced relative to another side effect on the same object, or relative to a use of the value of the same object. In the two examples above, that’s exactly what we have. In i = ++i there are two side-effects on the same object i, one from the increment operator and one from the assignment. And those side-effects are unsequenced. But okay, we already said that i = ++i is a silly if not perverse statement that no sane programmer would write. I concede that point of view. However, the second statement is not so absurd. Say you want to fill an array with the numbers between 0 and 9. A programmer—perhaps one who is inexperienced, but not totally insane—might well write j = 0; while (j < 10) A[j] = j++; thinking that there is no ambiguity. However, the side-effect on j for the increment, and the use of the same object j in the evaluation of the left-hand side expression, are unsequenced.

Are you still not convinced that this type of undefined behavior is really a problem? Fair enough. Test yourself with the follwing quizzes.

Pop quiz: do the following two statements invoke undefined behavior? Why?

A[i] = i;
i = i + 1;