# A Deeper Look at Expressions in C

Evaluation Semantics

A lot of C code consists of *expressions*. Expressions are
everywhere, really. For example, consider an assignment

i = 7 + j;

The sum on the right-hand side (`7 + j`

) is an expression. That is
clear. The expression is defined by the addition operator (`+`

) and
its two operands, and the value of the expression is the value of `j`

plus 7. All of this is pretty obvious. However, `j`

could be
considered an expression, and even `7`

, and `i`

. And the assignment
as a whole is also an expression.

In fact, an assignment expression is defined by the assignment
operator (`=`

) and its two operands (or arguments): a left-hand-side
expression that determines the object whose value will be assigned,
and a right-hand-side expression that determines the value that will
be assigned to the left-hand-side object. And the whole assignment is
itself an expression whose value (or “result”) is the value being
assigned, meaning the result of the right-hand-side expression.

We now take a closer look at expressions in order to better understand the semantics of the evaluation of expressions, which is a fundamental part of the semantics of the language.

## General Structure

Expressions have a recursive structure. That is, an expression is a
combination of sub-expressions. This “combination” is done by means
of *operators.* For example, the addition operator (`+`

) combines two
sub-expressions that define the two operands of the addition. The
semantics of this combination, meaning the its execution at run-time,
usually involves some computation and/or some side-effects. In the
case of the addition operator, the values of the two operands are the
“input” of the computation, and the “output” is—you guessed
it—their sum. The addition operator itself has no side-effects.

So, in essence, in order to understand expressions in C and C++, you
must understand two things: operators, with the properties and
constraints of their computations and side-effects; and the recursive
structure of an entire expression, that is, how multiple operators in
a source expression are interconnected, with the output of some
operators feeding into the input of others. For example, the
expression `i = 7 + j`

that we already discussed has the structure
illustrated in Figure 2:

Figure 1: Data-flow diagram for `i = 7 + j`

.

The sub-expressions `7`

and `j`

feed into the `+`

operator, whose result
feeds into the `=`

operator (assignment) as its right-hand side operand,
while the sub-expression `i`

feeds into the same `=`

operator as its
left-hand side operand. The `=`

operator also has a result, but that
goes nowhere, since the whole expression forms a single statement.

Another example is the expression `i = i - 7 + j`

, which has the
following structure:

Figure 2: Data-flow diagram for `i = 7 + j`

.

Here, the sub-expressions `7`

and `j`

feed into the `+`

operator, whose result
feeds into the `=`

operator (assignment) as its right-hand side operand,
while the sub-expression `i`

feeds into the same `=`

operator as its
left-hand side operand. The `=`

operator also has a result, but that
goes nowhere, since the whole expression forms a single statement.

## Objects, L-values, and R-values

An *object* is a region of storage, in memory. An object can
therefore store a value, which can be read and written. Examples are
a variable of a basic type such as this one:

int i;

This defines an `int`

object corresponding to variable `i`

. Notice
that object `i`

may have static storage or automatic storage, which
affects its lifetime. However, for the purpose of this discussion, we
are not concerned about the storage class and therefore the lifetime
of `i`

. All that matters at this point is that the object exists.

An *l-value* is an *expression* referring to an object. An obvious
example of an l-value expression is an identifier. For example, the
left-hand side of the assignment below is the expression `i`

, which is
an l-value expression, since it refers to an object (`i`

).

i = 7;

Single variables aren’t the only l-value expressions. Here is a slightly more complex example:

char s[100]; s[i] = 'a';

Here the expression `s[i]`

is an l-value. That is, it is an
expression that refers to an object, namely the `char`

object at
position `i`

in the `s`

array. Here’s another one, just a bit
more complicated—but you got the idea, right?

struct S { int x; char s[10]; }; struct S * fun2(int); (fun2(i + 2) + 5)->s[i] = '?';

The left-hand side of the last expression statement (above) refers an
object: a particular `char`

object in the `s`

array of a `struct S`

object located in some sequence of `struct S`

objects, specifically 5
`struct S`

objects down from one whose address is returned by the
invocation of the `fun2`

function, which in fact returns pointers to
`struct S`

objects. Bottom line, it’s an l-value expression.

That’s no coincidence. It *must* be an l-value. The left-hand side
of an assignment must be an l-value. An assignment stores a value
into an object. The right-hand side of the assignment determines the
value, and the left-hand side determines the object. In fact, that’s
what the name comes from: an *l*-value is what you want on the
*left*-hand side of the assignment. This concept might be grasped
intuitively by considering the following two expression statements:

char s[100]; /* CORRECT: s[i + 1] is an l-value, that's the object being assigned. */ s[i + 1] = 'a'; /* INCORRECT: What object are we assigning? Makes no sense. */ s[i] + 1 = 'a';

Did you try compiling this example? What does the compiler say?

So, an l-value is what you need whenever you need to refer to an
object. Like on the left-hand side of an assignment. However, the
right-hand side of an assignment may also denote a value that does not
correspond to an object. The most obvious example is a *literal
value*:

i = 7;

But many other expressions evaluate to a value without a corresponding
object. These are called *r-values*. Here’s another example:

i = (fun2(i + 2) + 5)->s[i] * i;

So, both l-values and r-values are *expressions*, but the main
difference is that an l-value refers to an *object*, while an r-value
refers to a value *without a corresponding object*. Of course, an
object also has a value, so an l-value expression can be used in both
cases, when you need an object or when you just need a value.

Another thing you can do with an l-value that you can not do with an
r-value is to apply the *address-of* operator (`&`

). Makes sense: the
address-of operator returns the address of an object, so it must apply
to an object.

/* CORRECT: the & operator applies to l-values */ int * p = &i; char * c = &((fun2(i + 2) + 5)->s[i]); /* INCORRECT: these are not l-values */ int * p_bad = &(i + 2); char * c_bad = &(i > 0 ? 'x' : 'y');

Okay, but what is this all about? Why should one care about
distinguishing l-value from r-value expressions? The compiler does
that for us anyway. Why should *you* care? The reason is that these
are fundamental concepts to understand the semantics of expressions in
general. These concepts are particularly important to understand C++,
where operators can be defined by the programmer, and where functions
can return l-values (references).

Each operator in the C language has specific requirements for its
operands. As we have seen, the assignment operator (`=`

) requires
that its first operator (left-hand side) is an l-value; the address-of
operator (`&`

), which is a unary operator, requires that its operand
be an l-value, and the same is true for the increment and decrement
operators (`++`

, `--`

); instead, the addition operator (binary `+`

)
only requires that its operands be r-values.

Furthermore, the *result* of an expression consisting of a particular
operator may be an l-value or an r-value. For example, the return of
the addition operator (binary `+`

) is an r-value, while the
array-index operator `[]`

returns an l-value expression. For example
`A[3]`

is an l-value.

**Pop quiz 1:** does the dereference operator (`*`

) take an l-value or
an r-value?

**Pop quiz 2:** does the dereference operator (`*`

) return an l-value or
an r-value?

**Pop quiz 3:** does the assignment operator (`=`

) return an l-value or
an r-value?

**Pop quiz 4:** does the prefix increment (`++`

) return an l-value or
an r-value?

**Pop quiz 5:** does the postfix increment (`++`

) return an l-value or
an r-value?

## Operator Precedence and Associativity

See http://en.cppreference.com/w/c/language/operator_precedence

Operator precedence determines the semantics—that is, the exact interpretation—of an expression. For example:

int i = 5; int j = 10; i = i + j * 2; printf("i = %d\n", i);

Here the result is 25, not 30, since the multiplication operator (`*`

)
has a higher precedence than the addition operator (`+`

). Of course,
you can use parenthesized expressions to force a certain evaluation
precedence. So, the output of the code below is 30, not 25:

int i = 5; int j = 10; i = (i + j) * 2; printf("i = %d\n", i);

When operators have the same precedence, the semantics is defined by
the associativity. For example, the `+`

and `-`

operators, when used
as binary operators representing subtraction and addition, have the
same precedence, and are left-associative:

int i = 10; int j = 2; i = i - j + 4;

So the above expression is equivalent to the following:

i = ((i - j) + 4); /* == 12 */

as opposed to this:

i = (i - (j + 4)); /* == 6 */

Other operators associate from right to left. For example:

int i = 3; int j = 5; i = j = 7; printf("i = %d, j = %d\n", i, j);

where the assignment expression is equivalent to this:

i = (j = 7);

**Pop quiz 1:** what happens here? And *why?*

int i = 3; int j = 5; (i = j) = 7; printf("i = %d, j = %d\n", i, j);

**Pop quiz 2:** what happens here? And *why?*

int i = 3; int j = 5; i = -j--; printf("i = %d, j = %d\n", i, j);

**Pop quiz 3:** what happens here? And *why?*

int i = 3; int j = 5; i = ---j; printf("i = %d, j = %d\n", i, j);

## Evaluation Order

The semantics of an expression, as determined by the precedence and
associativity of operators, dictates how an expression is to be
*parsed.* In other words, we can interpret an expression as a tree,
where the nodes are operators and the edges connect each operator to
the submextressions that make up its operands. In this view, the
precedence and associativity of operators defines the shape of the
tree. The lower-precedence operators will be at the top of the tree,
while the higher-precedence operators will be at the bottom of the
trees (closer to the leaves).

Another way to look at the associativity rules and therefore the parse
tree of an expression is to consider it as a data flow diagram, where
every sub-expression is a source of data, and data flows from
sub-expressions into and out of operators to other operators up to
determine the value of the highest-level expression. For example, in
this interpretation of the expression `i = j + k * 10`

, data flows
from `k`

and literal value `10`

into the `*`

operator; then the result
of the `*`

operator flows into `+`

operator together with the data
from `j`

; then the result of the `+`

operator goes into the `=`

operator (right-hand side) together with `i`

(left-hand side).

This is all very nice. It’s a good way to look at the semantics of
expressions. In fact, you should take a piece of paper and practice
drawing the parse and data-flow tree for many expressions. But this
is not what this section is about. Here we talk about the *order of
evaluation* of the operands and (subexpressions). And that has little
to do with the parse and data-flow tree. The same is true of the
side-effects of the various operators. These include the
“side-effect” of incrementing a variable with `++`

or of assigning one
with `=`

. Those are also not determined by precedence or
associativity rules. We need to learn new rules for that.

To illustrate the question of the evaluation order, consider the following code:

```
i = f() + g() - h();
```

What is the order of invocation of the three functions?

Does the order change in the following expression? And if so, how?

```
i = f() + g() * h();
```

And what about this case:

A[f()] = B[g()] = C[h()];

What is the order of invocation of the three functions?

The answer is, *unspecified!* The order is in fact unspecified in the
previous examples as well.

Notice that the *evaluation* of an expression (or subexpression) might
include the computations of values, as well as the initiation of
side-effects.

### Sequencing and Sequence Points

The C and C++ languages do not mandate a precise evaluation order for all expressions. (Other languages such as Java do that.) This means that in C and C++ the evaluation order of expressions is sometimes unspecified. For example, the order in which the arguments of a function call are evaluated is unspecified. One reason for this design choice is to give maximum flexibility to the compiler to perform aggressive code optimization. On the other hand, this means that the programmer must make sure that the code is correct in all allowable evaluation orders. And of course the programmer can explicitly determine any evaluation order. Also, in some cases the order of evaluation is fully specified and therefore unambiguous.

Let’s learn how all of this is done. We start by defining the
fundamental notion of execution order. We consider the order between
any two elementary evaluations \(A\) and \(B\). An elementary evaluation
might represent the use of the value of an object, the computation of
a value by the application of operator, the side-effect of the
application of an operator on scalar value, such as the assignment or
the increment of an `int`

value, or the invocation of a function.

In some cases, an evaluation \(A\) is *sequenced before* another
evaluation \(B\). In other cases, it’s the opposite. And yet in other
cases, it is *neither*, that is, neither \(A\) is squenced before \(B\)
nor \(B\) is sequenced before \(A\). In this latter case, we also say
that \(A\) and \(B\) are *unsequenced* relative to each other.

Here is an example:

A[i] = A[i+1] * A[i+2];

Here the evaluations of the two operands of the multiplication
operator are *unsequenced*. This means that the expressions `A[i+1]`

and `A[i+2]`

may be evaluated in any order. In fact, the same is true
for the subexpressions that make up the array indexes. In fact, even
the evaluation of the two operands of the assignment operator are
unsequenced. This means that the determination of the destination
object (l-value `A[i]`

) and the determination of the value to be
assigned (r-value `A[i+1]*A[i+2]`

) are unsequenced. However, those
two evaluations are *sequenced before* the side effect of storing the
value of the right operand (r-value) into the object of the left
operand (l-value). This is confusing. I know. So, just go back and
read this paragraph once again from the beginnig. Trust me: do it!

Here is another, related example:

```
i = f(i) + g(i);
```

Here the evaluations of the left- and right-hand sides of the
assignment, `i`

(l-value) and `f(i) + g(i)`

, respectively, are
unsequenced. However, all of them are *sequenced before* the side
effect of the assignment. This means that the value of `i`

used in
evaluating `f(i)`

and `g(i)`

is the initial value of `i`

, which is
*later* replaced by the value of the expression `f(i) + g(i)`

.

Furthermore, as in the previous case for multiplication, the
evaluation of the operands of the addition operator are unsequenced.
So, neither `f(i)`

is *sequenced before* `g(i)`

, nor `g(i)`

is
*sequenced before* `f(i)`

. However, in this case, both evaluations
include the invocation of a function, `f(i)`

and `g(i)`

respectively,
and since the executions of functions do not overlap, `f(i)`

is
sequenced either before or after `g(i)`

, but we do not know which. In
this case we say that the two evaluations (function executions) are
*indeterminately sequenced*.

Here is yet another example:

if (i > 0 || f(i) == 1 || g(i) == 2) { /* ... */ }

Here the evaluation is completely specified. In fact, the evaluation
of `i > 0`

is sequenced before the evaluation of `f(i) == 1`

, which is
sequenced before `g(i) == 2`

. Furthermore, since the evaluation of
logical operators *always* uses a short-cut semantics, `f(i) == 1`

is
evaluated *only if* `i > 0`

evaluates to *false*, and `g(i) == 2`

is
evaluated *only if* `f(i) == 1`

also evaluates to *false*.

So, what determines the sequencing of evaluations? Many language
contstructs do that. Some operators induce some sequencing. We have
seen that an assignment (`=`

operator) introduces a sequencing
between, on the one hand the computations of its two operands, and on
the other hand the side effect of assignment. But the assignment does
not induce a complete sequencing between any other evaluations. In
particular, if the computations of the operands of the assignment have
side-effects, those are unsequenced relative to all other computations
in the assignment expression and also relative to the side-effect of
the assignment.

Other language constructs introduce complete sequencing between their
sub-expressions or even more generally. These points in the program
(flow) are what we call *sequence points*. Everything before a
sequence point is *sequenced before* anything after that sequence
point.

The most straightforward example of a sequence point is the end of a
full expression statement. Syntactically, that is indicated by a
semicolon (`;`

):

i = g(j) + h(j); j = f(i);

Here `g(j)`

and `h(j)`

are *sequenced before* `f(i)`

, although `g(j)`

and `h(j)`

are unsequenced. Also, the assignment to `i`

in the first
expression statement is *sequenced before* the evaluation of `f(i)`

in
the second expression statement.

There is also a sequence point between the evaluation of function parameters (as well as the function designator) and the actual call. So, for example

```
j = f(g(j) + h(j));
```

here the evaluation of the parameter to function `f`

is *sequenced
before* the actual call of `f`

, so the invocations of `g`

and `h`

are
both *sequenced before* the invocation of `f`

. However, the
invocations of `g`

and `h`

are *indeterminately sequenced*.

Other syntactic structures introduce sequence points. The details are intuitive in most cases.

So, what happens when two evaluations are *unsequenced*? There are
good and bad cases. Good cases are perfectly fine expressions (indeed
most non-trivial expressions) in which unsequenced expressions do not
interfere with each other, and the semantics of the program is
unambiguous anyway.

Bad cases are those in which unsequenced expressions have interrelated side-effects, which leads to ambiguities in the interpretation of the evaluation. For example:

```
j = i++ - i--;
```

What is the value of `j`

? And what is the value of `i`

?

And again:

int i = 0; j = A[++i] + B[++i];

What’s the semantics here? Is it `j == A[1] + B[2]`

? Or perhaps `j == A[2] + B[1]`

?
Or `j == A[1] + B[1]`

?

The answer is, in all these cases *the behavior is undefined*. And it
should be at least intuitively clear, because they are all ambiguous cases.
However, notice that there are other, perhaps more subtle cases in
which the behavior is also undefined. For example:

int i = 0; int j = 1; i = ++i; A[j] = j++;

These two (latter) statements might look innoquous. In the first
assignment, you might think that object `i`

gets its own increment,
which changes the value of `i`

from `0`

to `1`

. There does not seem
to be any other possible interpretation. Similarly, the following
assignment statement seems to assign `A[1] = 1`

and then increment
`j`

. However, both these statements invoke *undefined behavior*.

Why?

This is the precise wording in the definition of the C language:

If a side effect on a scalar object is unsequenced relative to either a different side effect on the same scalar object or a value computation using the value of the same scalar object, the behavior is undefined. If there are multiple allowable orderings of the subexpressions of an expression, the behavior is undefined if such an unsequenced side effect occurs in any of the orderings.

So, the key point is that a side effect on an object should never be
unsequenced relative to another side effect on the same object, or
relative to a use of the value of the same object. In the two
examples above, that’s exactly what we have. In `i = ++i`

there are
two side-effects on the same object `i`

, one from the increment
operator and one from the assignment. And those side-effects are
unsequenced. But okay, you might think that `i = ++i`

is a silly if
not perverse statement that no sane programmer would write. I concede
that point of view. However, the second statement is not so absurd.
Say you want to fill an array with the numbers between 0 and 9. A
sane programmer—although perhaps an inexperienced one—might well
write `j = 0; while (j < 10) A[j] = j++;`

thinking that there is no
ambiguity. However, the side-effect on `j`

for the increment, and the
use of the same object `j`

in the evaluation of the left-hand side
expression, are unsequenced.

Are you still not convinced that this type of undefined behavior is really a problem? Fair enough. Test yourself with the follwing quizzes.

**Pop quiz:** do the following two statements invoke undefined behavior?
Why?

A[i] = i; i = i + 1;

**Pop quiz:** does the following statement invoke undefined behavior? Why? And if so, when?

A[i] = i++;

**Pop quiz:** does the following statement invoke undefined behavior? Why? And if so, when?

A[++i] = A[++j] + 1;

**Pop quiz:** does the following statement invoke undefined behavior? Why? And if so, when?

```
i = j++ + k++;
```

**Pop quiz:** does the following statement invoke undefined behavior? Why? And if so, when?

A[i] = A[j]++ + A[k]++;

**Pop quiz:** does the following statement invoke undefined behavior? Why? And if so, when?

int *p, *q; /* ... */ *p = *q++;

**Pop quiz:** does the following statement invoke undefined behavior? Why? And if so, when?

int *p, *q; /* ... */ *p = (*q)++;

**Pop quiz:** does the following code invoke undefined behavior? Why? And if so, when?

for (int i = 0; i < 100; ++i) A[i] = A[7];