ESCAPE CHARACTERS IN C
Escape sequences in C
From Mustapha habib
Escape sequences are used in the programming languages C and C++, and their design was copied in many other languages such as Java, PHP, C#, etc. An escape sequence is a sequence of characters that does not represent itself when used inside a character or string literal, but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly.
In C, all escape sequences consist of two or more characters, the first of which is the backslash, \ (called the “Escape character”); the remaining characters determine the interpretation of the escape sequence. For example, \n is an escape sequence that denotes a newline character.
Contents
- 1Motivation
- 2Table of escape sequences
- 2.1Notes
- 2.2Non-standard escape sequences
- 2.3Universal character names
- 3See also
- 4References
- 5Further reading
Motivation
Suppose we want to print out Hello, on one line, followed by world! on the next line. One could attempt to represent the string to be printed as a single literal as follows:
#include <stdio.h>
int main() {
printf(“Hello,
world!”);
}
This is not valid in C, since a string literal may not span multiple logical source lines. This can be worked around by printing the newline character using its numerical value (0x0A in ASCII),
#include <stdio.h>
int main() {
printf(“Hello,%cworld!”, 0x0A);
}
This instructs the program to print Hello,, followed by the byte whose numerical value is 0x0A, followed by world!. While this will indeed work when the machine uses the ASCII encoding, it will not work on systems that use other encodings, that have a different numerical value for the newline character. It is also not a good solution because it still does not allow to represent a newline character inside a literal, and instead takes advantage of the semantics of printf. In order to solve these problems and ensure maximum portability between systems, C interprets \n inside a literal as a newline character, whatever that may be on the target system:
#include <stdio.h>
int main() {
printf(“Hello,\nworld!”);
}
In this code, the escape sequence \n does not stand for a backslash followed by the letter n, because the backslash causes an “escape” from the normal way characters are interpreted by the compiler. After seeing the backslash, the compiler expects another character to complete the escape sequence, and then translates the escape sequence into the bytes it is intended to represent. Thus, “Hello,\nworld!” represents a string with an embedded newline, regardless of whether it is used inside printf or anywhere else.
This raises the issue of how to represent an actual backslash inside a literal. This is done by using the escape sequence \\, as seen in the next section.
Some languages don’t have escape sequences. Instead a command including a newline would be used (writeln includes a newline, write excludes it).
writeln(‘Hello’);
write(‘world!’);
Table of escape sequences
The following escape sequences are defined in standard C. This table also shows the values they map to in ASCII. However, these escape sequences can be used on any system with a C compiler, and may map to different values if the system does not use a character encoding based on ASCII.
Escape sequence
Hex value in ASCII
Character represented
\a
07
Alert (Beep, Bell) (added in C89)[1]
\b
08
\enote 1
1B
\f
0C
\n
0A
Newline (Line Feed); see notes below
\r
0D
\t
09
\v
0B
\\
5C
\’
27
Apostrophe or single quotation mark
\”
22
Double quotation mark
Note 1.^ Common non-standard code; see the Notes section below.
Note 2.^ There may be one, two, or three octal numerals n present; see the Notes section below.
Note 3.^ \u takes 4 hexadecimal digits h; see the Notes section below.
Note 4.^ \U takes 8 hexadecimal digits h; see the Notes section below.
Non-standard escape sequences
A sequence such as \z is not a valid escape sequence according to the C standard as it is not found in the table above. The C standard requires such “invalid” escape sequences to be diagnosed (i.e., the compiler must print an error message). Notwithstanding this fact, some compilers may define additional escape sequences, with implementation-defined semantics. An example is the \e escape sequence, which has 1B as the hexadecimal value in ASCII, represents the escape character, and is supported in GCC,[2] clang and tcc. It wasn’t however added to the C standard repertoire, because it has no meaningful equivalent in some character sets (such as EBCDIC).[1]
Universal character names
From the C99 standard, C has also supported escape sequences that denote Unicode code points in string literals. Such escape sequences are called universal character names, and have the form \uhhhh or \Uhhhhhhhh, where h stands for a hex digit. Unlike the other escape sequences considered, a universal character name may expand into more than one code unit.
The sequence \uhhhh denotes the code point hhhh, interpreted as a hexadecimal number. The sequence \Uhhhhhhhh denotes the code point hhhhhhhh, interpreted as a hexadecimal number. (Therefore, code points located at U+10000 or higher must be denoted with the \U syntax, whereas lower code points may use \u or \U.) The code point is converted into a sequence of code units in the encoding of the destination type on the target system. For example, consider
char s1[] = “\xC0”;
char s2[] = “\u00C1”;
wchar_t s3[] = L”\xC0";
wchar_t s4[] = L”\u00C0";
The string s1 will contain a single byte (not counting the terminating null) whose numerical value, the actual value stored in memory, is in fact 0xC0. The string s2 will contain the character “Á”, U+00C1 LATIN CAPITAL LETTER A WITH ACUTE. On a system that uses the UTF-8 encoding, the string s2 will contain two bytes, 0xC3 0x81. The string s3 contains a single wchar_t, again with numerical value 0xC0. The string s4 contains the character “À” encoded into wchar_t, if the UTF-16 encoding is used, then s4 will also contain only a single wchar_t, 16 bits long, with numerical value 0x00C0. A universal character name such as \U0001F603 may be represented by a single wchar_t if the UTF-32 encoding is used, or two if UTF-16 is used.
Importantly, the universal character name \u00C0 always denotes the character “À”, regardless of what kind of string literal it is used in, or the encoding in use. Again, \U0001F603 always denotes the character at code point 1F60316, regardless of context. On the other hand, octal and hex escape sequences always denote certain sequences of numerical values, regardless of encoding. Therefore, universal character names are complementary to octal and hex escape sequences; while octal and hex escape sequences represent “physical” code units, universal character names represent code points, which may be thought of as “logical” characters.