Special Interest Group on C++
This post discusses the use of C-strings in C++. It defines the terms C-string and NTBS (null-terminated byte string); discusses C-string literals and variables; outlines common patterns of C-string usage; and highlights a subtle technical difference between C-strings and NTBSs.
But first, some advice: Avoid using C-strings in C++, and instead use std::string
where
possible. However, there are situations where C-strings can provide better performance
over std::string
, but make that choice on a case-by-case basis. Even when using a
C-string, consider using it with the light-weight wrapper std::string_view
.
In C++, the informal term C-string is used to mean the “string” data structure defined in the C programming language [7.1.1]:
A string is a contiguous sequence of characters terminated by and including the first null character.
Here, the term null character refers to the character whose integer code is 0. The
literal '\0'
is the null character. It is different from the literal '0'
whose
integer code is 48.
In contrast to C, a string in C++ is an abstract sequence of characters without regard for how the sequence is represented or terminated. However, C++ defines the related term null-terminated byte string [byte.strings]:
A null-terminated byte string, or NTBS, is a character sequence whose highest-addressed element with defined content has the value zero (the terminating null character); no other element in the sequence has the value zero.
C++ also defines the length of an NTBS as the number of elements that precede the terminating null character. An NTBS of length 0 is called an “empty NTBS”. C similarly defines the length of a string as the number of bytes preceding the null character.
Note: When working with C++:
std::string
is a concrete
implementation of this abstraction.)A literal such as "hello"
is a C-string literal (“static NTBS” in C++, to be precise).
The compiler automatically places the null character after the last character inside the
double quotes.
There is no data type called C-string. Instead, the type of a C-string is char
array of
the appropriate bound, where bound is the declared number of array elements (aka “array
size” or “array capacity”). The array type is const
qualified for literals. For example,
the type of the literal "hello"
is const char[6]
because storing that literal requires
six characters including the null character.
The length of the literal "hello"
is five, which is the number of characters preceding the
null character. In general, the length of a C-string is one less than the bound of the
character array that contains the C-string.
The following code fragment declares six character-array variables, only four of which create C-strings. The end-of-line comments provide additional information for each array.
Note: As the examples show, every C-string is a character array, but not every character array is a C-string.
char s1[]{'h','e','\0'}; // C-string; bound 3; length 2; explicit null at position 3
char s2[]{'h','e','r'}; // not C-string; bound 3; no null char
char s3[7]; // not C-string; bound 7; no null char
char s4[]{"he"}; // C-string; bound 3; length 2; auto null at position 3
char s5[8]{"he"}; // C-string; bound 8; length 2; auto null at position 3
char
arrayBeing a C-string is just a property of a character array based on whether the array meets the requirement of containing the null character, and this property can change over time for an array within the program. That is, a character array could be a C-string at one point in the program, and not be a C-string at another point in the same program. The following code segment illustrates this possibility.
char s6[]{'h','e','\0'}; // C-string; null character at position 3
s6[2] = 'r'; // no longer a C-string: null character replaced
s6[2] = '\0'; // C-string again: null character restored
The C++ library includes many functions that operate on C-strings. These functions are
defined in the header <cstring>
[cstring.syn].
Example functions are: strlen
to find length; strcpy
to copy a C-string to another;
and strcmp
to compare a C-string with another.
In addition, the insertion operator <<
on output streams is overloaded to output
C-strings.
A function that operates on a C-string typically receives a parameter of type char*
,
with the expectation that the caller passes a pointer to the first character of the
C-string. It is not necessary to also receive the C-string’s length because the function
can use the null character to detect end of data.
Here are the declarations for some common C-string functions:
std::size_t strlen(const char* s); // find C-string length
char* strcpy(char* dest, const char* src); // copy a C-string to another
int strcmp(const char *s1, const char *s2); // compare two C-strings
For all practical purposes, a C-string is the same as an NTBS, but a careful examination of their respective definitions reveals a subtle difference.
From C’s definition of a string, it is clear that a C-string may include multiple null characters, but the string is deemed to have ended as soon as the first null character is seen. Thus, we could say a C-string is a character array with at least once occurrence of the null character.
A string is a contiguous sequence of characters terminated by and including the first null character.
However, reviewing the C++ definition of NTBS, it is clear that an NTBS must end with the null character and the null character may appear only once:
A null-terminated byte string, or NTBS, is a character sequence whose highest-addressed element with defined content has the value zero (the terminating null character); no other element in the sequence has the value zero.
In summary, C-strings and NTBSs are different with respect to the number of null character occurrences permitted and the required location of the null character. However, I have yet to encounter any situation where this subtlety causes an issue: If an array with multiple null characters or a misplaced null character is used where an NTBS is expected, only the portion of the array until the first occurrence of the null character is processed. That is, in practice, an NTBS is treated just as a C-string.
Conclusion: It is OK to use the terms C-string and NTBS interchangeably, but it is important to be aware of the subtle difference, especially when arguing a point, or when discussing specifics in a job interview or other such scenario.
The following program shows how one could create valid C-strings but invalid NTBSs. The program’s output also shows that an NTBS with multiple null chars or misplaced null char is treated just as a C-string.
#include <iostream>
#include <cstring>
int main() {
char s7[]{'h','e','\0','r','\0'}; // two explicit null chars
char s8[]{"hello\0World"}; // explicit and implicit null chars
char s9[]{'h','e','r','\0','s'}; // misplaced null char
std::cout << s7 << '\n'; // "he": ignores chars after position 2
std::cout << std::strlen(s8) << '\n'; // 5: ignores chars after position 5
std::cout << std::strlen(s9) << '\n'; // 3: ignores chars after position 4
}
Strictly following the definitions of C-string and NTBS:
(The empty initializer {}
in an array declaration writes the number zero to all array
elements.)
char s10[]{'h','e'};
char s11[]{"he\0"};
char s12[4]{};
char s13[1]{};
char* p1 = s12;
char* p2 = s13;
char* p3;
Ask questions, give feedback, and discuss this post on Twitter. The Twitter link is specific to this post. We greatly appreciate all discussion on the post being only at the post-specific tweet.
Submit solutions by DM on Twitter (only by DM, please) so as to avoid spoilers. Please provide Compiler Explorer links to code. We prefer textual answers in the form of GitHub gists, files in a repo, or other form where we can just follow a link and open the content in a browser.