Safely processing immutable text

07 Apr 2020

Safely processing immutable text

This is Part 2 of a 3-part series on std::string_view. This part focuses on the safety std::string_view provides over character arrays, and on the safety considerations to be made when using std::string_view.

Part 1 focuses on efficiency of std::string_view over std::string. Part 3 provides guidelines for using std::string_view.

1. String_view creation means

As stated in Part 1 of this series, std::string_view is simply a read-only wrapper around a character array. This statement is true regardless of the creation means used: from nothing, from a C-string, from a character array that is not a C-string, or from a std::string object.

Listing A shows creation of four string_view objects, each using a different creation means. The object created from nothing (variable sv1) is effectively as if it is created from the empty C-string "".

Listing A: creating string_view objects using four different means

std::string_view sv1;          // from nothing: approximates std::string_view sv1("");

std::string_view sv2{"hello"}; // from C-string

char a[]{'h','e','l','l','o'};
std::string_view sv3{a, 5};    // from character array that is not a C-string

std::string s{"hello"};
std::string_view sv4{s};       // from string: approx std::string_view sv4{s.c_str()};

A string_view created from a string object is the same as a string_view created using the C-string obtained via the c_str function on the string object. That is, the code associated with variable sv4 could be rewritten as: std::string_view sv4{s.c_str()};

Note: As discussed in Part 1, creating a string_view from a string object actually invokes an operator function in std::string.

2. Simplicity and safety over character arrays

It is much cleaner and safer to use the string_view wrapper instead of directly using a character array because string_view has functions and operators that abstract over low-level C-style functions. For example, we could simply test if sv1 > sv2, instead of testing if std::strcmp(z1, z2) > 0, where z1 and z2 represent C-strings, and sv1 and sv2 are corresponding string_view objects. Likewise, we can use the function member find instead of the low-level functions std::strchr, and std::strstr. (The string_view function find can find both single-character and multi-character texts.)

The string_view approach also has the advantage that it works with any character array, not just with C-strings. As a result, there is no need to resort to using functions such as std::memcmp to compare arrays and std::memchr to locate a character. Plus, with string_view, there is no need to explicitly carry around the size of every character array.

Also, because a string_view is immutable by design, it eliminates the many const qualifications that would be necessary to guarantee immutability of character arrays.

Lastly, the string_view approach is safer because the programmer does not have to worry about buffer overflow and other issues associated with low-level functions such as std::strcmp and std::strchr.

3. Inside a string_view

As stated in Part 1, just two internal data members facilitate the entire string_view functionality:

data_: a pointer to the first character of the array wrapped
size_: the number of characters of interest in the array

Using just these two data members, a string_view is able to become a wrapper to a character array regardless of the creation means used: In all four creation means shown in Listing A, the internal member data_ points to the first character in the array wrapped, and the size_ member contains the number of characters of interest.

String_view function members data and size provide access to the internal data members data_ and size_ respectively.

4. Safety issues with string_view

There are three major safety-related issues when using string_views:

assuming a string_view object makes a copy of the character array it wraps;
having a string_view object outlive the character array it wraps; and
assuming the data function member always returns a C-string.

String_view does not make a copy of the character array it wraps; nor does it own the array. Instead, the string_view object’s creator continues to own the character array and is responsible for the array’s management. Specifically, after a string_view object has served its purpose, the object owner should deallocate the array if the array was dynamically allocated.

The bottom line is that a string_view object should never outlive the character array it wraps. For example, a function should not return a string_view object that wraps a local array. Listing B shows examples of acceptable and unacceptable uses of string_view. The comments in the code are self-explanatory.

Note: Carefully study the differences between functions bad_idea and also_acceptable in Listing B.

Listing B: examples of acceptable and unacceptable uses of string_view

std::string_view bad_idea() {
    char z[]{"hello"};          // z is deallocated when this function ends
    return std::string_view{z}; // bad: returned object wraps deallocated array
}

std::string_view another_bad_idea() {
    std::string s{"hello"};     // s is deleted when this function ends
    return std::string_view{s}; // bad: returned obj wraps data in deleted string
}

std::string_view also_bad_idea() {
    char* p = new char[6]{};
    std::string_view sv{p};     // sv wraps dynamically allocated array
    delete[] p;                 // sv still wraps deallocated array
    return sv;                  // bad: returned object wraps deallocated array
}

void possibly_acceptable(std::string_view& sv) {
    std::cout << sv;            // safe only if sv wraps a legit array
}

void acceptable() {
    char z[]{"hello"};          // z is deallocated when this function ends
    std::string_view sv{z};
    possibly_acceptable(sv);    // OK: z lives until end of this function
}

std::string_view also_acceptable() {
    return std::string_view{"hello"}; // OK: "hello" has static storage
}

Another issue to be aware of when using string_view is that the function member data is not guaranteed to return a C-string. Specifically, that function simply returns a pointer to the first character in the array that was passed to it. (It could return a pointer to a later character in the array if the function remove_prefix was called earlier.)

Listing C illustrates safe and unsafe uses of the data function member. Part 3 discusses in detail, but briefly, it is better to avoid accessing the data function altogether. For example, insert a string_view directly to an output stream (as is done with sv6 in the last line of Listing C) instead of inserting the value returned from the data function.

Listing C: safe and unsafe uses of `data` function member

char z[]{"hello"};          // z is a C-string
std::string_view sv5{z};
std::cout << sv5.data();    // OK: sv5 wraps a C-string

char a[]{'h','e'};          // a is not a C-string
std::string_view sv6{a,2};
std::cout << sv6.data();    // unsafe: sv6 does not wrap a C-string

std::cout << sv6;           // OK: insertion operator is safely overloaded

5. Part-2 summary

Overall, std::string_view provides a cleaner and safer means to process immutable data than character arrays do. However, there are some safety concerns in using string_view, especially concerns related to object lifetime.

Listing D shows two versions of a function to count vowels in some text. The first version represents text as a C-string; the second represents text as a string_view. The listing aptly demonstrates that the string_view version is both simpler and safer:

No pointers
No need for const qualification: the C-string version needs const qualification; the string_view version does not need it, but that qualification is made as good practice. (In this case, there is a benefit to const qualifying the string_view objects. What is that benefit?)
Simpler code: the for-loop header and the test for vowel are both easier to comprehend (and thus to maintain) in the string_view version.
No undefined behavior: the C-string version has undefined behavior if the null character is missing. (This issue exists in two locations in the C-string version. What are those locations?)

Note: Be sure to read Part 3 of this series for guidelines on using string_view.

Listing D: counting vowels using C-string and string_view (run this code)

// using C-string
std::size_t vowel_count(const char* z) {
    const char vowels[]{"aeiouAEIOU"};

    std::size_t count{0};
    for (std::size_t i = 0; z[i] != '\0'; ++i)
        if (std::strchr(vowels, z[i]) != nullptr)
            ++count;

    return count;
}

// using std::string_view
std::size_t vowel_count(const std::string_view& sv) {
    const std::string_view vowels{"aeiouAEIOU"};

    std::size_t count{0};
    for (auto c : sv)
        if (vowels.find(c) != std::string_view::npos)
            ++count;

    return count;
}

6. Exercises

Answer the questions embedded in the bulleted list in the summary section.
Which of the two versions of function vowel_count in Listing D is faster? Which version is likely to use more run-time memory? Why?
- Use the code in Listing A of Part 1 as a model to measure wall times.
Which of the two versions vowel_count is better to count vowels in a character array that is not null-terminated? Why?
Using only the string_view version of vowel_count, write a main function to count vowels in a C-string and a character array that is not null-terminated. That is, make two calls to the string_view version of vowel_count, each time with a different argument. (For this part of the exercise, it might help to remove the C-string version of vowel_count from the program.)
Assuming the string_view abstraction (or something comparable) does not exist or cannot be used, write a function or functions to count vowels in a character array that may or may not be null-terminated.

It is OK to have two versions of the function if that approach seems better. (Listing D already shows the C-string version.) However, strive to reuse code as much as possible, but also strive to make code maintainable and “efficient”. Include comments that clearly explain the rationale for your choices.
Rewrite the string_view version of vowel_count using member function remove_prefix. There are three different approaches to this rewrite. Try all three approaches and outline the pros and cons of each approach. State which approach you prefer and include a rationale.
Rewrite the string_view version of vowel_count using member function find_first_of. Which version is “better”: the one in Listing D, or the rewritten one? Why?
Write a C-string version of the code in Listing B of Part 1.
Write a program to extract words from text, where words may be separated by space, comma, semi-colon, or period. Write both a C-string version and a string_view version.
- Do not use regular-expression, stream extraction, or other such approach that simplifies the task, but feel free to use any other standard-library facility.
- Break down the code into appropriate functions.
- const qualify all variables/parameters that represent immutable text. Meeting this requirement is quite important for this exercise.
- Hard-code the following immutable text in the program and use it in testing. Just for this exercise, do not read the text to process as user input at run time:
  
  The quality mantra: improve the process; the process improves you.
- Depending on the approach taken in the C-string version, hard-coding the text to process as a const qualified variable/parameter could pose a challenge. Yet, use a const qualified variable/parameter to represent the text to process exactly as required in the preceding bullet.

Discussion

Ask questions, give feedback, and discuss this post on Twitter. The Twitter link is specific to this post. We greatly appreciate all discussion on the post being only at the post-specific tweet.

Submit solutions by DM on Twitter (only by DM, please) so as to avoid spoilers. Please provide Compiler Explorer links to code. We prefer textual answers in the form of GitHub gists, files in a repo, or other form where we can just follow a link and open the content in a browser.