C++ optimizes the use of strings

String properties

Strings are dynamically allocated

Strings are values

Strings are copied a lot

String optimization

Basic optimization

Compound assignment operations avoid temporary strings

Allocate memory in advance

Use constant references to reduce temporary objects generated when passing parameters

Eliminate copying of the returned string

C-style optimization: use character arrays instead of strings

Algorithm optimization: remove instead of add

Use a better compiler

Use a better string library

Adopt a richer std::string library

Use std::stringstream to avoid value semantics


Note: The following content is derived from Chapter 4 of the C++ Performance Optimization Guide: “Optimizing String Usage: Case Study”

String attribute

  • The behavior of std::string may be different between the C++98 and C++11 standards

The string is dynamically allocated

  • C’s strcat(), strcpy()) operate on fixed-length character arrays, while C++’s strings are more convenient to use.
  • C++ strings are dynamically allocated, so the memory space will automatically become longer when needed.
  • When initializing a string in C++, the allocated memory is fixed. When characters are added at the end and exceed this fixed length, the memory manager will:
  1. Re-apply for a larger length of memory (some implementations apply for twice the original length)
  2. Copy the contents of the previous string in
  3. Give the newly generated buffer address to the variable
  4. Release the memory allocated during initialization
  • Disadvantage: The longer the string, the higher the overhead when adding characters or strings later. Appending a character will apply for twice the memory, resulting in a waste of memory space.

String is the value

  • In assignment statements and expressions, strings behave the same as values (memory reallocation)
std::string s1,s2;
s1 = "hot"; //s1 is "hot"
s2 = s1; //s2 is "hot"
s1[0] = 'n'; //s1 is "not", s2 is still "hot"

s1 = s2 + s3 + s4 When string splicing, application and release of temporary objects and memory will occur, and the memory manager will be called multiple times:

  1. The result of s2 + s3 will be saved in a newly allocated temporary string, assumed to be ss
  2. The result of ss + s4 will be saved in another newly allocated temporary string, assumed to be sss
  3. sss will replace the previous value of s1
  4. ss and is the memory allocated before s1 is released

Strings will be copied in large quantities

  • “Copy on Write” (COW): A copy will be copied when a string is assigned or passed into a function as a variable.
std::string s1,s2;
s1 = "hot"; //s1 is "hot"
s2 = s1; //s2 is "hot" (s1 and s2 point to the same memory)
s1[0] = 'n'; //s1 will assign a copy of the contents of the current memory space before changing, whether s1 is "not", s2 is still "hot"
  • When a string is assigned a value or passed as a formal parameter in a function, a temporary variable will be assigned to it.
  • In C++11 and later, with “rvalue references” and “move statements”, string copy performance has been optimized. If a function uses an “rvalue reference” as a parameter and the actual parameter is an rvalue expression, the string can be lightweight pointer copied, thus saving a copy operation.

String optimization

  • The function of the following function is to remove control characters from a string, using a 222-character string containing multiple control characters as a parameter, in PC (Intel i7 tablet), operating system (Windows 8.1) and compiler ( VS2010, 32-bit) took an average of 24.8 microseconds per call
std::string remove_ctrl(std::string s)
{
    std::string result;
    for (int i=0; i<s.length(); + + i)
    {
        if(s[i] >= 0x20)
        {
            result = result + s[i];
        }
    }
    return result;
}

Optimize it in several ways

Basic optimization

Compound assignment operations avoid temporary strings

  • result = result + s[i] The string is the value. When the string is connected, the memory manager will be called to build a new temporary object.
  • The construction of temporary objects will apply for and release memory. You can use the compound operator + = to reduce the copying of strings
    std::string remove_ctrl_mutating(std::string s)
    {
        std::string result;
        for (int i=0; i<s.length(); + + i)
        {
            if(s[i] >= 0x20)
            {
                result + = s[i];
            }
        }
        return result;
    }
    

After the change, each call takes an average of 1.72 microseconds, which is 13 times better than remove_ctrl() performance.

Allocate memory in advance

  • The memory of the result in the remove_ctrl_mutating() function is dynamically allocated. When appending characters that exceed the current length, memory will be reallocated, and characters may apply for memory larger than s, resulting in a waste of memory space.
  • You can use the reserve() member function of std::string() to pre-allocate enough memory space for optimization.
std::string remove_ctrl_reserve(std::string s)
{
    std::string result;
    result.reserve(s.length());
    for (int i=0; i<s.length(); + + i)
    {
        if(s[i] >= 0x20)
        {
            result + = s[i];
        }
    }
    return result;
}

After the change, each call takes an average of 1.47 microseconds, which is 17% higher than remove_ctrl_mutating()

Use constant references to reduce the generation of temporary objects when passing parameters

  • When passing parameters in remove_ctrl_reserve(), the string will also be copied to generate a temporary object.
  • You can pass references instead to reduce the generation of temporary objects
    std::string remove_ctrl_ref_args(const std::string & amp; s)
    {
        std::string result;
        result.reserve(s.length());
        for (int i=0; i<s.length(); + + i)
        {
            if(s[i] >= 0x20)
            {
                result + = s[i];
            }
        }
        return result;
    }
  • Reference variables are implemented as pointers. Dereferencing is required when using string s, which may cause performance degradation
  • Pointer dereferencing can be eliminated using an iterator
    std::string remove_ctrl_ref_args_it(const std::string & amp; s)
    {
        std::string result;
        result.reserve(s.length());
        for (auto it=s.begin(),end=s.end(); it != end; + + it)
        {
            if (*it >= 0x20)
            {
                result + = *it;
            }
        }
        return result;
    }

After the change, each call takes an average of 1.04 microseconds.

Eliminate copying of returned string

  • When the result is returned in remove_ctrl_ref_args_it(), the string will also be copied to generate a temporary object. A copy will occur, and memory will also be requested at this time.
  • Strings can be passed by reference
void remove_ctrl_ref_result_it (std::string & amp; result,const std::string & amp; s)
{
    std::string result;
    result.reserve(s.length());
    for (auto it=s.begin(),end=s.end(); it != end; + + it)
    {
        if (*it >= 0x20)
        {
            result + = *it;
        }
    }
}

After the change, each call took an average of 1.02 microseconds

C-style optimization: use character arrays instead of strings

  • When the program has extremely strict performance requirements, it can be written manually using C-style string functions.
  • C-style string functions must manually allocate and release character buffers, which increases maintenance costs.
void remove_ctrl_cstrings(char* destp, char const* srcp, size_t size)
{
    for (size_t i=0; i<size; + + i)
    {
        if (srcp[i] >= 0x20)
        {
            *destp + + = srcp[i];
        }
    }
    *destp = 0;
}

The test result is that each call to remove_ctrl_cstrings() takes 0.15 microseconds

Algorithm optimization: remove instead of add

  • Use the erase() member function of std::string to remove control characters to change the string. Instead of creating a new string, the value of the parameter string is modified and returned as the result.
void remove_ctrl_erase(std::string & amp;s)
{
    for (size_t i = 0; i < s.length();)
    {
        if (s[i] < 0x20)
        {
            s.erase(i,1);
        }
        else
        {
             + + i;
        }
    }
}

  • s is constantly getting shorter, and except for memory allocation when returning a value, no memory allocation will occur under other circumstances.
  • The performance of the modified function is very good. The test result is that each call takes 0.81 milliseconds.

Use a better compiler

Use a compiler that includes the C++11 standard, which can reduce some unnecessary copying because of move constructors and rvalue references

Use a better string library

Use a richer std::string library

Boost String Library: Provides functions for segmenting strings by tokens, formatting strings, and other operations

C++ String Toolkit (StrTk): Excellent at parsing strings and segmenting strings by tokens, and compatible with std::string

Use std::stringstream to avoid value semantics

  • std::stringstream for strings, just like std::ostream for output files
  • In a different way, the std::stringstream class encapsulates a dynamically sized buffer (usually a std::string) to which data can be added
  • If std::stringstream is implemented using std::string, then it can never outperform std::string in performance. It has the advantage of preventing certain programming practices that reduce program performance