String Concatenation vs Memory Allocation
Over the years, performance <a href="http://www.codeproject.com/csharp/StringPerf.asp "Performance considerations for strings in C#") <a href="http://www.heikniemi.net/hc/archives/000124.html ".net String vs. StringBuilder - concatenation performance") <a href="http://www.yoda.arachsys.com/csharp/stringbuilder.html "Concatenating Strings Efficiently"), lots of comparisons between String.Concat and StringBuilder. Today I decided to do some of my own research into the subject and contribute to the knowledge already out there. More specifically, I'll be taking a look at the memory usage for various concatenation methods and compiler optimizations used to generate the IL.
The test scenario I defined consists out of several methods, each returning the same string. The string I created is supposed to resemble a real-life scenario. I identified five different ways of concatenating strings for my test. I will be taking a look at the numbers when calling each method once and inside a very small loop of 50 calls, which is another real-life number in my case.
Single line concatenation.
The easiest way of concatenating strings together, by simply putting a plus sign between them.
[csharp] public string GetPlussedString() { string myString = "SELECT column1,"
+ " column2,"
+ " column3,"
+ " column4,"
+ " column5,"
+ " column6,"
+ " FROM table1 t1"
+ " JOIN table2 t2"
+ " ON t1.column1 = t2.column1";
return myString;
} [/csharp]
Although it seems like we are creating 9 string instances, the compiler optimizes this into the following IL:
[code] .method public hidebysig instance string GetPlussedString() cil managed { .maxstack 1 .locals init ( [0] string myString) L_0000: ldstr "SELECT column1, column2, column3, column4, column5, column6, FROM table1 t1 JOIN table2 t2 ON t1.column1 = t2.column1" L_0005: stloc.0 L_0006: ldloc.0 L_0007: ret } [/code]
In reality, we created one string instance and returned it, which is about the most efficient way we can achieve.
When profiling the test application, I couldn't even find a call to GetPlussedString in the profiler, which makes me believe the runtime even optimized this.
In total, our application created 113 string instances and barely used any memory.
Running this in the loop gives the following result:
Important to note is the fact that we still have 113 string instances. This is because .NET used String Interning on my string and simply returns a reference to that instance over and over.
Variable concatenation.
Another frequently used way of concatenating strings is by appending a variable with the += operator for each line.
[csharp] public string GetPlussedVarString() { string myString = "SELECT column1,"; myString += " column2,"; myString += " column3,"; myString += " column4,"; myString += " column5,"; myString += " column6,"; myString += " FROM table1 t1"; myString += " JOIN table2 t2"; myString += " ON t1.column1 = t2.column1"; return myString; } [/csharp]
Things become messy here, take a look at the generated IL for this code:
[code] .method public hidebysig instance string GetPlussedVarString() cil managed { .maxstack 2 .locals init ( [0] string myString) L_0000: ldstr "SELECT column1," L_0005: stloc.0 L_0006: ldloc.0 L_0007: ldstr " column2," L_000c: call string [mscorlib]System.String::Concat(string, string) L_0011: stloc.0 L_0012: ldloc.0 L_0013: ldstr " column3," L_0018: call string [mscorlib]System.String::Concat(string, string) L_001d: stloc.0 L_001e: ldloc.0 L_001f: ldstr " column4," L_0024: call string [mscorlib]System.String::Concat(string, string) L_0029: stloc.0 L_002a: ldloc.0 L_002b: ldstr " column5," L_0030: call string [mscorlib]System.String::Concat(string, string) L_0035: stloc.0 L_0036: ldloc.0 L_0037: ldstr " column6," L_003c: call string [mscorlib]System.String::Concat(string, string) L_0041: stloc.0 L_0042: ldloc.0 L_0043: ldstr " FROM table1 t1" L_0048: call string [mscorlib]System.String::Concat(string, string) L_004d: stloc.0 L_004e: ldloc.0 L_004f: ldstr " JOIN table2 t2" L_0054: call string [mscorlib]System.String::Concat(string, string) L_0059: stloc.0 L_005a: ldloc.0 L_005b: ldstr " ON t1.column1 = t2.column1" L_0060: call string [mscorlib]System.String::Concat(string, string) L_0065: stloc.0 L_0066: ldloc.0 L_0067: ret } [/code]
Every += operation translates into a call to String.Concat() creating a new temporary string.
Looking at the profiler we end up with 129 string instances, which is 16 more than the our comparison base. These strings can be split up into 8 coming from the 8 calls to String.Concat and from having 8 more strings declared in code.
Calling this 50 times quickly shows the downside of this method. We end up with 408 additional string instances, 400 coming from 50*8 calls to String.Concat and our original 8 extra strings, which got Interned by the way.
Note the explosion in memory size used for this simple example, 73kB vs 16kB.
I strongly discourage the use of the += operator for string concatenation in these scenarios.
String.Concat(array) concatenation.
A less used way of concatenating strings is by using one of the String.Concat overloads which accept a string array.
[csharp] public string GetConcatedString() { string[] pieces = new string[] { "SELECT column1,", " column2,", " column3,", " column4,", " column5,", " column6,", " FROM table1 t1", " JOIN table2 t2", " ON t1.column1 = t2.column1" }; return String.Concat(pieces); } [/csharp]
This is a more efficient variation of String.Concat by using it explicitly with a string array, as can be seen in the following IL:
[code] .method public hidebysig instance string GetConcatedString() cil managed { .maxstack 3 .locals init ( [0] string[] pieces, [1] string[] CS$0$0000) L_0000: ldc.i4.s 9 L_0002: newarr string L_0007: stloc.1 L_0008: ldloc.1 L_0009: ldc.i4.0 L_000a: ldstr "SELECT column1," L_000f: stelem.ref L_0010: ldloc.1 L_0011: ldc.i4.1 L_0012: ldstr " column2," L_0017: stelem.ref L_0018: ldloc.1 L_0019: ldc.i4.2 L_001a: ldstr " column3," L_001f: stelem.ref L_0020: ldloc.1 L_0021: ldc.i4.3 L_0022: ldstr " column4," L_0027: stelem.ref L_0028: ldloc.1 L_0029: ldc.i4.4 L_002a: ldstr " column5," L_002f: stelem.ref L_0030: ldloc.1 L_0031: ldc.i4.5 L_0032: ldstr " column6," L_0037: stelem.ref L_0038: ldloc.1 L_0039: ldc.i4.6 L_003a: ldstr " FROM table1 t1" L_003f: stelem.ref L_0040: ldloc.1 L_0041: ldc.i4.7 L_0042: ldstr " JOIN table2 t2" L_0047: stelem.ref L_0048: ldloc.1 L_0049: ldc.i4.8 L_004a: ldstr " ON t1.column1 = t2.column1" L_004f: stelem.ref L_0050: ldloc.1 L_0051: stloc.0 L_0052: ldloc.0 L_0053: call string [mscorlib]System.String::Concat(string[]) L_0058: ret } [/code]
This method uses 9 more string instances than our base, which is already better than using += resulting in 16.
These 9 come from the 8 additional strings defined in the code and 1 coming from the single call to String.Concat().
Calling this 50 times will result in 58 additional strings compared to our base, coming from 50 calls to String.Concat() and our 8 additional strings in code (again, Interned @ work).
Internally the array overload for String.Concat() will first count the needed length for the result and then create a temporary string variable of the correct length, where as the previous method could not use this optimization since it were 8 separate calls.
StringBuilder.Append() concatenation.
Method number four uses a StringBuilder to create a string, as demonstrated in plenty of tutorials.
[csharp] public string GetBuildString() { StringBuilder builder = new StringBuilder(); builder.Append("SELECT column1,"); builder.Append(" column2,"); builder.Append(" column3,"); builder.Append(" column4,"); builder.Append(" column5,"); builder.Append(" column6,"); builder.Append(" FROM table1 t1"); builder.Append(" JOIN table2 t2"); builder.Append(" ON t1.column1 = t2.column1"); return builder.ToString(); } [/csharp]
The not so interesting IL for this method simply shows the creation of the object and several method calls.
[code] .method public hidebysig instance string GetBuildString() cil managed { .maxstack 2 .locals init ( [0] class [mscorlib]System.Text.StringBuilder builder) L_0000: newobj instance void [mscorlib]System.Text.StringBuilder::.ctor() L_0005: stloc.0 L_0006: ldloc.0 L_0007: ldstr "SELECT column1," L_000c: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(string) L_0011: pop L_0012: ldloc.0 L_0013: ldstr " column2," L_0018: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(string) L_001d: pop L_001e: ldloc.0 L_001f: ldstr " column3," L_0024: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(string) L_0029: pop L_002a: ldloc.0 L_002b: ldstr " column4," L_0030: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(string) L_0035: pop L_0036: ldloc.0 L_0037: ldstr " column5," L_003c: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(string) L_0041: pop L_0042: ldloc.0 L_0043: ldstr " column6," L_0048: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(string) L_004d: pop L_004e: ldloc.0 L_004f: ldstr " FROM table1 t1" L_0054: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(string) L_0059: pop L_005a: ldloc.0 L_005b: ldstr " JOIN table2 t2" L_0060: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(string) L_0065: pop L_0066: ldloc.0 L_0067: ldstr " ON t1.column1 = t2.column1" L_006c: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(string) L_0071: pop L_0072: ldloc.0 L_0073: callvirt instance string [mscorlib]System.Object::ToString() L_0078: ret } [/code]
Note I am using a default StringBuilder, which defaults to a size of 16 characters.
From the profiler we can see this approach created 13 more string instances than our base, from which 8 are again the extra strings in code, one is coming from the final ToString() call and 4 are coming from the internals of the StringBuilder, since it had to increase its capacity 4 times (At 16 characters, 32, 64 and 128).
Interesting to note here is the fact that the usage of a StringBuilder already uses less memory than concatenating with += when using 9 strings. Choosing a good estimate of the target size upon constructing the StringBuilder would have made the difference even bigger.
This becomes even more obvious when comparing the results from the loop:
Using 258 more than our base, 8 Interned strings, 50 ToString() calls and 200 increases inside StringBuilder, we can clearly see the StringBuilder being more efficient than += even taking StringBuilder object creation into account. It is however not as efficient than the String.Concat(array) method.
StringBuilder.AppendFormat() concatenation.
And lastly, for my own personal curiosity, I wanted to see the effect of using AppendFormat() versus Append().
[csharp] public string GetBuildFormatString() { // AppendFormat will first parse your string to find {x} instances // and then fill them in. Afterwards it calls .Append // Better to simply call .Append several times. StringBuilder builder = new StringBuilder(); builder.AppendFormat("SELECT {0},", "column1"); builder.AppendFormat(" {0},", "column2"); builder.AppendFormat(" {0},", "column3"); builder.AppendFormat(" {0},", "column4"); builder.AppendFormat(" {0},", "column5"); builder.AppendFormat(" {0},", "column6"); builder.AppendFormat(" FROM {0} t1", "table1"); builder.AppendFormat(" JOIN {0} t2", "table2"); builder.Append(" ON t1.column1 = t2.column1"); return builder.ToString(); } [/csharp]
This method is the most inefficient method of the pack. First a look at the IL:
[code] .method public hidebysig instance string GetBuildFormatString() cil managed { .maxstack 3 .locals init ( [0] class [mscorlib]System.Text.StringBuilder builder) L_0000: newobj instance void [mscorlib]System.Text.StringBuilder::.ctor() L_0005: stloc.0 L_0006: ldloc.0 L_0007: ldstr "SELECT {0}," L_000c: ldstr "column1" L_0011: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::AppendFormat(string, object) L_0016: pop L_0017: ldloc.0 L_0018: ldstr " {0}," L_001d: ldstr "column2" L_0022: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::AppendFormat(string, object) L_0027: pop L_0028: ldloc.0 L_0029: ldstr " {0}," L_002e: ldstr "column3" L_0033: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::AppendFormat(string, object) L_0038: pop L_0039: ldloc.0 L_003a: ldstr " {0}," L_003f: ldstr "column4" L_0044: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::AppendFormat(string, object) L_0049: pop L_004a: ldloc.0 L_004b: ldstr " {0}," L_0050: ldstr "column5" L_0055: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::AppendFormat(string, object) L_005a: pop L_005b: ldloc.0 L_005c: ldstr " {0}," L_0061: ldstr "column6" L_0066: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::AppendFormat(string, object) L_006b: pop L_006c: ldloc.0 L_006d: ldstr " FROM {0} t1" L_0072: ldstr "table1" L_0077: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::AppendFormat(string, object) L_007c: pop L_007d: ldloc.0 L_007e: ldstr " JOIN {0} t2" L_0083: ldstr "table2" L_0088: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::AppendFormat(string, object) L_008d: pop L_008e: ldloc.0 L_008f: ldstr " ON t1.column1 = t2.column1" L_0094: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(string) L_0099: pop L_009a: ldloc.0 L_009b: callvirt instance string [mscorlib]System.Object::ToString() L_00a0: ret } [/code]
Inside the AppendFormat() calls is where the ugly stuff happens. The format is converted to a character array, after which it is scanned for occurrences of {x} and in the end it is being passed to StringBuilder.Append() anyway.
I didn't spent much time trying to extract a conclusion out of the results of this test, since it's bound to perform worse than the previous method anyway, since it's the same logic with extra operations.
Interesting to note are the results inside the loop, demonstrating it uses even more memory than += concatenating.
Conclusion
The conclusions I made for myself and will be following in my future development are as follows:
- If you can avoid concatenating, do it! This is a no brainer, if you don't have to concatenate but want your source code to look nice, use the first method. It will get optimized as if it was a single string.
- Don't use += concatenating ever. Too much changes are taking place behind the scene, which aren't obvious from my code in the first place. I advise to rather use String.Concat() explicitly with any overload (2 strings, 3 strings, string array). This will clearly show what your code does without any surprises, while allowing yourself to keep a check on the efficiency.
- Try to estimate the target size of a StringBuilder. The more accurate you can estimate the needed size, the less temporary strings the StringBuilder will have to create to increase its internal buffer.
- Do not use any Format() methods when performance is an issue. Too much overhead is involved in parsing the format, when you could construct an array out of pieces when all you are using are {x} replaces. Format() is good for readability, but one of the things to go when you are squeezing all possible performance out of your application.
One conclusion I am not 100 percent sure of is the difference between using String.Concat(array) and using a StringBuilder. It seems using an array incurs less memory overhead than using a StringBuilder, unless the cost of array creation is big, which I couldn't determine in my tests. I'd be more than interested to know if someone could provide more detail on this.
The guidelines in Jouni Heikniemi's article seem to be accurate when comparing between String.Concat(string, string) and StringBuilder, and will be the ones I'll be following, until I get a clear picture of the String.Concat(array) implementation.
Once again, I've uploaded the solution I used as a additional HTML page displaying the results below each other.