Delegate in .NET
Delegation in .NET is an important feature that enables indirect method invocation and functional programming.
Since .NET Framework 1.0, delegates have supported multicast in .NET. With multicasting, we can invoke a series of methods in a single delegate call without maintaining the list of methods ourselves.
Even today, delegated multicasting still plays a vital role in desktop development.
Let’s take a quick look with an example.
delegate void FooDelegate(int v); class MyFoo { public FooDelegate? Foo { get; set; } public void Process() { Foo?.Invoke(42); } }
We simply define a delegate with a single parameter v and call this delegate in the method Process.
To use the code above, we need to add some targets to the delegate member Foo.
var obj = new MyFoo(); obj.Foo += v => Console.WriteLine(v); obj.Foo += v => Console.WriteLine(v + 1); obj.Foo += v => Console.WriteLine(v - 42); obj. Process();
Then we will get the expected output as below.
42 43 0
But what’s going on behind the scenes?
In fact, the compiler automatically converts our lambda expression into a method and caches the created delegate with a static field as shown below.
[CompilerGenerated] internal class Program { [Serializable] [Compiler Generated] private sealed class <>c { public static readonly <>c <>9 = new <>c(); public static FooDelegate <>9__0_0; public static FooDelegate <>9__0_1; public static FooDelegate <>9__0_2; internal void <<Main>$>b__0_0(int v) { Console. WriteLine(v); } internal void <<Main>$>b__0_1(int v) { Console. WriteLine(v + 1); } internal void <<Main>$>b__0_2(int v) { Console. WriteLine(v - 42); } } private static void <Main>$(string[] args) { MyFoo myFoo = new MyFoo(); myFoo.Foo = (FooDelegate)Delegate.Combine(myFoo.Foo, <>c.<>9__0_0 (<>c.<>9__0_0 = new FooDelegate(<>c.<>9.<<Main>$> b__0_0))); myFoo.Foo = (FooDelegate)Delegate.Combine(myFoo.Foo, <>c.<>9__0_1 (<>c.<>9__0_1 = new FooDelegate(<>c.<>9.<<Main>$> b__0_1))); myFoo.Foo = (FooDelegate)Delegate.Combine(myFoo.Foo, <>c.<>9__0_2 (<>c.<>9__0_2 = new FooDelegate(<>c.<>9.<<Main>$> b__0_2))); myFoo. Process(); } }
Each delegate will only be created and cached the first time, so when we go through the code path created by the lambda expression again, the delegate will not be assigned.
But notice the line of code that includes Delegate.Combine , which effectively combines our three methods into a single delegate. In fact, every delegate in .NET inherits from MulticastDelegate, which contains invocationList to save the method pointer and target (object) when calling the method. The implementation of Delegate.Combine is thread-safe, so we can safely use it in every corner of the code.
Convenience vs. complexity, and problems
In desktop development, this really provides us with great convenience. At the same time, however, there is another keyword in C# called “event”.
class MyFoo { private List<Delegate> funcs = new(); public event FooDelegate Foo { add => funcs. Add(value); remove { if (funcs. IndexOf(value) is int v and not -1) funcs. RemoveAt(v); } } }
Using the event keyword, we can determine how delegates are added or removed. For example, we can use List
to hold all delegates instead of using the delegate’s built-in multicast functionality.
However, even with the event keyword, the delegate’s multicast functionality doesn’t disappear. So why do we need to provide multicast functionality at the delegate level? Why not provide a thread-safe delegate collection type DelegateCollection and make the auto-implemented events use that type instead of making the delegate itself a multicast delegate?
To make matters worse, the runtime needs to iterate over the call targets each time the delegate is called. For this reason, the JIT compiler cannot convert a delegate call to a direct call, preventing the JIT from inlining the target method.
This happens even in the simplest of delegate calls.
int Foo() => 42; void Call(Func<int> f) => Console. WriteLine(f()); Call(Foo);
Let’s see how this affects code generation.
G_M24006_IG02: mov rcx, 0xD1FFAB1E ; System.Func`1[int] call CORINFO_HELP_NEWSFAST mov rsi, rax lea rcx, bword ptr [rsi + 08H] mov rdx, rsi call CORINFO_HELP_ASSIGN_REF mov rcx, 0xD1FFAB1E ; function address mov qword ptr [rsi + 18H], rcx mov rcx, 0xD1FFAB1E ; code in cProgram:<Main>g__Foo|0_0():int mov qword ptr [rsi + 20H], rcx mov rcx, gword ptr [rsi + 08H] call [rsi + 18H]System.Func`1[int]:Invoke():int:this ; <---- here mov ecx, eax call [System. Console:WriteLine(int)] nop
Although the method CallDelegate is inlined by the callee, it still has to call System.Func
to iterate through the invocation list and all callees one by one, which is faster than a simple indirect method call ( By using a function pointer directly) is slower, and much slower than a direct method call (when the callee can be inlined).
public unsafe class Benchmarks { private int Foo() => 42; private readonly Func<int> f; public Benchmarks() => f = Foo; [Benchmark] public int SumWithDelegate() { var lf = this.f; // Make a local copy of f, because f may be modified by other methods at any time, which prevents some optimizations. var sum = 0; for (var i = 0; i < 42; i ++ ) sum + = lf(); return sum; } [Benchmark] public int SumWithDirectCall() { var sum = 0; for (var i = 0; i < 42; i + + ) sum + = Foo(); return sum; } }
Benchmark results:
Method | Mean | Error | StdDev |
---|---|---|---|
SumWithDelegate | 60.21 ns | 0.725 ns | 0.678 ns |
SumWithDirectCall | 10.52 ns | 0.155 ns | 0.145 ns |
Delegate calls are 500% slower than direct calls. We can explain it simply by looking at the assembly code that the JIT generates for each method’s loop body:
; Method SumWithDelegate G_M41830_IG03: mov rax, gword ptr [rsi + 08H] mov rcx, gword ptr [rax + 08H] call [rax + 18H]System.Func`1[int]:Invoke():int:this add edi, eax inc ebx cmp ebx, 42 jl SHORT G_M41830_IG03 ; Method SumWithDirectCall G_M33206_IG03: add eax, 42 inc edx cmp edx, 42 jl SHORT G_M33206_IG03
Answers to life, the universe and everything
Before .NET 7, we had to accept the shortcomings of delegate performance, but fortunately, the whole game has changed since .NET 7.
Now I want to introduce two concepts: PGO (Profiling Based Optimization) and GDV (Guarded Devirtualization).
PGO is an optimization technique that consists of two parts: one is to instrument the program and collect runtime performance analysis data, and the other is to provide the collected analysis data to the compiler so that the compiler can use the data Generate better code.
GDV is a protected version of devirtualization. Sometimes, due to polymorphism, we can’t simply devirtualize a method, but we can type test it first, which acts as a guard, and then devirtualize the callee after the guard:
void Foo(Base obj) { obj.VirtualCall(); // here we cannot cancel the virtual call } void Foo(Base obj) { if (obj is Derived2) // add a guard code ((Derived2)obj).VirtualCall(); // now we can cancel the virtual call else obj.VirtualCall(); // else, fallback to standard virtual call }
But how does the compiler determine which type to test? Analysis data now participates in the compilation process. For example, if the compiler sees that most calls to VirtualCall dispatch to the Derived2 type, the compiler can emit a guard on Derived2 and devirtualize the call under the guard, making it a fast path, while on the other hand it Fallback to standard virtual calls (if type is not Derived2).
In .NET 7, we also have a similar optimization for delegate calls, achieved by collecting method histograms.
Now I’m going to enable dynamic PGO in .NET 7, let’s see what happens.
To enable dynamic PGO, we need to set
in the csproj file. This time, we obtain the following benchmark results:
Method | Mean | Error | StdDev | Code Size |
---|---|---|---|---|
SumWithDelegate | 15.95 ns | 0.320 ns | 0.299 ns | 69 B |
SumWithDirectCall | 10.25 ns | 0.112 ns | 0.105 ns | 15 B |
Great performance boost! This time the performance of the method invoked using the delegate is almost comparable to the method invoked directly. Let’s look at the disassembled code. I’ve added some comments to the disassembly to explain what’s going on.
; Method SumWithDelegate ... G_M000_IG03: mov rdx, qword ptr [rcx + 18H] mov rax, 0x7FFED3C041C8 ; Here is the benchmark code: Benchmarks:Foo():int:this cmp rdx, rax ; test if the caller is a Foo method jne SHORT G_M000_IG07 ; if not, fallback to virtual call mov eax, 42 ; otherwise, cancel the virtual call and perform inline optimization ; so we can add the return value 42 of the Foo method directly to the sum G_M000_IG04: ; without actually calling the Foo method add edi, eax ; just like we did in SumWithDirectCall inc ebx cmp ebx, 42 jl SHORT G_M000_IG03 ... G_M000_IG07: ; Execute the slow path of virtual calls mov rcx, gword ptr [rcx + 08H] call rdx jmp SHORT G_M000_IG04 ; Method SumWithDirectCall ... ; Callee devirtualized and inline optimized G_M000_IG03: ; So we can add the return value 42 of the Foo method directly to the sum add eax, 42 ; without actually calling the Foo method inc edx cmp edx, 42 jl SHORT G_M000_IG03
From the disassembly code, it can be seen that through dynamic PGO, the compiler has also optimized the method called by the delegate inline. At the same time, Guarded De-virtualization technology has been introduced to generate Optimized code paths similar to direct calls.
Specifically, in the assembly code of the entrusted call method, the compiler tests the method history contained in the entrusted object to determine whether the entrusted call method is of a certain type in most cases, and if so, pass Type-checking instructions protect the type, and then devirtualize and inline the methods called by the delegate, generating an assembly code path similar to a direct call. And if the method called by the delegate does not belong to any type in most cases, the slow delegate calling path is executed directly.
In the final performance test results, the performance of the delegated calling method is close to that of the direct calling method, which means that the performance of the delegated calling method can be greatly improved by using PGO and GDV technologies.
Can this be improved further?
We can now see that on each iteration of the loop we are testing the delegate’s target method. Why not move the check earlier outside the loop so that only one check is needed for the entire loop?
Thankfully, recent work in .NET 8 has been able to see improvements in nightly builds. The disassembly of the SumWithDelegate method now looks like this:
... G_M41830_IG02: mov rsi, gword ptr [rcx + 08H] xor edi, edi xor ebx, ebx test rsi, rsi je SHORT G_M41830_IG05 mov rax, qword ptr [rsi + 18H] mov rcx, 0xD1FFAB1E ; Here is the benchmark code: Benchmarks:Foo():int:this cmp rax, rcx ; test if the caller is a Foo method jne SHORT G_M41830_IG05 ; if not, jump to G_M41830_IG05, falling back to testing the caller in each iteration G_M41830_IG03: ; Otherwise, we enter the fastest path, which is exactly the same as SumWithDirectCall mov eax, 42 add edi, eax inc ebx cmp ebx, 42 jl SHORT G_M41830_IG03 ... G_M41830_IG05: mov rax, qword ptr [rsi + 18H] mov rcx, 0xD1FFAB1E ; Here is the benchmark code: Benchmarks:Foo():int:this cmp rax, rcx ; test if the caller is a Foo method jne SHORT G_M41830_IG09 ; if not, jump to G_M41830_IG09, falling back to the slow path of virtual calls mov eax, 42 ; otherwise, the callee performs devirtualization and inlining optimizations G_M41830_IG06: add edi, eax inc ebx cmp ebx, 42 jl SHORT G_M41830_IG05 ... G_M41830_IG09: mov rcx, gword ptr [rsi + 08H] call [rsi + 18H]System.Func`1[int]:Invoke():int:this jmp SHORT G_M41830_IG06
Under normal circumstances, .NET will test whether the target method of the delegate is the specified method, and if it is, it will use the fast path (IG03), otherwise it will use the slow path (IG05 and IG09). In the fast path, the target method of the delegate is invoked directly, while in the slow path, the target method of the delegate is invoked either through a virtual call or indirectly.
This optimization can make the performance of the delegate call equal to the performance of calling the method directly.
This code is actually optimized to:
var sum = 0; if (f == Foo) for (var i = 0; i < 42; i + + ) sum + = 42; else for (var i = 0; i < 42; i ++ ) if (f == Foo) sum + = 42; else sum + = f(); return sum;
Now under normal circumstances, the performance of a delegate call is exactly the same as a direct method call.
End
While .NET has made some poor decisions with delegation before, since .NET 7 it has managed to solve performance issues with delegation.
Happy coding!
authorized
Author: hez2010
Translator: InCerry
Original link: https://medium.com/@skyake/the-evolution-of-delegate-performance-in-net-c8f23572b8b1