The Evolution of Delegating Capabilities in .NET

Delegate in .NET

Delegation in .NET is an important feature that enables indirect method invocation and functional programming.

Since .NET Framework 1.0, delegates have supported multicast in .NET. With multicasting, we can invoke a series of methods in a single delegate call without maintaining the list of methods ourselves.

Even today, delegated multicasting still plays a vital role in desktop development.

Let’s take a quick look with an example.

delegate void FooDelegate(int v);

class MyFoo
{
    public FooDelegate? Foo { get; set; }

    public void Process()
    {
        Foo?.Invoke(42);
    }
}

We simply define a delegate with a single parameter v and call this delegate in the method Process.

To use the code above, we need to add some targets to the delegate member Foo.

var obj = new MyFoo();
obj.Foo += v => Console.WriteLine(v);
obj.Foo += v => Console.WriteLine(v + 1);
obj.Foo += v => Console.WriteLine(v - 42);
obj. Process();

Then we will get the expected output as below.

42
43
0

But what’s going on behind the scenes?

In fact, the compiler automatically converts our lambda expression into a method and caches the created delegate with a static field as shown below.

[CompilerGenerated]
internal class Program
{
    [Serializable]
    [Compiler Generated]
    private sealed class <>c
    {
        public static readonly <>c <>9 = new <>c();

        public static FooDelegate <>9__0_0;

        public static FooDelegate <>9__0_1;

        public static FooDelegate <>9__0_2;

        internal void <<Main>$>b__0_0(int v)
        {
            Console. WriteLine(v);
        }

        internal void <<Main>$>b__0_1(int v)
        {
            Console. WriteLine(v + 1);
        }

        internal void <<Main>$>b__0_2(int v)
        {
            Console. WriteLine(v - 42);
        }
    }

    private static void <Main>$(string[] args)
    {
        MyFoo myFoo = new MyFoo();
        myFoo.Foo = (FooDelegate)Delegate.Combine(myFoo.Foo, <>c.<>9__0_0  (<>c.<>9__0_0 = new FooDelegate(<>c.<>9.<<Main>$> b__0_0)));
        myFoo.Foo = (FooDelegate)Delegate.Combine(myFoo.Foo, <>c.<>9__0_1  (<>c.<>9__0_1 = new FooDelegate(<>c.<>9.<<Main>$> b__0_1)));
        myFoo.Foo = (FooDelegate)Delegate.Combine(myFoo.Foo, <>c.<>9__0_2  (<>c.<>9__0_2 = new FooDelegate(<>c.<>9.<<Main>$> b__0_2)));
        myFoo. Process();
    }
}

Each delegate will only be created and cached the first time, so when we go through the code path created by the lambda expression again, the delegate will not be assigned.

But notice the line of code that includes Delegate.Combine , which effectively combines our three methods into a single delegate. In fact, every delegate in .NET inherits from MulticastDelegate, which contains invocationList to save the method pointer and target (object) when calling the method. The implementation of Delegate.Combine is thread-safe, so we can safely use it in every corner of the code.

Convenience vs. complexity, and problems

In desktop development, this really provides us with great convenience. At the same time, however, there is another keyword in C# called “event”.

class MyFoo
{
    private List<Delegate> funcs = new();
    public event FooDelegate Foo
    {
        add => funcs. Add(value);
        remove
        {
            if (funcs. IndexOf(value) is int v and not -1) funcs. RemoveAt(v);
        }
    }
}

Using the event keyword, we can determine how delegates are added or removed. For example, we can use List to hold all delegates instead of using the delegate’s built-in multicast functionality.

However, even with the event keyword, the delegate’s multicast functionality doesn’t disappear. So why do we need to provide multicast functionality at the delegate level? Why not provide a thread-safe delegate collection type DelegateCollection and make the auto-implemented events use that type instead of making the delegate itself a multicast delegate?

To make matters worse, the runtime needs to iterate over the call targets each time the delegate is called. For this reason, the JIT compiler cannot convert a delegate call to a direct call, preventing the JIT from inlining the target method.

This happens even in the simplest of delegate calls.

int Foo() => 42;
void Call(Func<int> f) => Console. WriteLine(f());

Call(Foo);

Let’s see how this affects code generation.

G_M24006_IG02:
       mov rcx, 0xD1FFAB1E ; System.Func`1[int]
       call CORINFO_HELP_NEWSFAST
       mov rsi, rax
       lea rcx, bword ptr [rsi + 08H]
       mov rdx, rsi
       call CORINFO_HELP_ASSIGN_REF
       mov rcx, 0xD1FFAB1E ; function address
       mov qword ptr [rsi + 18H], rcx
       mov rcx, 0xD1FFAB1E ; code in cProgram:<Main>g__Foo|0_0():int
       mov qword ptr [rsi + 20H], rcx
       mov rcx, gword ptr [rsi + 08H]
       call [rsi + 18H]System.Func`1[int]:Invoke():int:this ; <---- here
       mov ecx, eax
       call [System. Console:WriteLine(int)]
       nop

Although the method CallDelegate is inlined by the callee, it still has to call System.Func::Invoke to iterate through the invocation list and all callees one by one, which is faster than a simple indirect method call ( By using a function pointer directly) is slower, and much slower than a direct method call (when the callee can be inlined).

public unsafe class Benchmarks
{
    private int Foo() => 42;
    private readonly Func<int> f;
    public Benchmarks() => f = Foo;

    [Benchmark]
    public int SumWithDelegate()
    {
        var lf = this.f; // Make a local copy of f, because f may be modified by other methods at any time, which prevents some optimizations.
        var sum = 0;
        for (var i = 0; i < 42; i ++ ) sum + = lf();
        return sum;
    }

    [Benchmark]
    public int SumWithDirectCall()
    {
        var sum = 0;
        for (var i = 0; i < 42; i + + ) sum + = Foo();
        return sum;
    }
}

Benchmark results:

Method	Mean	Error	StdDev
SumWithDelegate	60.21 ns	0.725 ns	0.678 ns
SumWithDirectCall	10.52 ns	0.155 ns	0.145 ns

Delegate calls are 500% slower than direct calls. We can explain it simply by looking at the assembly code that the JIT generates for each method’s loop body:

; Method SumWithDelegate

G_M41830_IG03:
       mov rax, gword ptr [rsi + 08H]
       mov rcx, gword ptr [rax + 08H]
       call [rax + 18H]System.Func`1[int]:Invoke():int:this
       add edi, eax
       inc ebx
       cmp ebx, 42
       jl SHORT G_M41830_IG03


; Method SumWithDirectCall

G_M33206_IG03:
       add eax, 42
       inc edx
       cmp edx, 42
       jl SHORT G_M33206_IG03

Answers to life, the universe and everything

Before .NET 7, we had to accept the shortcomings of delegate performance, but fortunately, the whole game has changed since .NET 7.

Now I want to introduce two concepts: PGO (Profiling Based Optimization) and GDV (Guarded Devirtualization).

PGO is an optimization technique that consists of two parts: one is to instrument the program and collect runtime performance analysis data, and the other is to provide the collected analysis data to the compiler so that the compiler can use the data Generate better code.

GDV is a protected version of devirtualization. Sometimes, due to polymorphism, we can’t simply devirtualize a method, but we can type test it first, which acts as a guard, and then devirtualize the callee after the guard:

void Foo(Base obj)
{
    obj.VirtualCall(); // here we cannot cancel the virtual call
}

void Foo(Base obj)
{
    if (obj is Derived2) // add a guard code
        ((Derived2)obj).VirtualCall(); // now we can cancel the virtual call
    else obj.VirtualCall(); // else, fallback to standard virtual call
}

But how does the compiler determine which type to test? Analysis data now participates in the compilation process. For example, if the compiler sees that most calls to VirtualCall dispatch to the Derived2 type, the compiler can emit a guard on Derived2 and devirtualize the call under the guard, making it a fast path, while on the other hand it Fallback to standard virtual calls (if type is not Derived2).

In .NET 7, we also have a similar optimization for delegate calls, achieved by collecting method histograms.

Now I’m going to enable dynamic PGO in .NET 7, let’s see what happens.

To enable dynamic PGO, we need to set true in the csproj file. This time, we obtain the following benchmark results:

Method	Mean	Error	StdDev	Code Size
SumWithDelegate	15.95 ns	0.320 ns	0.299 ns	69 B
SumWithDirectCall	10.25 ns	0.112 ns	0.105 ns	15 B

Great performance boost! This time the performance of the method invoked using the delegate is almost comparable to the method invoked directly. Let’s look at the disassembled code. I’ve added some comments to the disassembly to explain what’s going on.

; Method SumWithDelegate

...
G_M000_IG03:
       mov rdx, qword ptr [rcx + 18H]
       mov rax, 0x7FFED3C041C8 ; Here is the benchmark code: Benchmarks:Foo():int:this
       cmp rdx, rax ; test if the caller is a Foo method
       jne SHORT G_M000_IG07 ; if not, fallback to virtual call
       mov eax, 42 ; otherwise, cancel the virtual call and perform inline optimization
                                      ; so we can add the return value 42 of the Foo method directly to the sum
G_M000_IG04: ; without actually calling the Foo method
       add edi, eax ; just like we did in SumWithDirectCall
       inc ebx
       cmp ebx, 42
       jl SHORT G_M000_IG03
...
G_M000_IG07: ; Execute the slow path of virtual calls
       mov rcx, gword ptr [rcx + 08H]
       call rdx
       jmp SHORT G_M000_IG04


; Method SumWithDirectCall

... ; Callee devirtualized and inline optimized
G_M000_IG03: ; So we can add the return value 42 of the Foo method directly to the sum
       add eax, 42 ; without actually calling the Foo method
       inc edx
       cmp edx, 42
       jl SHORT G_M000_IG03

From the disassembly code, it can be seen that through dynamic PGO, the compiler has also optimized the method called by the delegate inline. At the same time, Guarded De-virtualization technology has been introduced to generate Optimized code paths similar to direct calls.

Specifically, in the assembly code of the entrusted call method, the compiler tests the method history contained in the entrusted object to determine whether the entrusted call method is of a certain type in most cases, and if so, pass Type-checking instructions protect the type, and then devirtualize and inline the methods called by the delegate, generating an assembly code path similar to a direct call. And if the method called by the delegate does not belong to any type in most cases, the slow delegate calling path is executed directly.

In the final performance test results, the performance of the delegated calling method is close to that of the direct calling method, which means that the performance of the delegated calling method can be greatly improved by using PGO and GDV technologies.

Can this be improved further?

We can now see that on each iteration of the loop we are testing the delegate’s target method. Why not move the check earlier outside the loop so that only one check is needed for the entire loop?

Thankfully, recent work in .NET 8 has been able to see improvements in nightly builds. The disassembly of the SumWithDelegate method now looks like this:

...
G_M41830_IG02:
       mov rsi, gword ptr [rcx + 08H]
       xor edi, edi
       xor ebx, ebx
       test rsi, rsi
       je SHORT G_M41830_IG05
       mov rax, qword ptr [rsi + 18H]
       mov rcx, 0xD1FFAB1E ; Here is the benchmark code: Benchmarks:Foo():int:this
       cmp rax, rcx ; test if the caller is a Foo method
       jne SHORT G_M41830_IG05 ; if not, jump to G_M41830_IG05, falling back to testing the caller in each iteration
G_M41830_IG03: ; Otherwise, we enter the fastest path, which is exactly the same as SumWithDirectCall
       mov eax, 42
       add edi, eax
       inc ebx
       cmp ebx, 42
       jl SHORT G_M41830_IG03
...
G_M41830_IG05:
       mov rax, qword ptr [rsi + 18H]
       mov rcx, 0xD1FFAB1E ; Here is the benchmark code: Benchmarks:Foo():int:this
       cmp rax, rcx ; test if the caller is a Foo method
       jne SHORT G_M41830_IG09 ; if not, jump to G_M41830_IG09, falling back to the slow path of virtual calls
       mov eax, 42 ; otherwise, the callee performs devirtualization and inlining optimizations
G_M41830_IG06:
       add edi, eax
       inc ebx
       cmp ebx, 42
       jl SHORT G_M41830_IG05
...
G_M41830_IG09:
       mov rcx, gword ptr [rsi + 08H]
       call [rsi + 18H]System.Func`1[int]:Invoke():int:this
       jmp SHORT G_M41830_IG06

Under normal circumstances, .NET will test whether the target method of the delegate is the specified method, and if it is, it will use the fast path (IG03), otherwise it will use the slow path (IG05 and IG09). In the fast path, the target method of the delegate is invoked directly, while in the slow path, the target method of the delegate is invoked either through a virtual call or indirectly.

This optimization can make the performance of the delegate call equal to the performance of calling the method directly.

This code is actually optimized to:

var sum = 0;
if (f == Foo)
    for (var i = 0; i < 42; i + + ) sum + = 42;
else
    for (var i = 0; i < 42; i ++ )
        if (f == Foo) sum + = 42;
        else sum + = f();
return sum;

Now under normal circumstances, the performance of a delegate call is exactly the same as a direct method call.

End

While .NET has made some poor decisions with delegation before, since .NET 7 it has managed to solve performance issues with delegation.

Happy coding!

authorized

Author: hez2010

Translator: InCerry

Original link: https://medium.com/@skyake/the-evolution-of-delegate-performance-in-net-c8f23572b8b1