laitimes

Remember once . Stuck analysis of an edge computing system in a .NET network

author:opendotnet

One: Background

1. Tell a story

I've heard what there is for a long time

网络边缘计算

,I really met it this time.,It's a little interesting.,Asked ChatGPT What is this for?

Edge computing is a computing model that moves computing power and data storage locations from traditional centralized data centers to user devices, sensors, and other IoT devices at the edge of the network. The purpose of this model is to provide faster computing and data processing power close to the source of data generation, thereby reducing data transmission delays and improving quality of service. Edge computing makes it possible to process and make decisions locally on the device, while also helping to offload network traffic and load on the central data center.

See. It's quite gratifying that .NET still has such an application scenario, so let's analyze what this dump is all about?

Two: WinDbg analysis

1. Why does it get stuck?

Different programs have different ways to analyze the stuckness, so you need to identify the type of program and the call stack of the main thread first, as follows:

0:000> !eeversion
5.0.721.25508
5.0.721.25508 @Commit: 556582d964cc21b82a88d7154e915076f6f9008e
Server mode with 64 gc heaps
SOS Version: 8.0.10.10501 retail build

0:000> k
# Child-SP RetAddr Call Site
00 0000ffff`e0dddac0 0000fffd`c194c30c libpthread_2_28!pthread_cond_wait+0x238
...
18 (Inline Function) --------`-------- libcoreclr!RunMain::$_0::operator()::{lambda(Param *)#1}::operator()+0x14c [/__w/1/s/src/coreclr/src/vm/assembly.cpp @ 1536] 
19 (Inline Function) --------`-------- libcoreclr!RunMain::$_0::operator()+0x188 [/__w/1/s/src/coreclr/src/vm/assembly.cpp @ 1538] 
1a 0000ffff`e0dde600 0000fffd`c153e860 libcoreclr!RunMain+0x298 [/__w/1/s/src/coreclr/src/vm/assembly.cpp @ 1538] 
...
20 0000ffff`e0dded10 0000fffd`c1bf7800 libhostpolicy!corehost_main+0xc0 [/root/runtime/src/installer/corehost/cli/hostpolicy/hostpolicy.cpp @ 409] 
21 (Inline Function) --------`-------- libhostfxr!execute_app+0x2c0 [/root/runtime/src/installer/corehost/cli/fxr/fx_muxer.cpp @ 146] 
22 (Inline Function) --------`-------- libhostfxr!<unnamed-namespace>::read_config_and_execute+0x3b4 [/root/runtime/src/installer/corehost/cli/fxr/fx_muxer.cpp @ 520] 
23 0000ffff`e0ddeeb0 0000fffd`c1bf6840 libhostfxr!fx_muxer_t::handle_exec_host_command+0x57c [/root/runtime/src/installer/corehost/cli/fxr/fx_muxer.cpp @ 1001] 
24 0000ffff`e0ddf000 0000fffd`c1bf4090 libhostfxr!fx_muxer_t::execute+0x2ec
25 0000ffff`e0ddf130 0000aaad`c9e1d22c libhostfxr!hostfxr_main_startupinfo+0xa0 [/root/runtime/src/installer/corehost/cli/fxr/hostfxr.cpp @ 50] 
26 0000ffff`e0ddf200 0000aaad`c9e1d468 dotnet!exe_start+0x36c [/root/runtime/src/installer/corehost/corehost.cpp @ 239] 
27 0000ffff`e0ddf370 0000fffd`c1c63fe0 dotnet!main+0x90 [/root/runtime/src/installer/corehost/corehost.cpp @ 302] 
28 0000ffff`e0ddf3b0 0000aaad`c9e13adc libc_2_28!_libc_start_main+0xe0
29 0000ffff`e0ddf4e0 00000000`00000000 dotnet!start+0x34

           

Judging from the indicators in the hexagram, this is a Web site deployed on Linux, since the website is stuck, it is necessary to pay attention to what each thread is doing.

2. What are threads doing?

With my years of analytical experience, the vast majority are due to:

线程饥饿

Or rather

线程池耗尽

First, let's look at the thread pool.

0:000> !t
ThreadCount: 365
UnstartedThread: 0
BackgroundThread: 354
PendingThread: 0
DeadThread: 10
Hosted Runtime: no
 Lock 
 DBG ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
0 1 31eaf 0000AAADF267C600 2020020 Preemptive 0000000000000000:0000000000000000 0000aaadf26634b0 -00001 Ukn 
...
423 363 36d30 0000FFDDB4000B20 1020220 Preemptive 0000000000000000:0000000000000000 0000aaadf26634b0 -00001 Ukn (Threadpool Worker) 
424 364 36d31 0000FFDDA8000B20 1020220 Preemptive 0000000000000000:0000000000000000 0000aaadf26634b0 -00001 Ukn (Threadpool Worker) 
425 365 36d32 0000FFDDAC000B20 1020220 Preemptive 0000000000000000:0000000000000000 0000aaadf26634b0 -00001 Ukn (Threadpool Worker) 

0:000> !tp
Using the Portable thread pool.

CPU utilization: 9%
Workers Total: 252
Workers Running: 236
Workers Idle: 13
Worker Min Limit: 64
Worker Max Limit: 32767

Completion Total: 0
Completion Free: 0
Completion MaxFree: 128
Completion Current Limit: 0
Completion Min Limit: 64
Completion Max Limit: 1000

           

Judging from the hexagram, there are currently 365 managed threads, is this a lot? For 64core, this thread is actually normal, and friends in the training camp know that the server version of the GC is only available for the GC thread

64*2=128

Next, another indicator is whether there is a current backlog of tasks? Can be used

!ext tpq

command, the reference output is as follows:

0:000> !ext tpq
global work item queue________________________________

local per thread work items_____________________________________

           

Judging from the hexagram, there is no current backlog of tasks, which is a bit anti-experience.

3. Isn't it really thread hunger

The last trick is more thorough, which is to see what each thread stack is doing, and it can be used

~*e !clrstack

Order.

Remember once . Stuck analysis of an edge computing system in a .NET network

I don't know if I don't look at it, I was shocked when I saw it, there are 193 threads there

Task.Result

Wait on it, this thing is too classic, and then call the stack from above

UIUpdateTimer_Elapsed

It seems to be caused by a timer, and then I wonder how this code is written?

Remember once . Stuck analysis of an edge computing system in a .NET network

After analyzing the code above, I found that it is and

Linux Shell

The window interacts with commands, and I don't know why the shell doesn't respond and the code gets stuck here.

4. Why doesn't the thread pool have a backlog

I believe that there are many friends who are curious about this anti-empirical thing why the request is not backlogged in the thread pool, in fact, this test is your understanding of the underlying understanding of the PortableThreadPool, here I will briefly talk about it.

  1. There is a GateThread thread in the ThreadPool that is dedicated to dynamically injecting threads into the thread pool, and the reference code is as follows:
private static class GateThread
{
private static void GateThreadStart()
 {
while (true)
 {
bool wasSignaledToWake = DelayEvent.WaitOne((int)delayHelper.GetNextDelay(tickCount));

 WorkerThread.MaybeAddWorkingWorker(threadPoolInstance);
 }
 }
}

           
  1. Once someone calls the Task.Result code, the DelayEvent event will be woken up internally, telling GateThread to inject me a new thread through the MaybeAddWorkingWorker method, the reference code is as follows:
private bool SpinThenBlockingWait(int millisecondsTimeout, CancellationToken cancellationToken)
{
bool flag3 = ThreadPool.NotifyThreadBlocked();

}
internal static bool NotifyThreadBlocked()
{
if (UsePortableThreadPool)
 {
return PortableThreadPool.ThreadPoolInstance.NotifyThreadBlocked();
 }
return false;
}
public bool NotifyThreadBlocked()
{
 GateThread.Wake(this);
}

           

The above active wake-up mechanism is optimized by the C# version of PortableThreadPool to alleviate thread hunger, and there is an important point here

只能缓解

In other words, if the upstream is too strong, there will still be a backlog of requests, but why is there no backlog here? Obviously, the upstream is not violent, so how can seeing be believing? This requires looking at the number of periods of the timer and picking it up on the current thread stack.

0:417> !DumpObj /d 0000ffee380757f8
Name: System.Timers.Timer
MethodTable: 0000fffd4ab24030
EEClass: 0000fffd4ad6e140
Size: 88(0x58) bytes
File: /home/user/env/dotnet/shared/Microsoft.NETCore.App/5.0.7/System.ComponentModel.TypeConverter.dll
Fields:
 MT Field Offset Type VT Attr Value Name
0000fffd4c947498 400001c 8 ...ponentModel.ISite 0 instance 0000000000000000 _site
0000000000000000 400001d 10 ....EventHandlerList 0 instance 0000000000000000 _events
0000fffd479195d8 400001b 98 System.Object 0 static 0000000000000000 s_eventDisposed
0000fffd47926f60 400000e 40 System.Double 1 instance 3000.000000 _interval
0000fffd4791fb10 400000f 48 System.Boolean 1 instance 1 _enabled
0000fffd4791fb10 4000010 49 System.Boolean 1 instance 0 _initializing
0000fffd4791fb10 4000011 4a System.Boolean 1 instance 0 _delayedEnable
0000fffd4ab241d8 4000012 18 ...apsedEventHandler 0 instance 0000ffee3807aae8 _onIntervalElapsed
0000fffd4791fb10 4000013 4b System.Boolean 1 instance 1 _autoReset
0000fffd4c944ea0 4000014 20 ...SynchronizeInvoke 0 instance 0000000000000000 _synchronizingObject
0000fffd4791fb10 4000015 4c System.Boolean 1 instance 0 _disposed
0000fffd49963e28 4000016 28 ...m.Threading.Timer 0 instance 0000ffee38098dc8 _timer
0000fffd48b90a30 4000017 30 ...ing.TimerCallback 0 instance 0000ffee3807aaa8 _callback
0000fffd479195d8 4000018 38 System.Object 0 instance 0000ffee38098db0 _cookie

           

Judging from the hexagram, the current 3s cycle can explain the underlying reason why there is no backlog in the thread pool.

Three: Summary

This stuck accident is still quite easy to solve, if you have some experience to use it directly

dotnet-counter

It can also be done, the point is that this is a Linux dump, and at the same time . A very interesting scene on .NET, so share it.

Read on