Diagnose Memory Leaks in .NET Applications with WinDbg: A Hands-on Approach

Introduction

In this article, I will share what I learned about .NET memory management by troubleshooting a memory leak issue for a .NET service running in a real production environment. After reading this post, you can understand more about the following items:

  • .NET Memory Management and .NET Garbage Collection
  • Diagnose memory usage with WinDbg

Previously I once published articles about memory allocation of native applications written in C on Linux systems. I highly recommend you read both articles and compare what’s the difference, this can deepen your understanding of memory-related techniques.

Background

Firstly, let me explain the background: the problematic .NET service is running on the Azure Service Fabric, which consists of a cluster of Windows Server. If you are not familiar with Service Fabric, please refer to my previous article, which explains how to develop microservices based on it. The problem is that the memory usage of this .NET service reaches roughly 1.5GB, which is 4 times higher than the average usage. Generally speaking, on the cloud more resource usage means higher cost. Next, let’s diagnose what is happening here. But before we roll up our sleeves to debug this issue, let’s first review how .NET memory works at a theoretical level.

.NET Memory Management

.NET Runtime

Unlike native software written directly to the operating system’s APIs, programs written with .NET are called managed applications because they depend on a runtime and framework that manages many of their vital tasks and ensures a basic safe operating environment. In .NET, runtime refers to the Common Language Runtime(CLR) and the framework is .NET Framework or .NET Core. And the .NET application is built on top of them.

CLR works as a virtual execution engine for .NET applications, and it is a very complex topic that is out of the scope of this article, so we can’t dive into the details. But one of the crucial functionalities it provides is just: memory management.

.NET GC

When you’re working on system programming with C or C++, you need to manage the memory by calling standard library’s APIs like malloc and free to allocate and release the memory. But on the .NET framework, your life will be easier since the .NET memory manager does this task for you. When the program creates a new object, the memory manager will allocate the memory for you. Easy, right? But that’s only a trivial part of the memory manager, the complicated task is how and when the memory manager can decide to collect the memory. Because of this, the .NET memory manager has another name: .NET Garbage Collector.

The .NET GC uses two Win32 functions VirtualAlloc and VirtualFree for allocating a segment of memory and releasing segments back to the operating system respectively. The entire process can be divided into the following three phases:

  • Marking Phase: A list of live objects is created during the marking phase. This is done by starting from a set of references called the GC roots. GC marks these root objects as live and then looks at any object that they reference and marks these as being live too. And the GC continues the search iteratively. This process is also called reachability analysis. All of the objects that are not on the reachable list are potentially released from the memory.
  • Relocating Phase: The references of all the reachable objects are updated so that they point to the new location. The purpose of the relocating phase is to optimize memory usage by compacting the live objects closer together. This helps reduce memory fragmentation and improves memory allocation efficiency.

  • Compacting Phase: The memory space occupied by the dead objects is released and the live objects are moved to the new location.

GC algorithm is a complex topic, the above description only touches the surface of it. You can explore more by yourself. In the following sections, I will add extra information when necessary.

WinDbg

To diagnose .NET applications’ memory issues, there are many modern tools, like dotMemory. But in this article, I will show you a low-level tool: WinDbg. WinDbg is a multipurpose debugger for the Windows OS, which can be used to debug both user mode and kernel mode. Originally WinDbg was used to debug the native applications, but WinDbg allows the loading of extensions which can augment the debugger’s supported commands. For example, to debug .NET applications running on CLR, we need the SOS extension.

Now that we have a powerful debugger in hand, the next question is how to use it. Generally speaking, you can attach the debugger to the problematic service, but since the target service is running in the production environment, we can’t do that easily. So we need to troubleshoot this issue in the offline style as follows:

We need to use ProcDump to dump the memory information into a file, called dump file. Then use WinDbg to analyze the dump file. I will not cover the usage of ProcDump in this article and leave it to you.

Next, let’s play with the dump file, all right? Let’s go!

Diagnose the memory leak

First, let’s use the WinDbg !address extension to display information about the memory that the target process or target computer uses as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
0:000> !address -summary

--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free 338 7ffe`d06bb000 ( 127.995 TB) 100.00%
<unknown> 4715 0`feb0f000 ( 3.980 GB) 83.90% 0.00%
Stack 276 0`16880000 ( 360.500 MB) 7.42% 0.00%
Image 1241 0`0e989000 ( 233.535 MB) 4.81% 0.00%
Heap 519 0`0b997000 ( 185.590 MB) 3.82% 0.00%
Other 15 0`001cd000 ( 1.801 MB) 0.04% 0.00%
TEB 92 0`000b8000 ( 736.000 kB) 0.01% 0.00%
PEB 1 0`00001000 ( 4.000 kB) 0.00% 0.00%

--- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_PRIVATE 5051 1`1e9d4000 ( 4.478 GB) 94.41% 0.00%
MEM_IMAGE 1753 0`1013b000 ( 257.230 MB) 5.30% 0.00%
MEM_MAPPED 55 0`00e26000 ( 14.148 MB) 0.29% 0.00%

--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE 338 7ffe`d06bb000 ( 127.995 TB) 100.00%
MEM_RESERVE 2503 0`cdc3c000 ( 3.215 GB) 67.78% 0.00%
MEM_COMMIT 4356 0`61cf9000 ( 1.528 GB) 32.22% 0.00%

Let’s examine the output, which provides so much information!

The first section shows the memory Usage Summary:

  • Free: is the entire virtual memory that can potentially be claimed from the operating system. This may include swap space, not only physical RAM. For a 64-bit process, the virtual memory is 128TB.
  • Heap: what you see as Heap is the memory that was allocated through the Windows Heap Manager, we generally call it the native heap.
  • Unknown: any other heap managers will implement their own memory management. Basically, they all work similarly: they get large blocks of memory from VirtualAlloc and then try to do a better management of the small block within that large block. Since WinDbg doesn’t know any of these memory managers, that memory is declared as Unknown. It includes but is not limited to the managed heap of .NET. In my case, the value is 3.98GB which is much bigger than the 1.5GB memory usage reported by the monitor tool. I will explain why it goes like this in the following sections.
  • Image: is the memory space allocated for the binary assemblies.
  • Stack: is straightforward. If you don’t know, please refer to my previous article.

The second section shows the usage by memory Type: MEM_PRIVATE, MEM_IMAGE and MEM_MAPPED. MEM_PRIVATE is private to one process and MEM_IMAGE and MEM_MAPPED can be shared among multiple processes.

The third section shows the usage by memory State.

  • MEM_COMMIT: committed memory refers to the portion of the virtual address space that has been allocated and backed by physical memory or the paging file. Committed memory is actively used and accessible by the process. It includes memory regions that have been allocated and are in use, such as code, data, and heap allocations.
  • MEM_RESERVE: corresponds to reserved memory. Reserved memory refers to memory that has been reserved in the virtual address space but has not yet been committed. When memory is reserved, the address space is allocated, but physical memory or paging file resources are not immediately allocated. The process reserves the address space to ensure that it will be available when needed. Reserved memory can later be committed to make it usable by the process.
  • MEM_FREE: represents free memory. This category includes all memory that has not been reserved or committed.

In my case, the committed memory is 1.528GB, which exactly matches the monitored memory usage. This can perfectly explain the confusing point mentioned about the 3.98GB unknown memory segment. It turns out the majority of these memories are only reserved. So next step is to get the actual committed memory usage for various types. How to achieve that?

Thanks for the powerful address extension, I can do that by adding some filters like this:

1
!address -f:MEM_COMMIT -c:".echo %1 %3 %5 %7"

This command will output all the committed memory regions, and for each region prints its base address(%1), region size(%3), state(%5) and type(%7). The output goes as follows:

1
2
3
4
5
6
7
8
9
0x5f3a0000 0x1000 MEM_COMMIT Image
0x5f3a1000 0x2ab000 MEM_COMMIT Image
0x5f64c000 0xb000 MEM_COMMIT Image
0x5f657000 0x1b000 MEM_COMMIT Image
0x5f672000 0x2000 MEM_COMMIT Image
0x5f674000 0x2d000 MEM_COMMIT Image
0x7ffe0000 0x1000 MEM_COMMIT Other
0x4580ca7000 0x5000 MEM_COMMIT Stack
...thousands of lines omitted...

I wrote a simple script to parse the above output and finally got the following result:

1
2
3
4
5
6
7
8
Memory by Type:
type Image: 233.52344MB
type Other: 1.8007812MB
type Stack: 3.796875MB
type TEB : 0.71875MB
type PEB : 0.00390625MB
type Heap : 84.88281MB
type <unknown> : 1240.2461MB

Based on this result, you can see the majority comes from the unknown segment. I suspect that the memory leak issue occurs in this problematic service. Various WinDbg commands can diagnose the memory leak problem. And in this specific case, I find the command !finalizequeue is super helpful!

Before I show you the output of the command, let’s examine what the finalizer queue is. In C#, the finalizer(also referred to as destructor) is used to perform any necessary final clean-up when a class instance is being collected by the GC.

1
2
3
4
5
6
7
class Car
{
~Car() // finalizer
{
// cleanup statements...
}
}

When your application encapsulates unmanaged resources, such as windows, files and network connections, you should use the finalizer to free those resources.

Objects of classes with finalizers can’t be removed immediately: they instead go to the finalizer queue and are removed from memory once the finalizer has been run.

Based on this, let’s examine the output of command !finalizequeue:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
0:000> !FinalizeQueue
SyncBlocks to be cleaned up: 0
Free-Threaded Interfaces to be released: 0
MTA Interfaces to be released: 0
STA Interfaces to be released: 0
----------------------------------
generation 0 has 66 finalizable objects (0000021c729c47b0->0000021c729c49c0)
generation 1 has 4 finalizable objects (0000021c729c4790->0000021c729c47b0)
generation 2 has 4372 finalizable objects (0000021c729bbef0->0000021c729c4790)
Ready for finalization 0 objects (0000021c729c49c0->0000021c729c49c0)
Statistics for all finalizable objects (including all objects ready for finalization):
MT Count TotalSize Class Name
00007ffcde443f78 1 32 Microsoft.Win32.SafeHandles.SafePEFileHandle
00007ffcde4407f8 1 32 System.Security.Cryptography.X509Certificates.SafeCertContextHandle

--- many lines of output omitted ---

00007ffcde447db0 97 3104 Microsoft.Win32.SafeHandles.SafeWaitHandle
00007ffcdd272468 28 4928 System.Diagnostics.PerformanceCounter
00007ffcde45d6a8 64 6144 System.Threading.Thread
00007ffcde4297c0 3639 262008 System.Reflection.Emit.DynamicResolver
Total 4442 objects

You can notice that there are 3639 objects of type System.Reflection.Emit.DynamicResolver. As we know, the objects in the finalizer queue can’t be removed until the finalizer runs. This also means that any object they reference and any object referenced by those, and so on has to be kept in memory.

This is the potential reason for the memory leak problem. System.Reflection.Emit is a low-level library used to generate the Microsoft Intermediate Language (MSIL) dynamically, my application does not rely on it. Finally, it turns out the issue is from the Service Fabric SDK and upgrading to new versions can fix it.

Summary

In this article, we examined how .NET GC works and how to diagnose the memory leak issue with WinDbg.

Linux Foundation Scholarship

After graduating from campus 10 years ago, I didn’t expect that I would still be eligible for a scholarship, especially considering that I wasn’t a high-achieving student at that time. But recently, I joined one program from Linux Foundation: The Shubhra Kar Linux Foundation Training (LiFT) Scholarship Program. And very luckily, get awarded.

I plan and hope to set up a bright new career in the open-source software field, that’s the motivation to join the program!

Thanks to the award from Linux Foundation, I got the opportunity to select one training course and certification exam freely. Big thanks.

And the one I choose is CKAD (Certified Kubernetes Application Developer), stay tuned, I will update the status later.

Scalability Lessons Learned from Amazon Return to The Monolith

Background

Recently, the engineering team at Amazon Prime Video posted an article that becomes super popular in the software community. Many discussions arose concerning how to design a scalable application in the modern cloud computing era.

Scalability is one critical metric for designing and developing an online application. Technically it is a challenging task, but the cloud made it easy; because the public cloud services providers like AWS and Azure do the job for you. Virtual machine, Container , and Serverless, you have so many powerful technologies to scale your application and business.

But the problem is you need to choose the one suitable for your application. So in this article, let’s examine the details of why Aamzon refactors the application and how they do it.

Since Amazon doesn’t expose the code implementation of this project, our analysis is based on the original post and the technical understanding of AWS. All right?

Case study

The product refactored is called Amazon Prime Video Monitoring Service, which monitors the quality of thousands of live video streams. This monitoring tool runs the real-time analysis of the streams and detects quality issues. The entire process consists of 3 steps: media converter, defect detector and real-time notification.

  • media converter: converting input audio/video streams to frames.
  • defect detector: executing machine learning algorithms that analyze frames to detect defects.
  • real-time notification: sending real-time notifications whenever a defect is found.

The old architecture goes as follows:

In this microservices architecture, each step was implemented by AWS Lambda serverless service (I assume you already know what is serverless; if not, please refer to other online documents). And the entire workflow was orchestrated by AWS Step Functions, which is a serverless orchestration service.

AWS Step Functions State Transition

The first bottleneck is just coming from AWS Step Functions. But before discussing the issues of this architecture, we need to understand what is AWS Step Functions and how it works basically. This knowledge is very critical to understand the performance bottlenecks later.

AWS Step Functions is a serverless service that allows you to coordinate and orchestrate multiple AWS serverless functions using a state machine. You can define the workflow of serverless applications as a state machine, which represents the different states and transitions of the application’s execution.

The State machine can be thought of as a directed graph, where each node represents a state, and each edge represents a transition between states, which is used to model complex systems. The topic of state machine isn’t in the scope of this article, I will write a post about it in the future. So far, you only need to know each state in the state machine represents a specific step in the workflow, while transitions represent the events that trigger the transition from one step to another. You can find many examples of AWS Step Functions in this repo, please take a look at what problems you can solve with it.

In theory, this serverless-based microservices architecture can scale out easily. However, as the original post mentioned it “hit a hard scaling limit at around 5% of the expected load. Also, the overall cost of all the building blocks was too high to accept the solution at a large scale.

So the bottleneck is the cost! Because AWS Step Functions charges users per state transition, and in this monitoring service case, it performed multiple state transitions for every second of the video stream, it quickly hit the account limits!

As a software engineer working in a team, maybe you can just focus on writing great codes, but when you need to select the tech stack, you have to consider the cost, especially when your application is running on the cloud.

AWS S3 Buckets Tier-1 requests

As mentioned above, the media converter service splits videos into frames, and the defect detector service loads and analyzes the frames later. So in the original architect, the frame images are stored in the Amazon S3 bucket. As the original post mentioned “Defect detectors then download images and processed them concurrently using AWS Lambda. However, the high number of Tier-1 calls to the S3 bucket was expensive.”

So the second bottleneck is also a cost issue. But what is a Tier-1 call or request in the context of AWS S3?

A Tier-1 request refers to an API request that retrieves or lists objects in an S3 bucket, and is charged at a higher rate than other types of API requests. AWS S3 API requests are classified into two categories: standard requests and Tier-1 requests.

  • standard requests: including API requests such as PUT, COPY, DELETE and HEAD requests.
  • Tier-1 requests: including API requests such as GET and LIST requests.

Tier-1 requests are expensive because they involve retrieving and listing objects in an AWS S3 bucket, which is more resource-intensive. Because when you retrieve or list objects, S3 needs to scan through the entire bucket to find the targets. Additionally, S3 needs to transfer the data for each retrieved object over the network. So basically, it consumes more storage and network resources on the cloud.

Monolith Architecture

Based on these two bottlenecks, they refactored this monitoring tool and returned to the monolith architecture as follows:

In the new design, everything is running inside a single process host in Amazon Elastic Container Service(ECS). In this monolith architecture, the frames are stored in the memory instead of S3 buckets. It doesn’t need serverless orchestration service either.

How does this new architecture run at a high scale? They directly scale out and partition the Amazon ECS cluster. In this way, they get a scalable monitoring application with a 90% cost reduction.

Summary

There is no perfect application architecture that can fit all cases. In the cloud computing era, you need to understand the service you used better than before and make a wise decision.

Build Microservices with Service Fabric: A Hands-on Approach

Background

In this article, I want to introduce a distributed systems platform: Service Fabric, that is used to build microservices.

First things first, what is Service Fabric? In the software world, Service Fabric is in the scope of the orchestrator. So it is the competitor of Kubernetes, Docker Swarm, etc.

I know what you’re thinking about, at the time of writing, Kubernetes already won the competition. Why am I still writing about Service Fabric? My motivation to write this post is as follows:

  • Firstly, I once used Service Fabric to build microservices and learned something valuable about it. I want to summarize all my learnings and share them with you here!
  • Secondly, we can (simply) examine both Service Fabric and Kubernetes to understand why the cloud-native solution is better and what problems it can solve.
  • Finally, Service Fabric is still widely used in some enterprise applications, in detail, you can refer here. So the skills you learned about service fabric are valuable.

Hands-on Project

To learn the Service Fabric platform, we will examine an open-source project: HealthMetrics. This project is originally developed by Microsoft itself to demonstrate the power of service fabric and I added some enhancements to it, in detail please refer to this GitHub repo.

At the business requirement level, this application goes like this: each patient keeps reporting the heart rate data to the doctor via the wearable device(the band in this case). And the doctor aggregates the collected health data and reports to the county service. Each county service instance runs some statistics on doctors’ data and sends the report to the country service. Finally, you can get something as follows:

Before we explore the detailed implementation, you can pause here for a few minutes. How will you design the architecture if this application is assigned to you? How to make it both reliable and scalable?

Service Fabric Programming Model

Service Fabric provides multiple ways to write and manage your services:

  • Reliable Services: is the most popular choice in the context of service fabric, we’ll examine much more about it later.
  • Containers: service fabric can also deploy services in containers. But if you choose to use containers, why not directly use Kubernetes? I’ll not cover it in this post.
  • Guest executables: You can run any type of code or script like Node.js as guest executables in the service fabric. Note that the service fabric platform doesn’t include the runtime for the guest executables, instead, the operating system hosting the service fabric must provide the corresponding runtime for the given language. It provides a flexible way to run legacy applications within microservices. If you have an interest in it, please refer to this article, I will not examine the details in this post.

Next, let’s examine what are Reliable Services.

Reliable Services

As I mentioned above, service fabric is an orchestrator which provides infrastructure-level functionalities like cluster management, service discovery, and scaling as Kubernetes does. But service fabric goes further by also providing a reliable services model to guide the development on the application level, this’s a special point compared to Kubernetes. Simply speaking, the application code can call the service fabric runtime APIs to query the system for building reliable applications. In the following sections, I’ll show you what that means with some code blocks.

Reliable Services can be classified into two types as follows:

  • Stateless service: when we mentioned stateless service in the context of service fabric, we are not saying that the service doesn’t have any state to store, but it means that the service doesn’t store the data inside the cluster of service fabric. Indeed, the stateless service can store states in the external database. The naming of stateless is relative to the cluster.
  • Stateful service: similarly, the stateful service keeps its state locally in the service fabric cluster, which means the data and service are in the same virtual or physical machine. And this is called Reliable Collections. Compared with external data storage, Reliable Collections implies low latency and high performance. In the following section, I will introduce more about Reliable Collections, which is a very interesting topic. Please hold on!

Besides stateless and stateful services listed above, there is the third type of service provided by Service Fabric: Actor Service.

  • Actor Service: is a special type of stateful service, which applies the actor model theory. The actor model is a general computer science theory handling concurrent computation. It is a very big topic worthy of a separate article. I’ll not cover the details in this post, instead, let’s go through this model in the context of service fabric.

The actor model is proposed many years ago in 1973 to simplify the process of writing concurrent programs with the following several advantages:

  • Encapsulation: in the actor model, each actor is a self-contained unit that encapsulates its own state and behavior. And each actor runs only one thread, so you don’t have to worry about complex multi-threading programming issues. But how does it support high concurrency? Just allocate more actor instances as the load goes up.
  • Message passing: different from the traditional multi-threading programming techniques, actors do not share memories, instead, they communicate with async message-passing communication, which can reduce the complexity of concurrent programs.
  • Location transparency: as a self-contained computation unit, each actor can be located on different machines, which makes it the perfect solution for building distributed applications as service fabric does.

In the future, I will write an article on the actor model to examine more details about it. Next, let’s take a look at the demo app: HealthMetrics and analyze how it was built based on the programming models we discussed above.

Architecture of HealthMetrics

There are several services included in this HealthMetrics demo application. They are:

  • BandCreationService: this stateless service read the input CSV file and creates the individual band actors and doctor actors. For example, as the above snapshot shows, the application creates roughly 300 band actors and doctor actors.
  • BandActor: this actor service is the host for the band actors. Each BandActor represents an individual wearable device, which generates heart rate data and then sends it to the designated DoctorActor every 5 seconds.
  • DoctorActor: the doctor actor service aggregates all of the data it receives from each band actor and generates an overall view for that doctor and then pushes it into the CountyService every 5 seconds.
  • CountyService: this stateful service aggregates the information provided by the doctor actor further and also pushes the data into the NationalService.
  • NationalService: this stateful service maintains the total aggerated data for the entire county and is used to serve data requested by WebService.
  • WebService: this stateless service just hosts a simple web API to query information from NationalService and render the data on the web UI.

As you can see, this demo application consists of all three types of services provided by service fabric, which is a perfect case to learn service fabric, right?

In the next sections, let’s examine what kinds of techniques of service fabric are applied to build this highly scalable and available microservices application.

Naming Perspective

Naming plays an important role in the distributed system. They are used to share resources, to uniquely identify entities, to refer to locations, and so on. In the distributed system, each name should be resolved to the entity it refers to. To resolve names, it is necessary to implement a naming system.

In Service Fabric, naming is used to identify and locate services and actors within a cluster. And Service Fabric uses a structured naming system, where services and actors are identified using a hierarchical naming scheme. The naming system is based on the concept of a namespace. A namespace is a logical grouping of related applications and services. In the Service Fabric namespace, the first level is the application, which represents a logical grouping of services, and the second level is the service, which represents a specific unit of functionality that is part of the application. And the default namespace is fabric:/, so the service URI in the Service Fabric cluster is in the following format:

1
fabric:/applicationName/ServiceName

And in our HealthMetrics demo app, the URL will be something like fabric:/HealthMetrics/HealthMetrics.BandActor or fabric:/HealthMetrics/HealthMetrics.CountyService(the application name is HealthMetrics and the service name is HealthMetrics.BandActor or HealthMetrics.CountyService).

In this demo app, we build a helper class ServiceUriBuilder to generate URI as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public Uri ToUri()
{
string applicationInstance = this.ApplicationInstance;

if (String.IsNullOrEmpty(applicationInstance))
{
try
{
// the ApplicationName property here automatically prepends "fabric:/" for us
applicationInstance = FabricRuntime.GetActivationContext().ApplicationName.Replace("fabric:/", String.Empty);
}
catch (InvalidOperationException)
{
// FabricRuntime is not available.
// This indicates that this is being called from somewhere outside the Service Fabric cluster.
}
}

return new Uri("fabric:/" + applicationInstance + "/" + this.ServiceInstance);
}

For detail, please refer to this source code file.

Since the behavior of the name resolver is quite similar to the internet domain name resolver, the Service Fabric Name Service can be regarded as an internal DNS service. This type of internal DNS service is a common requirement for all distributed systems, including Kubernetes. In the modern cloud-native ecosystem, the popular choice is CoreDNS, which runs inside K8S. In the future, I will write an article about DNS. Please keep watching my blog’s update.

Besides this default DNS solution mentioned above, you can use other service registry and service discovery solutions. For example, in my previous article, I once examined how to do this based on Consul. Imagine your large-scale application consists of hundreds of microservices, with the default DNS solutions you have to hardcode so many names in each service. That’s just where Consul can help. In detail, I’ll not repeat it here, please refer to my previous article!

Now that we understand how the naming system works, let’s examine how to do inter-service communication based on that.

Communication Perspective

Inter-service communication is at the heart of all distributed systems. For nondistributed platforms, communication between processes can be done by sharing memories. But the communication between processes on different machines has always been based on the low-level message passing as offered by the underlying network. There are two widely used models for communications in distributed systems: Remote Procedure Call(RPC) and Message-Oriented Middleware(MOM).

  • Remote Procedure Call(RPC): is ideal for client-server applications and aims at hiding message-passing details from the client. The HTTP protocol is a typical client-server model, so HTTP requests can be thought of as a type of RPC. In the client-server model, the client and server must be online at the same time to exchange information, so it’s often called synchronous communication.

  • Message-Oriented Middleware(MOM): is suitable for distributed applications, where the communication does not follow the strict pattern of a client-server model. In this model, there are three parties involved: the message producer, the message consumer, and the message broker. In this case, when the message producer publishes the message, the message consumer doesn’t have to be online, the message broker will store the message until the consumer becomes available to receive it, so this model is also called asynchronous communication.

The Service Fabric supports RPC out of the box. In our HealthMetrics demo app, you can find the RPC calls and HTTP requests easily. For example, when the BandActor and the DoctorActor are created, the RPC call is used in the BandCreationService as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
private async Task CreateBandActorTask(BandActorGenerator bag, CancellationToken cancellationToken)
{
// omit some code for simplicity
ActorId bandActorId;
ActorId doctorActorId;
bandActorId = new ActorId(Guid.NewGuid());
doctorActorId = new ActorId(bandActorInfo.DoctorId);
IDoctorActor docActor = ActorProxy.Create<IDoctorActor>(doctorActorId, this.DoctorServiceUri);
await docActor.NewAsync(doctorName, randomCountyRecord);
IBandActor bandActor = ActorProxy.Create<IBandActor>(bandActorId, this.ActorServiceUri);
await bandActor.NewAsync(bandActorInfo);
// omit some code for simplicity
}

You can see the ActorProxy instance is created as the RPC client, and the NewAsync method of BandActor and DoctorActor service is called. For example, the NewAsync method of BandActor goes like this:

1
2
3
4
5
6
7
8
9
10
11
public async Task NewAsync(BandInfo info)
{
await this.StateManager.SetStateAsync<CountyRecord>("CountyInfo", info.CountyInfo);
await this.StateManager.SetStateAsync<Guid>("DoctorId", info.DoctorId);
await this.StateManager.SetStateAsync<HealthIndex>("HealthIndex", info.HealthIndex);
await this.StateManager.SetStateAsync<string>("PatientName", info.PersonName);
await this.StateManager.SetStateAsync<List<HeartRateRecord>>("HeartRateRecords", new List<HeartRateRecord>()); // initially the heart rate records are empty list
await this.RegisterReminders();

ActorEventSource.Current.ActorMessage(this, "Band created. ID: {0}, Name: {1}, Doctor ID: {2}", this.Id, info.PersonName, info.DoctorId);
}

You can ignore the detailed content of this method for now, I will explain in a later section.

And when the DoctorActor service reports its status, it calls the CountyService endpoint with HTTP requests:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
public async Task SendHealthReportToCountyAsync()
{
// omit some code
await servicePartitionClient.InvokeWithRetryAsync(
client =>
{
Uri serviceAddress = new Uri(
client.BaseAddress,
string.Format(
"county/health/{0}/{1}",
partitionKey.Value.ToString(),
id));

HttpWebRequest request = WebRequest.CreateHttp(serviceAddress);
request.Method = "POST";
request.ContentType = "application/json";
request.KeepAlive = false;
request.Timeout = (int) client.OperationTimeout.TotalMilliseconds;
request.ReadWriteTimeout = (int) client.ReadWriteTimeout.TotalMilliseconds;

using (Stream requestStream = request.GetRequestStream())
{
using (BufferedStream buffer = new BufferedStream(requestStream))
{
using (StreamWriter writer = new StreamWriter(buffer))
{
JsonSerializer serializer = new JsonSerializer();
serializer.Serialize(writer, payload);
buffer.Flush();
}

using (HttpWebResponse response = (HttpWebResponse) request.GetResponse())
{
ActorEventSource.Current.Message("Doctor Sent Data to County: {0}", serviceAddress);
return Task.FromResult(true);
}
}
}
}
);
// omit some code
}

In the demo app, no asynchronous communication is used. But you can easily integrate the message broker middleware with the Service Fabric. This kind of microservice design is called Event-Driven Architecture. In the future, I will write another article about it, please keep watching my blog!

Scalability Perspective

Scalability has become one of the most important design goals for developers of distributed systems. A system can be scalable, meaning that we can easily add more users and resources to the system without any noticeable loss of performance. In most cases, scalability problems in distributed systems appear as performance problems caused by the limited capacity of servers and networks.

Simply speaking, there are two types of scaling techniques: scaling up and scaling out:

  • scaling up: is the process by which a machine is equipped with more and often more powerful resources(e.g., by increasing memory, upgrading CPUs, or replacing network modules) so that it can better accommodate performance-demanding applications. However, there are limits to how much you can scale up a single machine, and at some point, it may become more cost-effective to scale out.
  • scaling out: is all about extending a networked computer system with more computers and subsequently distributing workloads across the extended set of computers. There are basically only three techniques we can apply: asynchronous communication, replication and Partitioning.

We already examined asynchronous communication in the last section. Next, let’s take a deep look at the other two:

  • Replication: is a technique, which replicates more components or resources, etc., across a distributed system. Replication not only increases availability; but also helps to balance the load between components, leading to better performance. In Service Fabric, no matter whether it is a stateless or stateful service, you can replicate the service across multiple nodes. As the workload of your application increases, Service Fabric will automatically distribute the load. But replication can only boost performance for read requests (which don’t change data); if you need to optimize the performance for write requests (which change data), you need Partitioning.

  • Partitioning: is an important scaling technique, which involves taking a component or other resource, splitting it into smaller parts, and subsequently spreading those parts across the system. Each partition only contains a subset of the entire dataset. This can help reduce the amount of data that needs to be processed and accessed by each partition, which can lead to faster processing times and improved performance. In addition to reducing the size of the data set, partitioning can also improve concurrency and reduce contention.

In Service Fabric, each partition consists of a replica set with a single primary replica and multiple active secondary replicas. Service Fabric makes sure to distribute replicas of partitions across nodes so that secondary replicas of a partition do not end up on the same node as the primary replica, which can increase the availability.

The difference between a primary replica and a secondary replica is that the primary replica can handle both read and write requests, while the secondary replica can only handle read requests. Moreover, by default, the read requests are only handled by the primary replica, if you want to balance the read requests among all the secondary replicas, you need to set ListenOnSecondary at the application code level. You can set the number of replicas with MinReplicaSetSize and TargetReplicaSetSize

And when the client needs to call the service with multiple partitions, the client needs to generate a ServicePartitionKey; and send requests to the service with this partition key. For example, the DoctorActor sends the health report to the county service with multiple partitions as follows:

1
2
3
4
5
6
7
8
9
10
11
public async Task SendHealthReportToCountyAsync() {
// omit some code
ServicePartitionKey partitionKey = new ServicePartitionKey(countyRecord.CountyId);
ServicePartitionClient<HttpCommunicationClient> servicePartitionClient =
new ServicePartitionClient<HttpCommunicationClient>(
this.clientFactory,
this.countyServiceInstanceUri,
partitionKey);
await servicePartitionClient.InvokeWithRetryAsync();
// omit some code
}

You can see the partition key is generated based on the county id field. Service Fabric provides several different partition schemas. In this case, ranged partitioning is used, where the partition key is an integer.

The county service specifies an integer range by a Low Key ( set to 0) and High Key (set to 57000). It also defines the number of partitions as PartitionCount (set to 3). All the integer keys are evenly distributed among the partitions. So the partitions of the county service go as follows:

As I mentioned above, the county id is unique, we could then generate a hash code based on the id field, then modulus the key range, to finally get the partition key. Service Fabric runtime will direct the requests to the target node based on that partition key.

Summary

In this post, we quickly examined some interesting topics about Service Fabric. I have to admit that Service Fabric is a big project, what I examined here only covers a small portion of the entire system. Feel free to explore more!

Understand Red Black Tree: part one - background

Introduction

In this series of articles, I want to examine an important data structure: the red-black tree. The red-black tree is an advanced data structure that is difficult to fully understand. Maybe you have some wonders and confusion about it as follows:

  • What is the meaning of red and black here?
  • The red-black tree is known as a self-balancing binary search tree. But what’s the difference between it and others?
  • Where is it used or applied?

If you want to know the answer to the above questions, then this article is just for you. I will cover the following topics:

  • Why do we need the red-black tree and what is its advantage?
  • How does a red-black tree work?
  • How to write a red-black tree from scratch?

Background of Red-Black Tree

In this section, let’s review the history of the red-black tree. During this process, I will show you why it was invented and what kind of advantages it can provide compared with other tree data structures.

First thing first, let’s define the red-black tree as follows:

  • Red-Black Tree is a self-balancing binary search tree.

Let’s analyze this definition step by step:

  • Tree: A tree is a nonlinear data structure, compared to arrays, linked lists, stacks and queues which are linear data structures. A tree is a data structure consisting of one node called the root and zero or one or more subtrees. One disadvantage of linear data structures is the time required to search a linear list is proportional to the size of the data set, which means that the time complexity is O(n). That’s why more efficient data structures like trees are invented to store and search data.

  • General Tree: A general tree is a tree where each node may have zero or more children.

  • Binary Tree: A binary tree is a specialized case of a general tree, where each node can have no more than two children.

  • Binary Search Tree: A binary tree satisfying the binary search property is called a binary search tree(BST). To build a BST, the node with a key greater than any particular node is stored on the right sub-trees and the one equal to or less than is stored on the left sub-tree.

The average search time complexity of BST is O(logN), but in the worst case, it will be degraded to O(N). It happens when we insert the nodes one by one in order. For example, if we insert the elements in array [1, 2, 3, 4, 5, 6] into RST in order, what we get is as follows:

This kind of unbalanced BST is degraded to a single linked list. So the BST needs to keep balanced during the insert and delete of the node. That’s where the self-balancing binary search tree comes from.

  • Self-Balancing Binary Search Tree: it’s a binary search tree that automatically keeps its height small in the face of arbitrary item insertions and deletions. And the red-black tree is just one type of it.

Summary

As the first post of this series, I examined the background of the red-black tree bit by bit. I hope you can understand what’s self-balancing binary search tree and why we need it. In the next post, I will start to examine the behavior and property of the red-black tree.

Deletion operation in Binary Search Tree: successor or predecessor

Introduction

Today, in this article I want to examine one concrete topic: how to delete a node from a binary search tree. This question is important for understanding other tree data structures, like the AVL tree, and the Red-Black tree.

As is common with many data structures, deletion is always the hardest operation. For example, to delete a specific element from the array, we need to shift all the elements after it by one index. And to delete one node from the linked list, we need to reset the pointer of the previous node (if it’s a doubled linked list, we need to reset more points, which is a more complex case). This is the same for the binary search tree as well. Let’s see in the following section.

Note: the idea of this article is inspired by the book Data Structures and Algorithm Analysis in C written by Mark Allen Weiss. The demo code shown in the following section is from this book. And I made some changes based on it. I highly recommend this great book to readers.

Deletion of the binary search tree

Before we can examine the deletion operation in depth, let’s quickly review the concept of the binary search tree as follows:

  • Binary search tree: is a binary tree holding the following property that for every node, X, in the tree, the values of all the keys in its left subtree are smaller than the key value in X, and the values of all the keys in its right subtree are larger than the key value in X.

Simply speaking, the binary search tree is a binary tree satisfying the binary search property. So each node of a binary search tree can have two children subtrees at most. When we need to delete one node from the binary search tree, then it’s necessary to consider the following 3 cases.

If the node is a leaf, then it can be deleted immediately. For example, to delete node 1 from the following binary search tree:

I will show you how to implement it at the code level later, which summarizes all the cases.

If the node has only one child subtree, the node can be deleted after we reset its parent’s pointer to bypass the node and point to its child node. For example, to delete node 4 as follows:

In the above example, node 4 has only one left child node 3. Let’s consider the symmetric scenario, imagine what will happen when it has only one right child node. The answer is it doesn’t influence how we handle the deleted node here and the result is the same. I will not draw the diagram here and leave it for you to explore.

The complicated case is how to deal with a node with two children. Before we introduce the solution, let’s clarify one concept about the binary search tree: successor:

  • Successor: is the node with the minimum value in the right subtree of any node.

Let’s take node 16 in the above binary search tree as an example, the successor is node 17. All right! Based on this concept, the solution to delete a node with two children is straightforward:

  • Replace the data of the deleted node with the value of the successor and recursively delete the successor from the right subtree.

Let’s try to analyze this solution. Firstly, replacing the data of the node with the value of the successor can keep the binary search property after the deletion operation. Secondly, since the successor has the minimum value of the subtree, which means it cannot have a left child. So the successor is either a leaf node without any child node or a node with only one right child node. So recursively deleting the successor can be resolved by the two simple cases we discussed above, it’s perfect, right? For instance, the deletion of node 16 goes as follows:

As we mentioned above, node 16‘s successor is node 17. So replace the value with 17 and delete node 17 from the right subtree, where node 17 is just a leaf node. Next, let’s implement the node deletion operation.

Code implementation

In this article, I will only show the codes related to the deletion operation rather than the complete implementation of a binary search tree. If you want to know how to write a BST from scratch, please refer to this GitHub repo.

Firstly, let’s examine the header file, which contains the data type definitions and function declarations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include "utility.h"
#include <stdio.h>
#include <stdlib.h>

#ifndef _BST_H
#define _BST_H

typedef int ET;

struct TreeNode;
typedef struct TreeNode *Position;
typedef Position BST;

BST delete(ET, BST);
Position findMinBST(BST);

// code omitted here
#endif

You can notice that besides the delete function, we also define a helper function findMinBST which is used to find the successor node.

Next, let’s examine the function definitions as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
BST
delete(ET elem, BST T)
{
Position tmpCell;

if (T == NULL)
fatal("Element not found");
else if (elem < T->Element)
T->Left = delete(elem, T->Left);
else if (elem > T->Element)
T->Right = delete(elem, T->Right);
else // we found the element to be deleted
if (T->Left != NULL && T->Right != NULL) // two children
{
tmpCell = findMinBST(T->Right);
T->Element = tmpCell->Element;
T->Right = delete(T->Element, T->Right);
}
else // one or zero children, change the pointer T pointing to new address(NULl or the child node)
{
tmpCell = T;
// for leaf node, T will be reset to null
if (T->Right == NULL)
T = T->Left;
else if (T->Left == NULL)
T = T->Right;
free(tmpCell);
}
return T;
}

Position
findMinBST(BST T)
{
if (T == NULL)
return NULL;
else if (T->Left == NULL)
return T;
else
return findMinBST(T->Left);
}

I add some comments in the above code block which can help your understanding of this recursive algorithm. Please go ahead and think hard about it.

In the next section, we’ll have some open discussions about this solution. Let’s see whether there is any other solution. And what’re the potential issues of the current solution?

Open discussion

successor vs predecessor

Firstly, in the above solution, we delete the node with two children based on the successor. And there is the other concept called predecessor:

  • Predecessor: is the node with the maximum value in the left subtree of any node.

So similarly, the alternative solution to delete the node with two children is:

  • Replace the data of the deleted node with the value of the predecessor and recursively delete the predecessor from the left subtree.

We can do this by writing another helper function findMaxBST as follows:

1
2
3
4
5
6
7
8
Position
findMaxBST(BST T)
{
if (T != NULL)
while (T->Right != NULL)
T = T->Right;
return T;
}

Does it work? The answer is yes. But the performance of the solution based on the predecessor is worse than the one based on the successor. Because the predecessor has the maximum value of the left subtree, it means that the predecessor can have two children. Then when we delete the predecessor recursively, the worst-case time complexity can reach O(logN) while the solution based on the successor only requires constant(O(1)) time. That’s the difference.

balanced vs unbalanced

Although the above solution can work, it exposes a serious performance issue. The reason why people invent binary search tree data structures is that we can get O(logN) level performance for searching operations. But imagine what will happen if we keep inserting and deleting nodes in the binary search tree in the way we mentioned here. The depth of the tree will become unbalanced. The left subtree grows deeper than the right subtree because we are always replacing a deleted node with the successor(which is from the right subtree, right?)

The following image is borrowed from Mark’s great book I mentioned above, which clearly shows that the tree becomes unbalanced after many rounds of insertion and deletion. If you want to know about it in theory, please refer to the book.

For an unbalanced BST, the worst-case time complexity can be degraded to O(n). To keep the desired performance, people invent a more advanced data structure self-balancing binary search tree, like the AVL tree and Red-black tree. I will share about them in the coming articles, please keep watching my blog.

Summary

In this article, we examined various solutions to delete a node from the binary search tree and evaluated their performance. We also discussed some open questions about BST, which prove why we need more advanced data structures like the red-black tree.

External Mergesort: part two

Introduction

In this last article, we examined how external mergesort works in theory, let’s implement it now.

First, you can find all the source codes of this implementation in this Github repo. In this repo, I implemented both two-way and multi-way solutions, which are tracked in different branches, please be aware of this. But for the sake of simplicity, I will focus on the generalized multi-way mergesort solution in this article.

Data Preparation

Before diving into the code, let’s define the problem we need to solve here. I will generate an input file containing several millions of seven digits random numbers, from 1,000,000 to 9,999,999. The random numbers can be duplicated, and each number is stored in one new line of the input file. The input file can be prepared with the following Bash script which calls GNU shuf :

1
2
3
4
5
6
7
#!/usr/bin/bash 

# Author: Chris Bao
# Generate millions of seven digits random integers
# based on shuf utility

shuf -i 1000000-9999999 -n 7777777 > ./input.txt

The generated input file is roughly 60 MB in size. For modern computers, it can be loaded to memory easily. But since we are working external memory algorithm, so let’s assume we are running the algorithm on an old computer, which has only 100000 Byte memory. Based on this assumed restriction, we need to sort the numbers in the input file and save the result in a new file.

Implementation

Let’s define some global constants in this algorithm:

1
2
3
4
5
6
7
8
#ifndef CONSTANT_H
#define CONSTANT_H

#define MEMORY_LIMIT 100000
#define RECORD_SIZE 4
#define MULTI_WAY_NUMBER 2

#endif

MEMORY_LIMIT denotes the 100000 bytes memory limit; In C, we can use the type unsigned int to store an integer in the range from 1,000,000 to 9,999,999. So RECORD_SIZE means each record(or integer) will take up 4 bytes of memory.

And by default, the algorithm will use the two-way merge, but the user can pass an argument to run a multi-way merge too.

Sort phase

The sort phase is implemented inside the separationSort function as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/*
* Goes through a given file and separates that file into sorted 1MB files using (internal) mergeSort algorithm
* */
void separationSort(FILE *input) {
FILE *fp;
unsigned int *buffer = malloc(sizeof(unsigned int)*(MEMORY_LIMIT/RECORD_SIZE));
char *line = NULL;
size_t len = 0;
ssize_t nread;
int count = 0;
printf("Sort phase start.\n");
while((nread = getline(&line, &len, input)) != -1) {
if (count < MEMORY_LIMIT/RECORD_SIZE) {
buffer[count++] = (unsigned int)strtoul(line, NULL, 10);
} else {
mergeSort(buffer, count); // sort records

// output sorted to file
if (fileNum == 1) { // create the dir
int status;
// create tmp directory
if ((status = mkdir("./tmp", S_IRWXU | S_IRWXU | S_IROTH | S_IXOTH)) == -1) {
fprintf(stderr, "Failed to create tmp directory.\n");
exit(EXIT_FAILURE);
}
// create pass0 directory for sort phase
if ((status = mkdir("./tmp/pass0", S_IRWXU | S_IRWXU | S_IROTH | S_IXOTH)) == -1) {
fprintf(stderr, "Failed to create pass0 directory.\n");
exit(EXIT_FAILURE);
}

}

char fileName[20];
sprintf(fileName, "./tmp/pass0/%d.txt", fileNum);
if ((fp = fopen(fileName, "w+")) == NULL) {
fprintf(stderr, "Failed to create file: %s.\n", fileName);
exit(EXIT_FAILURE);
}

outputToFile(buffer, count, fp);

// Reset memory buffer(zero-out the entire array)
memset(buffer, 0, sizeof(unsigned int)*(MEMORY_LIMIT/RECORD_SIZE));
count = 0;
fileNum++;
buffer[count++] = (unsigned int)strtoul(line, NULL, 10); // add the current record into new buffer's as first element

}
}

// sort the last and final file
mergeSort(buffer, count);
char fileName[20];
sprintf(fileName, "./tmp/pass0/%d.txt", fileNum);
if ((fp = fopen(fileName, "w+")) == NULL) {
fprintf(stderr, "Failed to create file: %s.\n", fileName);
exit(EXIT_FAILURE);
}
outputToFile(buffer, count, fp);

free(buffer);
free(line);
printf("Sort phase done. %d tmp sorted files are produced.\n", fileNum);
}

The logic is not difficult. The function takes the input file descriptor as a parameter and reads each line(via the getline method) in a loop until reaches the end of the file. The numbers will be read into the memory buffer before hitting the memory limit. When the memory buffer is full(100000 bytes), the numbers are sorted with the function mergeSort.

The function mergeSort is defined inside the file internal_sort.c, which implements the classic internal merge sorting algorithm. I will not cover it in this article, since you can find many documents about it online. If you don’t know about it, please spend some time learning about it. Of course, you can replace it with other sorting algorithms like quick sort too. I leave this for the readers.

After sorting, the numbers are saved in the temporary files in the directory of ./tmp/pass0. The filename is just the run number.

1
2
3
4
5
6
7
8
9
10
11
/*
* Output sorted record to given file(of)
* */

void outputToFile(unsigned int *buffer, int size, FILE *of) {
int i;
for (i = 0; i < size; i++) {
fprintf(of, "%u\n", buffer[i]);
}
fclose(of);
}

We can verify the result of the sort phase as follows:

You can see each file contains up to 25000 (equal to MEMORY_LIMIT/RECORD_SIZE) numbers and 312 files are created in pass0.

Note that I will not examine the details about how to make a new directory and how to open a new file to read or write. You can learn such Linux file I/O concepts by yourself.

Merge phase

The exMerge function controls the passes in the merge phase starting from pass1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void exMerge() {
/* some code omitted ... */
int pass = 1;
while (fileNum > 1) {
exMergeSort(pass, fileNum);
int remainer = fileNum % ways;
fileNum = fileNum / ways;
if (remainer > 0) {
fileNum++;
}
pass++;
}
/* some code omitted ... */
}

The variable fileNum stores the run number in each pass. And the variable ways denotes the number of multi-way. Thus, the run number of the next pass should be calculated as fileNum / ways.

The detailed merging logic is inside the function exMergeSort, which takes two parameters. pass means the current pass number(starting from 1), while nums means how many runs(or sub-files) in the last pass need to be merged.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
void exMergeSort(int pass, int nums) {
/* some code omitted ... */
int inputFileNum = 0;
int run = 1;
for (; inputFileNum < nums;) {

// create the dir for current pass
if (inputFileNum == 0) {
int status;
char dirName[20];
sprintf(dirName, "./tmp/pass%d", pass);
if ((status = mkdir(dirName, S_IRWXU | S_IRWXU | S_IROTH | S_IXOTH)) == -1) {
fprintf(stderr, "Failed to create tmp directory %s.\n", dirName);
exit(EXIT_FAILURE);
}
}
// open new file to merge in each run
FILE *fm;
char mergedFileName[20];
sprintf(mergedFileName, "./tmp/pass%d/%d.txt", pass, run);
if ((fm = fopen(mergedFileName, "w+")) == NULL) {
fprintf(stderr, "%s\n", strerror(errno));
fprintf(stderr, "merged file %s: can't create or open.\n", mergedFileName);
}
run++;
/* some code omitted ... */
}

}

The above code creates a temp directory for each pass and a temp file for each run.

Next, we create an array of the input files for each run. And the input file is inside the temp directory of the last pass. Each run merges multiple files in the for loop. The only trick logic here is that the remaining files in the last run may be less than the number of ways declared, we need to handle that properly (line 5 of the below code block).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Rewind the sorted files in previous pass, each run merge way_numbers numbers of files
// Merge the sorted files with multi ways in N runs.
// In the first N - 1 runs, each run merge ways numbers of files
// In the last run, merge the remaining files.
int way_numbers = run * ways <= nums ? ways : nums - inputFileNum;
FILE *fiarr[way_numbers];
for (int i = 0; i < way_numbers; i++) {
char inputFileName[20];
inputFileNum++; // start from 0 to nums
sprintf(inputFileName, "./tmp/pass%d/%d.txt", pass - 1, inputFileNum);
if ((fiarr[i] = fopen(inputFileName, "r")) == NULL) {
fprintf(stderr, "%s\n", strerror(errno));
fprintf(stderr, "input file %s: can't create or open.\n", inputFileName);
}
rewind(fiarr[i]);
}

Next, we need to read one number from every input file until only one file is not run out. Find the smallest one and save it in the temp output run file. And for the last remaining file, remember to put the rest numbers into the output file too.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// get and compare records until files runs out of records
char *records[way_numbers];
for (int i = 0; i < way_numbers; i++) {
records[i] = getRecord(fiarr[i]);
}
// loop until only one file is not run-out
while(validateRecords(records, way_numbers )) {
int index = getMinRecordIndex(records, way_numbers);
fprintf(fm, "%s", records[index]); // print record to new merged file
free(records[index]); // free the memory allocated by getline in getRecord function
records[index] = getRecord(fiarr[index]); // Get new record from the file
}

// put the rest record in the last remaining file into new file
int lastIndex = getLastRemainRecordIndex(records, way_numbers);
while(records[lastIndex]) {
fprintf(fm, "%s", records[lastIndex]);
free(records[lastIndex]);
records[lastIndex] = getRecord(fiarr[lastIndex]);
}

The above code bock utilizes several methods like getRecord, validateRecords, getMinRecordIndex and getLastRemainRecordIndex as follows, and these functions are easy to understand.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/*
* Returns a copy of the record
*
* */
char* getRecord(FILE *ifp) {
char *line = NULL;
size_t len = 0;
ssize_t nread;
while ((nread = getline(&line, &len, ifp)) != -1) {
return line;
}
return NULL;
}
/*
* Validate whether at least two records are non-zero
* */
bool validateRecords(char **records, int size) {
int count = 0;
for (int i = 0; i < size; i++) {
if (records[i] != NULL) {
count++;
}
}
return count >= 2;
}

/*
* Get the min valid record's index
* */
int getMinRecordIndex(char **records, int size) {
int index = 0;
unsigned int min = (int)INFINITY;
for (int i = 0; i < size; i++) {
if (records[i] == NULL) { // pass invalid run-out record files
continue;
}
if (strtoul(records[i], NULL, 10) < min) {
min = strtoul(records[i], NULL, 10);
index = i;
}
}
return index;
}
/*
* Get the last remainer of the records
* */
int getLastRemainRecordIndex(char **records, int size) {
for (int i = 0; i < size; i++) {
if (records[i] != NULL) {
return i;
}
}
}

In detail, you can refer to the source code of this github repo. Next, let’s evaluate the performance of this algorithm by tuning the number of ways for merging.

Performance Evaluation

We’ll use the Linux time utility to measure the running time of this algorithm.

The result of two-way mergesort is:

while six-way mergesort can complete with a shorter runtime.

External Mergesort: part one

Introduction

This article will examine one interesting question which I came across when reading the book Programming Pearls. This question simply goes like this: How to sort a large disk file? The disk file has so much data that it cannot all fit into the main memory. I considered this algorithm question for a while; but noticed that all the classic sorting algorithms, like Quick sort and merge sort, can’t solve it easily. Then I did some research about it, and I will share what I learned in this article. I believe you can solve this problem as well, after reading this article.

Background of External Algorithm

Traditionally, computer scientists analyze the running time of an algorithm by counting the number of executed instructions, which is usually expressed as a function of the input size n, like the well-known Big O notation. This kind of algorithm complexity analysis is based on the Random access machine(RAM) model, which defines the following assumptions:

  • A machine with an unbounded amount of available memory;
  • Any desired memory location can be accessed in unit time;
  • Every instruction takes the same amount of time.

This model works well when the amount of memory that the algorithm needs to use is smaller than the amount of memory in the computer running the code.

But in the real world, some applications need to process data that are too large to fit into a computer’s main memory at once. And the algorithms used to solve such problems are called external memory algorithms or disk-based algorithms; since the input data are stored on an external memory storage device.

Instead of the RAM model, the external memory algorithm is analyzed based on the external memory model. The model consists of a CPU processor with an internal memory of bounded size, connected to an unbounded external memory. One I/O operation consists of moving a block of contiguous elements from external to internal memory(this is called page cache and is managed by the kernel.).

Compared with the main memory, the access speed of external memory is slower by several orders of magnitude, even though modern storage techniques such as SSD are already adopted. Thus, for external algorithms, the bottleneck of the performance is disk IO speed instead of the CPU cycles.

In this section, we examined the computational model of the external memory algorithms. Next, let’s review one typical type of external memory algorithm: external sorting.

External Mergesort

Similar to the traditional internal sorting algorithms, several different solutions are available for external sorting. In this article, I will focus on external mergesort.

External mergesort is a straightforward generalization of internal mergesort. The algorithm starts by repeatedly loading M input items into the memory(since the memory buffer size is limited, can only store M input items at once), sorting them, and writing them back out to disk. This step divides the input file into some(or many, if the input file is very large) sorted runs, each with M items sorted. Then the algorithm repeatedly merges multiple sorted runs, until only a single sorted run containing all the input data remains.

Let’s use the following example model to analyze this algorithm. First of all, assume the memory buffer is limited, and the size is one page. And the input file size is 8 pages. External mergesort can be divided into two phases: sort phase and merge phase.

Sort phase:

  • Divide the entire input file into 8 groups, each group size is one page(memory buffer capacity).
  • Load each page into the memory, sort the items in the memory buffer(with an internal sorting algorithm), and save the sorted items in a temporary sub-file. Each sub-file is called a run.

At the end of the sort phase, 8 temporary sorted 1-page runs will be created. This step can be marked as pass 0.

Merge phase:

The 8 sorted runs in pass 0 will be merged into a single sorted file with 3 more passes.

  • pass 1: Perform 4 runs for the merge.
    • Run 1: Merge the first two small 1-page runs into a big 2-page run. This merging step goes as follows:
      • Read the first two sorted sub-files (one item from each file).
      • Find the smaller item, output it in the new sub-file, and the appropriate input sub-file is advanced. Repeat this cycle until the input sub-file is completed. This routine’s logic is the same as the internal mergesort algorithm.
    • Run 2: Merge the next two 1-page runs into a 2-page run.
    • Run 3 and 4: follow the same process.
    • At the end of pass 1, 4 temporary sorted 2-page runs will be created.
  • pass 2: Perform 2 runs for the merge.
    • At the end of pass 2, 2 temporary sorted 4-page runs will be created.
  • pass 3: Perform 1 run for the merge.
    • At the end of pass 3, the final 8-page run containing all the sorted items will be created.

Note: the above process may seem complicated at the first sight, but the logic is nearly the same as the internal merge sort. The only difference is internal merging is based on memory buffers, while external merging is based on disk files, and needs reading the item from disk to memory.

Since we keep merging two small sub-files into a big one with doubled size, the above algorithm can be called two-way external merge sorting. We can generalize the idea to multi-way external merge sorting, which merges M runs into one.

Next, let’s analyze its complexity. Suppose the input file has N items and each page consists of B items. And M denotes the number of ways used in the merge phase, thus the number of passes should be: logM(N/B) + 1, where plus one means the first pass in the sort phase. And each pass, each item is read and written once from and to the disk file. So the total number of disk I/O is: 2N*(logM(N/B) + 1)

Summary

In this article, we examined the abstract computational model of the external memory algorithm and analyzed the details of the external mergesort algorithm. Next article, let’s implement the code and evaluate the performance.

Understand userland heap memory allocation: part three - free chunk

Introduction

In the last article, we investigated how the allocated chunks are aligned and stored in the heap. This article continues to examine how to free a chunk of memory and how the freed chunks are stored in the heap.

Hands-on demo

Let’s continue debugging the demo code shown in the last article:

1
2
3
4
5
6
7
8
9
10
11
12
13
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>

int main(int argc, char *argv[]) {
char *a = (char*)malloc(100);
strcpy(a, "AAAABBBBCCCCDDDD");
free(a);
char *b = (char*)malloc(100);
free(b);
return 0;
}

Previously, we allocated a chunk of memory and put data in it. The next line will free this chunk. Before we run the instruction and show the demo result, let’s discuss the theory first.

The freed chunk will not be returned to the kernel immediately after the free is called. Instead, the heap allocator keeps track of the freed chunks in a linked list data structure. So the freed chunks in the linked list can be reused when the application requests new allocations again. This can decrease the performance overhead by avoiding too many system calls.

The allocator could store all the freed chunks together in a long linked list, this would work but the performance would be slow. Instead, the glibc maintains a series of freed linked lists called bins, which can speed up the allocations and frees. We will examine how bins work later.

It is worth noting that each free chunk needs to store pointers to other chunks to form the linked list. That’s what we discussed in the last section, there’re two points in the malloc_chunk structure: fd and bk, right? Since the user data region of the freed chunk is free for use by the allocator, so it repurposes the user data region as the place to store the pointer.

Based on the above description, the following picture illustrates the exact structure of a freed chunk:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of previous chunk, if freed |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of chunk, in bytes |A|M|P|
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| pointer to the next freed chunk |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| pointer to the previous freed chunk |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. .
. ...... .
nextchunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of chunk |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Now step over one line in gdb and check chunks in the heap as follows:

You can see the changes: the allocated chunk is marked as a Free chunk (tcache) and pointer fd is set(which indicates this freed chunk is inserted into a linked list).

The tcache is one kind of bins provided by glibc. The gdb pwndbg plugin allows you to check the content of bins by running command bins as follows:

Note that the freed chunk(at 0x5555555592a0) is inserted into tcache bins as the liked list header.

Note that there 5 types of bins: small bins, large bins, unsorted bins, fast bins and tcache bins. If you don’t know, don’t worry I will examine them in the following section.

According to the definition, after the second malloc(100) is called, the allocator should reuse the freed chunk in the bins. The following image can prove this:

The freed chunk at 0x555555559290 is in use again and all bins are empty after the chunk is removed from the linked list. All right!

Recycling memory with bins

Next, I want to spend a little bit of time examining why we need bins and how bins optimize chunk allocation and free.

If the allocator keeps track of all the freed chunks in a long linked list. The time complexity is O(N) for the allocator to find a freed chunk with fit size by traversing from the head to the tail. If the allocator wants to keep the chunks in order, then at least O(NlogN) time is needed to sort the list by size. This slow process would have a bad impact on the overall performance of programs. That’s the reason why we need bins to optimize this process. In summary, the optimization is done on the following two aspects:

  • High-performance data structure
  • Per-thread cache without lock contention

High-performance data structure

Take the small bins and large bins as a reference, they are defined as follows:

1
2
3
4
5
#define NBINS             128

typedef struct malloc_chunk* mchunkptr;

mchunkptr bins[NBINS * 2 - 2];

They are defined together in an array of linked lists and each linked list(or bin) stores chunks that are all the same fixed size. From bins[2] to bins[63] are the small bins, which track freed chunks less than 1024 bytes while the large bins are for bigger chunks. small bins and large bins can be represented as a double-linked list shown below:

The glibc provides a function to calculate the index of the corresponding small(or large) bin in the array based on the requested size. Since the index operation of the array is in O(1) time. Moreover, each bin contains chunks of the same size, so it can also take O(1) time to insert or remove one chunk into or from the list. As a result, the entire allocation time is optimized to O(1).

bins are LIFO(Last In First Out) data structure. The insert and remove operations can be illustrated as follows:

Moreover, for small bins and large bins, if the neighbors of the current chunk are free, they are merged into a larger one. That’s the reason we need a double-linked list to allow running fast traverse both forward and backward.

Unlike small bins and large bins, fast bins and tcache bins chunks are never merged with their neighbors. In practice, the glibc allocator doesn’t set the P special flag at the start of the next chunk. This can avoid the overhead of merging chunks so that the freed chunk can be immediately reused if the same size chunk is requested. Moreover, since fast bins and tcache bins are never merged, they are implemented as a single-linked list.

This can be proved by running the second free method in the demo code and checking the chunks in the heap as follows:

First, the top chunk’s size is still 0x20d01 rather than 0x20d00, which indicates the P bit is equal to 1. Second, the Free chunk only has one pointer: fd. If it’s in a double-linked list, both fd and bk should point to a valid address.

Per-thread cache without lock contention

The letter t in tcache bins represents the thread, which is used to optimize the performance of multi-thread programs. In multi-thread programming, the most common way solution to prevent the race condition issue is using the lock or mutex. Similarly, The glibc maintains a lock in the data structure for each heap. But this design comes with a performance cost: lock contention, which happens when one thread attempts to acquire a lock held by another thread. This means the thread can’t do any tasks.

tcache bins are per-thread bins. This means if the thread has a chunk on its tcache bins, it can serve the allocation without waiting for the heap lock!

Summary

In this article, we examined how the userland heap allocaor works by debugging into the heap memory with gdb. The discussion is fully based on the glibc implementation. The design and behavior of the glibc heap allocator are complex but interesting, what we covered here just touches the tip of the iceberg. You can explore more by yourself.

Moreover, I plan to write a simple version of a heap allocator for learning and teaching purpose. Please keep watching my blog!

Understand userland heap memory allocation: part two - allocate chunk

Introduction

The previous article gave a general overview of memory management. The story goes on. In this section, let’s break into the heap memory to see how it works basically.

Memory allocator

We need to first understand some terminology in the memory management field:

  • mutator: the program that modifies the objects in the heap, which is simply the user application. But I will use the term mutator in this article.
  • allocator: the mutator doesn’t allocate memory by itself, it delegates this generic job to the allocator. At the code level, the allocator is generally implemented as a library. The detailed allocation behavior is fully determined by the implementations, in this article I will focus on the memory allocator in the library of glibc.

The relationship between the mutator and allocator is shown in the following diagram:

There is a third component in the memory management field: the garbage collector(GC). GC reclaims memories automatically. Since this article is talking about manual heap memory allocation in system programming, we will ignore GC for now. GC is a very interesting technical challenge, I will examine it in the future. Please keep watching my blog!

Hands-on demo

We will use gdb and pwndbg(which is a gdb plugin) and break into the heap memory to see how it works. The gdb provides the functionality to extend it via Python plugins. pwndbg is the most widely used.

The demo code is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>

int main(int argc, char *argv[]) {
char *a = (char*)malloc(100);
strcpy(a, "AAAABBBBCCCCDDDD");
free(a);
char *b = (char*)malloc(100);
free(b);
return 0;
}

The demo code above just allocates some memory, set the content of the memory and releases it later. And then allocate other chunks of memory again. Very simple, all right?

First, set a breakpoint at line 7(the first malloc call) and run the program in gdb. Then run vmmap command from pwndbg, which can get the process memory layout as follows:

Note that there is no heap segment yet before the first malloc call is made. After step over one line in gdb, check the layout again:

Now the heap segment is created with the size of 132KB(21000 in hexadecimal). As described above, the kernel maps 132KB of physical memory to this process’s virtual memory and marks this 132KB block of physical memory as used to isolate other processes. This mapping routine is done via system calls like brk, sbrk and mmap. Please investigate these system calls yourself.

132KB is much bigger than the 100B(the size passed to malloc). This behavior can answer one question at the beginning of this article. The system calls aren’t necessary to be triggered each time when malloc is called. This design is aimed to decrease performance overhead. Now the 132KB heap memory is maintained by the allocator. Next time the application calls malloc again, the allocator will allocate memory for it.

Next, step one more line in gdb to assign value(“AAAABBBBCCCCDDDD”) to the allocated block. Let’s check the content of this 132KB heap segment with heap command as follows:

There are 3 chunks. Let’s examine these chunks one by one.

The top chunk contains all the remaining memories which have not been allocated yet. In our case, the kernel maps 132KB of physical memory to this process. And 100B memory is allocated by calling malloc(100), so the remaining memories are in the top chunk. The top chunk stays on the border of the heap segment, and it can grow and shrink as the process allocates more memory or release unused memory.

Then let’s look at the chunk with the size of 0x291. The allocator uses this chunk to store heap management structures. It is not important for our analysis, just skip it.

What we care about is the chunk in the middle with a size of 0x71. It should be the block we requested and contains the string “AAAABBBBCCCCDDDD”. We can verify this point by checking its content:

gdb’s x command can display the memory contents at a given address using the specified format. x/40wx 0x555555559290 prints 40 words(each word is 32 bits) of memories starting from 0x555555559290 in the hexadecimal format.

We can see that the string “AAAABBBBCCCCDDDD” is there. So our guess is correct. But the question is why the size of this chunk is 0x71. To understand this, we need to first analyze how the allocator stores chunk. A chunk of memory is represented by the following structure:

1
2
3
4
5
6
7
8
struct malloc_chunk {
INTERNAL_SIZE_T prev_size; /* Size of previous chunk (only if free). */
INTERNAL_SIZE_T size; /* Size in bytes, including overhead. */
struct malloc_chunk* fd; /* double links -- used only if free. */
struct malloc_chunk* bk; /* double links -- used only if free. */
};

typedef struct malloc_chunk* mchunkptr;
  • prev_size: the size of the previous chunk only when the previous chunk is free, otherwise when the previous chunk is in use it stores the user data of the previous chunk.
  • size: the size of the current chunk.
  • fd: pointer to the next free chunk only when the current chunk is free, otherwise when the current chunk is in use it stores the user data.
  • bk: pointer to the previous free chunk. Behaves in the same way as pointer fd.

Based on the above description, the following picture illustrates the exact structure of an allocated chunk:

1
2
3
4
5
6
7
8
9
10
11
12
    chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of previous chunk, if freed |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of chunk, in bytes |A|M|P|
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| User data starts here... .
. .
. .
. |
nextchunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of chunk |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  • chunk: indicates the real starting address of the object in the heap memory.
  • mem: indicates the returned address by malloc.

The memory in between is reserved for the metadata mentioned above: prev_size and size. On a 64-bit system, they’re (type of INTERNAL_SIZE_T) 8 bytes in length.

For the size field, it is worth noting:

  • It includes both the size of metadata and the size of the actual user data.
  • It is usually aligned to a multiple of 16 bytes. You can investigate the purpose of memory alignment by yourself.
  • It contains three special flags(A|M|P) at the three least significant bits. We can ignore the other two bits for now, but the last bit indicates whether the previous chunk is in use(set to 1) or not(set to 0).

According to this, let’s review the content of this chunk again:

I add marks on the image to help you understand. Let’s do some simple calculations. 100 + 8 = 108, 100 is the size of memory we requested, 8 is the size of metadata(for size field). Then 108 is aligned to 112 as a multiple of 16 bytes. Finally, since the special flag P is set to 1, then we get 112 + 1 = 113(0x71)(that’s the reason why the size is 0x71 instead of 0x70).

In this section, we break into the heap segment and see how an allocated chunk works. Next, we’ll check how to free a chunk.