Understand userland heap memory allocation: part three - free chunk

Introduction

In the last article, we investigated how the allocated chunks are aligned and stored in the heap. This article continues to examine how to free a chunk of memory and how the freed chunks are stored in the heap.

Hands-on demo

Let’s continue debugging the demo code shown in the last article:

1
2
3
4
5
6
7
8
9
10
11
12
13
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>

int main(int argc, char *argv[]) {
char *a = (char*)malloc(100);
strcpy(a, "AAAABBBBCCCCDDDD");
free(a);
char *b = (char*)malloc(100);
free(b);
return 0;
}

Previously, we allocated a chunk of memory and put data in it. The next line will free this chunk. Before we run the instruction and show the demo result, let’s discuss the theory first.

The freed chunk will not be returned to the kernel immediately after the free is called. Instead, the heap allocator keeps track of the freed chunks in a linked list data structure. So the freed chunks in the linked list can be reused when the application requests new allocations again. This can decrease the performance overhead by avoiding too many system calls.

The allocator could store all the freed chunks together in a long linked list, this would work but the performance would be slow. Instead, the glibc maintains a series of freed linked lists called bins, which can speed up the allocations and frees. We will examine how bins work later.

It is worth noting that each free chunk needs to store pointers to other chunks to form the linked list. That’s what we discussed in the last section, there’re two points in the malloc_chunk structure: fd and bk, right? Since the user data region of the freed chunk is free for use by the allocator, so it repurposes the user data region as the place to store the pointer.

Based on the above description, the following picture illustrates the exact structure of a freed chunk:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of previous chunk, if freed |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of chunk, in bytes |A|M|P|
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| pointer to the next freed chunk |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| pointer to the previous freed chunk |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. .
. ...... .
nextchunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of chunk |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Now step over one line in gdb and check chunks in the heap as follows:

You can see the changes: the allocated chunk is marked as a Free chunk (tcache) and pointer fd is set(which indicates this freed chunk is inserted into a linked list).

The tcache is one kind of bins provided by glibc. The gdb pwndbg plugin allows you to check the content of bins by running command bins as follows:

Note that the freed chunk(at 0x5555555592a0) is inserted into tcache bins as the liked list header.

Note that there 5 types of bins: small bins, large bins, unsorted bins, fast bins and tcache bins. If you don’t know, don’t worry I will examine them in the following section.

According to the definition, after the second malloc(100) is called, the allocator should reuse the freed chunk in the bins. The following image can prove this:

The freed chunk at 0x555555559290 is in use again and all bins are empty after the chunk is removed from the linked list. All right!

Recycling memory with bins

Next, I want to spend a little bit of time examining why we need bins and how bins optimize chunk allocation and free.

If the allocator keeps track of all the freed chunks in a long linked list. The time complexity is O(N) for the allocator to find a freed chunk with fit size by traversing from the head to the tail. If the allocator wants to keep the chunks in order, then at least O(NlogN) time is needed to sort the list by size. This slow process would have a bad impact on the overall performance of programs. That’s the reason why we need bins to optimize this process. In summary, the optimization is done on the following two aspects:

  • High-performance data structure
  • Per-thread cache without lock contention

High-performance data structure

Take the small bins and large bins as a reference, they are defined as follows:

1
2
3
4
5
#define NBINS             128

typedef struct malloc_chunk* mchunkptr;

mchunkptr bins[NBINS * 2 - 2];

They are defined together in an array of linked lists and each linked list(or bin) stores chunks that are all the same fixed size. From bins[2] to bins[63] are the small bins, which track freed chunks less than 1024 bytes while the large bins are for bigger chunks. small bins and large bins can be represented as a double-linked list shown below:

The glibc provides a function to calculate the index of the corresponding small(or large) bin in the array based on the requested size. Since the index operation of the array is in O(1) time. Moreover, each bin contains chunks of the same size, so it can also take O(1) time to insert or remove one chunk into or from the list. As a result, the entire allocation time is optimized to O(1).

bins are LIFO(Last In First Out) data structure. The insert and remove operations can be illustrated as follows:

Moreover, for small bins and large bins, if the neighbors of the current chunk are free, they are merged into a larger one. That’s the reason we need a double-linked list to allow running fast traverse both forward and backward.

Unlike small bins and large bins, fast bins and tcache bins chunks are never merged with their neighbors. In practice, the glibc allocator doesn’t set the P special flag at the start of the next chunk. This can avoid the overhead of merging chunks so that the freed chunk can be immediately reused if the same size chunk is requested. Moreover, since fast bins and tcache bins are never merged, they are implemented as a single-linked list.

This can be proved by running the second free method in the demo code and checking the chunks in the heap as follows:

First, the top chunk’s size is still 0x20d01 rather than 0x20d00, which indicates the P bit is equal to 1. Second, the Free chunk only has one pointer: fd. If it’s in a double-linked list, both fd and bk should point to a valid address.

Per-thread cache without lock contention

The letter t in tcache bins represents the thread, which is used to optimize the performance of multi-thread programs. In multi-thread programming, the most common way solution to prevent the race condition issue is using the lock or mutex. Similarly, The glibc maintains a lock in the data structure for each heap. But this design comes with a performance cost: lock contention, which happens when one thread attempts to acquire a lock held by another thread. This means the thread can’t do any tasks.

tcache bins are per-thread bins. This means if the thread has a chunk on its tcache bins, it can serve the allocation without waiting for the heap lock!

Summary

In this article, we examined how the userland heap allocaor works by debugging into the heap memory with gdb. The discussion is fully based on the glibc implementation. The design and behavior of the glibc heap allocator are complex but interesting, what we covered here just touches the tip of the iceberg. You can explore more by yourself.

Moreover, I plan to write a simple version of a heap allocator for learning and teaching purpose. Please keep watching my blog!

Understand userland heap memory allocation: part two - allocate chunk

Introduction

The previous article gave a general overview of memory management. The story goes on. In this section, let’s break into the heap memory to see how it works basically.

Memory allocator

We need to first understand some terminology in the memory management field:

  • mutator: the program that modifies the objects in the heap, which is simply the user application. But I will use the term mutator in this article.
  • allocator: the mutator doesn’t allocate memory by itself, it delegates this generic job to the allocator. At the code level, the allocator is generally implemented as a library. The detailed allocation behavior is fully determined by the implementations, in this article I will focus on the memory allocator in the library of glibc.

The relationship between the mutator and allocator is shown in the following diagram:

There is a third component in the memory management field: the garbage collector(GC). GC reclaims memories automatically. Since this article is talking about manual heap memory allocation in system programming, we will ignore GC for now. GC is a very interesting technical challenge, I will examine it in the future. Please keep watching my blog!

Hands-on demo

We will use gdb and pwndbg(which is a gdb plugin) and break into the heap memory to see how it works. The gdb provides the functionality to extend it via Python plugins. pwndbg is the most widely used.

The demo code is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>

int main(int argc, char *argv[]) {
char *a = (char*)malloc(100);
strcpy(a, "AAAABBBBCCCCDDDD");
free(a);
char *b = (char*)malloc(100);
free(b);
return 0;
}

The demo code above just allocates some memory, set the content of the memory and releases it later. And then allocate other chunks of memory again. Very simple, all right?

First, set a breakpoint at line 7(the first malloc call) and run the program in gdb. Then run vmmap command from pwndbg, which can get the process memory layout as follows:

Note that there is no heap segment yet before the first malloc call is made. After step over one line in gdb, check the layout again:

Now the heap segment is created with the size of 132KB(21000 in hexadecimal). As described above, the kernel maps 132KB of physical memory to this process’s virtual memory and marks this 132KB block of physical memory as used to isolate other processes. This mapping routine is done via system calls like brk, sbrk and mmap. Please investigate these system calls yourself.

132KB is much bigger than the 100B(the size passed to malloc). This behavior can answer one question at the beginning of this article. The system calls aren’t necessary to be triggered each time when malloc is called. This design is aimed to decrease performance overhead. Now the 132KB heap memory is maintained by the allocator. Next time the application calls malloc again, the allocator will allocate memory for it.

Next, step one more line in gdb to assign value(“AAAABBBBCCCCDDDD”) to the allocated block. Let’s check the content of this 132KB heap segment with heap command as follows:

There are 3 chunks. Let’s examine these chunks one by one.

The top chunk contains all the remaining memories which have not been allocated yet. In our case, the kernel maps 132KB of physical memory to this process. And 100B memory is allocated by calling malloc(100), so the remaining memories are in the top chunk. The top chunk stays on the border of the heap segment, and it can grow and shrink as the process allocates more memory or release unused memory.

Then let’s look at the chunk with the size of 0x291. The allocator uses this chunk to store heap management structures. It is not important for our analysis, just skip it.

What we care about is the chunk in the middle with a size of 0x71. It should be the block we requested and contains the string “AAAABBBBCCCCDDDD”. We can verify this point by checking its content:

gdb’s x command can display the memory contents at a given address using the specified format. x/40wx 0x555555559290 prints 40 words(each word is 32 bits) of memories starting from 0x555555559290 in the hexadecimal format.

We can see that the string “AAAABBBBCCCCDDDD” is there. So our guess is correct. But the question is why the size of this chunk is 0x71. To understand this, we need to first analyze how the allocator stores chunk. A chunk of memory is represented by the following structure:

1
2
3
4
5
6
7
8
struct malloc_chunk {
INTERNAL_SIZE_T prev_size; /* Size of previous chunk (only if free). */
INTERNAL_SIZE_T size; /* Size in bytes, including overhead. */
struct malloc_chunk* fd; /* double links -- used only if free. */
struct malloc_chunk* bk; /* double links -- used only if free. */
};

typedef struct malloc_chunk* mchunkptr;
  • prev_size: the size of the previous chunk only when the previous chunk is free, otherwise when the previous chunk is in use it stores the user data of the previous chunk.
  • size: the size of the current chunk.
  • fd: pointer to the next free chunk only when the current chunk is free, otherwise when the current chunk is in use it stores the user data.
  • bk: pointer to the previous free chunk. Behaves in the same way as pointer fd.

Based on the above description, the following picture illustrates the exact structure of an allocated chunk:

1
2
3
4
5
6
7
8
9
10
11
12
    chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of previous chunk, if freed |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of chunk, in bytes |A|M|P|
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| User data starts here... .
. .
. .
. |
nextchunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of chunk |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  • chunk: indicates the real starting address of the object in the heap memory.
  • mem: indicates the returned address by malloc.

The memory in between is reserved for the metadata mentioned above: prev_size and size. On a 64-bit system, they’re (type of INTERNAL_SIZE_T) 8 bytes in length.

For the size field, it is worth noting:

  • It includes both the size of metadata and the size of the actual user data.
  • It is usually aligned to a multiple of 16 bytes. You can investigate the purpose of memory alignment by yourself.
  • It contains three special flags(A|M|P) at the three least significant bits. We can ignore the other two bits for now, but the last bit indicates whether the previous chunk is in use(set to 1) or not(set to 0).

According to this, let’s review the content of this chunk again:

I add marks on the image to help you understand. Let’s do some simple calculations. 100 + 8 = 108, 100 is the size of memory we requested, 8 is the size of metadata(for size field). Then 108 is aligned to 112 as a multiple of 16 bytes. Finally, since the special flag P is set to 1, then we get 112 + 1 = 113(0x71)(that’s the reason why the size is 0x71 instead of 0x70).

In this section, we break into the heap segment and see how an allocated chunk works. Next, we’ll check how to free a chunk.

Understand userland heap memory allocation: part one - overview

Introduction

In my eyes, compared with developing applications with high-level programming languages, one of the biggest differences for system programming with low-level languages like C and C++, is you have to manage the memory by yourself. So you call APIs like malloc, and free to allocate the memory based on your need and release the memory when the resource is no longer needed. It is not only one of the most frequent causes of bugs in system programming; but also can lead to many security issues.

It’s not difficult to understand the correct usage of APIs like malloc, and free. But have you ever wondered how they work, for example:

  • When you call malloc, does it trigger system calls and delegate the task to the kernel or there are some other mechanisms?
  • When you call malloc(10) and try to allocate 10 bytes of heap memory, how many bytes of memory do you get? 10 bytes or more?
  • When the memory is allocated, where exactly the heap objects are located?
  • When you call free, is the memory directly returned to the kernel?

This article will try to answer these questions.

Note that memory is a super complex topic, so I can’t cover everything about it in one article (In fact, what is covered in this article is very limited). This article will focus on userland memory(heap) allocation.

Process memory management overview

Process virtual memory

Every time we start a program, a memory area for that program is reserved, and that’s process virtual memory as shown in the following image:

You can note that each process has one invisible memory segment containing kernel codes and data structures. This invisible memory segment is important; since it’s directly related to virtual memory, which is employed by the kernel for memory management. Before we dive into the other different segments, let’s understand virtual memory first.

Virtual memory technique

Why do we need virtual memory? Virtual memory is a service provided by the kernel in the form of abstraction. Without virtual memory, applications would need to manage their physical memory space, coordinating with every other process running on the computer. Virtual memory leaves that management to the kernel by creating the maps that allow translation between virtual and physical memory. The kernel creates an illusion that each process occupies the entire physical memory space. We can also realize process isolation based on virtual memory to enhance security.

Virtual memory is out of this article’s scope, if you’re interested, please take a look at the core techniques: paging and swapping.

Static vs Dynamic memory allocation

Next, let’s take a close look at the process memory layout above and understand where they are from. Generally speaking, there are two ways via which memories can be allocated for storing data: static and dynamic. Static memory allocation happens at compile time, while dynamic memory allocation occurs at runtime.

When a program started, the executable file(on the Linux system, it’s called an ELF file) will be loaded into the memory as a Process Image. This ELF file contains the following segments:

  • .TEXT: contains the executable part of the program with all the machine codes.
  • .DATA: contains initialized static and global variables.
  • .BSS: is short for block started by symbol contains uninitialized static and global variables.

The ELF file will be loaded by the kernel and create a process image. And these static data will be mapped into the corresponding segments of the virtual memory. The ELF loader is also an interesting topic, I will write another article about it in the future. Please keep watching my blog!

The memory-mapped region segment is used for storing the shared libraries.

Finally, stack and heap segments are produced at runtime dynamically, which are used to store and operate on temporary variables that are used during the execution of the program. Previously, I once wrote an article about the stack, please refer to it if you want to know the details.

The only remaining segment we didn’t mention yet is the heap, which is this article’s focus!

You can check the memory layout of one process by examining this file /proc/{pid}/maps as below:

Note that the above investigation doesn’t consider multiple threads. The memory layout of the process with multi-threads will be more complex, please refer to other online documents.

In this section, we had a rough overview of memory management from top to bottom. Hope you can see the big picture and know where we are. Next, let’s dig into the heap segment and see how it works.

cPacketSniffer

Background

In this post, I want to introduce my new project: cPacketSniffer. I worked on it for the past two months. Finally, I worked it out and feel very proud of putting it here!

Motivation

Simply speaking, I want to sharpen my techniques in network programming and Linux system programming. Both of these two topics can lead you to the bottom of computers or software. Feynman said “There is plenty of room at the bottom”, I think this physics law can apply to software as well.

Acknowledgement

It’s very lucky for me to come across this site “Network programming in Linux”, which developed a network packet capturing tool with C++. After confirming that the documents and source code on this site is completed and clear, I decided to refactor it with C language. That’s the starting point for my project cPacketSniffer.

Features

As a network packets sniffer, cPacketSniffer provides the following features:

  • Integrate with libpcap to support: filtering captured packets, capturing packets offline, capturing packets on specific devices and capturing packets in promiscuous mode.
  • Analyze network packets at low layers of TCP/IP stack, including Ethernet, ARP, ICMP, IP(IPv4), TCP, UDP, etc. Also one protocol in the application layer: TFTP.
  • Detect network security attacks:
    • ARP spoofing detection.
    • Ping flood detection.
  • Analyze and track network traffics:
    • TCP session tracking and traffic analysis.
    • TFTP session tracking and traffic analysis.

The following images demonstrate some typical usages of cPacketSniffer:

Packet Analysis:

ARP Spoofing Detection:

PING Flood Detection:

TCP Session Tracking:

Besides the above network programming-related functionalities, it also covers the following points:

  • Develop a generic data structure in C.
  • Error handling in C.
  • Data encapsulation (object-oriented style programming) in C.
  • Manual memory management in C.
  • etc.

This article will not cover these points in detail, I will write articles on these topics separately in the future. Please keep watching my blog!

Future work

Now cPacketSniffer can work as a network packet sniffer based on the design. Moreover, it can also serve as a testbed to try experimental features. Next step I plan to try the following ideas:

  • Implement the network intrusion detection function.
  • Improve the performance with advanced data structures, like binary search trees.
  • Memory and cache performance tuning.
  • Automatic memory management by Garbage Collection.
  • Integrate ncurses for Text-based user interface.

Write a Linux firewall from scratch based on Netfilter: part three - Netfilter module

Background

In the previous article, we examined how to write a Kernel module and load it dynamically into a running Linux system. Based on this understanding, let’s continue our journey to write a Netfilter module as our mini-firewall.

Netfilter architecture.

Basics of Netfilter hooks

The Netfilter framework provides a bunch of hooks in the Linux kernel. As network packets pass through the protocol stack in the kernel, they will traverse these hooks as well. And Netfilter allows you to write modules and register callback functions with these hooks. When the hooks are triggered, the callback functions will be called. This is the basic idea behind Netfilter architecture. Not difficult to understand, right?

Currently, Netfilter provides the following 5 hooks for IPv4:

  • NF_INET_PRE_ROUTING: is triggered right after the packet has been received on a network card. This hook is triggered before the routing decision was made. Then the kernel determines whether this packet is destined for the current host or not. Based on the condition, the following two hooks will be triggered.
  • NF_INET_LOCAL_IN: is triggered for network packets that are destined for the current host.
  • NF_INET_FORWARD: is triggered for network packets that should be forwarded.
  • NF_INET_POST_ROUTING: is triggered for network packets that have been routed and before being sent out to the network card.
  • NF_INET_LOCAL_OUT: is triggered for network packets generated by the processes on the current host.

The hook function you defined in the module can mangle or filter the packets, but it eventually must return a status code to Netfilter. There are several possible values for the code, but for now, you only need to understand two of them:

  • NF_ACCEPT: this means the hook function accepts the packet and it can go on the network stack trip.
  • NF_DROP: this means the packet is dropped and no further parts of the network stack will be traversed.

Netfilter allows you to register multiple callback functions to the same hook with different priorities. If the first hook function accepts the packet, then the packet will be passed to the next functions with low priority. If the packet is dropped by one callback function, then the next functions(if existing) will not be traversed.

As you see, Netfilter has a big scope and I can’t cover every detail in the articles. So the mini-firewall developed here will work on the hook NF_INET_PRE_ROUTING, which means it works by controlling the inbound network traffic. But the way of registering the hook and handling the packet can be applied to all other hooks.

Note: there is another remarkable question: what’s the difference between Netfilter and eBPF? If you don’t know eBPF, please refer to my previous article. Both of them are important network features in the Linux kernel. The important thing is Netfilter and eBPF hooks are located in different layers of the Kernel. As I drew in the above diagram, eBPF is located in a lower layer.

Kernel code of Netfilter hooks

To have a clear understanding of how the Netfilter framework is implemented inside the protocol stack, let’s dig a little bit deeper and take a look at the kernel source code (Don’t worry, only shows several simple functions). Let’s use the hook NF_INET_PRE_ROUTING as an example; since the mini-firewall will be written based on it.

When an IPv4 packet is received, its handler function ip_rcv will be called as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
//In source code file /kernel-src/net/ipv4/ip_input.c
/*
* IP receive entry point
*/
int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
struct net_device *orig_dev)
{
struct net *net = dev_net(dev);

skb = ip_rcv_core(skb, net);
if (skb == NULL)
return NET_RX_DROP;
// run Netfilter NF_INET_PRE_ROUTING hook's callback function
return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
net, NULL, skb, dev, NULL,
ip_rcv_finish);
}

In this handler function, you can see the hook is passed to the function NF_HOOK. Based on the name NF_HOOK, you can guess that it is for triggering the Netfilter hooks. Right? Let’s continue to examine how NF_HOOK is implemented as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
//In source code file /kernel-src/include/linux/netfilter.h
static inline int
NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, struct sk_buff *skb,
struct net_device *in, struct net_device *out,
int (*okfn)(struct net *, struct sock *, struct sk_buff *))
{
int ret = nf_hook(pf, hook, net, sk, skb, in, out, okfn);
if (ret == 1)
ret = okfn(net, sk, skb); // in our case: okfn is ip_rcv_finish
return ret;
}
/**
* nf_hook - call a netfilter hook
*
* Returns 1 if the hook has allowed the packet to pass. The function
* okfn must be invoked by the caller in this case. Any other return
* value indicates the packet has been consumed by the hook.
*/
static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
struct sock *sk, struct sk_buff *skb,
struct net_device *indev, struct net_device *outdev,
int (*okfn)(struct net *, struct sock *, struct sk_buff *))
{
// code omit
}

The function NF_HOOK contains two steps:

  • First, runs the hook’s callback functions by calling the underlying function nf_hook.
  • Second, invokes the function okfn (passed to NF_HOOK as the argument), if the packet passes through the hook functions and doesn’t drop.

For the hook NF_INET_LOCAL_IN, the function ip_rcv_finish will be invoked after the hook functions pass. Its job is to pass the packet on to the next protocol handler(TCP or UDP) in the protocol stack to continue its journey!

The other 4 hooks all use the same function NF_HOOK to trigger the callback functions. The following table shows where the hooks are embedded in the kernel, I leave them to the readers.

Hook File Function
NF_INET_PRE_ROUTING /kernel-src/net/ipv4/ip_input.c ip_rcv()
NF_INET_LOCAL_IN /kernel-src/net/ipv4/ip_input.c ip_local_deliver()
NF_INET_FORWARD /kernel-src/net/ipv4/ip_forward.c ip_forward()
NF_INET_POST_ROUTING /kernel-src/net/ipv4/ip_output.c ip_build_and_send_pkt()
NF_INET_LOCAL_OUT /kernel-src/net/ipv4/ip_output.c ip_output()

Next, Let’s review the Netfilter’s APIs to create and register the hook function.

Netfilter API

It’s straightforward to create a Netfilter module, which involves three steps:

  • Define the hook function.
  • Register the hook function in the kernel module initialization process.
  • Unregister the hook function in the kernel module clean-up process.

Let’s go through them quickly one by one.

Define a hook function

The hook function name can be whatever you want, but it must follow the signature below:

1
2
3
4
//In source code file /kernel-src/include/linux/netfilter.h
typedef unsigned int nf_hookfn(void *priv,
struct sk_buff *skb,
const struct nf_hook_state *state);

The hook function can mangle or filter the packet whose data is stored in the sk_buff structure (we can ignore the other two parameters; since we don’t use them in our mini-firewall). As we mentioned above, the callback function must return a Netfilter status code which is an integer. For instance, the accepted and dropped status is defined as follows:

1
2
3
4
// In source code file /kernel-src/include/uapi/linux/netfilter.h
/* Responses from hook functions. */
#define NF_DROP 0
#define NF_ACCEPT 1
Register and unregister a hook function

To register a hook function, we should wrap the defined hook function with related information, such as which hook you want to bind to, the protocol family and the priority of the hook function, into a structure struct nf_hook_ops and pass it to the function nf_register_net_hook.

1
2
3
4
5
6
7
8
9
10
11
//In source code file /kernel-src/include/linux/netfilter.h
struct nf_hook_ops {
/* User fills in from here down. */
nf_hookfn *hook; // callback function
struct net_device *dev; // network device interface
void *priv;
u_int8_t pf; // protocol
unsigned int hooknum; // Netfilter hook enum
/* Hooks are ordered in ascending priority. */
int priority; // priority of callback function
};

Most of the fields are very straightforward to understand. The one need to emphasize is the field hooknum, which is just the Netfilter hooks discussed above. They are defined as enumerators as follows:

1
2
3
4
5
6
7
8
9
10
// In source code file /kernel-src/include/uapi/linux/netfilter.h
enum nf_inet_hooks {
NF_INET_PRE_ROUTING,
NF_INET_LOCAL_IN,
NF_INET_FORWARD,
NF_INET_LOCAL_OUT,
NF_INET_POST_ROUTING,
NF_INET_NUMHOOKS,
NF_INET_INGRESS = NF_INET_NUMHOOKS,
};

Next, let’s take a look at the functions to register and unregister hook functions goes as follows:

1
2
3
4
//In source code file /kernel-src/include/linux/netfilter.h
/* Function to register/unregister hook points. */
int nf_register_net_hook(struct net *net, const struct nf_hook_ops *ops);
void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *ops);

The first parameter struct net is related to the network namespace, we can ignore it for now and use a default value.

Next, let’s implement our mini-firewall based on these APIs. All right?

Implement mini-firewall

First, we need to clarify the requirements for our mini-firewall. We’ll implement two network traffic control rules in the mini-firewall as follows:

  • Network protocol rule: drops the ICMP protocol packets.
  • IP address rule: drops the packets from one specific IP address.

The completed code implementation is in this Github repo.

Drop ICMP protocol packets

ICMP is a network protocol widely used in the real world. The popular diagnostic tools like ping and traceroute run the ICMP protocol. We can filter out the ICMP packets based on the protocol type in the IP headers with the following hook function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// In mini-firewall.c 
static unsigned int nf_blockicmppkt_handler(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
{
struct iphdr *iph; // IP header
struct udphdr *udph; // UDP header
if(!skb)
return NF_ACCEPT;
iph = ip_hdr(skb); // retrieve the IP headers from the packet
if(iph->protocol == IPPROTO_UDP) {
udph = udp_hdr(skb);
if(ntohs(udph->dest) == 53) {
return NF_ACCEPT; // accept UDP packet
}
}
else if (iph->protocol == IPPROTO_TCP) {
return NF_ACCEPT; // accept TCP packet
}
else if (iph->protocol == IPPROTO_ICMP) {
printk(KERN_INFO "Drop ICMP packet \n");
return NF_DROP; // drop TCP packet
}
return NF_ACCEPT;
}

The logic in the above hook function is easy to understand. First, we retrieve the IP headers from the network packet. And then according to the protocol type field in the headers, we decided to accept TCP and UDP packets but drop the ICMP packets. The only technique we need to pay attention to is the function ip_hdr, which is the kernel function defined as follows:

1
2
3
4
5
6
7
8
9
10
//In source code file /kernel-src/include/linux/ip.h
static inline struct iphdr *ip_hdr(const struct sk_buff *skb)
{
return (struct iphdr *)skb_network_header(skb);
}
// In source code file /kernel-src/include/linux/skbuff.h
static inline unsigned char *skb_network_header(const struct sk_buff *skb)
{
return skb->head + skb->network_header;
}

The function ip_hdr delegates the task to the function skb_network_header. It gets IP headers based on the following two data:

  • head: is the pointer to the packet;
  • network_header: is the offset between the pointer to the packet and the pointer to the network layer protocol header. In detail, you can refer to this document.

Next, we can register the above hook function as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// In mini-firewall.c 
static struct nf_hook_ops *nf_blockicmppkt_ops = NULL;

static int __init nf_minifirewall_init(void) {
nf_blockicmppkt_ops = (struct nf_hook_ops*)kcalloc(1, sizeof(struct nf_hook_ops), GFP_KERNEL);
if (nf_blockicmppkt_ops != NULL) {
nf_blockicmppkt_ops->hook = (nf_hookfn*)nf_blockicmppkt_handler;
nf_blockicmppkt_ops->hooknum = NF_INET_PRE_ROUTING;
nf_blockicmppkt_ops->pf = NFPROTO_IPV4;
nf_blockicmppkt_ops->priority = NF_IP_PRI_FIRST; // set the priority

nf_register_net_hook(&init_net, nf_blockicmppkt_ops);
}
return 0;
}

static void __exit nf_minifirewall_exit(void) {
if(nf_blockicmppkt_ops != NULL) {
nf_unregister_net_hook(&init_net, nf_blockicmppkt_ops);
kfree(nf_blockicmppkt_ops);
}
printk(KERN_INFO "Exit");
}

module_init(nf_minifirewall_init);
module_exit(nf_minifirewall_exit);

The above logic is self-explaining. I will not spend too much time here.

Next, it’s time to demo how our mini-firewall works.

Demo time

Before we load the mini-firewall module, the ping command can work as expected:

1
2
3
4
5
6
7
8
9
10
11
chrisbao@CN0005DOU18129:~$ lsmod | grep mini_firewall
chrisbao@CN0005DOU18129:~$ ping www.google.com
PING www.google.com (142.250.4.103) 56(84) bytes of data.
64 bytes from sm-in-f103.1e100.net (142.250.4.103): icmp_seq=1 ttl=104 time=71.9 ms
64 bytes from sm-in-f103.1e100.net (142.250.4.103): icmp_seq=2 ttl=104 time=71.8 ms
64 bytes from sm-in-f103.1e100.net (142.250.4.103): icmp_seq=3 ttl=104 time=71.9 ms
64 bytes from sm-in-f103.1e100.net (142.250.4.103): icmp_seq=4 ttl=104 time=71.8 ms
^C
--- www.google.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3005ms
rtt min/avg/max/mdev = 71.857/71.902/71.961/0.193 ms

In contrast, after the mini-firewall module is built and loaded (based on the commands we discussed previously):

1
2
3
4
5
6
7
chrisbao@CN0005DOU18129:~$ lsmod | grep mini_firewall
mini_firewall 16384 0
chrisbao@CN0005DOU18129:~$ ping www.google.com
PING www.google.com (142.250.4.105) 56(84) bytes of data.
^C
--- www.google.com ping statistics ---
6 packets transmitted, 0 received, 100% packet loss, time 5097ms

You can see all the packets are lost; because it is dropped by our mini-firewall. We can verify this by running the command dmesg:

1
2
3
4
5
6
chrisbao@CN0005DOU18129:~$ dmesg | tail -n 5
[ 1260.184712] Drop ICMP packet
[ 1261.208637] Drop ICMP packet
[ 1262.232669] Drop ICMP packet
[ 1263.256757] Drop ICMP packet
[ 1264.280733] Drop ICMP packet

But other protocol packets can still run through the firewall. For instance, the command wget 142.250.4.103 can return normally as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
chrisbao@CN0005DOU18129:~$ wget 142.250.4.103
--2022-06-25 10:12:39-- http://142.250.4.103/
Connecting to 142.250.4.103:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: http://142.250.4.103:6080/php/urlblock.php?args=AAAAfQAAABAjFEC0HSM7xhfO~a53FMMaAAAAEILI_eaKvZQ2xBfgKEgDtwsAAABNAAAATRPNhqoqFgHJ0ggbKLKcdinR4UvnlhgAR4~YyrY4tAnroOFkE_IsHsOg9~RFPc7nEoj6YdiDgqZImAmb_xw9ZuFLvF91P2HzP5tlu1WX&url=http://142.250.4.103%2f [following]
--2022-06-25 10:12:39-- http://142.250.4.103:6080/php/urlblock.php?args=AAAAfQAAABAjFEC0HSM7xhfO~a53FMMaAAAAEILI_eaKvZQ2xBfgKEgDtwsAAABNAAAATRPNhqoqFgHJ0ggbKLKcdinR4UvnlhgAR4~YyrY4tAnroOFkE_IsHsOg9~RFPc7nEoj6YdiDgqZImAmb_xw9ZuFLvF91P2HzP5tlu1WX&url=http://142.250.4.103%2f
Connecting to 142.250.4.103:6080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3248 (3.2K) [text/html]
Saving to: ‘index.html’

index.html 100%[===================================================================================================================>] 3.17K --.-KB/s in 0s

2022-06-25 10:12:39 (332 MB/s) - ‘index.html’ saved [3248/3248]

Next, let’s try to ban the traffic from this IP address.

Drop packets source from one specific IP address

As we mentioned above, multiple callback functions are allowed to be registered on the same Netfilter hook. So we will define the second hook function with a different priority. The logic of this hook function goes like this: we can get the source IP address from the IP headers and make the drop or accept decision according to it. The code goes as follows

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// In mini-firewall.c 
#define IPADDRESS(addr) \
((unsigned char *)&addr)[3], \
((unsigned char *)&addr)[2], \
((unsigned char *)&addr)[1], \
((unsigned char *)&addr)[0]

static char *ip_addr_rule = "142.250.4.103";

static unsigned int nf_blockipaddr_handler(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
{
if (!skb) {
return NF_ACCEPT;
} else {
char *str = (char *)kmalloc(16, GFP_KERNEL);
u32 sip;
struct sk_buff *sb = NULL;
struct iphdr *iph;

sb = skb;
iph = ip_hdr(sb);
sip = ntohl(iph->saddr); // get source ip address;

sprintf(str, "%u.%u.%u.%u", IPADDRESS(sip)); // convert to standard IP address format
if(!strcmp(str, ip_addr_rule)) {
return NF_DROP;
} else {
return NF_ACCEPT;
}
}
}

This hook function uses two interesting techniques:

  • ntohl: is a kernel function, which is used to convert the value from network byte order to host byte order. Byte order is related to the computer science concept of Endianness. Endianness defines the order or sequence of bytes of a word of digital data in computer memory. A big-endian system stores the most significant byte of a word at the smallest memory address. A little-endian system, in contrast, stores the least-significant byte at the smallest address. Network protocol uses the big-endian system. But different OS and platforms run various Endianness system. So it may need such conversion based on the host machine.

  • IPADDRESS: is a macro, which generates the standard IP address format(four 8-bit fields separated by periods) from a 32-bit integer. It uses the technique of the equivalence of arrays and pointers in C. I will write another article to examine what it is and how it works. Please keep watching my updates!

Next, we can register this hook function in the same way discussed above. The only remarkable point is this callback function should have a different priority as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
static int __init nf_minifirewall_init(void) {
<-omit code->
nf_blockipaddr_ops = (struct nf_hook_ops*)kcalloc(1, sizeof(struct nf_hook_ops), GFP_KERNEL);
if (nf_blockipaddr_ops != NULL) {
nf_blockipaddr_ops->hook = (nf_hookfn*)nf_blockipaddr_handler;
nf_blockipaddr_ops->hooknum = NF_INET_PRE_ROUTING; // register to the same hook
nf_blockipaddr_ops->pf = NFPROTO_IPV4;
nf_blockipaddr_ops->priority = NF_IP_PRI_FIRST + 1; // set a higher priority

nf_register_net_hook(&init_net, nf_blockipaddr_ops);
}
<-omit code->
}

Let’s see how it works with a demo.

Demo time

After re-build and re-load the module, we can get:

1
2
3
4
chrisbao@CN0005DOU18129:~$ wget 142.250.4.103
--2022-06-25 10:20:07-- http://142.250.4.103/
Connecting to 142.250.4.103:80... failed: Connection timed out.
Retrying.

The wget 142.250.4.103 can’t return response. Because it is dropped by our mini-firewall. Great!

1
2
3
4
5
6
chrisbao@CN0005DOU18129:~$ dmesg | tail -n 5
[ 3162.064284] Drop packet from 142.250.4.103
[ 3166.089466] Drop packet from 142.250.4.103
[ 3166.288603] Drop packet from 142.250.4.103
[ 3174.345463] Drop packet from 142.250.4.103
[ 3174.480123] Drop packet from 142.250.4.103

More space to expand

You can find the full code implementation here. But I have to say, our mini-firewall only touches the surface of what Netfilter can provide. You can keep expanding the functionalities. For example, currently, the rules are hardcoded, why not make it possible to config the rules dynamically. There are many cool ideas worth trying. I leave it for the readers.

Summary

In this article, we implement the mini-firewall step by step and examined many detailed techniques. Not only code; but we also verify the behavior of the mini-firewall by running real demos.

Write a Linux firewall from scratch based on Netfilter: part two - hello world module

Background

In the last article, we examined the basics of Netfilter and Linux kernel modules in theory. Starting from this article, we will make our hands dirty and start implementing our mini-firewall. We will walk through the whole process step by step. In this article, let’s write our first Linux kernel module using a simple hello world demo. Then let’s learn how to build the module(which is very different from compiling an application in the user space) and how to load it in the kernel. After understanding how to write a module, in the next article, let’s write the initial version of our mini-firewall module using Netfilter’s hook architecture. All right. Let’s start the journey.

Make the first Kernel module

First, I have to admit that Linux Kernel module development is a kind of large and complex technology topic. And there are many great online resources about it. This series of articles is focusing on developing the mini-firewall based on Netfilter, so we can’t cover all the aspects of the Kernel module itself. In future articles, I’ll examine more in-depth knowledge of kernel modules.

Write the module

You can write the hello world Kernel module with a single C source code file hello.c as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <linux/init.h> /* Needed for the macros */
#include <linux/kernel.h>
#include <linux/module.h> /* Needed by all modules */

static int __init hello_init(void)
{
printk(KERN_INFO "Hello, world\n");
return 0;
}

static void __exit hello_exit(void)
{
printk(KERN_INFO "Goodbye, world\n");
}

module_init(hello_init);
module_exit(hello_exit);

MODULE_LICENSE("GPL");

We can write a Kernel module in such an easy and simple way because the Linux Kernel does the magic for you. Remember the design philosophy of Linux(Unix): Design for simplicity; add complexity only where you must.

Let’s examine several technical points worth to remark as follows:

First, Kernel modules must have at least two functions: a “start” function which is called when the module is loaded into the kernel, and an “end” function which is called just before it is removed from the kernel. Before kernel 2.3.13, the names of these two functions are hardcoded as init_module() and cleanup_module(). But in the new versions, you can use whatever name you like for the start and end functions of a module by using the module_init and module_exit macros. The macros are defined in include/linux/module.h and include/linux/init.h. You can refer there for detailed information.

Typically, module_init either registers a handler for something with the kernel (for example, the mini-firewall developed in this article), or it replaces one of the kernel functions with its own code (usually code to do something and then call the original function). The module_exit function is supposed to undo whatever module_init did, so the module can be unloaded safely.

Second, printk function provides similar behaviors to printf, which accepts the format string as the first argument. The printk function prototype goes as follows:

1
int printk(const char *fmt, ...);

printk function allows a caller to specify log level to indicate the type and importance of the message being sent to the kernel message log. For example, in the above code, the log level KERN_INFO is specified by prepending to the format string. In C programming, this syntax is called string literal concatenation. (In other high-level programming languages, string concatenation is generally done with + operator). For the function printk and log level, you can find more information in include/linux/kern_levels.h and include/linux/printk.h.

Note: The path to header files for Linux kernel module development is different from the one you often used for the application development. Don’t try to find the header file inside /usr/include/linux, instead please use the following path /lib/modules/`uname -r`/build/include/linux (uname -r command returns your kernel version).

Next, let’s build this hello-world kernel module.

Build the module

The way to build a kernel module is a little different from how to build a user-space application. The efficient solution to build kernel image and its modules is Kernel Build System(Kbuild).

Kbuild is a complex topic and I won’t explain it in too much detail here. Simply speaking, Kbuild allows you to create highly customized kernel binary images and modules. Technically, each subdirectory contains a Makefile compiling only the source code files in its directory. And a top-level Makefile recursively executes each subdirectory’s Makefile to generate the binary objects. And you can control which subdirectories are included by defining config files. In detail, you can refer to other documents.

The following is the Makefile for the hello world module:

1
2
3
4
5
6
obj-m += hello.o
PWD := $(CURDIR)
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

The make -C dir command changes to directory dir before reading the makefiles or doing anything else. The top-level Makefile in /lib/modules/$(shell uname -r)/build will be used. You can find that command make M=dir modules is used to make all modules in specified dir.

And in the module-level Makefile, the obj-m syntax tells kbuild system to build module_name.o from module_name.c, and after linking, will result in the kernel module module_name.ko. In our case, the module name is hello.

The build process goes as follows:

1
2
3
4
5
6
7
8
9
chrisbao:~/develop/kernel/hello-1$ sudo make
make -C /lib/modules/4.15.0-176-generic/build M=/home/DIR/jbao6/develop/kernel/hello-1 modules
make[1]: Entering directory '/usr/src/linux-headers-4.15.0-176-generic'
CC [M] /home/DIR/jbao6/develop/kernel/hello-1/hello.o
Building modules, stage 2.
MODPOST 1 modules
CC /home/DIR/jbao6/develop/kernel/hello-1/hello.mod.o
LD [M] /home/DIR/jbao6/develop/kernel/hello-1/hello.ko
make[1]: Leaving directory '/usr/src/linux-headers-4.15.0-176-generic'

After the build, you can get several new files in the same directory:

1
2
chrisbao:~/develop/kernel/hello-1$ ls
hello.c hello.ko hello.mod.c hello.mod.o hello.o Makefile modules.order Module.symvers

The file ends with .ko is the kernel module. You can ignore other files now, I will write another article later to have a deep discussion about the kernel module system.

Load the module

With the file command, you can note that the kernel module is an ELF(Executable and Linkable Format) format file. ELF files are typically the output of a compiler or linker and are a binary format.

1
2
chrisba:~/develop/kernel/hello-1$ file hello.ko
hello.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=f0da99c757751e7e9f9c4e55f527fb034a0a4253, not stripped

Next step, let’s try to install and remove the module dynamically. You need to know the following three commands:

  • lsmod: shows the list of kernel modules currently loaded.
  • insmod: inserts a module into the Linux Kernel by running sudo insmod module_name.ko
  • rmmod: removes a module from the Linux Kernel by running sudo rmmod module_name

Since the hello world module is quite simple, you can easily install and remove the module as you wish. I will not show the detailed commands here and leave it to the readers.

Note: It doesn’t mean that you can easily install and remove any kernel module without any issues. If the module you are loading has bugs, the entire system can crash.

Debug the module

Next step, let’s prove that the hello world module is installed and removed as expected. We will use dmesg command. dmesg (diagnostic messages) can print the messages in the kernel ring buffer.

First, a ring buffer is a data structure that uses a single, fixed-size buffer as if it were connected end-to-end. The kernel ring buffer is a ring buffer that records messages related to the operation of the kernel. As we mentioned above, the kernel logs printed by the printk function will be sent to the kernel ring buffer.

We can find the messages produced by our module with command dmesg | grep world as follows:

1
2
3
4
5
6
chrisbao:~$ dmesg | grep world

[2147137.177254] Hello, world
[3281962.445169] Goodbye, world
[3282008.037591] Hello, world
[3282054.921824] Goodbye, world

Now you can see that the hello world is loaded into the kernel correctly. And it can be removed dynamically as well. Great.

Summary

In this article, we examine how to write a kernel module, how to build it and how to install it into the kernel dynamically. Next article we can work on the mini-firewall as a Netfilter module.

Write a Linux firewall from scratch based on Netfilter: part one- Netfilter and Kernel Modules

Background

Firewalls are an important tool that can be configured to protect your servers and infrastructure. Firewalls’ main functionalities are filtering data, redirecting traffic, and protecting against network attacks. There are both hardware-based firewalls and software-based firewalls. I will not discuss too much about the background here, since you can find many online documents about it.

Have you ever thought of implementing a simple firewall from scratch? Sounds crazy? But with the power of Linux, you can do that. After you read this series of articles, you will find that actually, it is quite simple.

You may once use various firewalls on Linux such as iptables, nftables, UFW, etc. All of these firewall tools are user-space utility programs, and they are all relying on Netfilter. Netfilter is the Linux kernel subsystem that allows various networking-related operations to be implemented. Netfilter allows you to develop your firewall using the Linux Kernel Module. If you don’t know the techniques such as the Linux Kernel module and Netfilter, don’t worry. In this article, let’s write a Linux firewall from scratch based on Netfilter. You can learn the following interesting points:

  • Linux kernel module development.
  • Linux kernel network programming.
  • Netfilter module development.

Netfilter and Kernel modules

Basics of Netfilter

Netfilter can be considered to be the third generation of firewall on Linux. Before Netfilterwas introduced in Linux Kernel 2.4, there are two older generations of firewalls on Linux as follows:

  • The first generation was a port of an early version of BSD UNIX’s ipfw to Linux 1.1.
  • The second generation was ipchains developed in the 2.2 series of Linux Kernel.

As we mentioned above, Netfilter was designed to provide the infrastructure inside the Linux kernel for various networking operations. So firewall is just one of the multiple functionalities provided by Netfilter as follows:

  • Packet filtering: is in charge of filtering the packets based on the rules. It is also the topic of this article.
  • NAT (Network address translation): is in charge of translating the IP address of network packets. NAT is an important protocol, which has become a popular and essential tool in conserving global address space in the face of IPv4 address exhaustion. If you don’t know NAT protocol, you can refer to other documents. I will examine it in other future articles.
  • Packet mangling: is in charge of modifying the packet content(In fact, NAT is one kind of packet mangling, which modifies the source or destination IP address). For example, MSS (Maximum Segment Size) value of TCP SYN packets can be altered to allow large-size packets transported over the network.

Note: this article will focus on building a simple firewall to filter packets based on Netfilter. So the NAT and Packet Mangling parts are not in the scope of this article.

Packet filtering can only be done inside the Linux kernel (Netfilter’s code is in the kernel as well), if we want to write a mini firewall, it has to run in the kernel space. Right? Does it mean we need to add our code into the kernel and recompile the kernel? Imagine you have to recompile the kernel each time you want to add a new packet filtering rule. That’s a bad idea. The good news is that Netfilter allows you to add extensions using the Linux kernel modules.

Basics of Linux Kernel modules

Although Linux is a monolithic kernel, it can be extended using kernel modules. Modules can be inserted into the kernel and removed on demand. Linux isolates the kernel but allows you to add specific functionality on the fly through modules. In this way, Linux keeps a balance between stability and usability.

I want to examine one confusing point about the kernel module here: what is the difference between driver and module:

  • A driver is a bit of code that runs in the kernel to talk to some hardware device. It drives the hardware. Standard practice is to build drivers as kernel modules where possible, rather than link them statically to the kernel since that gives more flexibility.
  • A kernel module may not be a device driver at all.

Summary

In the first post of this series, we examine the basics of Netfilter and Linux kernel modules. In the next post, let’s start implementing the mini firewall.

Write a Linux packet sniffer from scratch: part two- BPF

Introduction

In the previous article, we examined how to develop a network sniffer with PF_SOCKET socket in Linux platform. The sniffer developed in the last article captures all the network packets. But a powerful network sniffer like tcpdump should provide the packet filtering functionality. For instance, the sniffer can only capture the TCP segment(and skip the UPD), or it can only capture the packets from a specific source IP address. In this article, let’s continue to explore how to do that.

Background of BPF

Berkeley Packet Filter(BPF) is the essential underlying technology for packet capture in Unix-like operating systems.
Search BPF as the keyword online, and the result is very confusing. It turns out that BPF keeps evolving, and there are several associated concepts such as BPF cBPF eBPF and LSF. So let us examine those concepts along the timeline:

  • In 1992, BPF was first introduced to the BSD Unix system for filtering unwanted network packets. The proposal of BPF was from researchers in Lawrence Berkeley Laboratory, who also developed the libpcap and tcpdump.

  • In 1997, Linux Socket Filter(LSF) was developed based on BPF and introduced in Linux kernel version 2.1.75. Note that LSF and BPF have some distinct differences, but in the Linux context, when we speak of BPF or LSF, we mean the same packet filtering mechanism in the Linux kernel. We’ll examine the detailed theory and design of BPF in the following sections.

  • Originally, BPF was designed as a network packet filter. But in 2013, BPF was widely extended, and it can be used for non-networking purposes such as performance analysis and troubleshooting. Nowadays, the extended BPF is called eBPF, and the original and obsolete version is renamed to classic BPF (cBPF). Note that what we examine in this article is cBPF, and eBPF is not inside the scope of this article. eBPF is the hottest technology in today’s software world, and I’ll talk about it in the future.

Where to place BPF

The first question to answer is where should we place the filter. The last article examines the path of a received packet as follows:

The best solution to this question is to put the filter as early as possible in the path. Since copying a large amount of data from kernel space to the user space produces a huge overhead, which can influence the system performance a lot. So BPF is a kernel feature. The filter should be triggered immediately when a packet is received at the network interface.As the original BPF paper said To minimize memory traffic, the major bottleneck in most modern system, the packet should be filtered ‘in place’ (e.g., where the network interface DMA engine put it) rather than copied to some other kernel buffer before filtering.
Let’s verify this behavior by examining the kernel source code as follows (Note the kernel code shown in this article is based on version 2.6, which contains the cBPF implementation.):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/* source code file of net/packet/af_packet.c */
/* packet_create: create socket */
static int packet_create(struct net *net, struct socket *sock, int protocol)
{
/* some code omitted ... */
po = pkt_sk(sk);
sk->sk_family = PF_PACKET;
po->num = proto;

spin_lock_init(&po->bind_lock);
po->prot_hook.func = packet_rcv; // attach hook function to socket

if (sock->type == SOCK_PACKET)
po->prot_hook.func = packet_rcv_spkt; // attach hook function to socket

if (proto) {
po->prot_hook.type = proto;
dev_add_pack(&po->prot_hook);
sock_hold(sk);
po->running = 1;
}
}

packet_create function handles the socket creation when the application calls the socket system call. In lines 11 and 14, it attaches the hook function to the socket. The hook function executes when the packet is received.

The following code block shows the hook function packet_rcv:

1
2
3
4
5
6
7
8
9
10
11
12
/* hook function packet_rcv is triggered, when the packet is received */
static int packet_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
{
/* some code omitted ... */
sk = pt->af_packet_priv;
snaplen = skb->len;
res = run_filter(skb, sk, snaplen); // filter logic
if (!res)
goto drop_n_restore; // drop the packet

__skb_queue_tail(&sk->sk_receive_queue, skb); // put the packet into the queue
}

packet_rcv function calls run_filter, which is just the BPF logic part(Currently, you can regard it as a black box. In the next section, we’ll examine the details). Based on the return value of run_filter the packet can be filtered out or put into the queue.

So far, you can understand BPF(or the packet filtering) is working inside kernel space. But the packet sniffer is a user-space application. The next question is how to link the filtering rules in user space to the filtering handler in kernel space.

To answer this question, we have to understand BPF itself. It’s right time to understand this great piece of work.

BPF machine

As I mentioned above, BPF was introduced in this original paper written by researchers from Berkeley. I strongly recommend you read this great paper based on my own experience. In the beginning, I felt crazy to read it, so I read other related documents and tried to understand BPF. But most documents only cover one portion of the entire system, so it is difficult to piece all the information together. Finally, I read the original paper and connected all parts together. As the saying goes, sometimes taking time is actually a shortcut.

Virtual CPU

A packet filter is simply a boolean-valued function on a packet. If the value of the function is true the kernel copies the packet for the application; if it is false the packet is ignored.

In order to be as flexible as possible and not to limit the application to a set of predefined conditions, the BPF is actually implemented as a register-based virtual machine (for the difference between stack-based and register-based virtual machine, you can refer to this article) running a user-defined program.

You can regard the BPF as a virtual CPU. And it consists of an accumulator, an index register(x), a scratch memory store, and an implicit program counter. If you’re not familiar with these concepts, I add some simple illustrations as follows:

  • An accumulator is a type of register included in a CPU. It acts as a temporary storage location holding an intermediate value in mathematical and logical calculations. For example, in the operation of “1+2+3”, the accumulator would hold the value 1, then the value 3, then the value 6. The benefit of an accumulator is that it does not need to be explicitly referenced.
  • An index register in a computer’s CPU is a processor register or assigned memory location used for modifying operand addresses during the run of a program.
  • A program counter is a CPU register in the computer processor which has the address of the next instruction to be executed from memory.

In the BPF machine, the accumulator is used for arithmetic operations, while the index register provides offsets into the packet or the scratch memory areas.

Instructions set and addressing mode

Same as the physical CPU, the BPF provides a small set of arithmetic, logical and jump instructions as follows, these instructions run on the BPF virtual machine(or CPU):

The first column opcodes lists the BPF instructions written in an assembly language style. For example, ld, ldh and ldb means to copy the indicated value into the accumulator. ldx means to copy the indicated value into the index register. jeq means jump to the target instruction if the accumulator equals the indicated value. ret means return the indicated value. You can check the functionality of the instructions set in detail in the paper.

This kind of assembly-like style is more readable to humans. But when we develop an application (like the sniffer written in this article), we use binary code directly as the BPF instruction. This kind of binary format is called BPF Bytecode. I’ll examine the way to convert this assembly language to bytecode later.

The second column addr modes lists the addressing modes allowed for each instruction. The semantics of the addressing modes are listed in the following table:

For instance, [k] means the data at byte offset k in the packet. #k means the literal value stored in k. You can read the paper in detail to check the meaning of other address modes.

Example BPF program

Now let’s try to understand the following small BPF program based on the knowledge above:

1
2
3
4
(000) ldh      [12]
(001) jeq #0x800 jt 2 jf 3
(002) ret #262144
(003) ret #0

The BPF program consists of an array of BPF instructions. For example, the above BPF program contains four instructions.

The first instruction ldh loads a half-word(16-bit) value into the accumulator from offset 12 in the Ethernet packet. According to the Ethernet frame format shown below, the value is just the Ethernet type field. The Ethernet type is used to indicate which protocol is encapsulated in the frame’s payload (for example, 0x0806 for ARP, 0x0800 for IPv4, and 0x86DD for IPv6).

The second instruction jeq compares the accumulator (currently stores Ethernet type field) to 0x800(stands for IPv4). If the comparison fails, zero is returned, and the packet is rejected. If it is successful, a non-zero value is returned, and the packet is accepted. So the small BPF program filters and accepts all IP packets. You can find other BPF programs in the original paper. Go to read it, and you can feel the flexibility of BPF as well as the beauty of the design.

Kernel implementation of BPF

Next, let’s examine how kernel implements BPF. As mentioned above, the hook function packet_rcv calls run_filter to handle the filtering logic. run_filter is defined as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* Copied from net/packet/af_packet.c */
/* function run_filter is called in packet_rcv*/
static inline unsigned int run_filter(struct sk_buff *skb, struct sock *sk,
unsigned int res)
{
struct sk_filter *filter;

rcu_read_lock_bh();
filter = rcu_dereference(sk->sk_filter); // get the filter bound to the socket
if (filter != NULL)
res = sk_run_filter(skb, filter->insns, filter->len); // the filtering is inside sk_run_filter function
rcu_read_unlock_bh();

return res;
}

You can find that the real filtering logic is inside sk_run_filter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
unsigned int sk_run_filter(struct sk_buff *skb, struct sock_filter *filter, int flen)
{
struct sock_filter *fentry; /* We walk down these */
void *ptr;
u32 A = 0; /* Accumulator */
u32 X = 0; /* Index Register */
u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */
u32 tmp;
int k;
int pc;

/*
* Process array of filter instructions.
*/
for (pc = 0; pc < flen; pc++) {
fentry = &filter[pc];

switch (fentry->code) {
case BPF_ALU|BPF_ADD|BPF_X:
A += X;
continue;
case BPF_ALU|BPF_ADD|BPF_K:
A += fentry->k;
continue;
case BPF_ALU|BPF_SUB|BPF_X:
A -= X;
continue;
case BPF_ALU|BPF_SUB|BPF_K:
A -= fentry->k;
continue;
case BPF_ALU|BPF_MUL|BPF_X:
A *= X;
continue;
/* some code omitted ... */
case BPF_RET|BPF_K:
return fentry->k;
case BPF_RET|BPF_A:
return A;
case BPF_ST:
mem[fentry->k] = A;
continue;
case BPF_STX:
mem[fentry->k] = X;
continue;
default:
WARN_ON(1);
return 0;
}
}

return 0;
}

Same as we mentioned, sk_run_filter is simply a boolean-valued function on a packet. It maintains the accumulator, the index register, etc. as local variables. And process the array of BPF filter instructions in a for loop. Each instruction will update the value of local variables. In this way, it simulates a virtual CPU. Interesting, right?

BPF JIT

Since each network packet must go through the filtering function, it becomes the performance bottleneck of the entire system.

A just-in-time (JIT) compiler was introduced into the kernel in 2011 to speed up BPF bytecode execution.

  • What is a JIT compiler? A JIT compiler runs after the program has started and compiles the code(usually bytecode or some type of VM instructions) on the fly(or just in time) into a form that’s usually faster, typically the host CPU’s native instruction set. This is in contrast to a traditional compiler that compiles all the code to machine language before the program is first run.

In the BPF case, the JIT compiler translates BPF bytecode into a host system’s assembly code directly, which can optimize the performance a lot. I’ll not show details about JIT in this article. You can refer to the kernel code.

Set BPF in sniffer

Next, let’s add BPF into our packet sniffer. As we mentioned above in the application level, the BPF instructions should use bytecode format with the following data structure:

1
2
3
4
5
6
struct sock_filter {    /* Filter block */
__u16 code; /* Actual filter code */
__u8 jt; /* Jump true */
__u8 jf; /* Jump false */
__u32 k; /* Generic multiuse field */
};

How can we convert the BPF assembly language into bytecode? There are two solutions. First, there is a small helper tool called bpf_asm(which is provided along with the Linux kernel), and you can regard it as the BPF assembly language interpreter. But it is not recommended to application developers.

Second, we can use tcpdump, which provides the converting functionality. You can find the following information from the tcpdump man page:

  • -d: Dump the compiled packet-matching code in a human-readable form to standard output and stop.

  • -dd: Dump packet-matching code as a C program fragment.

  • -ddd: Dump packet-matching code as decimal numbers (preceded with a count).

tcpdump ip means we want to capture all the IP packets. With options -d, -dd and -ddd, the output goes as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
baoqger@ubuntu:~$ sudo tcpdump -d ip
[sudo] password for baoqger:
(000) ldh [12]
(001) jeq #0x800 jt 2 jf 3
(002) ret #262144
(003) ret #0

baoqger@SLB-C8JWZH3:~$ sudo tcpdump -dd ip
{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 1, 0x00000800 },
{ 0x6, 0, 0, 0x00040000 },
{ 0x6, 0, 0, 0x00000000 },

baoqger@SLB-C8JWZH3:~$ sudo tcpdump -ddd ip
4
40 0 0 12
21 0 1 2048
6 0 0 262144
6 0 0 0

Option -d prints the BPF instructions in assembly language (same as the example BPF program shown above). Options -dd prints the bytecode as a C program fragment. So tcpdump is the most convenient tool when you want to get the BPF bytecode.

The BPF filter bytecode (wrapped in the structure sock_fprog) can be passed to the kernel through setsockopt system call as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// attach the filter to the socket
// the filter code is generated by running: tcpdump tcp
struct sock_filter BPF_code[] = {
{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 1, 0x00000800 },
{ 0x6, 0, 0, 0x00040000 },
{ 0x6, 0, 0, 0x00000000 }
};
struct sock_fprog Filter;
// error prone code, .len field should be consistent with the real length of the filter code array
Filter.len = sizeof(BPF_code)/sizeof(BPF_code[0]);
Filter.filter = BPF_code;


if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &Filter, sizeof(Filter)) < 0) {
perror("setsockopt attach filter");
close(sock);
exit(1);
}

setsockopt system call triggers two kernel functions: sock_setsockopt and sk_attach_filter (I’ll not show the details for these two functions), which binds the filters to the socket. And in run_filter kernel function (mentioned above), it can get the filters from the socket and execute the filters on the packet.

So far, every piece is connected. The puzzle of BPF is solved. The BPF machine allows the user-space applications to inject customized BPF programs straight into a kernel. Once loaded and verified, BPF programs execute in kernel context. These BPF programs operate inside kernel memory space with access to all the internal kernel states available to it. For example, the cBPF machine which uses the network packet data. But this power can be extended as eBPF, which can be used in many other varied applications. As someone said In some way, eBPF does to the kernel what Javascript does to the websites: it allows all sorts of new application to be created. In the future, I plan to examine eBPF in depth.

Process the packet

We examined the BPF filtering theory on the kernel level a lot in the above section. But for our tiny sniffer, the last step we need to do is process the network packet.

  • First, the recvfrom system call reads the packet from the socket. And we put the system call in a while loop to keep reading the incoming packets.

  • Then, we print the source and destination MAC address in the packet(the packet we got is a raw Ethernet frame in Layer 2, right?). And if what this Ethernet frame contains is an IP4 packet, then we print out the source and destination IP address. To understand more about it, you can study the header format of various network protocols. I will not cover in details here.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
while(1) {
printf("-----------\n");
n = recvfrom(sock, buffer, 2048, 0, NULL, NULL);
printf("%d bytes read\n", n);

/* Check to see if the packet contains at least
* complete Ethernet (14), IP (20) and TCP/UDP
* (8) headers.
*/
if (n < 42) {
perror("recvfrom():");
printf("Incomplete packet (errno is %d)\n", errno);
close(sock);
exit(0);
}

ethhead = buffer;
printf("Source MAC address: %.2x:%.2x:%.2x:%.2x:%.2x:%.2x\n",
ethhead[0], ethhead[1], ethhead[2], ethhead[3], ethhead[4], ethhead[5]
);
printf("Destination MAC address: %.2x:%.2x:%.2x:%.2x:%.2x:%.2x\n",
ethhead[6], ethhead[7], ethhead[8], ethhead[9], ethhead[10], ethhead[11]
);

iphead = buffer + 14;

if (*iphead==0x45) { /* Double check for IPv4
* and no options present */
printf("Source host %d.%d.%d.%d\n",
iphead[12],iphead[13],
iphead[14],iphead[15]);
printf("Dest host %d.%d.%d.%d\n",
iphead[16],iphead[17],
iphead[18],iphead[19]);
printf("Source,Dest ports %d,%d\n",
(iphead[20]<<8)+iphead[21],
(iphead[22]<<8)+iphead[23]);
printf("Layer-4 protocol %s\n", transport_protocol(iphead[9]));
}
}

You can find the complete source code of the sniffer in this Github repo.

Summary

In this article, we examine how to add filters to our sniffer. First, we analyze why the filter should be running inside kernel space instead of the application space. Then, this article examines the BPF machine design and implementation in detail based on the paper. We reviewed the kernel source code to understand how to implement the BPF virtual machine. As I mentioned above, the original BPF(cBPF) was extended to eBPF now. But the understanding of the BPF virtual machine is very helpful to eBPF as well.

Write a Linux packet sniffer from scratch: part one- PF_PACKET socket and promiscuous mode

Background

When we refer to network packet sniffer, some famous and popular tools come to your mind, like tcpdump. I have shown you how to capture network packets with such tools in my previous articles. But have you ever thought about writing a packet sniffer from scratch without dependencies on any third-party libraries? We need to dig deep into the operating system and find the weapons needed to build this tool. Sounds complex, right? In this article, let us do it. After reading this article, you can find that it is not as difficult as you think.

Note that different operating system kernels have different internal network implementations. This article will focus on the Linux platform.

Introduction

Firstly, we need to review how tcpdump is implemented. According to the official document, tcpdump is built on the library libpcap, which is developed based on the remarkable research result from Berkeley, in details you can refer to this paper.

As you know, different operating systems have different internal implementations of network stacks. libpcap covers all of these differences and provides the system-independent interface for user-level packet capture. I want to focus on the Linux platform, so how does libpcap work on the Linux system? According to some documents, it turns out that libpcap uses the PF_PACKET socket to capture packets on a network interface.

So the next question is: what the PF_PACKET socket is?

PF_PACKET socket

In my previous article, we mentioned that the socket interface is TCP/IP’s window on the world. In most modern systems incorporating TCP/IP, the socket interface is the only way applications can use the TCP/IP suite of protocols.

It is correct. This time, let’s dig deeper about socket by examining the system call executed when we create a new socket:

1
int socket(int domain, int type, int protocol);

When you want to create a socket with the above system call, you have to specify which domain (or protocol family) you want to use with that socket as the first argument. The most commonly used family is PF_INET, which is for communications based on IPv4 protocols (when you create a TCP server, you use this family). Moreover, you have to specify a type for your socket as the second argument. And the possible values depend on the family you specified. For example, when dealing with the PF_INET family, the values for type include SOCK_STREAM(for TCP) and SOCK_DGRAM(for UDP). For other detailed information about the socket system call, you can refer to the socket(3) man page.

You can find one potential value for the domain argument as follows:

1
AF_PACKET    Low-level packet interface

Note: AF_PACKET and PF_PACKET are same. It is called PF_PACKET in history and then renamed AF_PACKET later. PF means protocol families, and AF means address families. In this article, I use PF_PACKET.

Different from PF_INET socket, which can give you TCP segment. By PF_PACKET socket, you can get the raw Ethernet frame which bypasses the usual upper layer handling of TCP/IP stack. It might sound a little bit crazy. But, that is, any packet received will be directly passed to the application.

For a better understanding of PF_PACKET socket, let us go deeper and roughly examine the path of a received packet from the network interface to the application level.

(As shown in the image above) When the network interface card(NIC) receives a packet, it is handled by the driver. The driver maintains a structure called ring buffer internally. And write the packet to kernel memory (the memory is pre-allocated with ring buffer) with direct memory access(DMA). The packet is placed inside a structure called sk_buff(one of the most important structures related to kernel network subsystem).

After entering the kernel space, the packet goes through protocol stack handling layer by layer, such as IP processing and TCP/UDP processing. And the packet goes into applications via the socket interface. You already understand this familiar path very well.

But for the PF_PACKET socket, the packet in sk_buff is cloned, then it skips the protocol stacks and directly goes to the application. The kernel needs the clone operation, because one copy is consumed by the PF_PACKET socket, and the other one goes through the usual protocol stacks.

In future articles, I’ll demonstrate more about Linux kernel network internals.

Next step, let us see how to create a PF_PACKET socket at the code level. For brevity, I omit some code and only show the essential part. You can refer to this Github repo in detail.

1
2
3
4
if ((sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP))) < 0) {
perror("socket");
exit(1);
}

Please ensure to include the system header files: <sys/socket.h> <sys/types.h>.

Bind to one network interface

Without the additional settings, the sniffer captures all the packets received on all the network devices. Next step, let us try to bind the sniffer to a specific network device.

Firstly, you can use ifconfig command to list all the available network interfaces on your machines. The network interface is a software interface to the networking hardware.

For example, the following image shows information of network interface eth0:

1
2
3
4
5
6
7
8
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
inet 192.168.230.49 netmask 255.255.240.0 broadcast 192.168.239.255
inet6 fe80::215:5dff:fefb:e31f prefixlen 64 scopeid 0x20<link>
ether 00:15:5d:fb:e3:1f txqueuelen 1000 (Ethernet)
RX packets 260 bytes 87732 (87.7 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 178 bytes 29393 (29.3 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Let’s bind the sniffer to eth0 as follows:

1
2
3
4
5
6
7
8
// bind to eth0 interface only
const char *opt;
opt = "eth0";
if (setsockopt(sock, SOL_SOCKET, SO_BINDTODEVICE, opt, strlen(opt) + 1) < 0) {
perror("setsockopt bind device");
close(sock);
exit(1);
}

We do it by calling the setsockopt system call. I leave the detailed usage of it to you.

Now the sniffer only captures network packets received on the specified network card.

Non-promiscuous and promiscuous mode

By default, each network card minds its own business and reads only the frames directed to it. It means that the network card discards all the packets that do not contain its own MAC address, which is called non-promiscuous mode.

Next, let us make the sniffer can work in promiscuous mode. In this way, it retrieves all the data packets. Even the ones that are not addressed to its host.

To set a network interface to promiscuous mode, all we have to do is issue the ioctl() system call to an open socket on that interface.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/* set the network card in promiscuos mode*/
// An ioctl() request has encoded in it whether the argument is an in parameter or out parameter
// SIOCGIFFLAGS 0x8913 /* get flags */
// SIOCSIFFLAGS 0x8914 /* set flags */
struct ifreq ethreq;
strncpy(ethreq.ifr_name, "eth0", IF_NAMESIZE);
if (ioctl(sock, SIOCGIFFLAGS, &ethreq) == -1) {
perror("ioctl");
close(sock);
exit(1);
}
ethreq.ifr_flags |= IFF_PROMISC;
if (ioctl(sock, SIOCSIFFLAGS, &ethreq) == -1) {
perror("ioctl");
close(sock);
exit(1);
}

ioctl stands for I/O control, which manipulates the underlying device parameters of specific files. ioctl takes three arguments:

  • The first argument must be an open file descriptor. We use the socket file descriptor bound to the network interface in our case.
  • The second argument is a device-dependent request code. You can see we called ioctl twice. The first call uses request code SIOCGIFFLAGS to get flags, and the second call uses request code SIOCSIFFLAGS to set flags. Do not be fooled by these two constant values, which are spelled alike.
  • The third argument is for returning information to the requesting process.

Now the sniffer can retrieve all the data packets received on the network card, no matter to which host the packets are addressed.

Summary

This article examined what PF_PACKET socket is, how it works and why the application can get raw Ethernet packets. Furthermore, we discussed how to bind the sniffer to one specific network interface and how can make the sniffer work in the promiscuous mode. The next article will examine how to implement the packet filter functionality, which is very useful to a network sniffer.