Understanding of BPF

Question

When I need to capture some packets using tcpdump, I use command like:

tcpdump -i eth0 "dst host 192.168.1.0"

I always think the dst host 192.168.1.0 part is something called BPF, Berkeley Packet Filter. To me, it's a simple language to filter network packets. But today my roommate tells me that BPF can be used to capture performance info. According to his description, it's like the tool perfmon on Windows. Is it true? Is it the same BPF as I mentioned in the beginning of the question?

forest · Accepted Answer · 2025-10-07 22:30:43Z

What is BPF?

BPF (or more commonly, the extended version, eBPF) is a language that was originally used exclusively for filtering packets, but it is capable of quite a lot more. On Linux, it can be used for many other things, including system call filters for security, choosing processes to kill when the system runs out of memory, and sophisticated performance monitoring, as you pointed out. While Windows did add eBPF support, that is not what Windows' perfmon utility uses. Windows only added support for compatibility with non-Windows utilities that rely on OS support for eBPF.

The eBPF programs are not executed in userspace. Instead, the application creates and sends an eBPF program to the kernel, which executes it. It is actually machine code for a virtual processor that is implemented in the form of an interpreter in the kernel, although it can also use JIT compilation to enhance performance considerably. The program has access to some basic interfaces in the kernel, including those related to performance and networking. The eBPF program then communicates with the kernel to provide it the computational results (such as dropping a packet).

Restrictions on eBPF programs

In order to protect from denial-of-service attacks or accidental crashes, the kernel first verifies the code before it is compiled. Before being run, the code is subject to several important checks:

The program consists of no more than 4096 instructions in total for unprivileged users.
Backwards jumps cannot occur, with the exception of bounded loops and function calls.
There are no instructions that are always unreachable.

The upshot is that the verifier must be able to prove that the eBPF program halts. It hasn't found a solution to the halting problem, of course, which is why it only accepts programs that it knows will halt. To do this, it represents the program as a directed acyclic graph. In addition to this, it tries to prevent information leaks and out-of-bounds memory access by preventing the actual value of a pointer from being revealed while still allowing limited operations to be performed on it:

Pointers cannot be compared, stored, or returned as a value that can be examined.
Pointer arithmetic can only be done against a scalar (a value not derived from a pointer).
No pointer arithmetic can result in pointing outside the designated memory map.

The verifier is rather complex and does far more, although it has itself been the source of serious security bugs, at least when the bpf(2) syscall is not disabled for unprivileged users.

Viewing the code

The dst host 192.168.1.0 component of the command is not BPF. That is just syntax which is used by tcpdump. However, the command you give it is used to generate a BPF program which is then sent to the kernel. Note that it is not eBPF which is used in this case, but the older cBPF. There are several important differences between the two (although the kernel internally converts cBPF into eBPF). The -d flag can be used to see the cBPF code that is to be sent to the kernel:

# tcpdump -i eth0 "dst host 192.168.1.0" -d
(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 4
(002) ld       [30]
(003) jeq      #0xc0a80100      jt 8    jf 9
(004) jeq      #0x806           jt 6    jf 5
(005) jeq      #0x8035          jt 6    jf 9
(006) ld       [38]
(007) jeq      #0xc0a80100      jt 8    jf 9
(008) ret      #262144
(009) ret      #0

More complicated filters result in more complicated bytecode. Try some of the examples in the manpage and append the -d flag to see what bytecode would be loaded into the kernel. In order to understand how to read the disassembly, review the BPF filter documentation. If you're reading an eBPF program, you should take a look at the eBPF instruction set for the virtual CPU.

Understanding the code

For simplicity, I'll assume you specified a destination IP of 192.168.1.1 instead of 192.168.1.0 and wanted to match IPv4 only, which shrinks the code quite a bit as it no longer has to handle IPv6:

# tcpdump -i eth0 "dst host 192.168.1.1 and ip" -d
(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 5
(002) ld       [30]
(003) jeq      #0xc0a80101      jt 4    jf 5
(004) ret      #262144
(005) ret      #0

Let's walk through what the above bytecode actually does. Each time a packet is received on the interface specified, the BPF bytecode is run. The packet contents (including the Ethernet header, if applicable) are put in a buffer that the BPF code has access to. If the packet matches the filter, the code will return the size of the capture buffer (262144 bytes by default), otherwise it returns 0.

Let's assume you are running this filter and it receives a packet sending an ICMP message with an empty payload from 192.168.1.142 to 192.168.1.1. The source MAC is aa:aa:aa:aa:aa:aa and the destination MAC is bb:bb:bb:bb:bb:bb. The contents of the Ethernet frame, in hexadecimal, are:

aa aa aa aa aa aa bb bb bb bb bb bb 08 00 45 00
00 1c 77 71 40 00 40 01 3f 92 c0 a8 01 8e c0 a8
01 01 08 00 c1 c0 36 0e 00 01

Note that the 7 byte Ethernet preamble and single byte start frame delimiter are excluded.

Instruction 0 is ldh [12]. This loads a half-word (two bytes) located at an offset of 12 bytes into the packet into the A register. In an Ethernet header, this is the location of the EtherType field, which specifies the protocol that is being encapsulated. In our case, this is the value 0x0800 (remember that network data is always big-endian). The value 0x0800 indicates the IPv4 protocol (other possible values include 0x86dd for IPv6, 0x0806 for ARP, and 0x88e5 for MACsec).

Instruction 1 is jeq #0x800, which will compare an immediate with the value in the A register. If they are equal, it will jump to instruction 2, otherwise 5. The value 0x800 at that offset in the Ethernet frame specifies the IPv4 protocol. Because the comparison evaluates true, the code now jumps to instruction 2. If the payload was not IPv4, it would have jumped to 5.

Instruction 2 is ld [30]. This loads an entire 4-byte word at an offset of 30 into the A register. In our Ethernet frame, this is 0xc0a80101. The Ethernet header that was passed to us is 14 bytes (two 6-byte MAC addresses and a 2-byte EtherType field), so offset 30 into our Ethernet frame is offset 16 into the IPv4 header, which is the 4-byte destination address in big-endian format.

Instruction 3, jeq #0xc0a80101, will compare an immediate against the contents of the A register and will jump to 4 if true, otherwise 5. This value is the destination address (0xc0a80101 is the big-endian representation of 192.168.1.1, as 0xc0 is 192, 0xa8 is 168, 0x01 is 1, and 0x01 is 1). The values do indeed match, so the program counter is now set to 4.

Instruction 4 is ret #262144. This terminates the BPF program and returns the integer 262144. This tells the kernel that the program that loaded the filter, tcpdump in this case, would like the packet. The kernel passes it to the program which parses it, writing the information to your terminal or saving the contents of the packet to a pcap file. Performance (and security!) is vastly improved because tcpdump is not forced to tediously parse each and every packet that is received.

Instruction 5 is ret #0 and is not reached for our Ethernet frame. If the destination address did not match what the filter was looking for or the protocol type was not IPv4, the code would have jumped here instead, returning 0 to signify there was no match.

This is all just a way to return 262144 if the half-word at offset 12 into the packet is 0x800 AND the word at offset 30 is 0xc0a80101, and return 0 otherwise. Because this is all done in the kernel (optionally after being converted into native machine code by the JIT engine), no expensive context switches or passing buffers between kernelspace and userspace are required, so the filter is fast. This repeats with each packet that the kernel handles, although the filter can be installed only for ingress, egress, only on a certain interface, etc. In the above example, it is installed for both ingress and egress on eth0, and the filter will be executed on each packet passing through it.

More advanced examples

All filters generated by tcpdump work like this, with varying levels of complexity. Even complex filters are nothing more than a series of instructions checking various values at various offsets into the packet. For example, this a filter that captures only ECN packets on the 192.168.1.0/24 subnet, excluding 192.168.1.100, with a size larger than 1000 bytes going to port 80 or 443:

tcpdump -i eth0 'ip and tcp port (80 or 443) and net 192.168.1/24 and not host 192.168.1.100 and tcp[tcpflags] & (tcp-ece|tcp-cwr) != 0 and len > 1000'

This results in the following assembly listing (annotated by me):

; Check EtherType
(000) ldh      [12]                              ; Load 2 byte EtherType from offset 12
(001) jeq      #0x800           jt 2    jf 28    ; Is it IPv4? If yes, jump to next instruction, else jump to 28 (drop)

; Check IP protocol
(002) ldb      [23]                              ; Load a single byte from IP header offset 9 (protocol field)
(003) jeq      #0x6             jt 4    jf 28    ; Is it TCP (6)? If yes, jump to next instruction, else jump to 28 (drop)

; Skip fragmented IP datagrams
(004) ldh      [20]                              ; Load 2 bytes from IP header offset 0 (flags and frag offset)
(005) jset     #0x1fff          jt 28   jf 6     ; Are fragmentation flags set? If yes, the packet is incomplete; jump to 28 (drop), else jump to next instruction

; Calculate TCP header offset
(006) ldxb     4*([14]&0xf)                      ; Load IP header length from offset 14, multiply by 4 and save result (TCP header start offset) in X

; Check source and destination port
(007) ldh      [x + 14]                          ; Load TCP source port (14 bytes into the TCP header)
(008) jeq      #0x50            jt 13   jf 9     ; Is it 80 (0x50)? If yes, jump past dest port checks and source port check for 443, else jump to next instruction
(009) jeq      #0x1bb           jt 13   jf 10    ; Is it 443 (0x1bb)? If yes, jump past dest port checks, else jump to next instruction
(010) ldh      [x + 16]                          ; Load TCP dest port (16 bytes into the TCP header)
(011) jeq      #0x50            jt 13   jf 12    ; Is it 80? If yes, jump past the check for port 443, else jump to next instruction
(012) jeq      #0x1bb           jt 13   jf 28    ; Is it 443? If yes, jump to next instruction, else jump to 28 (drop)

; Check source and destination address
(013) ld       [26]                              ; Load 4 byte source IP from offset 26
(014) and      #0xffffff00                       ; Mask last byte to isolate /24 subnet
(015) jeq      #0xc0a80100      jt 19   jf 16    ; Is it in 192.168.1.0/24? If yes, jump past source/dest IP check, else jump to next instruction
(016) ld       [30]                              ; Load 4 byte dest IP from offset 30
(017) and      #0xffffff00                       ; Mask last byte to isolate /24 subnet
(018) jeq      #0xc0a80100      jt 19   jf 28    ; Is it in 192.168.1.0/24? If yes, jump to next instruction, else jump to 28 (drop)

; Exclude host 192.168.1.100
(019) ld       [26]                              ; Load source IP again
(020) jeq      #0xc0a80164      jt 28   jf 21    ; Is it 192.168.1.100? If yes, jump to 28 (drop), else jump to next instruction
(021) ld       [30]                              ; Load dest IP again
(022) jeq      #0xc0a80164      jt 28   jf 23    ; Is it 192.168.1.100? If yes, jump to 28 (drop), else jump to next instruction

; Check ECN flags (ECE or CWR)
(023) ldb      [x + 27]                          ; Load 1 byte TCP flags (27 bytes into TCP header)
(024) jset     #0xc0            jt 25   jf 28    ; Is the ECE (0x40) or CWR (0x80) flag set? If yes, jump to next instruction, else jump to 28 (drop)

; Check packet length
(025) ld       #pktlen                           ; Load 4 byte packet length
(026) jgt      #0x3e8           jt 27   jf 28    ; Is it >1000 (0x3e8)? If yes, jump to next instruction (match), else jump to 28 (drop)

; Final decision
(027) ret      #262144                           ; Return 262144 (match)
(028) ret      #0                                ; Return 0 (drop)

The BPF code is not limited to being used by tcpdump. A number of other utilities can use it. You can even create an iptables rule with a BPF filter by using the xt_bpf module! However, you have to be careful when generating the bytecode with tcpdump -ddd because it expects to consume a layer 2 header, whereas iptables does not. To make them compatible, you have to adjust the offsets.

Furthermore, a number of auxiliary functions are provided that provide information that can't be obtained by reading the raw packet contents such as the packet length, the payload start offset, the CPU the packet was received on, the NetFilter mark, etc. From the filter documentation:

The Linux kernel also has a couple of BPF extensions that are used along with the class of load instructions by “overloading” the k argument with a negative offset + a particular extension offset. The result of such BPF extensions are loaded into A.

The supported BPF extensions are:

Extension	Description
len	skb->len
proto	skb->protocol
type	skb->pkt_type
poff	Payload start offset
ifidx	skb->dev->ifindex
nla	Netlink attribute of type X with offset A
nlan	Nested Netlink attribute of type X with offset A
mark	skb->mark
queue	skb->queue_mapping
hatype	skb->dev->type
rxhash	skb->hash
cpu	raw_smp_processor_id()
vlan_tci	skb_vlan_tag_get(skb)
vlan_avail	skb_vlan_tag_present(skb)
vlan_tpid	skb->vlan_proto
rand	prandom_u32()

For example, to match all packets that are received on CPU 3, you could do:

    ld #cpu
    jneq #3, drop
    ret #262144
drop:
    ret #0

Note that this is using BPF assembly syntax compatible with bpf_asm, whereas the other assembly listings here are using tcpdump syntax. The main difference is that the former's syntax uses named labels whereas the latter's BPF syntax labels each instruction with a line number. This assembly translates to the following bytecode (commas delimit instructions after the first integer, which specifies the number of instructions in the bytecode, in this case 4):

4,32 0 0 4294963236,21 0 1 1,6 0 0 262144,6 0 0 0,

This can then be used with iptables using the xt_bpf module:

iptables -A INPUT -m bpf --bytecode "4,32 0 0 4294963236,21 0 1 1,6 0 0 262144,6 0 0 0," -j CPU3

This will jump to target chain CPU3 for any packets received on that CPU.

If this seems powerful, remember that this is all cBPF. Although cBPF is translated into eBPF internally, all this is nothing compared to what raw eBPF can do!

For more information

I highly recommend you read this article to understand how tcpdump uses cBPF.

After reading that, read this explanation of how tcpdump turns expressions into bytecode.

If you want to learn everything else about it, you can always check out the source code!

The 4096-instructions limit is no longer relevant for privileged users; instead there are limits on complexity (in particular, a max of 1M insns that the verifier can check in total, when going through the different branches of the DAC). Function calls can also result in backward jumps. This page can give an overview of projects relying on eBPF. For performance tracing, I'd suggest having a look at BCC or bpftrace. — Qeole, Commented Apr 25, 2022 at 8:36
@Qeole Perhaps if you're using eBPF directly through the bpf() syscall, but I believe tcpdump uses only traditional cBPF with old-school cBPF semantics (including no backwards jumps, etc.), even it is compiled to eBPF internally which does not have this limit. — forest, Commented Aug 7, 2022 at 0:59
Yes, the 4096 limit still applies to cBPF (and tcpdump). In your answer, you mentioned it for eBPF specifically, and that's what my comment was based on. — Qeole, Commented Aug 7, 2022 at 12:40

Qeole · Accepted Answer · 2022-04-25 08:54:38Z

The eBPF programs are not executed in userspace. Instead, the application creates and sends an eBPF program to the kernel, which executes it.

To complement @forest's good answer, we can maybe elaborate a little on how those programs are executed.

cBPF, as used by tcpdump, has few hooks: it can be attached to sockets, in order to run when a packet arrives (this is what tcpdump does, to filter packets received on the socket, and to pass only the desired ones to userspace), or they can be attached to the seccomp hooks, so as to do some filtering on system calls and their arguments.

One of the important features of eBPF is that it can be attached to a wider selection of hooks in the kernel (although it doesn't do seccomp). For networking, there are sockets, but also TC (traffic control) hooks, XDP (driver-level hooks for fast networking), or a few others. With regard to your question: programs can also be attached to tracepoints in the kernel (pre-defined hooks on some specific functions, e.g. syscalls or “important” functions in the kernel), or on kernel probes (kprobes), making them able to trace any function in the kernel (provided it was not inlined at compilation time). Then other types exist, for example LSM for security use cases.

Tracing usually rely on tracepoints or kprobes to attach an eBPF program to a function, and to run it every time this function is called in the kernel. The program can access the arguments of the function or (if it's attached at the exit) to the return value. Through the use of maps, special kernel memory area such as arrays or hash maps, dedicated to share data between eBPF programs and/or user space, the programs can collect metrics or share states between consecutive runs.

For example, opensnoop from BCC will attach to the tracepoints at the entry and the exit of the open() and openat() syscalls. At the entry, it collects the path of the file being opened, and the PID of the process opening it, and stores it in a hash map. When the syscall exits, the second probe collects the return value and, based on the PID, updates the relevant entry in the hash map. Then user space can collect and dump all entries from the hash map to show what files have been opened by what processes, and what the return values were.

https://ebpf.io/ is a nice place to get started with eBPF.

Stack Exchange Network

Understanding of BPF

2 Answers 2

What is BPF?

Restrictions on eBPF programs

Viewing the code

Understanding the code

More advanced examples

For more information

You must log in to answer this question.

Linked

Hot Network Questions

Understanding of BPF

2 Answers 2

What is BPF?

Restrictions on eBPF programs

Viewing the code

Understanding the code

More advanced examples

For more information

You must log in to answer this question.

Linked

Related

Hot Network Questions