What is BPF?
BPF (or more commonly, the extended version, eBPF) is a language that was originally used exclusively for filtering packets, but it is capable of quite a lot more. On Linux, it can be used for many other things, including system call filters for security, choosing processes to kill when the system runs out of memory, and sophisticated performance monitoring, as you pointed out. While Windows did add eBPF support, that is not what Windows' perfmon
utility uses. Windows only added support for compatibility with non-Windows utilities that rely on OS support for eBPF.
The eBPF programs are not executed in userspace. Instead, the application creates and sends an eBPF program to the kernel, which executes it. It is actually machine code for a virtual processor that is implemented in the form of an interpreter in the kernel, although it can also use JIT compilation to enhance performance considerably. The program has access to some basic interfaces in the kernel, including those related to performance and networking. The eBPF program then communicates with the kernel to provide it the computational results (such as dropping a packet).
Restrictions on eBPF programs
In order to protect from denial-of-service attacks or accidental crashes, the kernel first verifies the code before it is compiled. Before being run, the code is subject to several important checks:
The program consists of no more than 4096 instructions in total for unprivileged users.
Backwards jumps cannot occur, with the exception of bounded loops and function calls.
There are no instructions that are always unreachable.
The upshot is that the verifier must be able to prove that the eBPF program halts. It hasn't found a solution to the halting problem, of course, which is why it only accepts programs that it knows will halt. To do this, it represents the program as a directed acyclic graph. In addition to this, it tries to prevent information leaks and out-of-bounds memory access by preventing the actual value of a pointer from being revealed while still allowing limited operations to be performed on it:
Pointers cannot be compared, stored, or returned as a value that can be examined.
Pointer arithmetic can only be done against a scalar (a value not derived from a pointer).
No pointer arithmetic can result in pointing outside the designated memory map.
The verifier is rather complex and does far more, although it has itself been the source of serious security bugs, at least when the bpf(2)
syscall is not disabled for unprivileged users.
Viewing the code
The dst host 192.168.1.0
component of the command is not BPF. That is just syntax which is used by tcpdump
. However, the command you give it is used to generate a BPF program which is then sent to the kernel. Note that it is not eBPF which is used in this case, but the older cBPF. There are several important differences between the two (although the kernel internally converts cBPF into eBPF). The -d
flag can be used to see the cBPF code that is to be sent to the kernel:
# tcpdump -i eth0 "dst host 192.168.1.0" -d
(000) ldh [12]
(001) jeq #0x800 jt 2 jf 4
(002) ld [30]
(003) jeq #0xc0a80100 jt 8 jf 9
(004) jeq #0x806 jt 6 jf 5
(005) jeq #0x8035 jt 6 jf 9
(006) ld [38]
(007) jeq #0xc0a80100 jt 8 jf 9
(008) ret #262144
(009) ret #0
More complicated filters result in more complicated bytecode. Try some of the examples in the manpage and append the -d
flag to see what bytecode would be loaded into the kernel. In order to understand how to read the disassembly, review the BPF filter documentation. If you're reading an eBPF program, you should take a look at the eBPF instruction set for the virtual CPU.
Understanding the code
For simplicity, I'll assume you specified a destination IP of 192.168.1.1 instead of 192.168.1.0 and wanted to match IPv4 only, which shrinks the code quite a bit as it no longer has to handle IPv6:
# tcpdump -i eth0 "dst host 192.168.1.1 and ip" -d
(000) ldh [12]
(001) jeq #0x800 jt 2 jf 5
(002) ld [30]
(003) jeq #0xc0a80101 jt 4 jf 5
(004) ret #262144
(005) ret #0
Let's walk through what the above bytecode actually does. Each time a packet is received on the interface specified, the BPF bytecode is run. The packet contents (including the Ethernet header, if applicable) are put in a buffer that the BPF code has access to. If the packet matches the filter, the code will return the size of the capture buffer (262144 bytes by default), otherwise it returns 0.
Let's assume you are running this filter and it receives a packet sending an ICMP message with an empty payload from 192.168.1.142 to 192.168.1.1. The source MAC is aa:aa:aa:aa:aa:aa and the destination MAC is bb:bb:bb:bb:bb:bb. The contents of the Ethernet frame, in hexadecimal, are:
aa aa aa aa aa aa bb bb bb bb bb bb 08 00 45 00
00 1c 77 71 40 00 40 01 3f 92 c0 a8 01 8e c0 a8
01 01 08 00 c1 c0 36 0e 00 01
Note that the 7 byte Ethernet preamble and single byte start frame delimiter are excluded.
Instruction 0 is ldh [12]
. This loads a half-word (two bytes) located at an offset of 12 bytes into the packet into the A register. In an Ethernet header, this is the location of the EtherType field, which specifies the protocol that is being encapsulated. In our case, this is the value 0x0800 (remember that network data is always big-endian). The value 0x0800 indicates the IPv4 protocol (other possible values include 0x86dd for IPv6, 0x0806 for ARP, and 0x88e5 for MACsec).
Instruction 1 is jeq #0x800
, which will compare an immediate with the value in the A register. If they are equal, it will jump to instruction 2, otherwise 5. The value 0x800 at that offset in the Ethernet frame specifies the IPv4 protocol. Because the comparison evaluates true, the code now jumps to instruction 2. If the payload was not IPv4, it would have jumped to 5.
Instruction 2 is ld [30]
. This loads an entire 4-byte word at an offset of 30 into the A register. In our Ethernet frame, this is 0xc0a80101. The Ethernet header that was passed to us is 14 bytes (two 6-byte MAC addresses and a 2-byte EtherType field), so offset 30 into our Ethernet frame is offset 16 into the IPv4 header, which is the 4-byte destination address in big-endian format.
Instruction 3, jeq #0xc0a80101
, will compare an immediate against the contents of the A register and will jump to 4 if true, otherwise 5. This value is the destination address (0xc0a80101 is the big-endian representation of 192.168.1.1, as 0xc0 is 192, 0xa8 is 168, 0x01 is 1, and 0x01 is 1). The values do indeed match, so the program counter is now set to 4.
Instruction 4 is ret #262144
. This terminates the BPF program and returns the integer 262144. This tells the kernel that the program that loaded the filter, tcpdump
in this case, would like the packet. The kernel passes it to the program which parses it, writing the information to your terminal or saving the contents of the packet to a pcap file. Performance (and security!) is vastly improved because tcpdump
is not forced to tediously parse each and every packet that is received.
Instruction 5 is ret #0
and is not reached for our Ethernet frame. If the destination address did not match what the filter was looking for or the protocol type was not IPv4, the code would have jumped here instead, returning 0 to signify there was no match.
This is all just a way to return 262144 if the half-word at offset 12 into the packet is 0x800 AND the word at offset 30 is 0xc0a80101, and return 0 otherwise. Because this is all done in the kernel (optionally after being converted into native machine code by the JIT engine), no expensive context switches or passing buffers between kernelspace and userspace are required, so the filter is fast. This repeats with each packet that the kernel handles, although the filter can be installed only for ingress, egress, only on a certain interface, etc. In the above example, it is installed for both ingress and egress on eth0, and the filter will be executed on each packet passing through it.
More advanced examples
All filters generated by tcpdump
work like this, with varying levels of complexity. Even complex filters are nothing more than a series of instructions checking various values at various offsets into the packet. For example, this a filter that captures only ECN packets on the 192.168.1.0/24 subnet, excluding 192.168.1.100, with a size larger than 1000 bytes going to port 80 or 443:
tcpdump -i eth0 'ip and tcp port (80 or 443) and net 192.168.1/24 and not host 192.168.1.100 and tcp[tcpflags] & (tcp-ece|tcp-cwr) != 0 and len > 1000'
This results in the following assembly listing (annotated by me):
; Check EtherType
(000) ldh [12] ; Load 2 byte EtherType from offset 12
(001) jeq #0x800 jt 2 jf 28 ; Is it IPv4? If yes, jump to next instruction, else jump to 28 (drop)
; Check IP protocol
(002) ldb [23] ; Load a single byte from IP header offset 9 (protocol field)
(003) jeq #0x6 jt 4 jf 28 ; Is it TCP (6)? If yes, jump to next instruction, else jump to 28 (drop)
; Skip fragmented IP datagrams
(004) ldh [20] ; Load 2 bytes from IP header offset 0 (flags and frag offset)
(005) jset #0x1fff jt 28 jf 6 ; Are fragmentation flags set? If yes, the packet is incomplete; jump to 28 (drop), else jump to next instruction
; Calculate TCP header offset
(006) ldxb 4*([14]&0xf) ; Load IP header length from offset 14, multiply by 4 and save result (TCP header start offset) in X
; Check source and destination port
(007) ldh [x + 14] ; Load TCP source port (14 bytes into the TCP header)
(008) jeq #0x50 jt 13 jf 9 ; Is it 80 (0x50)? If yes, jump past dest port checks and source port check for 443, else jump to next instruction
(009) jeq #0x1bb jt 13 jf 10 ; Is it 443 (0x1bb)? If yes, jump past dest port checks, else jump to next instruction
(010) ldh [x + 16] ; Load TCP dest port (16 bytes into the TCP header)
(011) jeq #0x50 jt 13 jf 12 ; Is it 80? If yes, jump past the check for port 443, else jump to next instruction
(012) jeq #0x1bb jt 13 jf 28 ; Is it 443? If yes, jump to next instruction, else jump to 28 (drop)
; Check source and destination address
(013) ld [26] ; Load 4 byte source IP from offset 26
(014) and #0xffffff00 ; Mask last byte to isolate /24 subnet
(015) jeq #0xc0a80100 jt 19 jf 16 ; Is it in 192.168.1.0/24? If yes, jump past source/dest IP check, else jump to next instruction
(016) ld [30] ; Load 4 byte dest IP from offset 30
(017) and #0xffffff00 ; Mask last byte to isolate /24 subnet
(018) jeq #0xc0a80100 jt 19 jf 28 ; Is it in 192.168.1.0/24? If yes, jump to next instruction, else jump to 28 (drop)
; Exclude host 192.168.1.100
(019) ld [26] ; Load source IP again
(020) jeq #0xc0a80164 jt 28 jf 21 ; Is it 192.168.1.100? If yes, jump to 28 (drop), else jump to next instruction
(021) ld [30] ; Load dest IP again
(022) jeq #0xc0a80164 jt 28 jf 23 ; Is it 192.168.1.100? If yes, jump to 28 (drop), else jump to next instruction
; Check ECN flags (ECE or CWR)
(023) ldb [x + 27] ; Load 1 byte TCP flags (27 bytes into TCP header)
(024) jset #0xc0 jt 25 jf 28 ; Is the ECE (0x40) or CWR (0x80) flag set? If yes, jump to next instruction, else jump to 28 (drop)
; Check packet length
(025) ld #pktlen ; Load 4 byte packet length
(026) jgt #0x3e8 jt 27 jf 28 ; Is it >1000 (0x3e8)? If yes, jump to next instruction (match), else jump to 28 (drop)
; Final decision
(027) ret #262144 ; Return 262144 (match)
(028) ret #0 ; Return 0 (drop)
The BPF code is not limited to being used by tcpdump
. A number of other utilities can use it. You can even create an iptables rule with a BPF filter by using the xt_bpf
module! However, you have to be careful when generating the bytecode with tcpdump -ddd
because it expects to consume a layer 2 header, whereas iptables does not. To make them compatible, you have to adjust the offsets.
Furthermore, a number of auxiliary functions are provided that provide information that can't be obtained by reading the raw packet contents such as the packet length, the payload start offset, the CPU the packet was received on, the NetFilter mark, etc. From the filter documentation:
The Linux kernel also has a couple of BPF extensions that are used along with the class of load instructions by “overloading” the k argument with a negative offset + a particular extension offset. The result of such BPF extensions are loaded into A.
The supported BPF extensions are:
Extension |
Description |
len |
skb->len |
proto |
skb->protocol |
type |
skb->pkt_type |
poff |
Payload start offset |
ifidx |
skb->dev->ifindex |
nla |
Netlink attribute of type X with offset A |
nlan |
Nested Netlink attribute of type X with offset A |
mark |
skb->mark |
queue |
skb->queue_mapping |
hatype |
skb->dev->type |
rxhash |
skb->hash |
cpu |
raw_smp_processor_id() |
vlan_tci |
skb_vlan_tag_get(skb) |
vlan_avail |
skb_vlan_tag_present(skb) |
vlan_tpid |
skb->vlan_proto |
rand |
prandom_u32() |
For example, to match all packets that are received on CPU 3, you could do:
ld #cpu
jneq #3, drop
ret #262144
drop:
ret #0
Note that this is using BPF assembly syntax compatible with bpf_asm
, whereas the other assembly listings here are using tcpdump
syntax. The main difference is that the former's syntax uses named labels whereas the latter's BPF syntax labels each instruction with a line number. This assembly translates to the following bytecode (commas delimit instructions after the first integer, which specifies the number of instructions in the bytecode, in this case 4):
4,32 0 0 4294963236,21 0 1 1,6 0 0 262144,6 0 0 0,
This can then be used with iptables
using the xt_bpf
module:
iptables -A INPUT -m bpf --bytecode "4,32 0 0 4294963236,21 0 1 1,6 0 0 262144,6 0 0 0," -j CPU3
This will jump to target chain CPU3
for any packets received on that CPU.
If this seems powerful, remember that this is all cBPF. Although cBPF is translated into eBPF internally, all this is nothing compared to what raw eBPF can do!
For more information
I highly recommend you read this article to understand how tcpdump
uses cBPF.
After reading that, read this explanation of how tcpdump
turns expressions into bytecode.
If you want to learn everything else about it, you can always check out the source code!