SlideShare a Scribd company logo
Debugging Linux issues with eBPF
One incident from start to finish with dynamic tracing applied
Ivan Babrou
Performance @ Cloudflare
What does Cloudflare do
CDN
Moving content physically
closer to visitors with
our CDN.
Intelligent caching
Unlimited DDOS
mitigation
Unlimited bandwidth at
flat pricing with free
plans
Edge access control
IPFS gateway
Onion service
Website Optimization
Making web fast and up to
date for everyone.
TLS 1.3 (with 0-RTT)
HTTP/2 + QUIC
Server push
AMP
Origin load-balancing
Smart routing
Serverless / Edge Workers
Post quantum crypto
DNS
Cloudflare is the fastest
managed DNS providers
in the world.
1.1.1.1
2606:4700:4700::1111
DNS over TLS
160+
Data centers globally
4.5M+
DNS requests/s
across authoritative, recursive
and internal
10%
Internet requests
everyday
10M+
HTTP requests/second
Websites, apps & APIs
in 150 countries
10M+
Cloudflare’s anycast network
Network capacity
20Tbps
350B+
DNS requests/day
across authoritative, recursive
and internal
800B+
HTTP requests/day
Cloudflare’s anycast network (daily ironic numbers)
Network capacity
1.73Ebpd
Link to slides with speaker notes
Slideshare doesn’t allow links on the first 3 slides
Cloudflare is a Debian shop
● All machines were running Debian Jessie on bare metal
● OS boots over PXE into memory, packages and configs are ephemeral
● Kernel can be swapped as easy as OS
● New Stable (stretch) came out, we wanted to keep up
● Very easy to upgrade:
○ Build all packages for both distributions
○ Upgrade machines in groups, look at metrics, fix issues, repeat
○ Gradually phase out Jessie
○ Pop a bottle of champagne and celebrate
Cloudflare core Kafka platform at the time
● Kafka is a distributed log with multiple producers and consumers
● 3 clusters: 2 small (dns + logs) with 9 nodes, 1 big (http) with 106 nodes
● 2 x 10C Intel Xeon E5-2630 v4 @ 2.2GHz (40 logical CPUs), 128GB RAM
● 12 x 800GB SSD in RAID0
● 2 x 10G bonded NIC
● Mostly network bound at ~100Gbps ingress and ~700Gbps egress
● Check out our blog post on Kafka compression
● We also blogged about our Gen 9 edge machines recently
Small clusters went ok, big one did not
One node upgraded to Stretch
Perf to the rescue: “perf top -F 99”
RCU stalls in dmesg
[ 4923.462841] INFO: rcu_sched self-detected stall on CPU
[ 4923.462843] 13-...: (2 GPs behind) idle=ea7/140000000000001/0 softirq=1/2 fqs=4198
[ 4923.462845] (t=8403 jiffies g=110722 c=110721 q=6440)
Error logging issues
Aug 15 21:51:35 myhost kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Aug 15 21:51:35 myhost kernel: 26-...: (1881 ticks this GP) idle=76f/140000000000000/0
softirq=8/8 fqs=365
Aug 15 21:51:35 myhost kernel: (detected by 0, t=2102 jiffies, g=1837293, c=1837292, q=262)
Aug 15 21:51:35 myhost kernel: Task dump for CPU 26:
Aug 15 21:51:35 myhost kernel: java R running task 13488 1714 1513 0x00080188
Aug 15 21:51:35 myhost kernel: ffffc9000d1f7898 ffffffff814ee977 ffff88103f410400 000000000000000a
Aug 15 21:51:35 myhost kernel: 0000000000000041 ffffffff82203142 ffffc9000d1f78c0 ffffffff814eea10
Aug 15 21:51:35 myhost kernel: 0000000000000041 ffffffff82203142 ffff88103f410400 ffffc9000d1f7920
Aug 15 21:51:35 myhost kernel: Call Trace:
Aug 15 21:51:35 myhost kernel: [<ffffffff814ee977>] ? scrup+0x147/0x160
Aug 15 21:51:35 myhost kernel: [<ffffffff814eea10>] ? lf+0x80/0x90
Aug 15 21:51:35 myhost kernel: [<ffffffff814eecb5>] ? vt_console_print+0x295/0x3c0
Page allocation failures
Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages
Aug 16 01:14:51 myhost kernel: [<ffffffff81171754>] shrink_inactive_list+0x1f4/0x4f0
Aug 16 01:14:51 myhost kernel: [<ffffffff8117234b>] shrink_node_memcg+0x5bb/0x780
Aug 16 01:14:51 myhost kernel: [<ffffffff811725e2>] shrink_node+0xd2/0x2f0
Aug 16 01:14:51 myhost kernel: [<ffffffff811728ef>] do_try_to_free_pages+0xef/0x310
Aug 16 01:14:51 myhost kernel: [<ffffffff81172be5>] try_to_free_pages+0xd5/0x180
Aug 16 01:14:51 myhost kernel: [<ffffffff811632db>] __alloc_pages_slowpath+0x31b/0xb80
...
[78991.546088] systemd-network: page allocation stalls for 287000ms, order:0,
mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
Downgrade and investigate
● System CPU was up, so it must be the kernel upgrade
● Downgrade Stretch to Jessie
● Downgrade Linux 4.9 to 4.4 (known good, but no allocation stall logging)
● Investigate without affecting customers
● Bisection pointed at OS upgrade, kernel was not responsible
Make a flamegraph with perf
#!/bin/sh -e
# flamegraph-perf [perf args here] > flamegraph.svg
# Explicitly setting output and input to perf.data is needed to make perf work over ssh without TTY.
perf record -o perf.data "$@"
# Fetch JVM stack maps if possible, this requires -XX:+PreserveFramePointer
export JAVA_HOME=/usr/lib/jvm/oracle-java8-jdk-amd64 AGENT_HOME=/usr/local/perf-map-agent
/usr/local/flamegraph/jmaps 1>&2
IDLE_REGEXPS="^swapper;.*(cpuidle|cpu_idle|cpu_bringup_and_idle|native_safe_halt|xen_hypercall_sched_op|x
en_hypercall_vcpu_op)"
perf script -i perf.data | /usr/local/flamegraph/stackcollapse-perf.pl --all grep -E -v "$IDLE_REGEXPS" |
/usr/local/flamegraph/flamegraph.pl --colors=java --hash --title=$(hostname)
Full system flamegraphs point at sendfile
Jessie
Stretch
sendfile
Enhance
Stretch sendfile flamegraph spinlocks
eBPF and BCC tools
Latency of sendfile on Jessie: < 31us
$ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile
Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end.
23:27:25
usecs : count distribution
0 -> 1 : 9 | |
2 -> 3 : 47 |**** |
4 -> 7 : 53 |***** |
8 -> 15 : 379 |****************************************|
16 -> 31 : 329 |********************************** |
32 -> 63 : 101 |********** |
64 -> 127 : 23 |** |
128 -> 255 : 50 |***** |
256 -> 511 : 7 | |
Latency of sendfile on Stretch: < 511us
usecs : count distribution
0 -> 1 : 1 | |
2 -> 3 : 20 |*** |
4 -> 7 : 46 |******* |
8 -> 15 : 56 |******** |
16 -> 31 : 65 |********** |
32 -> 63 : 75 |*********** |
64 -> 127 : 75 |*********** |
128 -> 255 : 258 |****************************************|
256 -> 511 : 144 |********************** |
512 -> 1023 : 24 |*** |
1024 -> 2047 : 27 |**** |
2048 -> 4095 : 28 |**** |
4096 -> 8191 : 35 |***** |
Number of mod_timer runs
# Jessie
$ sudo /usr/share/bcc/tools/funccount -T -i 1
mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C
to end.
00:33:36
FUNC COUNT
mod_timer 60482
00:33:37
FUNC COUNT
mod_timer 58263
00:33:38
FUNC COUNT
mod_timer 54626
# Stretch
$ sudo /usr/share/bcc/tools/funccount -T -i 1
mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C
to end.
00:33:28
FUNC COUNT
mod_timer 149068
00:33:29
FUNC COUNT
mod_timer 155994
00:33:30
FUNC COUNT
mod_timer 160688
Number of lock_timer_base runs
# Jessie
$ sudo /usr/share/bcc/tools/funccount -T -i 1
lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit
Ctrl-C to end.
00:32:36
FUNC COUNT
lock_timer_base 15962
00:32:37
FUNC COUNT
lock_timer_base 16261
00:32:38
FUNC COUNT
lock_timer_base 15806
# Stretch
$ sudo /usr/share/bcc/tools/funccount -T -i 1
lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit
Ctrl-C to end.
00:32:32
FUNC COUNT
lock_timer_base 119189
00:32:33
FUNC COUNT
lock_timer_base 196895
00:32:34
FUNC COUNT
lock_timer_base 140085
We can trace timer tracepoints with perf
$ sudo perf list | fgrep timer:
timer:hrtimer_cancel [Tracepoint event]
timer:hrtimer_expire_entry [Tracepoint event]
timer:hrtimer_expire_exit [Tracepoint event]
timer:hrtimer_init [Tracepoint event]
timer:hrtimer_start [Tracepoint event]
timer:itimer_expire [Tracepoint event]
timer:itimer_state [Tracepoint event]
timer:tick_stop [Tracepoint event]
timer:timer_cancel [Tracepoint event]
timer:timer_expire_entry [Tracepoint event]
timer:timer_expire_exit [Tracepoint event]
timer:timer_init [Tracepoint event]
timer:timer_start [Tracepoint event]
Number of timers per function
# Jessie
$ sudo perf record -e timer:timer_start -p 23485 --
sleep 10 && sudo perf script | sed 's/.*
function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 54 times to write data ]
[ perf record: Captured and wrote 17.778 MB
perf.data (173520 samples) ]
2 clocksource_watchdog
5 cursor_timer_handler
2 dev_watchdog
10 garp_join_timer
2 ixgbe_service_timer
4769 tcp_delack_timer
171 tcp_keepalive_timer
168512 tcp_write_timer
# Stretch
$ sudo perf record -e timer:timer_start -p 3416 --
sleep 10 && sudo perf script | sed 's/.*
function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 671 times to write data ]
[ perf record: Captured and wrote 198.273 MB
perf.data (1988650 samples) ]
6 clocksource_watchdog
12 cursor_timer_handler
2 dev_watchdog
18 garp_join_timer
4 ixgbe_service_timer
4622 tcp_delack_timer
1 tcp_keepalive_timer
1983978 tcp_write_timer
Timer flamegraphs comparison
Jessie
Stretch
tcp_push_one
Number of calls for hot functions
# Jessie
$ sudo /usr/share/bcc/tools/funccount -T -i 1
tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C
to end.
03:33:33
FUNC COUNT
tcp_sendmsg 21166
$ sudo /usr/share/bcc/tools/funccount -T -i 1
tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-
C to end.
03:37:14
FUNC COUNT
tcp_push_one 496
# Stretch
$ sudo /usr/share/bcc/tools/funccount -T -i 1
tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C
to end.
03:33:30
FUNC COUNT
tcp_sendmsg 53834
$ sudo /usr/share/bcc/tools/funccount -T -i 1
tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-
C to end.
03:37:10
FUNC COUNT
tcp_push_one 64483
Count stacks leading to tcp_push_one
$ sudo stackcount -i 10 tcp_push_one
Stacks for tcp_push_one (stackcount)
tcp_push_one
inet_sendpage
kernel_sendpage
sock_sendpage
pipe_to_sendpage
__splice_from_pipe
splice_from_pipe
generic_splice_sendpage
direct_splice_actor
splice_direct_to_actor
do_splice_direct
do_sendfile
sys_sendfile64
do_syscall_64
return_from_SYSCALL_64
4950
tcp_push_one
inet_sendmsg
sock_sendmsg
kernel_sendmsg
sock_no_sendpage
tcp_sendpage
inet_sendpage
kernel_sendpage
sock_sendpage
pipe_to_sendpage
__splice_from_pipe
splice_from_pipe
generic_splice_sendpage
...
return_from_SYSCALL_64
735110
Diff of the most popular stack
--- jessie.txt 2017-08-16 21:14:13.000000000 -0700
+++ stretch.txt 2017-08-16 21:14:20.000000000 -0700
@@ -1,4 +1,9 @@
tcp_push_one
+inet_sendmsg
+sock_sendmsg
+kernel_sendmsg
+sock_no_sendpage
+tcp_sendpage
inet_sendpage
kernel_sendpage
sock_sendpage
Let’s look at tcp_sendpage
int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size, int flags) {
ssize_t res;
if (!(sk->sk_route_caps & NETIF_F_SG) ||
!sk_check_csum_caps(sk))
return sock_no_sendpage(sk->sk_socket, page, offset, size,
flags);
lock_sock(sk);
tcp_rate_check_app_limited(sk); /* is sending application-limited? */
res = do_tcp_sendpages(sk, page, offset, size, flags);
release_sock(sk);
return res;
}
what we see on the stack
segmentation offload
Cloudflare network setup
eth2 -->| |--> vlan10
|---> bond0 -->|
eth3 -->| |--> vlan100
Missing offload settings
eth2 -->| |--> vlan10
|---> bond0 -->|
eth3 -->| |--> vlan100
Compare ethtool -k settings on vlan10
-tx-checksumming: off
+tx-checksumming: on
- tx-checksum-ip-generic: off
+ tx-checksum-ip-generic: on
-scatter-gather: off
- tx-scatter-gather: off
+scatter-gather: on
+ tx-scatter-gather: on
-tcp-segmentation-offload: off
- tx-tcp-segmentation: off [requested on]
- tx-tcp-ecn-segmentation: off [requested on]
- tx-tcp-mangleid-segmentation: off [requested on]
- tx-tcp6-segmentation: off [requested on]
-udp-fragmentation-offload: off [requested on]
-generic-segmentation-offload: off [requested on]
+tcp-segmentation-offload: on
+ tx-tcp-segmentation: on
+ tx-tcp-ecn-segmentation: on
+ tx-tcp-mangleid-segmentation: on
+ tx-tcp6-segmentation: on
+udp-fragmentation-offload: on
+generic-segmentation-offload: on
Ha! Easy fix, let’s just enable it:
$ sudo ethtool -K vlan10 sg on
Actual changes:
tx-checksumming: on
tx-checksum-ip-generic: on
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on
udp-fragmentation-offload: on
R in SRE stands for Reboot
Kafka restarted
It was a bug in systemd all along
Logs cluster effect
Stretch upgrade
Offload fixed
DNS cluster effect
Stretch upgrade
Offload fixed
Lessons learned
● It’s important to pay closer attention and seemingly unrelated metrics
● Linux kernel can be easily traced with perf and bcc tools
○ Tools work out of the box
○ You don’t have to be a developer
● TCP offload is incredibly important and applies to vlan interfaces
● Switching OS on reboot proved to be useful
But really it was just an excuse
● Internal blog post about this is from Aug 2017
● External blog post in Cloudflare blog is from May 2018
● All to show where ebpf_exporter can be useful
○ Our tool to export hidden kernel metrics with eBPF
○ Can trace any kernel function and hardware counters
○ IO latency histograms, timer counters, TCP retransmits, etc.
○ Exports data in Prometheus (OpenMetrics) format
Can be nicely visualized with new Grafana
Disk upgrade in production
Thank you
● Blog post this talk is based on
● Github for ebpf_exporter: https://github.com/cloudflare/ebpf_exporter
● Slides for ebpf_exporter talk with presenter notes (and a blog post)
○ Disclaimer: contains statistical dinosaur gifs
● Training on ebpf_exporter with Alexander Huynh
○ Look for “Hidden Linux Metrics with Prometheus eBPF Exporter”
○ Wednesday, Oct 31st, 11:45 - 12:30, Cumberland room 3-4
● We’re hiring
Ivan on twitter: @ibobrik
Ad

More Related Content

What's hot (20)

Qemu Introduction
Qemu IntroductionQemu Introduction
Qemu Introduction
Chiawei Wang
 
QEMU is an emulator that uses dynamic translation to emulate one instruction set architecture (ISA) on another host ISA. It translates guest instructions to an intermediate representation (TCG IR) code, and then compiles the IR code to native host instructions. QEMU employs techniques like translation block caching and chaining to improve the performance of dynamic translation. It also uses helper functions to offload complex operations during translation to improve efficiency.
1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw
videos
 
This document provides an introduction to the Intel Data Plane Development Kit (DPDK) and discusses: - DPDK addresses the challenges of high-speed packet processing on Intel architectures by eliminating kernel and interrupt overheads through a userspace polling model. - DPDK is open source under a BSD license, allowing free use and modification of the code. - DPDK optimizes packet processing performance through techniques like huge pages, prefetching, and affinity of threads to CPU cores.
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
Brendan Gregg
 
Video: https://www.youtube.com/watch?v=JRFNIKUROPE . Talk for linux.conf.au 2017 (LCA2017) by Brendan Gregg, about Linux enhanced BPF (eBPF). Abstract: A world of new capabilities is emerging for the Linux 4.x series, thanks to enhancements that have been included in Linux for to Berkeley Packet Filter (BPF): an in-kernel virtual machine that can execute user space-defined programs. It is finding uses for security auditing and enforcement, enhancing networking (including eXpress Data Path), and performance observability and troubleshooting. Many new open source tools that have been written in the past 12 months for performance analysis that use BPF. Tracing superpowers have finally arrived for Linux! For its use with tracing, BPF provides the programmable capabilities to the existing tracing frameworks: kprobes, uprobes, and tracepoints. In particular, BPF allows timestamps to be recorded and compared from custom events, allowing latency to be studied in many new places: kernel and application internals. It also allows data to be efficiently summarized in-kernel, including as histograms. This has allowed dozens of new observability tools to be developed so far, including measuring latency distributions for file system I/O and run queue latency, printing details of storage device I/O and TCP retransmits, investigating blocked stack traces and memory leaks, and a whole lot more. This talk will summarize BPF capabilities and use cases so far, and then focus on its use to enhance Linux tracing, especially with the open source bcc collection. bcc includes BPF versions of old classics, and many new tools, including execsnoop, opensnoop, funcccount, ext4slower, and more (many of which I developed). Perhaps you'd like to develop new tools, or use the existing tools to find performance wins large and small, especially when instrumenting areas that previously had zero visibility. I'll also summarize how we intend to use these new capabilities to enhance systems analysis at Netflix.
Cisco router command configuration overview
Cisco router command configuration overviewCisco router command configuration overview
Cisco router command configuration overview
3Anetwork com
 
Leading Cisco networking products distributor-3network.com Cisco router command configuration overview
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
Ray Jenkins
 
eBPF is an exciting new technology that is poised to transform Linux performance engineering. eBPF enables users to dynamically and programatically trace any kernel or user space code path, safely and efficiently. However, understanding eBPF is not so simple. The goal of this talk is to give audiences a fundamental understanding of eBPF, how it interconnects existing Linux tracing technologies, and provides a powerful aplatform to solve any Linux performance problem.
05.2 virtio introduction
05.2 virtio introduction05.2 virtio introduction
05.2 virtio introduction
zenixls2
 
The document discusses Virtio, an interface for virtualized I/O devices. It introduces Virtio's architecture, which involves Virtqueues and Vrings to facilitate communication between guest drivers and hypervisor-level device emulators. It outlines the five Virtio APIs for adding, kicking, getting buffers and enabling/disabling callbacks. It also provides an overview of steps for adding new Virtio devices and drivers.
Linux field-update-2015
Linux field-update-2015Linux field-update-2015
Linux field-update-2015
Chris Simmonds
 
Updating Embedded Linux devices in the field requires robust, atomic, and fail-safe software update mechanisms to fix bugs remotely without rendering devices unusable. A commonly used open source updater is SWUpdate, a Linux application that can safely install updates downloaded over the network or from local media using techniques like separate recovery systems and ping-ponging between OS images. It aims to provide atomic system image updates with rollback capabilities and audit logs to ensure devices remain functional after updates.
Linux kernel tracing
Linux kernel tracingLinux kernel tracing
Linux kernel tracing
Viller Hsiao
 
This document discusses tracing in the Linux kernel. It describes various tracing mechanisms like ftrace, tracepoints, kprobes, perf, and eBPF. Ftrace allows tracing functions via compiler instrumentation or dynamically. Tracepoints define custom trace events that can be inserted at specific points. Kprobes and related probes like jprobes allow tracing kernel functions. Perf provides performance monitoring capabilities. eBPF enables custom tracing programs to be run efficiently in the kernel via just-in-time compilation. Tracing tools like perf, systemtap, and LTTng provide user interfaces.
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
Zhen Wei
 
This document discusses how Qemu works to translate guest binaries to run on the host machine. It first generates an intermediate representation called TCG-IR from the guest binary code. It then translates the TCG-IR into native host machine code. To achieve high performance, it chains translated blocks together by patching jump targets. Key techniques include just-in-time compilation, translation block finding, block chaining, and helper functions to emulate unsupported guest instructions.
Ceph with CloudStack
Ceph with CloudStackCeph with CloudStack
Ceph with CloudStack
ShapeBlue
 
Ceph is an open source distributed storage system that is highly scalable, self-managing, and provides multiple access methods including block, file, and object storage. It uses CRUSH to intelligently distribute data and replicas across clusters. Ceph Storage Clusters contain OSD, MON, and optionally MDS daemons. OSDs store data objects, MONs maintain cluster maps and state, and MDS provides metadata for CephFS. Ceph can be deployed with CloudStack to provide the backend storage for virtual machine volumes.
DPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet Processing
Michelle Holley
 
Learn about Transport Layer Development Kit and let us accelerate beyond the Layer 2/Layer 3. Author: Muthurajan (M Jay) Jayakumar
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
Michael Kehoe
 
This document provides an overview of cBPF and eBPF. It discusses the history and implementation of cBPF, including how it was originally used for packet filtering. It then covers eBPF in more depth, explaining what it is, its history, implementation including different program types and maps. It also discusses several uses of eBPF including networking, firewalls, DDoS mitigation, profiling, security, and chaos engineering. Finally, it introduces XDP and DPDK, comparing XDP's benefits over DPDK.
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on Lab
Michelle Holley
 
This document provides an agenda and overview for a hands-on lab on using DPDK in containers. It introduces Linux containers and how they use fewer system resources than VMs. It discusses how containers still use the kernel network stack, which is not ideal for SDN/NFV usages, and how DPDK can be used in containers to address this. The hands-on lab section guides users through building DPDK and Open vSwitch, configuring them to work with containers, and running packet generation and forwarding using testpmd and pktgen Docker containers connected via Open vSwitch.
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
lcplcp1
 
This document discusses XDP (eXpress Data Path), a high-performance network data path that allows programs to run on the receive path of a network interface card. XDP enables packet processing using eBPF programs before packets reach the Linux networking stack. The document provides an overview of XDP and its performance advantages over other packet processing methods. It also discusses XDP's current status and support in the Linux kernel as well as example use cases and benchmarks.
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
Adrien Mahieux
 
- The document discusses Linux network stack monitoring and configuration. It begins with definitions of key concepts like RSS, RPS, RFS, LRO, GRO, DCA, XDP and BPF. - It then provides an overview of how the network stack works from the hardware interrupts and driver level up through routing, TCP/IP and to the socket level. - Monitoring tools like ethtool, ftrace and /proc/interrupts are described for viewing hardware statistics, software stack traces and interrupt information.
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
Brendan Gregg
 
Talk by Brendan Gregg for OSSNA 2017. "Advanced performance observability and debugging have arrived built into the Linux 4.x series, thanks to enhancements to Berkeley Packet Filter (BPF, or eBPF) and the repurposing of its sandboxed virtual machine to provide programmatic capabilities to system tracing. Netflix has been investigating its use for new observability tools, monitoring, security uses, and more. This talk will be a dive deep on these new tracing, observability, and debugging capabilities, which sooner or later will be available to everyone who uses Linux. Whether you’re doing analysis over an ssh session, or via a monitoring GUI, BPF can be used to provide an efficient, custom, and deep level of detail into system and application performance. This talk will also demonstrate the new open source tools that have been developed, which make use of kernel- and user-level dynamic tracing (kprobes and uprobes), and kernel- and user-level static tracing (tracepoints). These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and a whole lot more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations."
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
Michelle Holley
 
What is DPDK? Why DPDK? How DPDK enhances OVS/NFV Infrastructure Author: Muthurajan (M Jay) Jayakumar
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototyping
Yan Vugenfirer
 
The document discusses QEMU and adding a new device to it. It begins with an introduction to QEMU and its uses. It then discusses setting up a development environment, compiling QEMU, and examples of existing devices. The main part explains how to add a new "Devix" device by creating source files, registering the device type, initializing PCI configuration, and registering memory regions. It demonstrates basic functionality like interrupts and I/O access callbacks. The goal is to introduce developing new emulated devices for QEMU.
Xilinx Vitis FreeRTOS Hello World
Xilinx Vitis FreeRTOS Hello WorldXilinx Vitis FreeRTOS Hello World
Xilinx Vitis FreeRTOS Hello World
Vincent Claes
 
Step by Step tutorial on the implementation of FreeRTOS on AVNET MiniZED Board. This board is powered by a Xilinx Zynq FPGA (7007S). This manual uses Xilinx Vitis Environment.
Implementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetImplementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over Ethernet
James Wernicke
 
A presentation of the results of a project researching the performance of RDMA over converged Ethernet (RoCE) compared to Infiniband.
1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw
videos
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
Brendan Gregg
 
Cisco router command configuration overview
Cisco router command configuration overviewCisco router command configuration overview
Cisco router command configuration overview
3Anetwork com
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
Ray Jenkins
 
05.2 virtio introduction
05.2 virtio introduction05.2 virtio introduction
05.2 virtio introduction
zenixls2
 
Linux field-update-2015
Linux field-update-2015Linux field-update-2015
Linux field-update-2015
Chris Simmonds
 
Linux kernel tracing
Linux kernel tracingLinux kernel tracing
Linux kernel tracing
Viller Hsiao
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
Zhen Wei
 
Ceph with CloudStack
Ceph with CloudStackCeph with CloudStack
Ceph with CloudStack
ShapeBlue
 
DPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet Processing
Michelle Holley
 
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on Lab
Michelle Holley
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
lcplcp1
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
Brendan Gregg
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
Michelle Holley
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototyping
Yan Vugenfirer
 
Xilinx Vitis FreeRTOS Hello World
Xilinx Vitis FreeRTOS Hello WorldXilinx Vitis FreeRTOS Hello World
Xilinx Vitis FreeRTOS Hello World
Vincent Claes
 
Implementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetImplementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over Ethernet
James Wernicke
 

Similar to Debugging linux issues with eBPF (20)

Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Ontico
 
Запускаем сервер (БД, Web-сервер или что-то свое собственное) и не получаем желаемый RPS. Запускаем top и видим, что 100% выедается CPU. Что дальше, на что расходуется процессорное время? Можно ли подкрутить какие-то ручки, чтобы улучшить производительность? А если параметр CPU не высокий, то куда смотреть дальше? Мы рассмотрим несколько сценариев проблем производительности, рассмотрим доступные инструменты анализа производительности и разберемся в методологии оптимизации производительности Linux, ответим на вопрос за какие ручки и как крутить.
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
Brendan Gregg
 
This document provides an overview of Brendan Gregg's presentation on BPF performance analysis at Netflix. It discusses: - Why BPF is changing the Linux OS model to become more event-based and microkernel-like. - The internals of BPF including its origins, instruction set, execution model, and how it is integrated into the Linux kernel. - How BPF enables a new class of custom, efficient, and safe performance analysis tools for analyzing various Linux subsystems like CPUs, memory, disks, networking, applications, and the kernel. - Examples of specific BPF-based performance analysis tools developed by Netflix, AWS, and others for analyzing tasks, scheduling, page faults
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
Cyber Security Alliance
 
The document discusses reverse engineering the firmware of Swisscom's Centro Grande modems. It identifies several vulnerabilities found, including a command overflow issue that allows complete control of the device by exceeding the input buffer, and multiple buffer overflow issues that can be exploited to execute code remotely by crafting specially formatted XML files. Details are provided on the exploitation techniques and timeline of coordination with Swisscom to address the vulnerabilities.
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on Linux
Naoto MATSUMOTO
 
Disruptive IP Networking with Intel DPDK on Linux 07 Jan, 2013 SAKURA Internet Research Center Senior Researcher / Naoto MATSUMOTO
Percona Live UK 2014 Part III
Percona Live UK 2014  Part IIIPercona Live UK 2014  Part III
Percona Live UK 2014 Part III
Alkin Tezuysal
 
The document discusses diagnosing and mitigating MySQL performance issues. It describes using various operating system monitoring tools like vmstat, iostat, and top to analyze CPU, memory, disk, and network utilization. It also discusses using MySQL-specific tools like the MySQL command line, mysqladmin, mysqlbinlog, and external tools to diagnose issues like high load, I/O wait, or slow queries by examining metrics like queries, connections, storage engine statistics, and InnoDB logs and data written. The agenda covers identifying system and MySQL-specific bottlenecks by verifying OS metrics and running diagnostics on the database, storage engines, configuration, and queries.
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
維泰 蔡
 
- The document discusses various Linux system log files such as /var/log/messages, /var/log/secure, and /var/log/cron and provides examples of log entries. - It also covers log rotation tools like logrotate and logwatch that are used to manage log files. - Networking topics like IP addressing, subnet masking, routing, ARP, and tcpdump for packet sniffing are explained along with examples.
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg
 
Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
Brendan Gregg
 
Talk for YOW! by Brendan Gregg. "Systems performance studies the performance of computing systems, including all physical components and the full software stack to help you find performance wins for your application and kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (ftrace, bcc/BPF, and bpftrace/BPF), advice about what is and isn't important to learn, and case studies to see how it is applied. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud. "
Handy Networking Tools and How to Use Them
Handy Networking Tools and How to Use ThemHandy Networking Tools and How to Use Them
Handy Networking Tools and How to Use Them
Sneha Inguva
 
Linux networking tools can be used to analyze network connectivity and performance. Tools like ifconfig show interface configurations, route displays routing tables, arp shows the ARP cache, dig/nslookup resolve DNS, and traceroute traces the network path. Nmap scans for open ports, ping checks latency, and tcpdump captures traffic. Iperf3 and wrk2 can load test throughput and capacity, while tcpreplay replays captured traffic. These CLI tools provide essential network information and testing capabilities from the command line.
test
testtest
test
WentingLiu4
 
This document provides an overview of Linux performance monitoring tools including mpstat, top, htop, vmstat, iostat, free, strace, and tcpdump. It discusses what each tool measures and how to use it to observe system performance and diagnose issues. The tools presented provide visibility into CPU usage, memory usage, disk I/O, network traffic, and system call activity which are essential for understanding workload performance on Linux systems.
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
Brendan Gregg
 
Talk by Brendan Gregg for USENIX LISA 2019: Linux Systems Performance. Abstract: " Systems performance is an effective discipline for performance analysis and tuning, and can help you find performance wins for your applications and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas of Linux systems performance: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (Ftrace, bcc/BPF, and bpftrace/BPF), and much advice about what is and isn't important to learn. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud."
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
Brendan Gregg
 
Talk by Brendan Gregg at Kernel Recipes 2017 (Paris): "The in-kernel Berkeley Packet Filter (BPF) has been enhanced in recent kernels to do much more than just filtering packets. It can now run user-defined programs on events, such as on tracepoints, kprobes, uprobes, and perf_events, allowing advanced performance analysis tools to be created. These can be used in production as the BPF virtual machine is sandboxed and will reject unsafe code, and are already in use at Netflix. Beginning with the bpf() syscall in 3.18, enhancements have been added in many kernel versions since, with major features for BPF analysis landing in Linux 4.1, 4.4, 4.7, and 4.9. Specific capabilities these provide include custom in-kernel summaries of metrics, custom latency measurements, and frequency counting kernel and user stack traces on events. One interesting case involves saving stack traces on wake up events, and associating them with the blocked stack trace: so that we can see the blocking stack trace and the waker together, merged in kernel by a BPF program (that particular example is in the kernel as samples/bpf/offwaketime). This talk will discuss the new BPF capabilities for performance analysis and debugging, and demonstrate the new open source tools that have been developed to use it, many of which are in the Linux Foundation iovisor bcc (BPF Compiler Collection) project. These include tools to analyze the CPU scheduler, TCP performance, file system performance, block I/O, and more."
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Anne Nicolas
 
The in-kernel Berkeley Packet Filter (BPF) has been enhanced in recent kernels to do much more than just filtering packets. It can now run user-defined programs on events, such as on tracepoints, kprobes, uprobes, and perf_events, allowing advanced performance analysis tools to be created. These can be used in production as the BPF virtual machine is sandboxed and will reject unsafe code, and are already in use at Netflix. Beginning with the bpf() syscall in 3.18, enhancements have been added in many kernel versions since, with major features for BPF analysis landing in Linux 4.1, 4.4, 4.7, and 4.9. Specific capabilities these provide include custom in-kernel summaries of metrics, custom latency measurements, and frequency counting kernel and user stack traces on events. One interesting case involves saving stack traces on wake up events, and associating them with the blocked stack trace: so that we can see the blocking stack trace and the waker together, merged in kernel by a BPF program (that particular example is in the kernel as samples/bpf/offwaketime). This talk will discuss the new BPF capabilities for performance analysis and debugging, and demonstrate the new open source tools that have been developed to use it, many of which are in the Linux Foundation iovisor bcc (BPF Compiler Collection) project. These include tools to analyze the CPU scheduler, TCP performance, file system performance, block I/O, and more. Brendan Gregg, Netflix
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
Brendan Gregg
 
Talk by Brendan Gregg for All Things Open 2018. "At over one thousand code commits per week, it's hard to keep up with Linux developments. This keynote will summarize recent Linux performance features, for a wide audience: the KPTI patches for Meltdown, eBPF for performance observability and the new open source tools that use it, Kyber for disk I/O sc heduling, BBR for TCP congestion control, and more. This is about exposure: knowing what exists, so you can learn and use it later when needed. Get the most out of your systems with the latest Linux kernels and exciting features."
Quic illustrated
Quic illustratedQuic illustrated
Quic illustrated
Alexander Krizhanovsky
 
QUIC is a new transport protocol developed by Google to replace TCP+TLS. It aims to reduce latency by eliminating OSI layers and supporting features like 0-RTT handshakes. The document provides a high-level overview of QUIC including its architecture, use of TLS 1.3, streams for multiplexing data, and support for features like connection migration through the use of connection IDs. It also discusses QUIC's current implementation status and adoption. Examples are given of QUIC packets and the handshake process.
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
Aman Gupta
 
This document provides information on various debugging and profiling tools that can be used for Ruby including: - lsof to list open files for a process - strace to trace system calls and signals - tcpdump to dump network traffic - google perftools profiler for CPU profiling - pprof to analyze profiling data It also discusses how some of these tools have helped identify specific performance issues with Ruby like excessive calls to sigprocmask and memcpy calls slowing down EventMachine with threads.
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
Brendan Gregg
 
Talk for USENIX ATC 2017 by Brendan Gregg "The Berkeley Packet Filter (BPF) in Linux has been enhanced in very recent versions to do much more than just filter packets, and has become a hot area of operating systems innovation, with much more yet to be discovered. BPF is a sandboxed virtual machine that runs user-level defined programs in kernel context, and is part of many kernels. The Linux enhancements allow it to run custom programs on other events, including kernel- and user-level dynamic tracing (kprobes and uprobes), static tracing (tracepoints), and hardware events. This is finding uses for the generation of new performance analysis tools, network acceleration technologies, and security intrusion detection systems. This talk will explain the BPF enhancements, then discuss the new performance observability tools that are in use and being created, especially from the BPF compiler collection (bcc) open source project. These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and much more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations. Because these BPF enhancements are only in very recent Linux (such as Linux 4.9), most companies are not yet running new enough kernels to be exploring BPF yet. This will change in the next year or two, as companies including Netflix upgrade their kernels. This talk will give you a head start on this growing technology, and also discuss areas of future work and unsolved problems."
Openstack 101
Openstack 101Openstack 101
Openstack 101
POSSCON
 
Dan Radez Red Hat - Sr. Software Engineer POSSCON 2015 4/15/2015 Big Data/Cloud Track: 1:30 PM Workshop
Varnish @ Velocity Ignite
Varnish @ Velocity IgniteVarnish @ Velocity Ignite
Varnish @ Velocity Ignite
Artur Bergman
 
Varnish is an HTTP accelerator that acts as a reverse proxy and cache. It is very fast due to being open source and outsourcing tasks to kernel functions. It relies on a massively multithreaded architecture that is partly event driven. It maps the cache store into memory using mmap and writes directly from mapped memory for maximum performance. Logging includes all request headers. Wikia uses Varnish across 4 datacenters with rapid cache invalidations and a RabbitMQ queue to handle invalidations. SSDs and tuning help optimize performance.
Using Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance TroublesUsing Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance Troubles
ScyllaDB
 
Trying to figure out why your application is responding late can be difficult, especially if it is because of interference from the operating system. This talk will briefly go over how to write a C program that can analyze what in the Linux system is interfering with your application. It will use trace-cmd to enable kernel trace events as well as tracing lock functions, and it will then go over a quick tutorial on how to use libtracecmd to read the created trace.dat file to uncover what is the cause of interference to you application.
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Ontico
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
Brendan Gregg
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
Cyber Security Alliance
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on Linux
Naoto MATSUMOTO
 
Percona Live UK 2014 Part III
Percona Live UK 2014  Part IIIPercona Live UK 2014  Part III
Percona Live UK 2014 Part III
Alkin Tezuysal
 
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
維泰 蔡
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
Brendan Gregg
 
Handy Networking Tools and How to Use Them
Handy Networking Tools and How to Use ThemHandy Networking Tools and How to Use Them
Handy Networking Tools and How to Use Them
Sneha Inguva
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
Brendan Gregg
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
Brendan Gregg
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Anne Nicolas
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
Brendan Gregg
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
Aman Gupta
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
Brendan Gregg
 
Openstack 101
Openstack 101Openstack 101
Openstack 101
POSSCON
 
Varnish @ Velocity Ignite
Varnish @ Velocity IgniteVarnish @ Velocity Ignite
Varnish @ Velocity Ignite
Artur Bergman
 
Using Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance TroublesUsing Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance Troubles
ScyllaDB
 
Ad

Recently uploaded (20)

AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
Presented at All Things Open RTP Meetup Presented by Brent Laster - President & Lead Trainer, Tech Skills Transformations LLC Talk Title: AI 3-in-1: Agents, RAG, and Local Models Abstract: Learning and understanding AI concepts is satisfying and rewarding, but the fun part is learning how to work with AI yourself. In this presentation, author, trainer, and experienced technologist Brent Laster will help you do both! We’ll explain why and how to run AI models locally, the basic ideas of agents and RAG, and show how to assemble a simple AI agent in Python that leverages RAG and uses a local model through Ollama. No experience is needed on these technologies, although we do assume you do have a basic understanding of LLMs. This will be a fast-paced, engaging mixture of presentations interspersed with code explanations and demos building up to the finished product – something you’ll be able to replicate yourself after the session!
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Slides for my "RTP Over QUIC: An Interesting Opportunity Or Wasted Time?" presentation at the Kamailio World 2025 event. They describe my efforts studying and prototyping QUIC and RTP Over QUIC (RoQ) in a new library called imquic, and some observations on what RoQ could be used for in the future, if anything.
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptxTop 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
mkubeusa
 
This engaging presentation highlights the top five advantages of using molybdenum rods in demanding industrial environments. From extreme heat resistance to long-term durability, explore how this advanced material plays a vital role in modern manufacturing, electronics, and aerospace. Perfect for students, engineers, and educators looking to understand the impact of refractory metals in real-world applications.
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
This presentation explores how AI will transform traditional assistive technologies and create entirely new ways to increase inclusion. The presenters will focus specifically on AI's potential to better serve the deaf community - an area where both presenters have made connections and are conducting research. The presenters are conducting a survey of the deaf community to better understand their needs and will present the findings and implications during the presentation. AI integration into accessibility solutions marks one of the most significant technological advancements of our time. For UX designers and researchers, a basic understanding of how AI systems operate, from simple rule-based algorithms to sophisticated neural networks, offers crucial knowledge for creating more intuitive and adaptable interfaces to improve the lives of 1.3 billion people worldwide living with disabilities. Attendees will gain valuable insights into designing AI-powered accessibility solutions prioritizing real user needs. The presenters will present practical human-centered design frameworks that balance AI’s capabilities with real-world user experiences. By exploring current applications, emerging innovations, and firsthand perspectives from the deaf community, this presentation will equip UX professionals with actionable strategies to create more inclusive digital experiences that address a wide range of accessibility challenges.
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
https://www.oicheryl.com
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Everybody is driven by incentives. Good incentives persuade us to do the right thing and patch our servers. Bad incentives make us eat unhealthy food and follow stupid security practices. There is a huge resource problem in IT, especially in the IT security industry. Therefore, you would expect people to pay attention to the existing incentives and the ones they create with their budget allocation, their awareness training, their security reports, etc. But reality paints a different picture: Bad incentives all around! We see insane security practices eating valuable time and online training annoying corporate users. But it's even worse. I've come across incentives that lure companies into creating bad products, and I've seen companies create products that incentivize their customers to waste their time. It takes people like you and me to say "NO" and stand up for real security!
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
Slides of the presentation by Vincenzo Stoico at the main track of the 4th International Conference on AI Engineering (CAIN 2025). The paper is available here: http://www.ivanomalavolta.com/files/papers/CAIN_2025.pdf
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
M365 Community Conference 2025 Workshop on Microsoft 365 Copilot
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
Slides of Limecraft Webinar on May 8th 2025, where Jonna Kokko and Maarten Verwaest discuss the latest release. This release includes major enhancements and improvements of the Delivery Workspace, as well as provisions against unintended exposure of Graphic Content, and rolls out the third iteration of dashboards. Customer cases include Scripted Entertainment (continuing drama) for Warner Bros, as well as AI integration in Avid for ITV Studios Daytime.
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Q1 2025 Dropbox Earnings Presentation
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Slides for the session delivered at Devoxx UK 2025 - Londo. Discover how to seamlessly integrate AI LLM models into your website using cutting-edge techniques like new client-side APIs and cloud services. Learn how to execute AI models in the front-end without incurring cloud fees by leveraging Chrome's Gemini Nano model using the window.ai inference API, or utilizing WebNN, WebGPU, and WebAssembly for open-source models. This session dives into API integration, token management, secure prompting, and practical demos to get you started with AI on the web. Unlock the power of AI on the web while having fun along the way!
Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
Original presentation of Delhi Community Meetup with the following topics ▶️ Session 1: Introduction to UiPath Agents - What are Agents in UiPath? - Components of Agents - Overview of the UiPath Agent Builder. - Common use cases for Agentic automation. ▶️ Session 2: Building Your First UiPath Agent - A quick walkthrough of Agent Builder, Agentic Orchestration, - - AI Trust Layer, Context Grounding - Step-by-step demonstration of building your first Agent ▶️ Session 3: Healing Agents - Deep dive - What are Healing Agents? - How Healing Agents can improve automation stability by automatically detecting and fixing runtime issues - How Healing Agents help reduce downtime, prevent failures, and ensure continuous execution of workflows
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Kit-Works Team Study
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
Unlock real estate success with smart investments leveraging agentic AI. This presentation explores how Agentic AI drives smarter decisions, automates tasks, increases lead conversion, and enhances client retention empowering success in a fast-evolving market.
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
We keep hearing that “integration” is old news, with modern architectures and platforms promising frictionless connectivity. So, is enterprise integration really dead? Not exactly! In this session, we’ll talk about how AI-infused applications and tool-calling agents are redefining the concept of integration, especially when combined with the power of Apache Camel. We will discuss the the role of enterprise integration in an era where Large Language Models (LLMs) and agent-driven automation can interpret business needs, handle routing, and invoke Camel endpoints with minimal developer intervention. You will see how these AI-enabled systems help weave business data, applications, and services together giving us flexibility and freeing us from hardcoding boilerplate of integration flows. You’ll walk away with: An updated perspective on the future of “integration” in a world driven by AI, LLMs, and intelligent agents. Real-world examples of how tool-calling functionality can transform Camel routes into dynamic, adaptive workflows. Code examples how to merge AI capabilities with Apache Camel to deliver flexible, event-driven architectures at scale. Roadmap strategies for integrating LLM-powered agents into your enterprise, orchestrating services that previously demanded complex, rigid solutions. Join us to see why rumours of integration’s relevancy have been greatly exaggerated—and see first hand how Camel, powered by AI, is quietly reinventing how we connect the enterprise.
Dark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanizationDark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanization
Jakub Šimek
 
Startup villages are the next frontier on the road to network states. This book aims to serve as a practical guide to bootstrap a desired future that is both definite and optimistic, to quote Peter Thiel’s framework. Dark Dynamism is my second book, a kind of sequel to Bespoke Balajisms I published on Kindle in 2024. The first book was about 90 ideas of Balaji Srinivasan and 10 of my own concepts, I built on top of his thinking. In Dark Dynamism, I focus on my ideas I played with over the last 8 years, inspired by Balaji Srinivasan, Alexander Bard and many people from the Game B and IDW scenes.
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
About this webinar Join our monthly demo for a technical overview of Zilliz Cloud, a highly scalable and performant vector database service for AI applications Topics covered - Zilliz Cloud's scalable architecture - Key features of the developer-friendly UI - Security best practices and data privacy - Highlights from recent product releases This webinar is an excellent opportunity for developers to learn about Zilliz Cloud's capabilities and how it can support their AI projects. Register now to join our community and stay up-to-date with the latest vector database technology.
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Presentació de la formació "How to write a data management plan with eiNa DMP?"
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
In an era where ships are floating data centers and cybercriminals sail the digital seas, the maritime industry faces unprecedented cyber risks. This presentation, delivered by Mike Mingos during the launch ceremony of Optima Cyber, brings clarity to the evolving threat landscape in shipping — and presents a simple, powerful message: cybersecurity is not optional, it’s strategic. Optima Cyber is a joint venture between: • Optima Shipping Services, led by shipowner Dimitris Koukas, • The Crime Lab, founded by former cybercrime head Manolis Sfakianakis, • Panagiotis Pierros, security consultant and expert, • and Tictac Cyber Security, led by Mike Mingos, providing the technical backbone and operational execution. The event was honored by the presence of Greece’s Minister of Development, Mr. Takis Theodorikakos, signaling the importance of cybersecurity in national maritime competitiveness. 🎯 Key topics covered in the talk: • Why cyberattacks are now the #1 non-physical threat to maritime operations • How ransomware and downtime are costing the shipping industry millions • The 3 essential pillars of maritime protection: Backup, Monitoring (EDR), and Compliance • The role of managed services in ensuring 24/7 vigilance and recovery • A real-world promise: “With us, the worst that can happen… is a one-hour delay” Using a storytelling style inspired by Steve Jobs, the presentation avoids technical jargon and instead focuses on risk, continuity, and the peace of mind every shipping company deserves. 🌊 Whether you’re a shipowner, CIO, fleet operator, or maritime stakeholder, this talk will leave you with: • A clear understanding of the stakes • A simple roadmap to protect your fleet • And a partner who understands your business 📌 Visit: https://optima-cyber.com https://tictac.gr https://mikemingos.gr
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
This talks shows why dependency injection is important and how to support it in a functional programming language like Unison where the only abstraction available is its effect system.
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptxTop 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
mkubeusa
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Dark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanizationDark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanization
Jakub Šimek
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
Ad

Debugging linux issues with eBPF

  • 1. Debugging Linux issues with eBPF One incident from start to finish with dynamic tracing applied
  • 3. What does Cloudflare do CDN Moving content physically closer to visitors with our CDN. Intelligent caching Unlimited DDOS mitigation Unlimited bandwidth at flat pricing with free plans Edge access control IPFS gateway Onion service Website Optimization Making web fast and up to date for everyone. TLS 1.3 (with 0-RTT) HTTP/2 + QUIC Server push AMP Origin load-balancing Smart routing Serverless / Edge Workers Post quantum crypto DNS Cloudflare is the fastest managed DNS providers in the world. 1.1.1.1 2606:4700:4700::1111 DNS over TLS
  • 4. 160+ Data centers globally 4.5M+ DNS requests/s across authoritative, recursive and internal 10% Internet requests everyday 10M+ HTTP requests/second Websites, apps & APIs in 150 countries 10M+ Cloudflare’s anycast network Network capacity 20Tbps
  • 5. 350B+ DNS requests/day across authoritative, recursive and internal 800B+ HTTP requests/day Cloudflare’s anycast network (daily ironic numbers) Network capacity 1.73Ebpd
  • 6. Link to slides with speaker notes Slideshare doesn’t allow links on the first 3 slides
  • 7. Cloudflare is a Debian shop ● All machines were running Debian Jessie on bare metal ● OS boots over PXE into memory, packages and configs are ephemeral ● Kernel can be swapped as easy as OS ● New Stable (stretch) came out, we wanted to keep up ● Very easy to upgrade: ○ Build all packages for both distributions ○ Upgrade machines in groups, look at metrics, fix issues, repeat ○ Gradually phase out Jessie ○ Pop a bottle of champagne and celebrate
  • 8. Cloudflare core Kafka platform at the time ● Kafka is a distributed log with multiple producers and consumers ● 3 clusters: 2 small (dns + logs) with 9 nodes, 1 big (http) with 106 nodes ● 2 x 10C Intel Xeon E5-2630 v4 @ 2.2GHz (40 logical CPUs), 128GB RAM ● 12 x 800GB SSD in RAID0 ● 2 x 10G bonded NIC ● Mostly network bound at ~100Gbps ingress and ~700Gbps egress ● Check out our blog post on Kafka compression ● We also blogged about our Gen 9 edge machines recently
  • 9. Small clusters went ok, big one did not One node upgraded to Stretch
  • 10. Perf to the rescue: “perf top -F 99”
  • 11. RCU stalls in dmesg [ 4923.462841] INFO: rcu_sched self-detected stall on CPU [ 4923.462843] 13-...: (2 GPs behind) idle=ea7/140000000000001/0 softirq=1/2 fqs=4198 [ 4923.462845] (t=8403 jiffies g=110722 c=110721 q=6440)
  • 12. Error logging issues Aug 15 21:51:35 myhost kernel: INFO: rcu_sched detected stalls on CPUs/tasks: Aug 15 21:51:35 myhost kernel: 26-...: (1881 ticks this GP) idle=76f/140000000000000/0 softirq=8/8 fqs=365 Aug 15 21:51:35 myhost kernel: (detected by 0, t=2102 jiffies, g=1837293, c=1837292, q=262) Aug 15 21:51:35 myhost kernel: Task dump for CPU 26: Aug 15 21:51:35 myhost kernel: java R running task 13488 1714 1513 0x00080188 Aug 15 21:51:35 myhost kernel: ffffc9000d1f7898 ffffffff814ee977 ffff88103f410400 000000000000000a Aug 15 21:51:35 myhost kernel: 0000000000000041 ffffffff82203142 ffffc9000d1f78c0 ffffffff814eea10 Aug 15 21:51:35 myhost kernel: 0000000000000041 ffffffff82203142 ffff88103f410400 ffffc9000d1f7920 Aug 15 21:51:35 myhost kernel: Call Trace: Aug 15 21:51:35 myhost kernel: [<ffffffff814ee977>] ? scrup+0x147/0x160 Aug 15 21:51:35 myhost kernel: [<ffffffff814eea10>] ? lf+0x80/0x90 Aug 15 21:51:35 myhost kernel: [<ffffffff814eecb5>] ? vt_console_print+0x295/0x3c0
  • 13. Page allocation failures Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages Aug 16 01:14:51 myhost kernel: [<ffffffff81171754>] shrink_inactive_list+0x1f4/0x4f0 Aug 16 01:14:51 myhost kernel: [<ffffffff8117234b>] shrink_node_memcg+0x5bb/0x780 Aug 16 01:14:51 myhost kernel: [<ffffffff811725e2>] shrink_node+0xd2/0x2f0 Aug 16 01:14:51 myhost kernel: [<ffffffff811728ef>] do_try_to_free_pages+0xef/0x310 Aug 16 01:14:51 myhost kernel: [<ffffffff81172be5>] try_to_free_pages+0xd5/0x180 Aug 16 01:14:51 myhost kernel: [<ffffffff811632db>] __alloc_pages_slowpath+0x31b/0xb80 ... [78991.546088] systemd-network: page allocation stalls for 287000ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
  • 14. Downgrade and investigate ● System CPU was up, so it must be the kernel upgrade ● Downgrade Stretch to Jessie ● Downgrade Linux 4.9 to 4.4 (known good, but no allocation stall logging) ● Investigate without affecting customers ● Bisection pointed at OS upgrade, kernel was not responsible
  • 15. Make a flamegraph with perf #!/bin/sh -e # flamegraph-perf [perf args here] > flamegraph.svg # Explicitly setting output and input to perf.data is needed to make perf work over ssh without TTY. perf record -o perf.data "$@" # Fetch JVM stack maps if possible, this requires -XX:+PreserveFramePointer export JAVA_HOME=/usr/lib/jvm/oracle-java8-jdk-amd64 AGENT_HOME=/usr/local/perf-map-agent /usr/local/flamegraph/jmaps 1>&2 IDLE_REGEXPS="^swapper;.*(cpuidle|cpu_idle|cpu_bringup_and_idle|native_safe_halt|xen_hypercall_sched_op|x en_hypercall_vcpu_op)" perf script -i perf.data | /usr/local/flamegraph/stackcollapse-perf.pl --all grep -E -v "$IDLE_REGEXPS" | /usr/local/flamegraph/flamegraph.pl --colors=java --hash --title=$(hostname)
  • 16. Full system flamegraphs point at sendfile Jessie Stretch sendfile
  • 19. eBPF and BCC tools
  • 20. Latency of sendfile on Jessie: < 31us $ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end. 23:27:25 usecs : count distribution 0 -> 1 : 9 | | 2 -> 3 : 47 |**** | 4 -> 7 : 53 |***** | 8 -> 15 : 379 |****************************************| 16 -> 31 : 329 |********************************** | 32 -> 63 : 101 |********** | 64 -> 127 : 23 |** | 128 -> 255 : 50 |***** | 256 -> 511 : 7 | |
  • 21. Latency of sendfile on Stretch: < 511us usecs : count distribution 0 -> 1 : 1 | | 2 -> 3 : 20 |*** | 4 -> 7 : 46 |******* | 8 -> 15 : 56 |******** | 16 -> 31 : 65 |********** | 32 -> 63 : 75 |*********** | 64 -> 127 : 75 |*********** | 128 -> 255 : 258 |****************************************| 256 -> 511 : 144 |********************** | 512 -> 1023 : 24 |*** | 1024 -> 2047 : 27 |**** | 2048 -> 4095 : 28 |**** | 4096 -> 8191 : 35 |***** |
  • 22. Number of mod_timer runs # Jessie $ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end. 00:33:36 FUNC COUNT mod_timer 60482 00:33:37 FUNC COUNT mod_timer 58263 00:33:38 FUNC COUNT mod_timer 54626 # Stretch $ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end. 00:33:28 FUNC COUNT mod_timer 149068 00:33:29 FUNC COUNT mod_timer 155994 00:33:30 FUNC COUNT mod_timer 160688
  • 23. Number of lock_timer_base runs # Jessie $ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end. 00:32:36 FUNC COUNT lock_timer_base 15962 00:32:37 FUNC COUNT lock_timer_base 16261 00:32:38 FUNC COUNT lock_timer_base 15806 # Stretch $ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end. 00:32:32 FUNC COUNT lock_timer_base 119189 00:32:33 FUNC COUNT lock_timer_base 196895 00:32:34 FUNC COUNT lock_timer_base 140085
  • 24. We can trace timer tracepoints with perf $ sudo perf list | fgrep timer: timer:hrtimer_cancel [Tracepoint event] timer:hrtimer_expire_entry [Tracepoint event] timer:hrtimer_expire_exit [Tracepoint event] timer:hrtimer_init [Tracepoint event] timer:hrtimer_start [Tracepoint event] timer:itimer_expire [Tracepoint event] timer:itimer_state [Tracepoint event] timer:tick_stop [Tracepoint event] timer:timer_cancel [Tracepoint event] timer:timer_expire_entry [Tracepoint event] timer:timer_expire_exit [Tracepoint event] timer:timer_init [Tracepoint event] timer:timer_start [Tracepoint event]
  • 25. Number of timers per function # Jessie $ sudo perf record -e timer:timer_start -p 23485 -- sleep 10 && sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c [ perf record: Woken up 54 times to write data ] [ perf record: Captured and wrote 17.778 MB perf.data (173520 samples) ] 2 clocksource_watchdog 5 cursor_timer_handler 2 dev_watchdog 10 garp_join_timer 2 ixgbe_service_timer 4769 tcp_delack_timer 171 tcp_keepalive_timer 168512 tcp_write_timer # Stretch $ sudo perf record -e timer:timer_start -p 3416 -- sleep 10 && sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c [ perf record: Woken up 671 times to write data ] [ perf record: Captured and wrote 198.273 MB perf.data (1988650 samples) ] 6 clocksource_watchdog 12 cursor_timer_handler 2 dev_watchdog 18 garp_join_timer 4 ixgbe_service_timer 4622 tcp_delack_timer 1 tcp_keepalive_timer 1983978 tcp_write_timer
  • 27. Number of calls for hot functions # Jessie $ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end. 03:33:33 FUNC COUNT tcp_sendmsg 21166 $ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one Tracing 1 functions for "tcp_push_one"... Hit Ctrl- C to end. 03:37:14 FUNC COUNT tcp_push_one 496 # Stretch $ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end. 03:33:30 FUNC COUNT tcp_sendmsg 53834 $ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one Tracing 1 functions for "tcp_push_one"... Hit Ctrl- C to end. 03:37:10 FUNC COUNT tcp_push_one 64483
  • 28. Count stacks leading to tcp_push_one $ sudo stackcount -i 10 tcp_push_one
  • 29. Stacks for tcp_push_one (stackcount) tcp_push_one inet_sendpage kernel_sendpage sock_sendpage pipe_to_sendpage __splice_from_pipe splice_from_pipe generic_splice_sendpage direct_splice_actor splice_direct_to_actor do_splice_direct do_sendfile sys_sendfile64 do_syscall_64 return_from_SYSCALL_64 4950 tcp_push_one inet_sendmsg sock_sendmsg kernel_sendmsg sock_no_sendpage tcp_sendpage inet_sendpage kernel_sendpage sock_sendpage pipe_to_sendpage __splice_from_pipe splice_from_pipe generic_splice_sendpage ... return_from_SYSCALL_64 735110
  • 30. Diff of the most popular stack --- jessie.txt 2017-08-16 21:14:13.000000000 -0700 +++ stretch.txt 2017-08-16 21:14:20.000000000 -0700 @@ -1,4 +1,9 @@ tcp_push_one +inet_sendmsg +sock_sendmsg +kernel_sendmsg +sock_no_sendpage +tcp_sendpage inet_sendpage kernel_sendpage sock_sendpage
  • 31. Let’s look at tcp_sendpage int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size, int flags) { ssize_t res; if (!(sk->sk_route_caps & NETIF_F_SG) || !sk_check_csum_caps(sk)) return sock_no_sendpage(sk->sk_socket, page, offset, size, flags); lock_sock(sk); tcp_rate_check_app_limited(sk); /* is sending application-limited? */ res = do_tcp_sendpages(sk, page, offset, size, flags); release_sock(sk); return res; } what we see on the stack segmentation offload
  • 32. Cloudflare network setup eth2 -->| |--> vlan10 |---> bond0 -->| eth3 -->| |--> vlan100
  • 33. Missing offload settings eth2 -->| |--> vlan10 |---> bond0 -->| eth3 -->| |--> vlan100
  • 34. Compare ethtool -k settings on vlan10 -tx-checksumming: off +tx-checksumming: on - tx-checksum-ip-generic: off + tx-checksum-ip-generic: on -scatter-gather: off - tx-scatter-gather: off +scatter-gather: on + tx-scatter-gather: on -tcp-segmentation-offload: off - tx-tcp-segmentation: off [requested on] - tx-tcp-ecn-segmentation: off [requested on] - tx-tcp-mangleid-segmentation: off [requested on] - tx-tcp6-segmentation: off [requested on] -udp-fragmentation-offload: off [requested on] -generic-segmentation-offload: off [requested on] +tcp-segmentation-offload: on + tx-tcp-segmentation: on + tx-tcp-ecn-segmentation: on + tx-tcp-mangleid-segmentation: on + tx-tcp6-segmentation: on +udp-fragmentation-offload: on +generic-segmentation-offload: on
  • 35. Ha! Easy fix, let’s just enable it: $ sudo ethtool -K vlan10 sg on Actual changes: tx-checksumming: on tx-checksum-ip-generic: on tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: on tx-tcp-mangleid-segmentation: on tx-tcp6-segmentation: on udp-fragmentation-offload: on
  • 36. R in SRE stands for Reboot Kafka restarted
  • 37. It was a bug in systemd all along
  • 38. Logs cluster effect Stretch upgrade Offload fixed
  • 39. DNS cluster effect Stretch upgrade Offload fixed
  • 40. Lessons learned ● It’s important to pay closer attention and seemingly unrelated metrics ● Linux kernel can be easily traced with perf and bcc tools ○ Tools work out of the box ○ You don’t have to be a developer ● TCP offload is incredibly important and applies to vlan interfaces ● Switching OS on reboot proved to be useful
  • 41. But really it was just an excuse ● Internal blog post about this is from Aug 2017 ● External blog post in Cloudflare blog is from May 2018 ● All to show where ebpf_exporter can be useful ○ Our tool to export hidden kernel metrics with eBPF ○ Can trace any kernel function and hardware counters ○ IO latency histograms, timer counters, TCP retransmits, etc. ○ Exports data in Prometheus (OpenMetrics) format
  • 42. Can be nicely visualized with new Grafana Disk upgrade in production
  • 43. Thank you ● Blog post this talk is based on ● Github for ebpf_exporter: https://github.com/cloudflare/ebpf_exporter ● Slides for ebpf_exporter talk with presenter notes (and a blog post) ○ Disclaimer: contains statistical dinosaur gifs ● Training on ebpf_exporter with Alexander Huynh ○ Look for “Hidden Linux Metrics with Prometheus eBPF Exporter” ○ Wednesday, Oct 31st, 11:45 - 12:30, Cumberland room 3-4 ● We’re hiring Ivan on twitter: @ibobrik

Editor's Notes

  • #2: Hello, Today we’re going to go through one production issue from start to finish and see how we can apply dynamic tracing to get to the bottom of the problem.
  • #3: My name is Ivan and I work for a company called Cloudflare, where I focus on performance and efficiency of our products.
  • #4: To give you some context, thise are some key areas Cloudflare specializes in. In addition to being a good old CDN service with free unlimited DDOS protection, we try to be at the front of innovation with technologies like TLS v1.3, QUIC and edge workers, making internet faster and and more secure for end users and website owners. We’re also the fastest authoritative and recursive DNS provider. Our resolver 1.1.1.1 is privacy oriented and supports things like DNS over TLS, stipping intermediates from knowing your DNS requests, not to mention DNSSEC. If you have a website of any size, you should totally put this behind Cloudflare.
  • #5: Here are some numbers to give you an idea of the scale we operate on. We have 160 datacenters around the world and plan to grow to at least 200 next year. At peak these datacenters process more than 10 million HTTP requests per second. At the same time the very same datacenters serve 4.5 million DNS requests per second across internal and external DNS. That’s a lot of data to analyze and we collect logs into core datacenters for processing and analytics.
  • #6: I often get frustrated when people show numbers that are not scaled to seconds. I figured I cannot win them, so I may as well just join them Here you see numbers per day. My favorite one is network capaity, which is 1.73 exabytes per day. As you can see, these numbers make no sense. It gets even weirder when different metrics are scaled to different time units. Please don’t use this as a reference, always scale down to second.
  • #8: Now to set a scene for this talk specifically, it makes sense to tell a little on our hardware and software stack. All machines serving traffic and doing backend analytics are bare metal servers running Debian, at that point in time we were running Jessie. We’re big fans of ephemeral stuff and not a single machine has OS installed on persistent storage. Instead, we boot from a minimal immutable initramfs from network and install all packages and configs on top of that into ramfs with configuration management system. This means that on reboot every machine is clean and OS and kernel can be swapped with just a reboot. And the story starts with my personal desire to update Debian to the latest Stable release, which was Stretch at that time. Our plan for this upgrade was quite simple because of our setup. We can just build all necessary packages for both distributions, switch some group of machines into Stretch, fix what’s broken and carry on to the next group of machines. No need to wipe disks, reinstall anything or deal with dependency issues. We even only needed to build just one OS image as opposed to one image per workload. On the edge every machine is the same, so that part was trivial. In core datacenters where backend out of band processing happens we have different machines doing different workloads, which means we have a more diverse set of metrics to look at, but we can also switch some groups completely faster.
  • #9: One of such groups was a set of our Kafka clusters. If you’re not familiar with Kafka, it’s basically a distributed log system. Multiple producers append messages to topics and then multiple consumers read those logs. For the most part we’re using it as a queue with a large on-disk buffer that can get us time to fix issues in consumers without losing data. We have three major clusters: DNS and Logs are small with just 9 nodes each, and HTTP is massive with 106 nodes. You can see the specs for HTTP cluster at that time on the slides: 128GB of RAM and two Broadwell Xeon CPUs in NUMA setup with 40 logical CPUs. We opted out for 12 SSDs in RAID0 to prevent IO trashing from consumers falling out of page cache. Disk level redundancy is absent in favor of larger usable disk space and higher throughput, we rely on 3x replication instead. In terms of network we had 2x10G NIC in bonded setup for maximum network throughput. It was not intended to provide any redundancy. We used to have a lot of issues with being network bound, but in the end that was solved by aggressive compression with zstd. Funnily enough, we also opted out to have 2x25G NICs, just because they are cheaper, even though we are not network bound anymore. Check out our blog post about Kafka compression or a recent one about Gen 9 edge servers if you want to learn more.
  • #10: So we did our upgrade on small Kafka clusters and it went pretty well, at least nobody said anything and user facing metrics looked good. If you were listening to talks yesterday, that’s what apparently should be alerted on, so no alerts fired. On the big HTTP cluster, however, we started seeing issues with consumers timing out and lagging, so we looked closer at the metrics we had. And this is what we saw: one upgraded node was using a lot more CPU than before, 5x more in fact. By itself this is not as big of an issue, you can see that we’re not stressing out CPUs that much. Typica Kafka CPU usage before this upgrade was around 3 logical CPUs out of 40, which leaves a lot of room. Still, having 5x CPU usage was definitely an unexpected outcome. For control datapoints, we compared the problematic machine to another machine where no upgrade happened, and an intermediary node that received a full software stack upgrade on reboot, but not an OS upgrade, which we optimistically bundled with a minor kernel upgrade. Neither of these two nodes experienced the same CPU saturation issues, even though their setups were practically identical.
  • #11: For debugging CPU saturation issues, we depend on linux perf command to find the cause. It’s included with the kernel and on end user distributions you can install it with package like linux-base or something. The first question that comes to mind when we see CPU saturation issues is what is using the CPU. In tools like top we can see what processes occupy CPU, but with perf you can see which functions inside these processes sit on CPU the most. This covers kernel and user space for well behaved programs that have a way to decode stacks. That includes C/C++ with frame pointers and Go. Here you can see top-like output from perf with the most expensive functions in terms of CPU time. Sorting is a bit confusing, because it sorts by inclusive time, but we’re mostly interested in “self” column, which shows how often the very tip of the stack is on CPU. In this case most of the time is taken by some spinlock slowpath. Spinlocks in the kernel exist to protect critical sections from concurrent access. There are two reasons to use them: * Critical section is small and is not contended * Lock owner cannot sleep (like interrupts cannot do that) If spinlock cannot be acquired, caller burns CPU until it can get hold of the lock. While it may sound like a questionable idea at first, there are legitimate uses for this mechanism. In our situation it seems like spinlock is really contended and half of CPU cycles are not doing useful work. We don’t know what lock is causing this to happen from this output, however. There were also other symptoms, so let’s look at them first.
  • #12: If anything bad happens in production, it’s always a good idea to have a look at dmesg. Messages there can be cryptic, but they can at least point you in the right direction. Fixing an issue is 95% knowing where to find the issue. In that particular case we saw RCU stalls, where RCU stands for read-copy-update. I’m not exactly an expert in this, but it sounds like another synchronization mechanism and it can be affected by spinlocks we saw before. We've seen rare RCU stalls before, and our (suboptimal) solution was to reboot the machine if no other issues can be found. 99% of the time reboot fixed the issue for a long time. However, one can only handle so many reboots before the problem becomes severe enough to warrant a deep dive. In this case we had other clues.
  • #13: While looking deeper into dmesg, we noticed issues around writing messages to the console. This suggested that we were logging too many errors, and the actual failure may be earlier in the process. Armed with this knowledge, we looked at the very beginning of the message chain.
  • #14: And this is what we saw. If you work with NUMA machines, you may immediately see “shrink_node” and have a minor PTSD episode. What you should be looking at is the number of missed kernel messages. There were so many errors, journald wasn’t able to keep up. We have console access to work around that, and that’s where we saw page allocation stalls in the second log except. You don't want your page allocations to stall for 5 minutes, especially when it's order zero allocation, which is the smallest allocation of one 4 KiB page.
  • #15: Comparing to our control nodes, the only two possible explanations were: a minor kernel upgrade, and the switch from Debian Jessie to Debian Stretch. We suspected the former, since CPU usage implies a kernel issue. Just to be safe, we rolled both the kernel back from 4.9 to a known good 4.4, and downgraded the affected nodes back to Debian Jessie. This was a reasonable compromise, since we needed to minimize downtime on production nodes. Then we proceeded to look into the issue in isolation. To our surprise, after some bisecting we found that OS upgrade alone was responsible for our issues, kernel was off the hook. Now all that remained is to find out what exactly was going on.
  • #16: Flamegraphs are a great way to visualize stacks that cause CPU usage in the system. We have a wrapper around Brendan Gregg’s flamegraph scripts that removes idle time and enables JVM stacks out of the box. This gives us a way to get an overview of CPU usage in one command.
  • #17: And this is how full system flamegraphs look like. We have jessie in the background on the left and stretch in the foreground on the right. This may be hard to see, but the idea is that each bar is a stack frame and width corresponds to frequency of this stack’s appearance, which is a proxy for CPU usage. You can see a fat column of frames on the left on Stretch, that’s not present on Jessie. We can see it’s the sendfile syscall and it’s highlighted in purple. It’s also present and highlighted on Jessie, but it’s tiny and quite hard to see. Flamegraphs allow you to click on the frame, which will zoom into stacks containing this frame, generating some sort of a sub-flamegraph.
  • #18: So let’s click on sendfile on Stretch and see what’s going on.
  • #19: This is what we saw. For somebody who’s not a kernel developer this just looks like a bunch of TCP stuff, which is exactly what I saw. Some colleagues suggested that the differences in the graphs may be due to TCP offload being disabled, but upon checking our NIC settings, we found that the feature flags were identical. You can also see some spinlocks at the tip of the flamegraph, which reinforces our initial findings with perf top. Let’s see what else we can figure out from here.
  • #20: To find out what’s going on with the system, we’ll be using bcc tools. Linux kernel has a VM that allows us to attach lightweight and safe probes to trace the kernel. eBPF itself is a hot topic and there are talks that explore it in great detail, slides for this talk link to them if you are interested. To clarify, VM here is more like JVM that provides runtime and not like KVM that provides hardware virtualization. You can compile code down to this VM from any language, so don’t look surprised when one day you’ll see javascript running in the kernel. I warned you. For the sake of brevity let’s just say that there’s a collection of readily available utilities that can help you debug various parts of the kernel and underlying hardware. That collection is called BCC tools and we’re going to use some of these to get to the bottom of our issue. On this slide you can see how different subsystems can be traced with different tools.
  • #21: To trace latency distributions of sendfile syscalls between Jessie and Stretch, we’re going to use funclatency. It takes a function name and prints exponential latency histogram for the function calls. Here we print latency histogram for do_sendfile, which is sendfile syscall function, in microseconds, every second. You can see that most of the calls on Jessie hover between 8 and 31 microseconds. Is that good or bad? I don’t know, but a good way to find out is to compare against another system.
  • #22: Now let’s look at what’s going on with Stretch. I had to cut some parts, because histogram was not fitting into the slide. If on Jessie we saw most of the calls complete in under 31 microsecond, here we see that that number is 511 microseconds, that’s a whopping 16x jump in latency.
  • #23: In the flamegraphs, you can see timers being set at the tip (mod_timer function is responsible for that), with these timers taking locks. We can count number of function calls instead of measuring their latency, and this is where funccount tool comes in. Feeding mod_timer as an argument to it we can see how many function calls there were every second. Here we have Jessie on the left and Stretch on the right. On stretch we installed 3x more timers than on Jessie. That’s not 16x difference, but still something.
  • #24: If we look at the number of locks taken for these timers by running funccount on lock_timer_base function, we can see an even bigger difference, around 10x this time. To sum up: on Stretch we installed 3x more timers, resulting in 10x the amount of contention. It definitely seems like we’re onto something.
  • #25: We can look at the kernel source code to figure out which timers are being scheduled based on the flamegraph, but that seems like a tedious task. Instead, we can use perf tool again to gather some stats on this for us. There’s a bunch of tracepoints in the kernel that provide insight into timer subsystem. We’re going to use timer_start for our needs.
  • #26: Here we record all timers started for 10s and then print function names they were triggering with respective counts. On Stretch we install 12x tcp_write_timer timers, that sounds like something that can cause issues. Remember: we are on a bandwidth bound workload where interface is 20G, that’s a lot of bytes to move.
  • #27: Taking specific flamegraphs of the timers revealed the differences in their operation. It’s probably hard ro see, but tcp_push_one really stands out on Stretch. Let’s dig in.
  • #28: The traces showed huge variations of tcp_sendmsg and tcp_push_one within sendfile, which is expected from the flamegraphs before.
  • #29: To further introspect, we leveraged a kernel feature available since 4.9: an ability to count and aggregate stacks in the kernel. BCC tools include stackcout tool that does exactly that, so let’s take advantage of that.
  • #30: The most popular Jessie stack is on the left and the most popular Stretch stack is on the right. There were a few much less popular stacks too, but there’s only so much one can fit on the slides. Stretch stack was too long, “…” is the same as highlighted section in Jessie stack. These are mostly the same and it’s not exactly fun to find the difference, so let’s just look at the diff on the next slide.
  • #31: We see 5 extra functions in the middle of the stack, starting with tcp_sendpage. Time to look at the source code. Usually I just google the function name and it gives me a result to elixir.bootlin.com, where I swap “latest” to my kernel version. Source code there allows you to click on identifiers and jump around the code to navigate.
  • #32: This is how tcp_sendpage function looks like, I pasted it verbatim from the kernel source. From tcp_sendpage our stack jumps into sock_no_sendpage. If you lookup what NET_F_SG means, you’ll find it’s segmentation offload. Segmentation offload is a technique where kernel doesn’t split TCP stream into packets, but instead offloads this job to a NIC. This makes a big difference when you want to send large chunks of data over high speed links. That’s exactly what we are doing and we definitely want to have offload enabled.
  • #33: Let’s take a pause and see how we configure network on our machines. Our 2x10G NIC provides eth2 and eth3, which we then bond into bond0 interface. On top of that bond0 we create two vlan interfaces, one for public internet and one for internal network.
  • #34: It turned out that we had segmentation offload enabled for only a few of our NICs: eth2, eth3, and bond0. When we checked NIC settings for offload earlier, we only checked physical interfaces and bonded one, but ignored vlan interfaces, where offload was indeed missing.
  • #35: We compared ethtool output for vlan interface and there was our issue in plain sight.
  • #36: We can just enable TCP offload by enabling scatter-gather (which is what “sg” stands for) and be done with it. Easy, right? Imagine our disappointment when this did not work. So much work with clear indication that this is the cause and the fix did not work.
  • #37: The last missing piece we found was that offload changes are applied only during connection initiation. We turned Kafka off and back on again to start offloading and immediately saw positive effects, which is green line. This is not 5x change I mentioned at the beginning, because we were experimenting on a lightly loaded node to avoid disruptions.
  • #38: Our network interfaces are managed by systemd networkd, so it turns out that missing offload settings were a bug in systemd in the end. It’s not clear whether upstream or Debian patches are responsible for this, however. In the meantime, we work around our upstream issue by enabling offload features automatically on boot if they are disabled on VLAN interfaces.
  • #39: Having a fix enabled, we rebooted our logs Kafka cluster to upgrade to the latest kernel, and on 5 day CPU usage history you can see clear positive results.
  • #40: On DNS cluster results were more dramatic because of the higher load. On this screenshot only one node is fixed, but you can see how much better it behaves compared to the rest.
  • #41: The first lesson here is to pay closer attention to metrics during major upgrades. We did not see major CPU changes on moderately loaded cluster and did not expect to see any effects on fully loaded machines. In the end we were not upgrading Kafka, which was main consumer of user CPU, or kernel, which was consuming system CPU. The second lesson is how useful perf and bcc tools were at pointing us to where the issue is. These tools work out of the box, they are safe and do not require any third party kernel modules. More importantly, they do not require operator to be a kernel expert, you just need some basic understanding of concepts. Another lesson is how important TCP offload is and how its importance grows non-linearly with traffic. It was unexpected that supposedly purely virtual vlan interfaces could be affected by offload, but it turned out they were. Challenge your assumptions often, I guess. Lastly, we used our ability to swap OS and kernels on reboot to the fullest. Having no need to install OS meant we didn’t have to reinstall it and could iterate quickly.
  • #42: Internal blog post about this incident was published in August 2017, heavily truncated external blog post went out in May 2018. That external blog post is what this talk is based on. All of it to illustrate how the tool we wrote can be used. If during debugging we were using bcc tools to count timers firing in the kernel ad hoc, we could’ve had a metric for this and noticed the issue sooner by just seeing an increase on a graph. This is what ebpf_exporter allows you to have: you can trace any function in the kernel (and in userspace) at very low overhead and create metrics in Prometheus format from it. For example, you can have latency histogram for disk io as a metric, which is not normally possible with procfs or anything else.
  • #43: Here’s a slide from my presentation of ebpf_exporter, which shows the level of detail you can get. On the left you can see IO wait time from /proc/diskstats, which is what Linux provides, and on the right you can see heatmap of IO latency, which is what ebpf_exporter enables. With the histograms you can see how many IOs landed in a particular bucket and things like multimodal distributions can be seen. You can also see how many IOs went above some threshold, allowing you to have alerts on this. Same goes for timers, kernel does not keep count of what is firing anywhere for collection.
  • #44: That’s all I had to talk about today. On the slides you have some links on the topic. Slides with speaker notes will be available on the LISA18 website and I’ll also tweet the link. I encourage you to look at my talk on ebpf_exporter itself, which goes into details about why histograms are so great. It involves dinosaur gifs in a very scientific way you probably do not expect, so make sure to check that out. My colleague Alex will be doing a training on ebpf_exporter tomorrow if you want to learn more about that, please come and talk to us. Slides have the information on time and location. If you want to learn more about eBPF itself, you can find Brendan Gregg around and ask him as well as myself.
Morty Proxy This is a proxified and sanitized view of the page, visit original site.