Blog>>Networks>>Linux networking>>A Linux debugging and observation tools overview

A Linux debugging and observation tools overview

Effective debugging and monitoring are crucial for performance optimization and troubleshooting in Linux systems. This guide dives into an array of essential tools that will improve your observation and debugging capabilities. 

Originally this article was inspired by a copy of the work of Brendan Gregg      link-icon that I found on Mastodon some time ago. The social post had “only” a simple picture with a lot of arrows and utility names but the value of such a picture was overwhelming. 

Here in this blog post we will present an updated picture (we added a few programs that we’ve found useful). Since many of the applications below might be unknown to readers we have added a few words of description for each one of them. Whether you're a seasoned DevOps or system administrator or just starting out, this guide will equip you with the tools to better understand and optimize your Linux environment.

The diagram overview

So without further ado let's start with the diagram itself:

Fig.1: Linux observability and debugging tools
Linux observability and debugging tools

Note: The following blogpost is also available as a single-page cheat sheet for your convenience. You can find the file for download at the end of the article.

Linux observability and debugging tools list

As for the programs here is the list of them in alphabetical order:

atop - is a top-like utility but concentrating more on system resources such as I/O, network, cpu.

Documentation      link-icon  

BCC - is an eBPF compiler for programs that can be attached to kprobes (Kernel Probes enable dynamically breaking into any kernel routine to collect debugging and performance information non-disruptively). BCC allows programs to be written in C, Python or Lua. It includes several utilities as examples that will be mentioned later.

Documentation      link-icon  

biolatency - summarizes block device I/O latency as a histogram. This tool is included with BCC.

Documentation      link-icon 

biosnoop - traces block device I/O and print details, including issuing PID. This tool is included with BCC.

Documentation      link-icon 

biotop - is a top-like utility for I/O usage. This tool is included with BCC.

Documentation      link-icon 

blktrace - generates traces of the I/O traffic on block devices.

Documentation      link-icon 

bpftrace - is a high-level programming language dedicated to enhancing tracing for Linux systems. Similar to BCC.

Documentation      link-icon 

bridge - is part of iproute2 concentrating on L2 traffic in Linux that is passing through a bridge. Aside from listing bridges, it also can show FDB, MDB (unicast and multicast forwarding database)  or VLAN/VXLAN data.

Documentation      link-icon 

cpupower - is a collection of tools to examine and tune power saving settings (such as frequency or Intel Turbo Boost status).

Documentation      link-icon 

criticalstat - reports long atomic critical sections in kernel with useful stacktraces showing their origins. This tool is included with BCC.

Documentation      link-icon

dropwatch - shows the reason why a packet was dropped allowing pinpoint debugging to NIC, firewall, routing or others.

Documentation      link-icon 

ethtool - shows hardware counter configuration and other data (such as inserted SFP module) that can be extracted from NIC. Can also be used to configure NIC on a hardware level, such as hashing, queues and other knobs that can enhance performance.

Documentation      link-icon 

execsnoop - traces the usage of exec() system calls. This utility is ideal for the monitoring of short-lived processes that would be easily missed in top/ps. This tool is included with BCC.

Documentation      link-icon 

ext4dist - summarizes ext4 operation latency. This toolis included with BCC. Similar tools for XFS and BTRFS exist as well.

Documentation      link-icon

ext4slower - traces slow ext4 file operations, with per-event details. This tool is included with BCC. Similar tools for XFS and BTRFS exist as well.

Documentation      link-icon 

fatrace - reports file access events (from all processes). Its main purpose is to help find the purpose of HDD not going to sleep.

Documentation      link-icon 

filelife - with the help of eBPF, this utility helps trace short-lived files for performance purposes. 

Documentation      link-icon 

free - displays the amount of free and used memory in the system.

Documentation      link-icon 

ftrace - is an internal tracer designed to help out developers and designers of systems to find out what is going on inside the kernel. It’s especially useful for analyzing latencies and performance issues affecting user space.

Documentation      link-icon 

gethostlatency - is a utility designed to solve issues with host name resolution. It shows latency for getaddrinfo/gethostbyname calls. This tool is included with BCC.

Documentation      link-icon 

hardirqs - summarizes the time spent servicing hard interrupts (IRQ created by physical hardware) and shows this time as either totals or histogram distributions. This tool is included with BCC.

Documentation      link-icon 

hdparm - reads and writes hardware disk registers such as performance/power mode, encryption, and spinning status.

Documentation      link-icon 

htop - provides similar information to top command but with much more detail, such as per CPU usage. Allows the user to quickly glance at what is happening on the system.

Documentation      link-icon 

iostat - is used for monitoring system input/output device loading by observing the time the devices are active in relation to their average transfer rates.

Documentation      link-icon 

ip - is part of iproute2, concentrating on several layers of OSI level, from interface/link statistics, through L3 routing up to L4 traffic policy.

Documentation      link-icon 

lldptool - reads and interprets LLDP packets sent by the attached switch(es). Allows to easily identify which port Linux is connected to and advertises Linux’s presence to neighboring  switch(es).

Documentation      link-icon 

lsblk - list all block devices (physical and virtual) and their topology in the underlying system.

Documentation      link-icon 

lsof - outputs a list of currently open files (hence the name). File can be a regular file, directory, socket, or the network connection/port on which the process is listening.

Documentation      link-icon 

lstopo - shows system hardware configuration along with device NUMA assignment in a nice graphical output.

Documentation      link-icon 

ltrace - is a similar program to strace but more concentrated on dynamic library calls which are called by the executed process and the signals which are received by that process. It can also intercept and print the system calls executed by the program.

Documentation      link-icon 

LTTng - The Linux Trace Toolkit (LTTng in short) is an open-source software toolkit that one can use to trace the Linux kernel, user applications, and user libraries concurrently.

Documentation      link-icon 

lurk - is similar to strace with some optimizations made for readability.

Documentation      link-icon 

mdflush - traces flush events by md, the Linux multiple device driver (used for the software RAID). This tool is included with BCC. 

Documentation      link-icon 

mpstat - prints CPU usage statistics divided per CPU in SMP (symmetric multiprocessing) systems.

Documentation      link-icon 

nicstat - prints out network statistics for all network cards including PPS, throughput, packet size, etc.

Documentation      link-icon 

nstat - is a simple tool designed to monitor kernel SNMP counters and network interface stats.

Documentation      link-icon 

numastat - show per-NUMA-node memory statistics for processes and the operating system.

Documentation      link-icon 

nvme - is an NVM storage command line utility. Among other uses, it can read S.M.A.R.T events, NAND, PCIE statistics and send custom commands to underlying devices.

Documentation      link-icon 

offcputime - summarizes off-CPU (where time is spent waiting while blocked on I/O, locks, timers, paging/swapping, etc.) time by kernel stack trace. This tool is included with BCC.

Documentation      link-icon  

opensnoop - traces open() syscalls, showing the file name (pathname) and returned file descriptor number (or -1, for error).

Documentation      link-icon

pcstat - gets page cache statistics for files in order to provide an answer as to whether Linux is caching data or not.

Documentation      link-icon 

perf - is a performance analysis tool in Linux. It’s a userspace controlling utility, accessed from the command line which provides a number of subcommands such as: stat, top, record, report, etc. It supports hardware performance counters, tracepoints, software performance counters, and dynamic probes.

Documentation      link-icon 

pidstat - is used to monitor every individual task currently being managed by the Linux kernel on the Linux system. IT can monitor every task on the system, including the child’s task of any task, along with details such as CPU usage or disk I/O.

Documentation      link-icon 

ps - shows information about current processes. Although not the most sophisticated tool, it's always available on Linux.

Documentation      link-icon 

rdmsr - reads CPU model-specific registers (MSR). MSRs are control registers provided by the processor implementation so that system software can interact with a variety of features, including performance monitoring, checking processor status, etc.

Documentation      link-icon 

runqlen - summarizes scheduler queue length as a histogram. It can be used to identify imbalances such as processes occupying a CPU causing queuing. This tool is included with BCC.

Documentation      link-icon 

sar - collects, reports or saves system statistics. Aside from network statistics it can be used to monitor other devices, such as disks as well.

Documentation      link-icon 

slabtop - displays kernel slab cache information in real time. A slab is a set of one or more contiguous pages of memory while a slab cache is a “container” of multiple slabs of the same type.

Documentation      link-icon 

smartctl - reads S.M.A.R.T data from the underlying disk device. Data includes sector remapping, errors, rereads, logs and other statistics reflecting the health and performance of storage devices. Older versions of smartctl were unable to access NVM devices. In such cases, the nvme utility should be used.

Documentation      link-icon 

softirqs - summarizes the time spent servicing soft IRQs (soft interrupts), and can show this time as either totals or histogram distributions. This tool is included with BCC.

Documentation      link-icon 

ss - is the tool that replaced the depreciated netstat utility. Allows viewing of port binding on the running system along with process names, connection statutes. The CLI is almost identical to its predecessor.

Documentation      link-icon 

stapprobes.udp - is a part of the SystemTap (stap for short) utility used for gathering information about the running Linux system. Among other things, it provides probe points for UDP activity.

Documentation      link-icon 

strace - runs the specified command until it exits. It intercepts and records the system calls which are called by a process and the signals which are received by a process.

Documentation      link-icon 

tc - is part of iproute2 concentrating traffic control settings. Aside from displaying QoS counters it can also deal with traffic offloading (tc flower).

Documentation      link-icon 

tcpdump - is a network sniffer that can display traffic activity happening on selected (or all) interfaces. Due to CLI’s nature, it provides invaluable information on traffic that is happening on systems where a graphical environment is not an option.. 

Documentation      link-icon 

tcplife - traces TCP sessions in systems and summarizes their lifespan. This tool is included with BCC.

Documentation      link-icon 

tcpretrans - shows possible issues with TCP connections by displaying retransmits and other details. 

Documentation      link-icon 

tiptop - displays hardware performance counters for Linux tasks. It's similar to the top utility but enriched by hardware counters.

Documentation      link-icon 

turbostat - shows CPU topology, temperature, frequency and idle statistics.

Documentation      link-icon 

vmstat - reports information about processes, memory, paging, block I/O, traps, disks and cpu activity in defined intervals.

Documentation      link-icon 

wireshark - is a graphical tool complementing tcpdump in many ways. Due to the several subtools included, as well as a vast number of supported protocols, it makes network debugging easier than tcpdump.

Documentation      link-icon 

Summary

While the picture and list are far from being comprehensive, this is a good place to start. If you think that there are tools missing, please contact us and we will update this blog post.

Linux observability and debugging tools cheat sheet to download

For those who prefer a printable version, we have provided the cheat sheet in PDF format here.

 Linux observability and debugging tools cheat sheet
Kułagowski Adam

Adam Kułagowski

Principal Network Engineer

Adam is a seasoned Principal Network Engineer with nearly two decades of experience in the realm of networking. Passionate about the intricacies of data transmission, he constantly strives to optimize network performance, pushing the boundaries of speed and efficiency. With a strong foundation in networking...Read about author >

Read also

Get your project estimate

For businesses that need support in their software or network engineering projects, please fill in the form and we'll get back to you within one business day.