Unpackign Elixir: Observability

The following article explains how Erlang/Elixir VM (BEAM) has the capabilities of providing out-of-the-box support for observability: tracing, telemetry, logging.

https://underjord.io/unpacking-elixir-observability.html

Elixir supports the usual supects of observability. Open Telemetry (OTel), log handlers, capturing metrics. And it does it well. This post will mostly focus on the observability you have on the BEAM that is either incredibly rare to see elsewhere or possibly entirely unique.

The previous posts on concurrency and resilience might give useful context around how processes work and how supervision trees are structured. I will try not to lean too heavily on them but if you feel the need to understand more, consider reading them.

I don’t know if my understanding is accurate to the truth of the system, I don’t read BEAM bytecode, but I’ll take a stab. Elixir and Erlang applications do not change their shape very mnuch in compilation. I imagine the VM-level code still has an understanding of the fundamental parts of the language we see. Modules. Functions. It certainly knows about processes, messages, mailboxes and all that.

Why would someone build it this way? It has to be fundamentally inefficient compared to say .. C++ or Rust. Your compiled C or C++ program will be very different from the code you wrote. Who knows what the compiler will feel like doing. It will be semantically equivalent, the meaning of how the code executes will be retained. There is a ton of additional debugging information required if you want to retain the human-readable meaning and ideas of the original code during execution of the end product. And that might be enough to give you a nice stack trace.

Elixir aside, other high-level dynamic languages also seem to keep more of their general structure. If the program can mutate itself significantly during runtime it limits how much the compiler can be allowed to mangle things.

Erlang was built to provide hot code updates. I think this is the fundamental reason it is so introspectable at runtime. Hot code updates, while they can be done rigorously and should be treated seriously for production systems, is essentially the biggest monkey patch facility imaginable. I guess any language that can patch itself at runtime could start implementing hot code updates. But Erlang is designed to allow it and do it gracefully. This significantly limits how much the compiler can be allowed to boil things away.

Let’s start with the tool that Elixir devs tend to use daily. The REPL, the shell, iEx. It is a great part of the developer workflow as is. More importantly though, if you package an Erlang Release of your project you also end up with my_app remote. This command pops an iex shell, connects to your app over Erlang distribution and lets you poke your application. It lets you operate your application. All your modules and functions are intact and you can just make function calls, send messages and generally poke and prod around in your cluster. Whatever you need, anything that’s available in Erlang and Elixir plus any new code your a willing to type or paste into the interpreter.

What facilities do we have to actually pull information from the system? Well. The Erlang sys.get_state/1 function lets you pull the state of an “Actor”-style process (GenServer, GenStage, gen_event, gen_statem, etc) which is usually all of them. If they are part of how Erlang runs supervision trees by implementing those protocols it should also be able to give you the state it is holding. So you can inspect the running state of your application down to the studs.

Elixir offers Process.list/0 to get a plain list of all the process IDs (PIDs) on the local node. With this you can do Process.info/1 to find out what is going on and investigate things like memory usage, reduction count (number of function calls), initial function call and much more. All things that are part of the Erlang protocol for a well-behaved process.

This is all underpinnings for higher-level tools such as Erlang’s fun desktop UI app observer which can show you a graph of your supervision tree. It also has an activity monitor for your processes (order by memory used descending, oh there’s the memory leak). You can kill processes, you can get system-level stats. It has a bit of everything.

Of course with LiveView coming to Elixir this was pushed a bit further. A default Phoenix app has a few lines to uncomment to enable LiveDashboard which is a view that gives you much of the same information as observer. Plus it collects some telemetry for Phoenix and Ecto. You also get system-level information (disk, RAM, CPU), BEAM-level information (memory allocation types, resource usage, schedulers running) and a bit more. You also get a Process listing here in a nice and neat web view. If you are new in Elixir and want to make some waves I think pushing LiveDashboard further would be a good place to poke around. It is already a very cool start.

I believe LiveDashboard established this pattern for libraries with web UI for Phoenix. It provides a plug. Meaning you can shove it in any part of your router that you like. Usually behind an admin access check. I’ve since seen this done with Oban (job processing library) web UI and I believe the same thing is done with Orion.

Let’s talk about Orion because it feeds right into this story. It is a recent development by Thomas Depierre. Dubbed as a Dynamic Distributed Profiler it does something that requires a lot of instrumentation to do in any ecosystem and which is probably impossible to do fully in some. It was entirely achieved with existing Erlang facilities. You can enter a module name, a function name and an arity, hit Run. It will start to capture the performance of that function being called across your entire cluster. It then graphs that and gives you data and statistics on how that function is performing. There are many cool directions to extend this tool, let’s get into what it was built on. Tracing.

Tracing in Erlang is a mechanism for capturing information around the execution of a function. It is not limited to performance numbers. I’ve used both raw Erlang dbg, the convenience library recon and a little bit of recon_ex to do tracing on production systems when I needed to figure something out. I don’t know of any other runtime or langugae that makes this possible. Maybe I’m missing a world of tools in different ecosystems. Let me know. But essentially you formulate a type of pattern match for which invocations of the function you want to capture and in what way. Typically I want the inputs and the outputs along with execution time.

Think about this for a moment. If you have a function that seems to end up getting called with the wrong value, probably nil, for some unclear reason. And of course only in production. You can, in a reasonable manner, set up a trace and either wait for the thing to happen or trigger the behavior. Then you just watch the answers come in.

Oh, you need to know what happens one function deeper? Set another trace. There are limits and considerations for how much tracing you should do at once but you have a lot of room to play. Most of the tools built on top of the Erlang primitives try to protect you a bit from overloading your system with trace messages.

There are so many tools that haven’t even been built on top of this yet. I don’t think most Elixir developers are even aware that it is possible. There is no particular reason you couldn’t trigger an automatic trace after a new type of error surfaces and try to capture more info on the next run, ship that to your devs. Or build a UI for picking modules, functions and args to match so that you can capture these traces. If you want you could probably adapt the results to go into your Open Telemetry trace storage. There is a world to explore here.

And to some it won’t be available at all. If you run your stuff on Heroku and don’t add some tool for accessing a shell through the web or something you might have no way of reaching your server to pop the shell.

I should have asked for a Fly sponsorship here. This stuff is why I’m excited about their platform. Wireguard private networking by default makes clustering much easier. It also makes connecting to servers simpler which means getting at an iex shell straightforward.

You could replicate most of it with Tailscale or if you really want to work for it, custom wireguard stuff. Either way, Fly is a very good match for Elixir. It makes sense that they anchor their presence in the Elixir ecosystem by funding Chris McCord’s work on LiveView. Their infra offering just fits so well. A bunch of asterisks on the maturity of the offering still remain and they’ve owned up to that. Feature-wise I really like it.

A regular VPS or dedicated server is also perfectly convenient to shell into.

This capability, especially with Wireguard networking, also allows you to connect a Livebook (collaborative code notebook for Elixir) to a running system. This lets you build a recipe-book of things you might want to do in your system. Some would call them playbooks or runbooks. Rather than writing ad-hoc code that will get lost in your terminal history you build up a toolset. I haven’t put this into practice but it should be perfectly feasible.

To revisit. The BEAM retains the shape of your application. You modules, your functions, they still exist. Your processes are not just abstractions that are flattened out by the compiler. They are real and exist, comparatively speaking. Erlang and OTP have protocols and systems in place that make it possible to observe the running system at a high level for an overview as well as dive in and inspect any particular part.

The unique nature of these incredibly dynamic systems warranted unusual solutions and that gave us the Erlang tracing facilities. It is not a wildly monkey-patching library from some APM provider. It is not a hack to interject into the operation of the software, it is a fundamental facility of the runtime.

Playing nice with existing ecosystems and standard practices such as Open Telemetry is important. Elixir and Erlang are used for serious stuff and in mixed environments. It can’t all be special, unique and quirky. And I think the telemetry handling, metrics libraries like PromEx and the OTel implementation make great use of the BEAM and do not require external tools to operate aside from where to store the data.

Then when you look at the stuff that really is special, unique and quirky there is immense potential. We can go beyond what is feasible in other runtimes and languages. I think this is a big space for innovation on top of Elixir, Erlang and the runtime. The tools I’ve seen in this area are still fairly simple and there is so much potential.

This is what you get working with a higher level of abstraction that had a deeper purpose. Here the abstraction is not just about convenience and syntax but a deeply worked design that serves larger objectives. Python is a high level of abstraction primarily to make a sleek and convenient language. I am not convinced that the design of the language and runtime had clearer objectives more elevated than nice and convenient syntax. That’s fine but it has long-term consequences.

I think observability and introspectability, of arbitrary parts of the system, at runtime, is one of those things that people don’t know or really think about with Erlang and Elixir. But the magic of the BEAM is in the runtime and at runtime. It always was.

#reads #lars wikman #erlang #elixir #beam #observability