3D printing and observability: How Prusa Research monitors its huge printer farm with Grafana
If you’ve ever been to a Grafana Labs event, you may have seen one of the company’s two 3D printers on site churning out Grafana coins and other novelties.
Many Grafanistas have their own 3D printers at home, too, and using them to make and build things is such a popular hobby that one of the company’s most popular and busy non-work Slack channels is devoted to 3D printing.
Among the enthusiasts is Tom Wilkie, Grafana Labs CTO and co-host of “Grafana’s Big Tent.” In the Season 2 finale of the podcast, he and co-host Mat Ryer, Grafana Labs Engineering Director, chat about the intersection of observability and 3D printing. Their guests are 3D printing hobbyist and Grafana Labs Director of Community Richard “RichiH” Hartmann, and Pavel Strobl, a DevOps engineer at Prusa Research. Pavel’s company not only makes many of the printers that are popular among Grafanistas, but it also uses Grafana internally for web development and observability.
You can read some of the show’s highlights below, but listen to the full episode to hear more about the history of Prusa Research (and its similarities with Grafana Labs), and to find out who is printing wind tunnels, plant pots, and . . . teeth.
Note: The following are highlights from episode 10, season 2 of “Grafana’s Big Tent” podcast. The transcript below has been edited for length and clarity.
An intersection of interests
Tom Wilkie: This is an observability podcast. So why are we doing an episode on 3D printing?
Richard “RichiH” Hartman: Several reasons. We have a lot of engineers [at Grafana Labs], and most of them are software engineers. If you work on software all day, every day, you don’t really see any results. But still, we like tinkering, so we have a large maker community. It’s roughly split between woodworking, 3D printing, and electronics.
Using the technology you work on to understand the real world and derive more information from what’s actually happening with things you can touch — that’s super-interesting and an intersection of interests.
The Prusa Research and Grafana connections
Tom: Pavel, you’re using Grafana all over the place at Prusa Research, aren’t you?
Pavel Strobl: Yeah. The DevOps department is mostly used for web development. We are observing web services that are running at our Kubernetes cluster, etc. — pretty usual stuff — but we also use Grafana for our development department that is developing 3D printers. Their use is different because they have to develop the printer, so they have different kinds of data than we do and use different technologies. Even though I’m not in charge of this department’s observability, I know a lot of stuff about it, and it’s even accessible at our GitHub.
Tom: So what kind of things are the people designing these 3D printers using Grafana for?
Pavel: Basically, when you are optimizing the printer, you have to test it, and you can see metrics from the load cell. For example, the probing of the heatbed — you can see the value of the filament sensor, which is analog. You see that the values change when you are printing PLA, or PETG, or flexible filaments. You can get this data and analyze it and implement stack filament detection. There’s also all the other data that’s interesting for the users — for example, current voltage, so you can calculate power consumption.
Tom: I really appreciate the open source nature of Prusa Research and their things. During the pandemic, it was impossible to get hold of Prusa Minis because they were sold out, so I ended up making my own. I downloaded all your designs and 3D-printed them. One of the things I really like about Prusa Research, and one of the reasons why I have your printers, is because I know I can hack on them and modify them.
Richi: The vast majority of everything which Prusa ever produced was enabled by 3D printing. And when you visit [their campus in Prague], there are hundreds and hundreds and hundreds of 3D printers running 24/7, just churning out parts for new 3D printers — which is part of why they are so reliable, because they use them for their own production. It’s not just a product they shove out the door and it’s your problem now. They actually run long-term. Tom and I saw printers which had been running for 400 hours, with basically no maintenance, and an empty error log.
Tom: Another thing that I think Grafana Labs and Prusa Research have in common is that we use a lot of our own tooling at Grafana Labs, and our staging cluster is our main operations cluster. We use our own Grafana Cloud products — [powered by] Mimir, Loki, Tempo, Grafana — to monitor our Grafana Cloud SaaS, and that’s what drives the reliability. Because ideally, if there is a problem with any of our software, we’re the ones to discover it first in our ops cluster, and so we’re constantly pushing out changes, and testing them, and using them in anger, and then only then giving them to customers.
That dogfooding is something that I think really drives a quality product, and I think it’s fantastic how Prusa Research does that.
Mat: Print your own dogfood.
Pavel: Yeah, you can say that our print farm is like our staging cluster, because we are running over 700 printers, I think. Most of them are MK4s. Then a few XLs for the details that are printable only on XL. Thanks to the 3D printing farm, we actually increased the reliability of our printers, and we found a few issues before the printers were released.
The input shaper was tested at the farm as well. The farm is a crucial part of the company. Without the farm, we cannot produce any printer. Having a test environment like this is actually useful.
Monitoring life at the farm
Mat: The print farm sounds amazing — 700 printers. That really makes it clear why Grafana is going to help there with observability. What did you do before you had Grafana? How did you manage something like that?
Pavel: Before we had Grafana, as far as I know, we had nothing — at least for the 3D printing part of the company. Because when MK3 was in development, there was a need for monitoring the printers. You had to go and print, take the prints out, etc., and notice that there’s some issue with the printer.
Datadog was used for web development, but I joined Prusa Research when Grafana was embraced, thankfully, and we have been using Grafana since then. For DevOps, we are cutting edge and running on 10.4.1, and we are preparing to migrate to version 11. For the development part, it’s a little bit older.
Tom: You’re going to really like 11. The new Explore Metrics and Explore Logs experiences in there are game-changing, in my opinion. It makes it so much easier for more people in an organization to access their Prometheus metrics and their Loki logs.
But you’re not using Prometheus for a lot of things, are you? You use something else.
Pavel: In DevOps we use Prometheus, but for the printers, InfluxDB is used. The main reason why is because of a need for nanosecond precision. That’s not possible with Prometheus. When you are developing a printer, you really have to know when something occurred, and nanosecond precision is a must.
Tom: Richi, why don’t we have nanosecond precision in Prometheus?
Richi: Because it’s always an engineering trade-off — how much precision versus how much data usage. For most computer things, microseconds are more than enough, and that’s why we use them.
In my somewhat strong opinion, for everything where you actually need the super-high precision, you tend to not have super-long running things. Pretty much all the established industries use metrics with whatever precision for long-running stuff. And when you have your microburst on the network, or you have your jitter on the power supply or whatever, you use logs or events with super-high precision for a very short time, and only the really important stuff gets emitted. But I don’t think the firmware of Prusa is there yet.
Pavel: You can actually get metrics into Prometheus from the printers thanks to Prusa Exporter, which I wrote. What the printer sends is almost the line protocol for Influx, but it’s not the line protocol, because in the line protocol you have timestamps. There is no timestamp because the printer doesn’t have enough memory to send timestamps, so it sends delta, and it sends for how long it’s running.
Tom: Richi takes a 3D printer to almost every Grafana Labs event and shows that work.
Richi: We print swag, but the main thing is we can display all the data the printers are emitting in Grafana, and this gives you a little bit more of a tangible thing. We can put up all the demos and all the recorded videos and everything we want, but to really drive home the point of the versatility of the whole platform, we — or myself — came up with the idea of showing something people can actually touch, as opposed to just yet another Kubernetes cluster where you can’t really see what’s happening.
So you see something in the real world, you can touch and even get the product, and you can see how it’s made on the Grafana dashboards.
Powerful visuals
Tom: Pavel, now that you’re gathering all this data, what is the most interesting thing you’ve discovered that you previously didn’t know?
Pavel: I cannot speak for the whole company, but for me it’s how a printer works with power supply and how effective it is while it’s printing.
For 3D printers, it was probably the use of load cell data. Basically, if you have any 3D object in reality, you can use the load cell sensor on the nozzle to probe the model or anything you have on the printer to get the shape of the model in 3D space. It’s not easy, obviously. You have to calculate a lot, and get the data from the printer, but you can actually get a 3D model from the load cell sensor. It’s like a 3D scanner.
As a load cell sensor, it’s interesting that you can actually print on any surface you want, and we recently did this video about printing on different surfaces. There was a PlayStation 4 used as a printing sheet, or you can print on T-shirts. It works pretty well.
A move to Mimir
Tom: Pavel, when we were chatting before the podcast, you said that you’re thinking of adopting Mimir for your metrics.
Pavel: Yeah, exactly. We are using Prometheus as our backend, but for federation, we have multiple instances. We use Thanos at this moment, and we are experimenting with Mimir because we are going to move our DevOps cluster to a different region. This is a really good opportunity to migrate the data because mostly we are going to scrap it. Using Grafana Mimir for my purposes — for Prusa Exporter — it works pretty well exporting data for everything I want. For me, it’s more scalable. And this is the future of the DevOps department for now.
Tom: This is the Big Tent podcast, so we embrace all of the metrics backends — that’s totally cool. But why are you moving off of Thanos to Mimir? What’s motivating that move?
Pavel: For us, it’s slow queries. We have over two years of data, and if you query over 14 days, it takes a lot of time to get the data from Thanos. That’s basically the main reason. The store gateway is starting too slowly, etc.
Tom: I wrote the early versions of Mimir when it was called Cortex. One of the things we designed it to do is lots of parallel queries, and really be able to scale in that dimension. We did it for fun, and at the time, it might not have been the thing people actually needed. Most people needed what you’ve just described, which is querying over months or years of data. The thing I’m most excited about is we’ve finally found a use case for being able to handle high concurrent query load, and it’s this new Explore Metrics.
In Explore Metrics, when you click on a metric, it’s gonna go off and do a query for every single label behind that metric to see which labels have outliers, and what the distributions are. Clicking on a metric issues, like, 15 queries instantaneously, and it puts a lot of load on the Mimir cluster, but it’s something Mimir handles very well.
It’s really interesting, because we’ve had a lot of users tell that story where, “Oh, I’m moving off of Thanos, because I started using the Explore Metrics and it was killing my cluster. And so I’m starting to move to Mimir, or to Grafana Cloud.” We run Grafana Cloud Metrics with many, many replicas of the querier to deal with this exact use case. It kind of was a solution in search of a problem for a while, but we’ve found one, finally.
Mat: It’s another version of dogfooding, really. We basically DDoS ourselves, and then we have to make sure that works and keeps working. It’s quite fun.
Tom: I remember when the Explore Metrics team was like, “Oh, we’ve got to be careful with how many queries we issue to the metrics back,” and I’m like, “No, no, no. You’re just giving the metrics team a challenge — giving them something to work on.”
Mat: And we’ve done that for logs, profiles, and traces, of course, as well.
Top 3D printing projects
Mat: What is the best thing you have 3D-printed?
Tom: We’re all holding up a little 3D-printed Grot, the Grafana Labs mascot that’s a little like a juvenile dinosaur. I found someone who would turn our art department’s 2D images into a 3D STL. I posted it up on Printables, and Pavel downloaded it and colorized it.
Richi: Honestly, mine is probably the first thing which I printed, which is a lamp holder. I whipped it up with no skills within, like, half an hour. It’s still going strong, years later, and it still holds up the lamp.
Pavel: Atlantis, the city from Stargate Atlantis, was one of my first prints and I still have it somewhere. It’s silver PLA. That’s probably the best print I ever did.
Tom: You can download your GitHub timeline — the green dots — as an STL to print, where each one’s like a different height. Mat, I’ll print yours for you and send it over.
Mat: Oh, let me do some coding first. I could probably do it at the moment with a 2D printer.
Tom: Yeah, it’s just a blank sheet of paper for Mat.
Mat: I’m genuinely considering getting a 3D printer now, just for that!
“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com. You can also catch up on the first and second season of “Grafana’s Big Tent” on Apple Podcasts and Spotify.