This is the multi-page printable view of this section. Click here to print.
Hachyderm Blog
Announcements
Updating Domain Blocks
Today we are unblocking x0f.org
from our list of suspended instances to federate with.
Hachyderm will begin federating with x0f.org
immediately.
Reason for suspending
We believe the original suspension was related to early moderation actions taken earlier in 2022. The moderation actions took place before Hachyderm had a process/policy in place to communicate and provide reasoning for the suspension.
Reason for removing suspension
According to our records, we have no reports on file that constitute a suspension of this domain. The domain was brought to our attention as likely flagged by mistake. After review we have determined that there is no reason to suspend this domain.
A Note On Suspensions
It is important to us to protect Hachyderm’s community and our users. We may not always get this right, and we will often make mistakes. Thank you to our dedicated users for surfacing this (and the other 13 domains) we have removed from our suspension list. Thank you to the broader fediverse for being patient with us as we continue to iterate on our processes in this unprecedented space.
Opening Hachyderm Registrations
Yesterday I made the decision to temporarily close user registrations for the main site: hachyderm.io.
Today I am making the decision to re-open user registrations again for Hachyderm.
Reason for Closing
The primary reason for closing user registrations yesterday was related to the DDoS Security Threat that occurred the morning after our Leaving the Basement migration.
The primary vector that was leveraging Hachyderm infrastructure for perceived malicious use, was creating spam/bot accounts on our system. Out of extreme precaution, we closed signups for roughly 24 hours,
Reason for Opening
Today, Hachyderm does not have a targeted growth or capacity number in mind.
However, what we have observed is that user adoption as dropped substantially compared to November. In my opinion, I believe that we will see substantially less adoption in December than we did in November.
We will be watching closely to validate this hypothesis, and will leverage this announcement page as an official source of truth if our posture changes.
For now we have addressed some more detail on growth, registrations, and sustainability in our Growth and Sustainability blog.
Posts
A Minute from the Moderators
We’ve been working hard to build out more of the Community Documentation to help everyone to create a wonderful experience on Hachyderm. For the past month, we’ve focused most heavily on our new How to Hachyderm section. The docs in this section are:
When you are looking at these sections, please be aware that the docs under the How to Hachyderm section are for the socialized norms around each topic and the subset of those norms that we moderate. Documentation around how to implement the features are both under our Mastodon docs section and on the main Mastodon docs. This is particularly relevant to our Content Warning sections: How To Hachyderm Content Warnings is about how content warnings are used here and on the Fediverse, whereas Mastodon User Interface Content Warnings is about where in the post composition UI you click to create a content warning.
Preserving your mental health
In our new Mental Health doc, we focus on ways that you can use the Mastodon tools for contraining content and other information. We structurd the doc to answer two specific questions:
- How can people be empowered to set and maintain their own boundaries in a public space (the Fediverse)?
- What are the ways that people can toggle the default “opt-in”?
By default, social media like Mastodon / the Fediverse, opts users in to all federating content. This includes posts, likes, and boosts. Depending on your needs, you may want to opt out of some subsets of that content either on a case-by-case basis, by topic, by source, or by type. Remember:
You can opt out of any content for any reason.
For example, you may want to opt out of displaying media by default because it is a frequent trigger. Perhaps the specific content warnings you need aren’t well socialized. Maybe you are sensitive to animated or moving media. That said, perhaps media isn’t a trigger - you just don’t like it. Regardless of your reason, you can change this setting (outlined in the doc) whenever you wish and however often as meets your needs.
Hashtags and Content Warnings
Our Hashtags and Content Warnings docs are to help Hachydermians better understand both what these features are and the social expectations around them. In both cases, there are some aspects of the feature that people have encountered before: hashtags in particular are very common in social media and content warnings mirror other features that obscure underlying text on sites like Reddit (depending on the subreddit) and tools like Discord.
Both of these features have nuance to how they’re used on the Fediverse that might be new for some. On the Fediverse, and on Hachyderm, there are “reserved hashtags”. These are hashtags that are intended only for a specific, narrow, use. The ones we moderate on Hachyderm are FediBlock, FediHire, and HachyBots. For more about this, please see the doc.
Content warnings are possibly less new in concept. The content warning doc focuses heavily on how to write an effective content warning. Effective content warnings are important as you are creating a situation for someone else to opt in to your content. This requires consent, specifically informed consent. A well written content warning should inform people of the difference between “spoilers”, “Doctor Who spoilers”, and “Doctor Who New Year’s Special Spoilers”. The art of crafting an effective content warning is balancing what information to include while also not making the content warning so transparent that the content warning is the post.
Notably, effective content warnings feature heavily in our Accessible Posting doc.
Accessible Posting
Our Accessible Posting doc is an introductory guide to different ways to improve inclusion. It is important to recognize there are two main constraints for this guide:
- It is an introductory guide
- The Mastodon tools
As an introductory guide, it does not cover all topics of accessibility. As a guide that focuses on Mastodon, the guide discusses the current Mastodon tools and how to fully utilize them.
As an introductory guide, our Accessibility doc primarily seeks to help users develop more situational awareness for why there are certain socialized patterns for hashtags, content warnings, and posting media. We, as moderators of Hachyderm, do not expect anyone to be an expert on any issue that the doc covers. Rather, we want to help inspire you to continue to learn about others unlike yourself and see ways that you can be an active participant in creating and maintaining a healthy, accessible, space on the Fediverse.
Content warnings feature heavily on this doc. The reason for this is Mastodon is a very visual platform, so the main ways that you are connecting with others who do not have the same experience of visual content is by supplying relevant information.
There will always be more to learn and more, and better, ways to build software. For those interested in improving the accessibility features of Mastodon, we recommend reviewing Mastodon’s CONTRIBUTING document.
More to come
We are always adding more docs! Please check the docs pages frequently for information that may be useful to you. If you have an idea for the docs, or wish to submit a PR for the docs, please do so on our Community repo on GitHub.
April will mark one month since we launched the Nivenly Foundation, Hachyderm’s parent org. Nivenly’s website is continuing to be updated with information about how to sponsor or become a member. For more information about Nivenly, please see Nivenly’s Hello World blog post.
The creation of Nivenly also allowed us to start taking donations for Hachyderm and sell swag. If you are interested in donating, please use either our GitHub Sponsors or one of the other methods that we outline on our Thank You doc. For Hachyderm swag, please check out Nivenly’s swag store .
Decaf Ko-Fi: Launching GitHub Sponsors et al
Since our massive growth at the end of last year, many of you have asked about ways to donate beyond Nóva’s Ko-Fi. There were a few limitations there, notably the need to create an account in order to donate. There were a few milestones we needed to hit before we could do this properly, notably we needed to have an EIN in order to properly receive donations and pay for services (as an entity).
Well that time has come! Read on to learn about how you can support Hachyderm either directly or via Hachyderm’s parent organization, the Nivenly Foundation.
First things first: GitHub Sponsors
Actual Octocat from our approval email
As of today the Hachyderm GitHub Sponsors page is up and accepting donations! Using GitHub Sponsors you can add a custom amount and donate either once or monthly. There are a couple of donation tiers that you can choose from as well if you are interested in shoutouts / thank yous either on Hachyderm or on our Funding and Thank You page. In both cases we’d use your GitHub handle for the shoutout.
The shoutouts and Thank You page
#ThankYouThursday is a hashtag we’re creating today to thank users for their contributions. Most posts for #ThankYouThursday happen on Hachyderm’s Hachyderm account, but higher donations will be elible for shoutouts on Kris Nóva’s Hachyderm.
- $7/mo. and higher
- Get a sponsor badge on your GitHub profile
- $25/mo. and higher or $100 one-time and higher
- Get a sponsor badge on your GitHub profile
- Get a shoutout on the Hachyderm account’s quarterly #ThankYouThursday
- $50/mo. and higher or $300 one-time and higher
- Get a sponsor badge on your GitHub profile
- Get a shoutout on the Kris Nóva’s account’s quarterly #ThankYouThursday
- $1000 one-time and higher
- Get a sponsor badge on your GitHub profile
- Get a shoutout on the Hachyderm account’s quarterly #ThankYouThursday
- Be added to the Thank You List on our Funding page
- $2500 one-time and higher
- Get a sponsor badge on your GitHub profile
- Get a shoutout on Kris Nóva’s quarterly #ThankYouThursday
(All above pricing in USD.)
A couple of important things about the above:
- All public announcements are optional. You can choose to opt-out by having your donation set to private.
- By default we’ll use your GitHub handle for shoutouts. This is easier than reconciling GitHub and Hachyderm handles.
- We may adjust the tiers to make the Thank Yous more frequent.
Right now the above tiers are our best guess, but we may edit the #ThankYouThursday thresholds in particular so that we can keep a sustainable cadence. Thank you for your patience and understanding with this ❤️
And now an update for the Nivenly Foundation
For those who don’t know: the Nivenly Foundation is the non-profit co-op that we’re founding for Hachyderm and other open source projects like Aurae. The big milestone we reached here is that 1 ) we’re an official non-profit with the State of Washington and 2 ) we have a nice, shiny, EIN which allowed us to start accepting donations to both the Nivenly Foundation as well as its two projects: Aurae and Hachyderm. For visibility, here are all the GitHub sponsor links in one place:
It is also possible to give a custom one-time donation to Nivenly via Stripe:
Right now only donations are open for Nivenly, Aurae, and Hachyderm. After we finalize Nivenly’s launch, Nivenly memberships will also be available for individuals, maintainers, and what we call trade memberships for companies, businesses, and business-like entities.
What do Nivenly Memberships mean for donations?
Right now, donations and memberships are separate. That means that you can donate to Hachyderm and, once available, join Nivenly as two separate steps. As Nivenly’s largest project, providing governance and funding for Hachyderm uses almost all of Nivenly’s donations. As we grow and include more projects this is likely to shift over time. As such, we are spinning up an Open Collective page for Nivenly that will manage the memberships and also provide a way for us to be transparent about our budget as we grow. Our next two big milestones:
- What you’ve all been waiting for: the public release of the governance model (almost complete)
- What we definitely need: the finalization of our 501(c)3 paperwork with the IRS (in progress)
As we grow we’ll continue to post updates. Thank you all so much for your patience and participation 💕
P.S. and update: What’s happening with Ko-fi?
We are currently moving away from Kris Nóva’s Ko-fi as a funding source for Nivenly and Hachyderm et al. We’ve created a new Ko-fi account for the Nivenly Foundation itself:
Kris Nóva’s Ko-fi is still live to give people time to migrate Nivenly-specific donations (including those for Hachyderm and Aurae) from her Ko-fi to either GitHub sponsors, Nivenly’s Ko-fi, Stripe or starting a Nivenly co-op general membership via Nivenly’s Open Collective page as those become ready (which should be soon). We’ll still be using Nivenly-specific funds from her Ko-fi for Nivenly for the next 30-60 days and will follow up with an update as we start to stop that (manual 😅) process.
Growth and Sustainability
Thank you to everyone who has been patient with Hachyderm as we have had to make some adjustments to how we do things. Finding ourselves launched into scale has impacted our people more than it has impacted our systems.
I wanted to provide some visibility into our intentions with Hachyderm, our priorities, and immediate initiatives.
Transparency Reports
We intend on offering transparency reports similar to the November Transparency Report from SFBA Social. It will take us several weeks before we will be able to publish our first one.
The immediate numbers from the administration dashboard are below.
Donations
On January 1st, 2023 we will be changing our financial model.
Hachyderm has been operating successfully since April of 2022 by funding our infrastructure from the proceeds of Kris Nóva’s Twitch presence.
In January 2023 we will be rolling out a new financial model intended to be sustainable and transparent for our users. We will be looking into donation and subscription models such as Patreon at that time.
From now until the end of the year, Hachyderm will continue to operate using the proceeds of Kris Nóva’s Twitch streams, and our donations through the ko-fi donation page.
Governing Body
We are considering forming a legal entity to control Hachyderm in January 2023.
At this time we are not considering a for-profit corporation for Hachyderm.
The exact details of what our decision is, will be announced as we come to conviction and seek legal advice.
User Registration
At this time we do not have any plans to “cap” or limit user registration for Hachyderm.
There is a small chance we might temporarily close registration for small limited periods of time during events such as the DDoS Security Threat.
To be clear, we do not plan on rolling out a formal registration closure for any substantial or planned period of time. Any closure will be as short as possible, and will be opened up as soon as it is safe to do so.
We will be reevaluating this decision continuously. If at any point Hachyderm becomes bloated or unreasonably large we will likely change our decision.
User Registration and Performance
At this time we do not believe that user registration will have an immediate or noticeable impact on the performance of our systems. We do not believe that closing registration will somehow “make Hachyderm faster” or “make the service more reliable”.
We will reevaluating this decision continuously. If at any point the growth patterns of Hachyderm changes we will likely change our decision.
Call for Volunteers
We will be onboarding new moderators and operators in January to help with our service. To help with that, we have created a short Typeform to consolidate all the volunteer offers so it is easier for us to reach back out to you when we’re ready:
The existing teams will be spending the rest of December cleaning up documentation, and building out this community resource in a way that is easy for newcomers to be self sufficent with our services.
As moderators and infrastructure teams reach a point of sustainability, each will announce the path forward for volunteers when they feel the time is right.
The announcements page on this website, will be the source of truth.
Our Promise to Our users
Hachyderm has signed The Mastodon Server Covenant which means we have given our commitment to give users at least 3 months of advance warning in case of shutting down.
My personal promise is that I will do everything in my power to support our users any way I can that does not jeopardize the safety of other users or myself.
We will be forming a broader set of governance and expectation setting for our users as we mature our services and documentation.
Sustainability
I wanted to share a few thoughts on sustainability with Hachyderm.
Part of creating a sustainable service for our users will involve participation from everyone. We are asking that all Hachydermians remind themselves that time, patience, and empathy are some of the most effective ways in creating sustainable services.
There will be some situations where we will have to make difficult decisions with regard to priority. Often times the reason we aren’t immediately responding to an issue isn’t because we are ignoring the issue or oblivious to it. It is because we have to spend our time and effort wisely in order to keep a sustainable posture for the service. We ask for patience as it will sometimes take days or weeks to respond to issues, especially during production infrastructure issues.
We ask that everyone reminds themselves that pressuring our teams is likely counter productive to creating a sustainable environment.
Leaving the Basement
This post has taken several weeks in the making to compile. My hope is that this captures the vast majority of questions people have been asking recently with regard to Hachyderm.
To begin, I would like to start by introducing the state of Hachyderm before the migration, as well as introduce the problems we were experiencing. Next, I will cover the root causes of the problems, and how we found them. Finally, I will discuss the migration strategy, the problems we experienced, and what we got right, and what can be better. I will end with an accurate depiction of how hachyderm exists today.

Alice, our main on-premise server with her 8 SSDs. A 48 port Unifi Gigabit switch.
Photo: Kris Nóva
State of Hachyderm: Before
Hachyderm obtained roughly 30,000 users in 30 days; or roughly 1 new user every 1.5 minutes for the duration of the month of November.
I documented 3 medium articles during the month, each with the assumption that it would be my last for the month.
- November 3rd, 720 users Operating Mastodon, Privacy, and Content
- November 13th, 6,000 users Hachyderm Infrastructure
- November 25th, 25,000 users Experimenting with Federation and Migrating Accounts
Here are the servers that were hosting Hachyderm in the rack in my basement, which later became known as “The Watertower”.
Alice | Yakko | Wakko | Dot | |
---|---|---|---|---|
Hardware | DELL PowerEdge R630 2x Intel Xeon E5-2680 v3 | DELL PowerEdge R620 2x Intel Xeon E5-2670 | DELL PowerEdge R620 2x Intel Xeon E5-2670 | DELL PowerEdge R620 2x Intel Xeon E5-2670 |
Compute | 48 Cores (each 12 cores, 24 threads) | 32 Cores (each 8 cores, 16 threads) | 32 Cores (each 8 cores, 16 threads) | 32 Cores (each 8 cores, 16 threads) |
Memory | 128 GB RAM | 64 GB RAM | 64 GB RAM | 64 GB RAM |
Network | 4x 10Gbps Base-T 2x | 4x 1Gbps Base-T (intel I350) | 4x 1Gbps Base-T (intel I350) | 4x 1Gbps Base-T (intel I350) |
SSDs | 238 GiB (sda/sdb) 4x 931 GiB (sdc/sdd/sde/sdf) 2x 1.86 TiB (sdg/sdh) | 558 GiB Harddrive (sda/sdb) | 558 GiB Harddrive (sda/sdb) | 558 GiB Harddrive (sda/sdb) |
It is important to note that all of the servers are used hardware, and all of the drives are SSDs.
“The Watertower” sat behind a few pieces of network hardware, including large business fiber connection in Seattle, WA. Here are the traffic patterns we measured during November, and the advertised limitations from our ISP.
Egress Advertised | Egress in Practice | Ingress Advertised | Ingress in Practice | |
---|---|---|---|---|
200 Mbps | 217 Mbps | 1 Gbps | 112 Mbps |
Our busiest traffic day was 11/21/22 where we processed 999.80 GiB in RX/TX traffic in a single day. During the month of November we averaged 36.86 Mbps in traffic with samples taken every hour.
The server service layout is detailed below.
Problems in Production
For the vast majority of November, Hachyderm had been stable. Most users reported excellent experience, and our systems remained relatively healthy.
On November 27th, I filed the 1st of what would become 21 changelogs for our production infrastructure.
The initial report was failing images in production. The initial investigation lead our team to discover that our NFS clients were behaving unreasonably slow.
We were able to prove that NFS was “slow” by trying to navigate to a mounted directory and list files. In the best cases results would come back in less than a second. In the worst cases results would take 10-20 seconds. In some cases the server would lock up and a new shell would need to be established; NFS would never return.
I filed a changelog, and mutated production. This is what became the first minor change in a week long crisis to evacuate the basement.
We were unable to fix the perceived slowness with NFS with my first change.
However we did determine that we had scaled our compute nodes very high in the process of investigating NFS. Load averages on Yakko, Wakko, and Dot were well above 1,000 at this time.
Each Yakko, Wakko, and Dot were housing multiple systemd units for our ingress, default, push, pull, and mailing queues – as well as the puma web server hosting Mastodon itself.
At this point Alice was serving our media over NFS, postgres, redis, and a lightweight Nginx proxy to load balance across the animaniacs (Yakko, Wakko, and Dot).
The problems began to cascade the night of the 27th, and continued to grow worse by the hour into the night.
- HTTP(s) response times began to degrade.
- Postgres response times began to degrade.
- NFS was still measurably slow on the client side.
The main observation was that the service would “flap”, almost as if it was deliberately toying with our psychology and our hope.
We would see long periods of “acceptable” performance when the site would “settle down”. Then, without warning, our alerts would begin to go off.
Hachyderm hosts a network of edge or point of presence (PoP) nodes that serve as a frontend caching mechanism in front of core.
During the “spikes” of failure, the edge Nginx logs began to record “Connection refused” messages.
The trend of “flapping” availability continued into the night. The service would recover and level out, then a spike in 5XX level responses, and then ultimately a complete outage on the edge.
This continued for several days.
A Note on Empathy
It is important to note that Hachyderm had grown organically over the month of November. Every log that was being captured, every graph that was consuming data, every secret, every config file, every bash script – all – were a consequence of reacting to the “problem” of growth and adoption.
I call this out, because this is very akin to most of the production systems I deal with in my career. It is important to have empathy for the systems and the people who work on them. Every large production is a consequence of luck. This means that something happened that caused human beings to flock to your service.
I am a firm believer that no system is ever “designed” for the consequences of high adoption. This is especially true with regard to Mastodon, as most of our team has never operated a production Mastodon instance before. To be candid, it would appear that most of the internet is in a similar situation.
We are all experimenting here. Hachyderm was just “lucky” to see adoption.
There is no such thing as both a mechanistic and highly adopted system. All systems that are a consequence of growth, will be organic, and prone to the symptoms of reactive operations.
In other words, every ugly system is also a successful system. Every beautiful system, has never seen spontaneous adoption.
Finding Root Causes
By the 3rd day we had roughly 20 changelogs filed.
Each changelog capturing the story of a highly motivated and extremely hopeful member of the team believing they had once and for all identified the bottleneck. Each, ultimately failing to stop the flapping of Hachyderm.
I cannot say enough good things about the team who worked around the clock on Hachyderm. In many cases we were sleeping for 4 hours a night, and bringing our laptops to bed with us.
- @Quintessence wins the “Universe’s best incident commander” award.
- @Taniwha wins the “Best late night hacker and cyber detective” award.
- @hazelweakly wins the “Extreme research and googling cyberhacker” award.
- @malte wins the “Best architect and most likely to remain calm in a crisis” award.
- @dma wins the “Best scientist and graph enthusiast” award.
After all of our research, science, and detection work we had narrowed down our problem two 2 disks on Alice.
/dev/sdg # 2Tb "new" drive
/dev/sdh # 2Tb "new" drive
The IOPS on these two particular drives would max out to 100% a few moments before the cascading failure in the rack would begin. We had successfully identified the “root cause” of our production problems.
Here is a graphic that captures the moment well. Screenshot taken from 2am Pacific on November 30th, roughly 3 days after production began to intermittently fail.
It is important to note that our entire production system, was dependent on these 2 disks, as well as our ZFS pool which was managing the data on the disks,
[novix@alice]: ~>$ df -h
Filesystem Size Used Avail Use% Mounted on
dev 63G 0 63G 0% /dev
run 63G 1.7G 62G 3% /run
/dev/sda3 228G 149G 68G 69% /
tmpfs 63G 808K 63G 1% /dev/shm
tmpfs 63G 11G 53G 16% /tmp
/dev/sdb1 234G 4.6G 218G 3% /home
/dev/sda1 1022M 288K 1022M 1% /boot/EFI
data/novix 482G 6.5G 475G 2% /home/novix
data 477G 1.5G 475G 1% /data
data/mastodon-home 643G 168G 475G 27% /var/lib/mastodon
data/mastodon-postgresql 568G 93G 475G 17% /var/lib/postgres/data
data/mastodon-storage 1.4T 929G 475G 67% /var/lib/mastodon/public/system
tmpfs 10G 7.5G 2.6G 75% /var/log
Both our main media block storage, and our main postgres database was currently housed on ZFS. The more we began to correlate the theory, the more we could correlate slow disks to slow databases responses, and slow media storage. Eventually our compute servers and web servers would max out our connection pool against the database and timeout. Eventually our web servers would overload the media server and timeout.
The timeouts would cascade out to the edge nodes and eventually cause:
- 5XX responses in production.
- Users hitting the “submit” button as our HTTP(s) servers would hang “incomplete” resulting in duplicate posts.
- Connection refused errors for every hop in our systems.
We had found the root cause. Our disks on Alice were failing.
Migration 1: Digital Ocean
We had made the decision to evacuate The Watertower and migrate to Hetzner weeks prior to the incident. However it was becoming obvious that our “slow and steady” approach to setting up picture-perfect infrastructure in Hetzner wasn’t going to happen.
We needed off Alice, and we needed off now.
A few notable caveats about leaving The Watertower.
- Transferring data off The Watertower was going to take several days with the current available speed of the disks.
- We were fairly confident that shutting down production for several days wasn’t an option.
- Our main problem was getting data off the disks.
Unexpectedly I received a phone call from an old colleague of mine @Gabe Monroy at Digital Ocean. Gabe offered to support Hachyderm altruistically and was able to offer the solution of moving our block storage to Digital Ocean Spaces for object storage.
Thank you to Gabe Monroy, Ado Kukic, and Daniel Hix for helping us with this path forward! Hachyderm will forever be grateful for your support!
There was one concern, how were we going to transfer over 1Tb of data to Digital Ocean on already failing disks?
One of our infrastructure volunteers @malte had helped us come up with an extremely clever solution to the problem.
We could leverage Hachyderm’s users to help us perform the most meaningful work first.
Solution: NGINX try_files
Malte’s model was simple:
- We begin writing data that is cached in our edge nodes directly to the object store instead of back to Alice.
- As users access data, we can ensure that it will be be taken of Alice and delivered to the user.
- We can then leverage Mastodon’s S3 feature to write the “hot” data directly back to Digital Ocean using a reverse Nginx proxy.
We can point the try_files
directive back to Alice, and only serve the files from Alice once as they would be written back to S3 by the edge node accessing the files. Read try_files documentation.
In other words, the more that our users accessed Hachyderm, the faster our data would replicate to Digital Ocean. Conveniently this also meant that we would copy the data that was being immediately used first.
We could additionally run a slow rclone
for the remaining data that is still running 2+ days later as I write this blog post.
This was the most impressive solution I have seen to a crisis problem in my history of operating distributed systems. Our users, were able to help us transfer our data to Digital Ocean, just by leveraging the service. The more they used Hachyderm, the more we migrated off Alice’s bad disks.
Migration 2: Hetzner
By the time the change had been in production for a few hours, we all had noticed a substantial increase in our performance. We were able to remove NFS from the system, and shuffle around our Puma servers, and sidekiq queues to reduce load on Postgres.
Alice was serving files from the bad disks, however all of our writes were now going to Digital Ocean.
While our systems performance did “improve” it was still far from perfect. HTTP(s) requests were still very slowly, and in cases would timeout and flap.
At this point it was easy to determine that Postgres (and it’s relationship to the bad disks) was the next bottleneck in the system.
Note: We still have an outstanding theory that ZFS, specifically the unbalanced mirrors, is also a contributing factor. We will not be able to validate this theory until the service is completely off Alice.
It would be slightly more challenging coming up with a clever solution to get Postgres off Alice.
On the morning of December 1st we finished replicating our postgres data across the atlantic onto our new fleet of servers in Hetzner.
- Nixie (Alice replacement)
- Freud (Yakko)
- Fritz (Wakko)
- Franz (Dot)
We will be publishing a detailed architecture on the current system in Hetzner as we have time to finalize it.
Our team made an announcement that we were shutting production down, and scheduled a live stream to perform the work.
The video of the cutover is available to watch directly on Twitch.
NodeJS and Mastodon
The migration would not be complete without calling out that I was unable to build the Mastodon code base on our new primary Puma HTTP server.
After what felt like an eternity we discovered that we needed to recompile the NodeJS assets.
cd /var/lib/mastodon
NODE_OPTIONS=--openssl-legacy-provider
RAILS_ENV=production bundle exec rails assets:precompile
Eventually we were able to build and bring up the Puma server which was connected to the new postgres server.
We moved our worker queues over to the new servers in Hetzner.
The migration was complete.
State of Hachyderm: After
To be candid, Hachyderm “just works” now and we are serving our core content within the EU in Germany.
There is an ever-growing smaller and smaller amount of traffic that is still routing through Alice as our users begin to access more and more obscure files.
Today we have roughly 700Gb of out 1.2Tb of data transferred to Digital Ocean.
We will be destroying the ZFS server in Alice, and replacing the disks as soon as we can completely take The Watertower offline.
On our list of items to cover moving forward:
- Offer a detailed public resource of our architecture in Hetzner complete with Mastodon specific service breakdowns.
- Build a blog and community resource such that we can begin documenting our community and bringing new volunteers on board.
- Take a break, and install better monitoring on our systems.
- Migrate to NixOS or Kubernetes depending on the needs of the system.
- Get back to working on Aurae, now with a lot more product requirements than we had before.
Conclusion
We suffered from pretty common pitfalls in our system. Our main operational problems stemmed from scaling humans, and not our knowledge of how to build effective distributed systems. We have observability, security, and infrastructure experts from across Silicon Valley working on Hachyderm and we were still SSHing into production and sharing passwords.
In other words, our main limitations to scale were managing people, processes, and organizational challenges. Even determining who was responsible for what, was a problem within itself.
We had a team of experts without any formal precedent working together, and no legal structure or corporate organization to glue us together. We defaulted back to some bad habits in a pinch, and also uncovered some exciting new patterns that were only made possible because of the constraints of the fediverse.
Ultimately I believe that myself, and the entire team is convinced that the future of the internet and social is going to be in large collaborative operational systems that operate in a decentralized network.
We made some decisions during the process, such as keeping registrations open during the process that I agree with. I think I would make the same decisions again. Our limiting factor in Hachyderm had almost nothing to do with the amount of users accessing the system as much as it did the amount of data we were federating. Our system would have flapped if we had 100 users, or if we had 1,000,000 users. We were nowhere close to hitting limits of DB size, storage size, or network capacity. We just had bad disks.
I think the biggest risk we took was onboarding new people to a slow/unresponsive service for a few days. I am willing to accept that as a good decision as we are openly experimenting with this entire process.
I have said it before, and I will say it again. I believe in Hachyderm. I believe we can build a beautiful and effective social media service for the tech industry.
The key to success will be how well we experiment. This is the time for good old fashioned computer science, complete with thoughtful hypothesis and detailed observability to validate them.
Incidents
TLS Expires: media.hachyderm.io
On February 28th, 2023 at approximately 01:55 UTC Hachyderm experienced a service degradation in which images failed to load in production.
We were able to quickly identify the root cause as expired TLS certificates in production for media.hachyderm.io
Context
Hachyderm TLS certificates are still managed manually, and are very clearly out of sprawling out of control due to our rapid growth. There are many certificates on various servers that have had config copied from one server to another as we grew into our current architecture.
The alert notification was missed, and the media.hachyderm.io
TLS privkey.pem
and fullchain.pem
material expired causing the service degradation.
Timeline
- Feb 28th 01:52
@quintessence
First report of media outages - Feb 28th 01:54
@nova
Confirms media is broken from remote proxy in EU - Feb 28th 01:56
@nova
Appoints@quintessence
as incident commander - Feb 28th 01:57
@nova
Confirms TLS expired onmedia.hachyderm.io
- Feb 28th 02:30
@nova
Live streaming fixing TLS
Shortly after starting the stream we discovered that the Acme challenge was not working because the media.hachyderm.io
DNS record was pointed to CNAME hachyderm.io
and the proxy was not configured to manage the request. In the past we have worked around this by editing the CDN on the East coast which is where the Acme challenge will resolve.
In this case we changed the media.hachyderm.io
DNS record to point to A <ip-of-fritz>
which is where the core web server was running.
We re-ran the renew process and it worked!
sudo -E certbot renew
We then re-pointed media.hachyderm.io
back to CNAME hachyderm.io
.
Next came the scp
command to move the new cert material out to the various CDN nodes and restart nginx.
# Copy TLS from fritz -> CDN host
scp /etc/letsencrypt/archive/media.hachyderm.io/* root@<host>:/etc/letsencrypt/archive/media.hachyderm.io/
# Access root on the CDN host
ssh root@<host>
# Private key (on CDN host)
rm -f /etc/letsencrypt/live/media.hachyderm.io/privkey.pem
ln -s /etc/letsencrypt/archive/media.hachyderm.io/privkey3.pem /etc/letsencrypt/live/media.hachyderm.io/privkey.pem
# Fullchain (on CDN host)
rm -f /etc/letsencrypt/live/media.hachyderm.io/fullchain.pem
ln -s /etc/letsencrypt/archive/media.hachyderm.io/fullchain3.pem /etc/letsencrypt/live/media.hachyderm.io/fullchain.pem
The full list of CDN hosts:
- cdn-frankfurt-1
- cdn-fremont-1
- sally
- esme
Restarting nginx
on each of the CDN hosts was able to fix the problem.
# On a CDN host
nginx -t # Test the config
systemctl reload nginx # Reload the service
# On your local machine
emacs /etc/hosts # Point "hachyderm.io" and "media.hachyderm.io" to IP of CDN host
# Check your browser for working images
Impact
- Full image outage across the site in all regions.
- A stressful situation interrupting dinner and impacting the family.
- Even more chaos and confusion with certificate material.
Lessons Learned
- We still have outstanding legacy certificate management problems.
Things that went well
- We had a quick report, and the mean time to resolution was <60 mins.
Things that went poorly
- The certs are in an even more chaotic state.
- There was no alerting that the images broke.
- There was a high stress situation that impacting our personal lives.
Where we got lucky
- I still had access to the servers, and was able to remedy the situation from existing knowledge.
Action items
- We need to destroy the vast majority of nginx configurations and domains in production
- We need to destroy all TLS certs and re-create them with a cohesive strategy
- We need a better way to perform the Acme challenge that doesn’t involve changing DNS around the globe
- Nóva to send list of domains to discord to destroy
Fritz Timeouts
On January 7th, 2023 at approximately 22:26 UTC Hachyderm experienced a spike in HTTP response times as well as a spike in 504 Timeouts across the CDN.
Working backwards from the CDN to fritz
we discovered another cascading failure.
Context
There is a fleet of CDN nodes around the world, commonly referred to as “POP” servers (Point of Presence) or even just “The CDN”. These servers reverse proxy over dedicated connections back to our core infrastructure.
These CDN servers served content timeouts at roughly 22:20:00 UTC.
These CDN servers depend on the mastodon-streaming
service to offer websocket connections.
Impact
- Total streaming server outage reported in Discord (Uptime Robot)
- Slow/Timeouts reported by users in Twitch chat
- Nóva noticed slow/timeouts on her phone
HTTP Response Times measured > 3s
Background
We received some valuable insight from @ThisIsMissEm who has experience with both node.js websocket servers and the mastodon codebase, which can be read here in HackMD.
An important takeaway from this knowledge is that the mastodon-streaming
service and the mastodon-web
service will not rate limit if they are communicating over localhost
.
In other words, you should be scheduling mastodon-streaming
on the same node you are running mastodon-web
.
We believe that the way the streaming API works, that if there is a “large event” such as having a post go out by a largely followed account it can cause a cascading effect on everyone connected via the streaming API.
A good metric to track would actually be the percentage of connections that a single write is going to. If the mastodon server has one highly followed user, a post by them, especially in a “busy” timezone for the instance, will result in unbalanced write behaviours, where one message posted will result in iterating over a heap more connections than others (one per follower who’s connected to streaming), so you can end up doing 40,000 network writes very easily, locking up node.js temporarily from processing disconnections correctly.
We believe that the streaming API began to drop connections which cascaded out to the CDN nodes via the mastodon-web
service.
We can correlate this theory by connecting observe logged lines to the Mastodon code bash.
Logs from mastodon-streaming
on Fritz
06-4afe-a449-a42f861855b2 Tried writing to closed socket
33-414d-9143-6a5080bd6254 Tried writing to closed socket
33-414d-9143-6a5080bd6254 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
41-4385-9762-c5c1d829ba27 Tried writing to closed socket
0f-4eb4-9751-b5ac7e21c648 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
60-40d1-99b4-349f03610b36 Tried writing to closed socket
60-40d1-99b4-349f03610b36 Tried writing to closed socket
33-414d-9143-6a5080bd6254 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
Code from Mastodon main
const streamToWs = (req, ws, streamName) => (event, payload) => {
if (ws.readyState !== ws.OPEN) {
log.error(req.requestId, 'Tried writing to closed socket');
return;
}
Found in mastodon/streaming/index.js
Logs (correlation) from mastodon-web
on Fritz
This is where we are suspecting that we are hitting the “Rack Attack” rate limit in the streaming service.
-4589-97ed-b67c66eb8c38] Rate limit hit (throttle): 98.114.90.221 GET /api/v1/timelines/home?since_id=109>
Working Theory (root cause)
We are maxing out the streaming service on Fritz, and it is rate limiting the mastodon web (puma) service. The “maxing out” can be described in the write-up by @ThisIsMissEm where NodeJS struggles to process/drop the connections that are potentially a result of a “Large Event”.
As the websocket count increases there is a cascading failure that starts on Fritz and works it way out to the nodes.
Eventually the code that is executing (looping) over the large amounts of websockets will “break” and there is a large release where a spike in network traffic can be observed.
We see an enormous (relatively) amount of events occur during the second of 22:17:30 on Fritz which we suspect is the “release” of the execution path.
As the streaming service recovers, the rest of Hachyderm slowly stabilizes.
Lessons Learned
Websockets are a big deal, and will likely be the next area of our service we need to start observing.
We will need to start monitoring the relationship between the streaming service and the main mastodon web service pretty closely.
Things that went well
We found some great help on Twitch, and we ended up discovering an unrelated (but potentially disastrous) problem with Nietzsche (the main database server).
We have a path forward for debugging the streaming issues.
Things that went poorly
Nóva was short on Twitch again and struggles to deal with a lot of “noise/distractions” while she is debugging production.
In general there isn’t much more we can do operationally other than keep a closer eye on things. The code base is gonna’ do what the code base is gonna’ do until we decide to fork it or wait for improvements from the community.
Where we got lucky
Seriously the Nietzsche discovery was huge, and had nothing to do with the streaming “hiccups”. We got extremely lucky here.
Consequently, Nóva fixed the problem on Nietzsche which was that our main database NVMe disk was at 98% capacity.
- We did NOT receive storage alerts in Discord (I believe we should have?)
- Nóva could NOT find an existing cron job on the server to clean the archive.
- Nóva scheduled the cron job (Using
sudo crontab -e
)
The directory (archive) that was full:
/var/lib/postgres/data/archive
Nietzsche is now back down to ~30%
Action items
1) Set up websocket observability on CDN nodes (clients) and Fritz (server)
We want to see how many “writes” we have on the client side and how many socket connections they are mapped to if possible. We might need to PR a log entry for this to the Mastodon code base.
2) Verify cron is running on Nietzsche
We need to make sure the cron is running and the archive is emptying
3) Debug why we didn’t receive Nietzsche alerts
I think we should have seen these, but I am not sure?
4) We likely need a bigger “Fritz”
Sounds like we need donations and a bigger server (it will be hard to move streaming off of the same machine as web).
Fritz on the fritz
On January 3th, 2023 at approximately 12:30 UTC Hachyderm experienced a spike in
response times. This appeared to be due to a certificate that had not been
renewed on fritz
, which runs the Mastodon Puma and Streaming services. The
service appeared to recover until approximately 15:00 UTC when another spike in
response times was observed.
Alerts were firing in discord alerting us to the issue.
Background
fritz
runs mastodon-web and mastodon-streaming and all other web nodes proxy
to fritz
.
mastodon-web was configured with 16 processes each having 20 threads.
mastodon-streaming was configured with 16 processes
Impact
p90 response times grew from ~400ms to >2s. increase of 502 responses to >1000 per minute.
Root causes and trigger
organic growth in users and traffic coupled with the return from vacation of
the US caused the streaming and puma processes on fritz
to use more CPU. CPU
load hit >90% consistently on fritz
. this in turn caused responses to fail to
be returned to the upstream web frontends.
Lessons Learned
response times are very sensitive to puma threads (reducing from 20 to 16 threads per process doubled GET response times).
the site functions well with fewer streaming processes.
Things that went well
we had the core CPU load on the public dashboard.
Things that went poorly
in an attempt to get things under control both mastodon-streaming and mastodon-web were changed. puma was then reverted as we had over-corrected and response times were getting quite bad.
no CPU load alerts were configured for fritz
specifically.
Where we got lucky
@dma
was already keyed in to fritz thanks to an earlier issue where
certs hadn’t been renewed.
Action items
1) Streaming processes reduces @dma
[repair]
Reduced the number of streaming processes on fritz
from 16 to 12.
2) Better alerting on CPU load @dma
[detect]
We should implement better CPU load alerting on every host to detect these issues and be able to respond even more quickly.
3) Postmortem documented @dma
This blog post and a hackmd postmortem doc.
The Queues ☃️ down in Queueville
Every Queue down in Queueville liked ActivityPub a lot. But John Mastodon who lived just north of Queuville, did not! John Mastodon hated ActivityPub, the whole Activity season! Now please don’t ask why. No one quite knows the reason.
It could be, perhaps, that his WEB_CONCURRENCY
was too tight.
It could be his MAX_THREADS
wasn’t screwed on just right.
But I think that the most likely reason of all
May have been that his CPU
was two sizes too small.
But, whatever the reason, his WEB_CONCURRENCY
or CPU
s,
He stood there on Activity Eve hating the Queues…
Staring down from his cave with systemd hacks
At the warm buzzing servers below in their racks
For he knew every Queue down in Queueville beneath Was busy now hanging an Activity-Wreath. “And they’re posting their statuses,” he snarled with a sneer. “Tomorrow is Activity-Mas! It’s practically here!”
Then he growled, with John Mastodon fingers nervously drumming, “I must find some way to keep the statuses from coming”!
For, tomorrow, I know all the Queues and the “they"s and the “them"s Will wake bright and early for ActivitySeason to begin!
And then! Oh, the noise! Oh, the noise! Noise! Noise! Noise! There’s one thing John Mastodon hates: All the NOISE! NOISE! NOISE! NOISE!
And they’ll shriek squeaks and squeals, racing ‘round on their hosts. They’ll update with jingtinglers tied onto their posts! They’ll toot their floofloovers. They’ll tag their tartookas. They’ll share their whohoopers. They’ll follow their #caturday-ookas. They’ll spin their #hashtags. They’ll boost their slooslunkas. They’ll defederate their blumbloopas. But complain about their whowonkas.
And they’ll play noisy games like post a cat on #caturday, An ActivityPub type of all the queers and the gays! And then they’ll make ear-splitting noises galooks On their great big postgres whocarnio ruby monolith flooks!
Then the Queues, young and old, will sit down to a feast. And they’ll feast! And they’ll feast! And they’ll FEAST! FEAST! FEAST! FEAST!
They’ll feast on Queue-pudding, and rare Queue-roast-beast, Ingress Queue roast beast is a feast I can’t stand in the least!
And then they’ll do something I hate most of all! Every Queue down in Queueville, the tall and the small,
They’ll stand close together, with UptimeRobot bells ringing. They’ll stand hand-in-hand, and those Queues will start singing!
And they’ll sing! And they’ll sing! And they’d SING! SING! SING! SING! And the more John Mastodon thought of this Queue Activity Sing, The more John Mastodon thought, “I must stop this whole thing!”
Why for fifty-three days I’ve put up with it now! I must stop ActivityPub from coming! But how?
Timeline
All events are documented in UTC time.
- 13:00
@dma
Noticed the ingress queue was backing up - 16:45
@quintessence
Noticed the ingress queue was still lagging - 17:00
@nova
Declared an incident - 17:30
@hazelweakly
Noticed CPU at 100% on Freud and Franz - 17:34
@hazelweakly
Worked with@dma
to rebalance queues across Freud, Franz, and Nietzsche - 17:37
@dma
Notices CPU on Nietzsche is not changing - 17:45
@hazelweakly
Changes 5MAX_THREADS
to 20MAX_THREADS
on Nietzsche
ActivityEve
“I know just what to do!” John Mastodon laughed in his throat. “I’ll max out the CPU, and cause the network to bloat.”
And he chuckled, and clucked, “What a great John Mastodon trick! With this CPU and network lag, I’ll cause the latency to stick!”
“All I need is a denial of service.” John Mastodon looked around. But since denial of services are scarce, there was none to be found.
Did that stop John Mastodon? Hah! John Mastodon simply said, “If I can’t find a denial of service, I’ll make one instead!”
So he took his dog MAX
, and he took some more EMPTY_THREADS
.
And he tied big WEB_CONCURRENCY
on top of his head.
Then he loaded some cores and some old empty racks.
On a ramshackle sleigh and he whistled for MAX
.
Then John Mastodon said “Giddyap!” and the sleigh started down Toward the homes where the Queues lay a-snooze in their town.
All their graphs were dark. No one knew he was there. All the Queues were all dreaming sweet dreams without care. When he came to the first little house of the square.
“This is stop number one,” John Mastodon hissed, As he climbed up load average, empty cores in his fist.
Then he slid down the ingress, a rather tight bond. But if a denial of service could do it, then so could John Mastodon.
The queues drained only once, for a minute or two. Then he stuck his posts out in front of the ingress queue!
Where the little Queue messages hung all in a row. “These messages,” he grinched, “are the first things to go!”
Then he slithered and slunk, with a smile most unpleasant, Around the whole server, and he took every message!
Cat pics, and updates, artwork, and birdsite plea’s! Holiday cheer, Hanukkah, Kwanza and holiday trees!
And he stuffed them in memory. John Mastodon very nimbly, Stuffed all the posts, one by one, up the chimney.
Then he slunk to the default queues. He took the queues’ feast! He took the queue pudding! He took the roast beast!
He cleaned out that /inbox
as quick as a flash.
Why, John Mastodon even took the last can of queue hash!
Then he stuffed all the queues up the chimney with glee. “Now,” grinned John Mastodon, “I will stuff up the whole process tree!”
As John Mastodon took the process tree, as he started to shove, He heard a small sound like the coo of a dove…
He turned around fast, and he saw a small Queue! Little Cindy-Lou Queue, who was no more than two.
She stared at John Mastodon and said, “our statuses, why? Why are you filling our queues? Why?”
But, you know, John Mastodon was so smart and so slick, He thought up a lie, and he thought it up quick!
“Why, my sweet little tot,” John Mastodon lied,
“There’s a status on this /inbox
that won’t light on one side.
So I’m taking it home to my workshop, my dear. I’ll fix it up there, then I’ll bring it back here.”
And his fib fooled the child. Then he patted her head, And he got her a drink, and he sent her to bed.
And when Cindy-Lou Queue was in bed with her cup, He crupt to the chimney and stuffed the ingress queues up!
Then he went up the chimney himself, the old liar.
And the last thing he took was /var/log
for their fire.
On their .bash_history
he left nothing but hooks and some wire.
And the one speck of content that he left in the house Was a crumb that was even too small for a mouse.
Then he did the same thing to the other Queues’ houses, Leaving crumbs much too small for the other Queues’ mouses!
Timeline
All events are documented in UTC time.
- 17:58
@dma
Notices we are no longer bottlenecked on Ingress after@hazelweakly
makes changes - 18:03
@dma
Provides update on priority of systemd flags - 18:10
@dma
Provides spreadsheet for us to calculate connections to database
ActivityMorn
It was quarter of dawn. All the Queues still a-bed, All the Queues still a-snooze, when he packed up his sled,
Packed it up with their statuses, their posts, their wrappings, Their posts and their hashtags, their trendings and trappings!
Ten thousand feet up, up the side of Mount Crumpet, He rode with his load average to the tiptop to dump it!
“Pooh-pooh to the Queues!” he was John Msatodon humming. “They’re finding out now that no ActivityPub messages are coming!
They’re just waking up! I know just what they’ll do! Their mouths will hang open a minute or two Then the Queues down in Queueville will all cry boo-hoo!
That’s a noise,” grinned John Mastodon, “that I simply must hear!” He paused, and John Mastodon put a hand to his ear.
And he did hear a sound rising over the snow. It started in low, then it started to grow.
But this sound wasn’t sad! Why, this sound sounded glad!
Every Queue down in Queueville, the tall and the small, Was singing without any ActivityPub messages at all!
He hadn’t stopped ActivityPub messages from coming! They came! Somehow or other, they came just the same!
And John Mastodon, with his feet ice-cold in the snow, Stood puzzling and puzzling. “How could it be so?”
Posts came without #hashtags! It came without tags! It came without content warnings or bags!
He puzzled and puzzled till his puzzler was sore. Then John Mastodon thought of something he hadn’t before.
Maybe ActivityPub, he thought, doesn’t come from a database store. Maybe ActivityPub, perhaps, means a little bit more!
And what happened then? Well, in Queueville they say That John Mastodon’s small heart grew three sizes that day!
And then the true meaning of ActivityPub came through, And John Mastodon found the strength of ten John Mastodon’s, plus two!
And now that his heart didn’t feel quite so tight, He whizzed with his load average through the bright morning light!
With a smile to his soul, he descended Mount Crumpet Cheerily blowing “Queue! Queue!” aloud on his trumpet.
He road into Queuville. He brought back their joys. He brought back their #caturday images to the Queue girls and boys!
He brought back their status and their pictures and tags, Brought back their posts, their content and #hashtags.
He brought everything back, all the CPU for the feast! And he, he himself, John Mastodon carved the roast beast!
Welcome ActivityPub. Bring your cheer, Cheer to all Queues, far and near.
ActivityDay is in our grasp So long as we have friends’ statuses to grasp.
ActivityDay will always be Just as long as we have we.
Welcome ActivityPub while we stand Heart to heart and hand in hand.
Timeline
All events are documented in UTC time.
- 18:10
@hazelweakly
Provides update that queues are now balancing and load is coming down - 18:18
@nova
Confirms queues are draining and systems are stabilizing
Root Cause
John Mastodon took the queue hash, and up the chimney he stuck it. The Hachyderm crew was too tired to fill out the report and said “fuck it”.
Nietzsche:
- 4 default queues (unchanged)
- 32 default ingress (changed)
Franz:
- 6 default queues (unchanged)
- 1 ingress queue (changed)
- 5 pull queues (unchanged)
- 5 push queues (unchanged)
Freud:
- 3 default queues (unchanged)
- 2 ingress queues (changed)
- 2 pull queue (changed)
- 2 push queue (changed)
Changes:
Because the database connection count per ingress queue process changed, when necessary, I will clarify queue amounts in terms of database connections.
- Moved 2 ingress queues (40 DB connections) from franz to nietzsche
- Moved 2 ingress queues (40 DB connections) from freud to nietzsche
- Changed DB_POOL on ingress queues from 20 to 5 as they're heavily CPU bound.
- Changed -c 20 on ingress queues from 20 to 5 as they're heavily CPU bound.
- Scaled Nietzsche up from 8 ingress queues to 32 to keep the amount of total database connections the same.
- Restarted the one ingress queue remaining on franz (this lowered ingress DB connections from 20 to 5).
- Restarted the two ingress queues remaining on freud (this lowered ingress DB connections from 40 to 10).
- Removed a "pushpull" systemd service on Freud and replaced it with independent push and pull sidekiq processes (neutral db connection change).
Degraded Service: Media Caching and Queue Latency
On Saturday, December 17th, 2022 at roughly 12:43 UTC Hachyderm received our first report of media failures which started a 2-day-long investigation of our systems by @hazelweakly
, @quintessence
, @dma
, and @nova
. The investigation coincidentally overlapped with a well-anticipated spike in growth which also unexpectedly degraded our systems simultaneously.
The first degradation was unplanned media failures, typically in the form of avatar and profile icons intermittently on the service. We had an increase in 4XX level responses due to misconfigured cache settings in our CDN. We believe the Western US to be the only region impacted by this degradation.
The second degradation was unplanned queue latency increasing presumably from the increase in usage due to the fallout of Twitter mass exodus. We experienced an increase in our push
and pull
queues, as well as a short period of default
latency.
Timeline
All events are documented in UTC time.
- Dec 16th 12:43
@arjenpdevries
First report of media cache misses #217 - Dec 17th 08:21
@blueturtleai
2nd Report, and first confirmation of media cache misses #218 - Dec 17th 21:43
@quintessence
3rd Report of media cache misses - Dec 17th 21:44
@nova
False mediation ofcmd+shift+r
cache refresh - Dec 17th 22:XX More reports of cache failures, multiple Discord channels, and posts
- Dec 17th 23:XX More reports of cache failures, multiple Discord channels, and posts
- Dec 17th 24:XX Still assuming “cache problems” will just fix themselves
- Dec 18th 14:45
@dma
Nginx audit andlocation{}
rewrite onfritz
; no results - Dec 18th 14:45
@dma
No success debugging various CDN nodes and cache strategies - Dec 18th 15:16
@dma
Check mastodon-web logs on CDNs; /system GETs with 404s - Dec 18th 20:32
@hazelweakly
Discovered.env.production
misconfiguration cdn-frankfurt-1, franz - Dec 18th 20:41
@quintessence
Confirms queues are backing up
- Dec 18th 20:45
@hazelweakly
Confirms actively reloading services to drain queues - Dec 18th 21:17
@malte_j
Appears from vacation, and is told to go back to relaxing - Dec 18th 21:23
@hazelweakly
Continues to “tweak and tune” the queues - Dec 18th 21:32
@hazelweakly
Claims we are growing at <1 user per minute - Dec 18th 21:45
@dma
Reminder to only focus oningress
anddefault
queues - Dec 18th 21:47
@hazelweakly
Identifies queue priority fix using systemd units - Dec 18th 21:47
@hazelweakly
Suggests moving queues to CDN nodes - Dec 18th 21:59
@dma
Suggests migrating DB fromfreud
->nietzsche
- Dec 18th 22:15
@hazelweakly
Summary confirms sidekiq running on CDNs - Dec 18th 22:18
@nova
Identifies conversation in Discord, and begins report
Root Cause
The cause of the caching 4XX responses and broken avatars was a misconfigured .env.production
file on cdn-fremont-1
and franz
.
S3_ENABLED=FALSE # Should be true
3_BUCKET=".." # Should be S3_BUCKET
The cause of the queue latency is suspected to be the increase in usage from Twitter, as well as the queue priority documented here in the official Mastodon scaling up documentation.
ExecStart=/usr/bin/bundle exec sidekiq -c 10 -q default
Things that went well
We have the cache media fixed, and we have been alerted to a high-risk concern early giving the team enough time to respond.
Things that went poorly
An outage was never declared for this incident, and therefore it was not handled as well as it could have been. Various members of the team were mutating production with reckless working habits
- Documenting informally in private infrastructure GitHub repository
- Discord used as documentation
- No documenting just “tinkering” alone
- Documenting after the fact
- Not using descriptive language, EG: “Tweaked the CDNs” instead of changed
on from to .
Unknown state of production after the incident. Unsure which services are running where, and who has what expectations for which services.
The configuration roll-out obviously had failed at some point, indicating a stronger need for config management on our servers.
We seemed to lose track of where the incident started and stopped and where improvements and action items began. For some reason we decided to make suggestions about next steps before we were entirely sure on the state of the systems today, and having a plan in place.
Opportunities
Config management should be a top priority.
Auditing and migrating sidekiq services off of CDN nodes should be a top priority.
Migrating the database from freud
-> nietzsche
should be a priority.
We shouldn’t be planning or discussing future improvements until the systems are restored to stability. Incidents are not also a venue for decision-making.
Resulting Action
1) Plan for Postgres migration
@nova
and @hazelweakly
planning live stream to migrate production database and clear up more compute power for sidekiq queues
2) TODO Configuration Management
We need to identify a configuration management pattern for our systems sooner than later. Perhaps an opportunity for a new volunteer.
3) TODO Discord Bot Incident Command
We need to identify ways of managing and starting and stopping incidents using Discord. Maybe in the future we can have “live operating room” incidents where folks can watch read-only during the action.
Global Outage: 504 Timeouts
On Tuesday, December 13th, 2022 at roughly 18:52 UTC Hachyderm experienced a 7 minute cascading failure that has impacted our users around the globe resulting in unresponsive HTTP(s) requests and 5XX level requests. The service has not experienced any data loss. We believe this was a total service outage.
Impacted users experienced 504 timeout responses from https://hachyderm.io
in all regions of the world.
Timeline
All events are documented in UTC time.
- 18:53
@nova
First report of slow response times in Discord - 18:55
@dma
First confirmation, and first report of 5XX responses globally - 18:56
@dma
Check of Mastodon web services, no immediate concerns - 18:56
@nova
Check of CDN proxy services, no immediate concerns - 18:57
@nova
First observed 504 timeout - 18:58
@dma
status.hachyderm.io updated acknowledging the outage - 18:59
@nova
First observed redis error, unable to persist to disk
Dec 13 18:59:01 fritz bundle[588687]: [2eae54f0-292d-488e-8fdd-5c35873676c0] Redis::CommandError (MISCONF Redis is configured to save RDB snapshots, but it's currently unable to persist to disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.):
- 19:01
@UptimeRobot
First alert received
Monitor is DOWN: hachyderm streaming
( https://hachyderm.io/api/v1/streaming/health ) - Reason: HTTP 502 - Bad Gateway
- 19:02
@nova
Root cause detected. The root filesystem is full on our primary database server.
- 19:04
@nova
Identified postgres archive/var/lib/postgres/archive
data exceeds 400Gb of history - 19:05
@malte_j
Request to destroy archive - 19:06
@malte_j
Confirmed archive has been destroyed - 19:06
@malte_j
Confirmed 187Gb of space has been recovered - 19:06
@dma
status.hachyderm.io updated acknowledging the root cause - 19:07
@nova
Begin drafting postmortem notes - 19:16
@nova
Official announcement posted to Hachyderm
Root Cause
Full root filesystem on primary database server resulted in a cascading failure that first impacted Redis’s ability to persist to disk which later resulted in 5XX responses on the edge.
Things that went well
We had a place to organize, and folks on standby to respond to the incident.
We were able to respond and recover in less than 10 minutes.
We were able to document and move forward in less than 60 minutes.
Things that went poorly
There was confusion about who had access to update status.hachyderm.io
and this is still unclear.
There was confusion about where redis lived, and which systems where interdependent upon redis in the stack.
The Novix installer is still our largest problem and is responsible for a lot of confusion. We do not have a better way forward to manage packages and configs in production. We need to decide on Nix
and our path forward as soon as possible.
Opportunities
We need to harden our credential management process, and account management. We need to have access to our systems.
We need global architecture, ideally observed from the systems themselves and not in a diagram.
When an announcement is resolved, it removes the status entirely from UptimeRobot. We can likely improve this.
Resulting Action
1) Cron cleanup scheduled @malte_j
Cron scheduled to remove postgres archive greater than 5 days.
#!/bin/bash
set -e
cd /var/lib/postgres/data/archive
find * -type f -mtime 5 -print0 | sort -z | tail -z -n 1 | xargs -r0 pg_archivecleanup /var/lib/postgres/data/archive
2) Alerts configured @dma
Alerts scheduled for >90%
filesystem storage on database nodes.
Postmortem template created for future incidents.
3) Postmortem documented @nova
This blog post as well as a small discussion in Discord.