<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hachyderm Community – Incidents</title><link>https://community.hachyderm.io/blog/incidents/</link><description>Recent content in Incidents on Hachyderm Community</description><generator>Hugo -- gohugo.io</generator><language>en</language><atom:link href="https://community.hachyderm.io/blog/incidents/index.xml" rel="self" type="application/rss+xml"/><item><title>Blog: Security incident: Redis cache exposed to public internet</title><link>https://community.hachyderm.io/blog/2023/07/16/security-incident-redis-cache-exposed-to-public-internet/</link><pubDate>Sun, 16 Jul 2023 00:00:00 +0000</pubDate><guid>https://community.hachyderm.io/blog/2023/07/16/security-incident-redis-cache-exposed-to-public-internet/</guid><description>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>Between July 9 and July 16, 2023, one of Hachyderm’s Redis cache servers was exposed to the public internet. On July 16, 2023, the Hachyderm Infrastructure team identified a misconfiguration of our firewall on the cache server which allowed access to the redis interface from the public Internet. After a routine system update, the nftables firewall service was not brought up automatically after a restart, which exposed the Redis cache to the internet for a period of seven days.&lt;/p>
&lt;p>During the exposure, an unknown third party attempted to reconfigure our Redis server to act as a replica for a Redis server they controlled. Due to this change, a read-only mode was enabled on Hachyderm Redis and no further data was written.&lt;/p>
&lt;p>Normally, Hachyderm servers run nftables to block all except necessary traffic from the Internet. We leverage Tailscale for server-to-server communication and only expose ports to the Internet as needed to run Mastodon and administer the systems.&lt;/p>
&lt;p>As of July 16, 2023 11:17 UTC, the Hachyderm team has corrected the configuration on our systems and blocked external actors from accessing this Redis instance. While we do not have any direct evidence the information in the cache was deliberately exfiltrated, because this was exposed to the public internet, we assume the data was compromised.&lt;/p>
&lt;h2 id="impact">Impact&lt;/h2>
&lt;p>Highly sensitive information like passwords, private keys, and private posts were &lt;strong>NOT&lt;/strong> exposed as part of this incident. No action is required from the user.&lt;/p>
&lt;p>The affected Redis cache stored the following types of information with a 10 minute time-to-live before getting deleted:&lt;/p>
&lt;ul>
&lt;li>Logins using &lt;code>/auth/sign_in&lt;/code> will cache inputted email addresses and used IP addresses for login throttling. Note that if you had a cached session during this period, no IPs or email-addresses would have been included.&lt;/li>
&lt;li>Rails-generated HTML content.&lt;/li>
&lt;li>Some UI-related settings for individual users (examples being toggles for reducing motion, auto play gifs, and the selected UI font).&lt;/li>
&lt;li>Public posts rendered by Mastodon in the affected period.&lt;/li>
&lt;li>Other non-critical information like emojis, blocked IPs, status counts, and other normally public information of the instance.&lt;/li>
&lt;/ul>
&lt;p>We do not have sufficient monitoring to confirm precisely when the compromise occurred. We also have no confirmation if any of the above data was actually sent to a third party, but since the information was available to them, we assume the data was compromised. But since the adversary turned on read-replica mode disabling writes, and Mastodon&amp;rsquo;s cache having a time to live of 10 minutes, it would have severely limited the amount of information leaked in this period.&lt;/p>
&lt;h2 id="timeline">Timeline&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>&lt;strong>Date/Time (UTC)&lt;/strong>&lt;/th>
&lt;th>&lt;strong>Event&lt;/strong>&lt;/th>
&lt;th>&lt;strong>Phase&lt;/strong>&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>2023-07-09 19:16&lt;/td>
&lt;td>Fritz was updated, and part of the upgrade process required a restart of the system. Nftables, our system firewall, was not reenabled across reboots on this server. Fritz acted normally throughout our monitoring period.&lt;/td>
&lt;td>Before&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Between 2023-07-10 to 2023-07-16&lt;/td>
&lt;td>The adversary adds a non-Hachyderm host as the fritz Redis write primary, causing Fritz to go into read-only mode. Sometime thereafter, Fritz’s Redis encounters a synchronization error, causing it to not synchronize further with the non-Hachyderm host.&lt;/td>
&lt;td>Before&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2023-07-16 10:45&lt;/td>
&lt;td>Hachyderm Infra Engineer (HIE1) identifies that, when attempting to run a standard administrative task sees Mastodon logs are alerting that RedisCacheStore: write_entry failed, returned false: Redis::CommandError: READONLY You can&amp;rsquo;t write against a read only replica.&lt;/td>
&lt;td>Identify&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2023-07-16 10:56&lt;/td>
&lt;td>After reviewing the system configuration further, HIE1 identifies that the Redis cache on fritz is targeting an unknown host IP as a write primary; overall Redis is in a degraded state.&lt;/td>
&lt;td>Identify&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2023-07-16 11:10&lt;/td>
&lt;td>HIE1 confirms the unknown host IP is not a Hachyderm host and that nftables on fritz is not enabled as expected.&lt;/td>
&lt;td>Identify&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2023-07-16 11:17&lt;/td>
&lt;td>HIE1 stops Redis and re-enables nftables on fritz, closing unbounded communication from the Internet.&lt;/td>
&lt;td>Remediate&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2023-07-16 13:02&lt;/td>
&lt;td>HIE1 confirms the type of information stored in the cache, including e-mail + IP address for logins.&lt;/td>
&lt;td>Investigate&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="analysis">Analysis&lt;/h2>
&lt;h2 id="what-went-well--where-did-we-get-lucky">What went well &amp;amp; where did we get lucky&lt;/h2>
&lt;ul>
&lt;li>We got lucky that it was the caching redis server, which primarily holds Rails generated HTML content, UI related settings per user, and rack based login throttling.&lt;/li>
&lt;li>No user data outside of a subset of IP addresses and Emails from people using the login form in the compromised period were possibly shared.&lt;/li>
&lt;li>Redis had a TTL of ten (10) minutes on any data in this cache.&lt;/li>
&lt;li>Redis was put into a READONLY mode when the compromise occurred, so it is likely no data was pushed to the adversary after the timestamp of the compromise. This, coupled with the ten minute cache, caused the cache itself to empty fairly quickly.&lt;/li>
&lt;/ul>
&lt;h2 id="what-didnt-go-well">What didn’t go well&lt;/h2>
&lt;ul>
&lt;li>Process: While we have a standard process for updating Mastodon, and while our servers are version-controlled into git, they are individually unique creations, which makes it challenging to understand if a server is configured correctly because each server can be just a &lt;em>little&lt;/em> different.&lt;/li>
&lt;li>System: We don’t leverage authentication or restrictive IP block binds on our Redis server, so once the firewall was down, Redis would become available on the Internet and trivial to connect to and see what data it contained.&lt;/li>
&lt;li>It took us a long time to identify the issue:
&lt;ul>
&lt;li>Observability: We don&amp;rsquo;t currently have detective control to alert us if a critical configuration on a server is set correctly or not.&lt;/li>
&lt;li>Observability: Furthermore, we didn’t have any outlier detection or redis alerts set up to notify us that Redis had gone into read only mode.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Observability: Due to how journalctl had been set up to rotate logs by size, and the explosive amount of RedisCacheStore: write_entry failed entries generated per successful page load, we quickly lost the ability to look back on our log history to see the exact date the access happened.&lt;/li>
&lt;/ul>
&lt;h2 id="corrective-actions">Corrective Actions&lt;/h2>
&lt;p>The Hachyderm infrastructure team is taking/will take the following actions to mitigate the impact of this incident:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Action&lt;/th>
&lt;th>Expected Date&lt;/th>
&lt;th>Status&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Enable nftables on fritz and ensure it will re-enable upon system or service restart on all systems&lt;/td>
&lt;td>Jul 16, 2023&lt;/td>
&lt;td>Done&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Perform system audit to identify potential additional compromise beyond Redis&lt;/td>
&lt;td>Jul 16, 2023&lt;/td>
&lt;td>Done&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Update system update runbooks to include validating that nftables is running as expected after restarts&lt;/td>
&lt;td>Jul 17, 2023&lt;/td>
&lt;td>To Do&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Bind Redis only to expected IP blocks for Hachyderm’s servers&lt;/td>
&lt;td>Jul 17, 2023&lt;/td>
&lt;td>To Do&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Publish full causal analysis graph and update corrective actions based on findings&lt;/td>
&lt;td>Jul 21, 2023&lt;/td>
&lt;td>To Do&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Identify tooling to keep logs for a defined time period that would not be affected by large log files.&lt;/td>
&lt;td>Jul 28, 2023&lt;/td>
&lt;td>To Do&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Identify plan to require authentication on Redis instances&lt;/td>
&lt;td>Jul 28, 2023&lt;/td>
&lt;td>To Do&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Identity mechanism for detective controls to alert if critical services are not running on servers &amp;amp; create plan to implement&lt;/td>
&lt;td>Jul 28, 2023&lt;/td>
&lt;td>To Do&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Explore possibility of using cloud-based firewall rules as an extra layer of protection &amp;amp; plan to implement&lt;/td>
&lt;td>Jul 28, 2023&lt;/td>
&lt;td>To Do&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table></description></item><item><title>Blog: Moderation Postmortem</title><link>https://community.hachyderm.io/blog/2023/05/02/moderation-postmortem/</link><pubDate>Tue, 02 May 2023 00:00:00 +0000</pubDate><guid>https://community.hachyderm.io/blog/2023/05/02/moderation-postmortem/</guid><description>
&lt;p>Hello Hachydermians! There has been a lot of confusion this week, so we’re writing
up this blog post to be both a postmortem of sorts and a
&lt;a href="https://en.wikipedia.org/wiki/Single_source_of_truth">single source of truth&lt;/a>.
This is partly to combat some of the problems generated by hearsay: hearsay generates
more Things To Respond To than Things That Actually Happened. As a result, this
post is a little longer than our norm.&lt;/p>
&lt;h2 id="moderation-incident">Moderation Incident&lt;/h2>
&lt;p>(A note to the broader, non-tech industry, members of our community: “Incident”
here carries similar context and meaning to an
“&lt;a href="https://en.wikipedia.org/wiki/Computer_security_incident_management">IT Incident&lt;/a>”
as we are a tech-oriented instance. Postmortems for traditional IT incidents are also in
this section.)&lt;/p>
&lt;h3 id="a-short-confusing-timeline">A Short, Confusing Timeline&lt;/h3>
&lt;p>On 24 April 2023 the Hachyderm Moderation team
&lt;a href="https://github.com/hachyderm/community/issues/401">received a request to review our Fundraising Policy via a GitHub Issue&lt;/a>.
The reason for the request was to
ensure there was a well understood distinction between Mutual Aid and
Fundraising. Although our Head Moderator responded to the thread with the
constraints we use when developing new rules, some of the hearsay we started
to see in the thread the user linked raised some flags that something else was
happening.&lt;/p>
&lt;p>In order to determine what happened, we needed to dive into various commentary
before arriving at a potential root cause (and we eventually determined this
was indeed the correct root cause). While a few of our moderators working on
this, our founder and now-former admin Kris Nóva was requested, either directly
or indirectly, to make statements on transgender genocide (she is herself
openly transgender) and classism.&lt;/p>
&lt;p>These issues are important, we want to be unequivocally clear. Kris Nóva
is transgender and has been open about her experiences with homelessness
and receiving mutual aid. The Hachyderm teams are also populated, intentionally,
with a variety of marginalized individuals that bring their own lived experiences
to our ability to manage Hachyderm’s moderation and infrastructure teams. That
said: it was not immediately apparent that these requests were initially connected
to the originating problem. In fact, it caused additional resources to be used
trying to determine if there was a secondary problem to address. This resulted
in a delay in actual remediation.&lt;/p>
&lt;h3 id="the-error-itself">The Error Itself&lt;/h3>
&lt;p>On 27 Mar 2023 the Hachyderm Moderation Team received a report that indicated
that an account may have been spamming the platform. When the posts were reviewed
at the time of the report, it did trigger our spam policy. When we receive
reports of accounts seeking funds, we try to validate the posts to check for
common issues like phishing and so forth, as well as checking post volume and pattern
to determine if the account is posting in a bot-like way, and so forth. At the
time the report was moderated, the result was that the posting type and/or pattern
was incorrectly flagged as spam and we requested the account stop posting that
type of post. Once we became aware of the situation, the Hachyderm Moderation
team followed up with our Hachydermian to ensure that they knew that they could
post their requests for Mutual Aid, apologized for the error, and did our best
to let them know we were here if there was anything else we could do to help
them feel warm and welcome on our instance.&lt;/p>
&lt;h2 id="lack-of-public-statement">Lack of public statement&lt;/h2>
&lt;p>There have been some questions around the lack of public statement regarding the
above. There are two reasons for this. First and foremost, this is because situations
involving moderation are between the moderation team and the impacted person.
Secondly, we must always take active steps to protect against negative consequences
that can come with all the benefits of being a larger instance.&lt;/p>
&lt;h3 id="how-errors-in-moderation-are-handled-at-hachyderm">How Errors in Moderation are Handled at Hachyderm&lt;/h3>
&lt;p>Depending on the error, one or both of the following occurs:&lt;/p>
&lt;ul>
&lt;li>Follow up with the user to rectify the situation&lt;/li>
&lt;li>Review policy to ensure it doesn’t happen again&lt;/li>
&lt;/ul>
&lt;p>This is because of two enforced opinions of the Hachyderm Moderation team:&lt;/p>
&lt;ul>
&lt;li>Moderation reports filed by users are reports of harm done.
&lt;ul>
&lt;li>To put it another way: if a user needs to file a report of hateful content,
then they have already seen that content to report it.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>All moderation mistakes are also harm done.&lt;/li>
&lt;/ul>
&lt;p>For the latter, we only follow up with the user if 1) they request and/or 2)
we have reason to believe it would not increase the harm done. All Hachyderm
Moderator and Hachydermian interactions are centered on harm mitigation.
Rectifying mistakes in moderation are about the impacted person and not ego on
the part of the specific moderator who made the mistake.&lt;/p>
&lt;p>To put it another way, the Hachyderm Moderation team exists to serve the Hachyderm
Community. This means that we will apologize to the user as part of harm mitigation.
We will not as part of “needing to have our apology accepted” or to “be seen as
apologizing”. These latter two go against the ethos of Hachyderm Moderation Strategy.&lt;/p>
&lt;h2 id="inter-instance-communication-and-hachyderm">Inter-instance Communication and Hachyderm&lt;/h2>
&lt;p>Back when we took on the Twitter Migration in Nov 2022, we started to overcommunciate
with Hachydermians that we were going to start using email and GitHub Issues
(in addition to moderation reports) to accommodate our team scaling. As part of the
changes we made, we also started making changes to reflect this in the documentation,
as well as including how other instances can reach out to us if needed. It was our own
pattern that if we needed to reach out to another instance, we used the email address
they listed on their instance page. This is because we didn’t want to assume “the name
on the instance” was “the” person to talk to. There are likely other instances like ours
that have multiple people involved. This is reinforced by the fact that the “name” is
actually populated by default by the person who originally installed the software on
the server(s).&lt;/p>
&lt;p>That said, we recently made a connection with someone who has grown their instance in
the Fediverse for quite some time and has been helping to make us aware of the pre-existing
cultural and communication norms in this space. In the same way it was natural for us
to check for the existence of instance documentation to find their preferred way to
communicate, it was unnatural for some of the existing instances to do so. Instead, it
seems there are pre-existing cultural norms in the space we weren’t aware of, the
impact of this is that some instances knew how to reach us and others did not.&lt;/p>
&lt;p>We have been taking feedback from the person we connected with so that we can balance
the needs of the existing communication patterns and norms of the Fediverse, while
not accidentally creating a situation where a communication method ends up either
too silo’ed or not appropriately visible due to our team structures.&lt;/p></description></item><item><title>Blog: TLS Expires: media.hachyderm.io</title><link>https://community.hachyderm.io/blog/2023/02/27/tls-expires-media.hachyderm.io/</link><pubDate>Mon, 27 Feb 2023 00:00:00 +0000</pubDate><guid>https://community.hachyderm.io/blog/2023/02/27/tls-expires-media.hachyderm.io/</guid><description>
&lt;p>On February 28th, 2023 at approximately 01:55 UTC Hachyderm experienced a service degradation in which images failed to load in production.&lt;/p>
&lt;p>We were able to quickly identify the root cause as expired TLS certificates in production for &lt;code>media.hachyderm.io&lt;/code>&lt;/p>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2023/02/27/tls-expires-media.hachyderm.io/img.png" alt="img.png">&lt;/p>
&lt;h1 id="context">Context&lt;/h1>
&lt;p>Hachyderm TLS certificates are still managed manually, and are very clearly out of sprawling out of control due to our rapid growth. There are many certificates on various servers that have had config copied from one server to another as we grew into our current architecture.&lt;/p>
&lt;p>The alert notification was missed, and the &lt;code>media.hachyderm.io&lt;/code> TLS &lt;code>privkey.pem&lt;/code> and &lt;code>fullchain.pem&lt;/code> material expired causing the service degradation.&lt;/p>
&lt;h3 id="timeline">Timeline&lt;/h3>
&lt;ul>
&lt;li>Feb 28th &lt;strong>01:52&lt;/strong> &lt;code>@quintessence&lt;/code> First report of media outages&lt;/li>
&lt;li>Feb 28th &lt;strong>01:54&lt;/strong> &lt;code>@nova&lt;/code> Confirms media is broken from remote proxy in EU&lt;/li>
&lt;li>Feb 28th &lt;strong>01:56&lt;/strong> &lt;code>@nova&lt;/code> Appoints &lt;code>@quintessence&lt;/code> as incident commander&lt;/li>
&lt;li>Feb 28th &lt;strong>01:57&lt;/strong> &lt;code>@nova&lt;/code> Confirms TLS expired on &lt;code>media.hachyderm.io&lt;/code>&lt;/li>
&lt;li>Feb 28th &lt;strong>02:30&lt;/strong> &lt;code>@nova&lt;/code> Live streaming &lt;a href="https://youtu.be/kMf0KOlSNdk?t=1561">fixing TLS&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Shortly after starting the stream we discovered that the Acme challenge was not working because the &lt;code>media.hachyderm.io&lt;/code> DNS record was pointed to &lt;code>CNAME hachyderm.io&lt;/code> and the proxy was not configured to manage the request. In the past we have worked around this by editing the CDN on the East coast which is where the Acme challenge will resolve.&lt;/p>
&lt;p>In this case we changed the &lt;code>media.hachyderm.io&lt;/code> DNS record to point to &lt;code>A &amp;lt;ip-of-fritz&amp;gt;&lt;/code> which is where the core web server was running.&lt;/p>
&lt;p>We re-ran the renew process and it worked!&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo -E certbot renew
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We then re-pointed &lt;code>media.hachyderm.io&lt;/code> back to &lt;code>CNAME hachyderm.io&lt;/code>.&lt;/p>
&lt;p>Next came the &lt;code>scp&lt;/code> command to move the new cert material out to the various CDN nodes and restart nginx.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#8f5902;font-style:italic"># Copy TLS from fritz -&amp;gt; CDN host&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>scp /etc/letsencrypt/archive/media.hachyderm.io/* root@&amp;lt;host&amp;gt;:/etc/letsencrypt/archive/media.hachyderm.io/
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#8f5902;font-style:italic"># Access root on the CDN host&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ssh root@&amp;lt;host&amp;gt;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#8f5902;font-style:italic"># Private key (on CDN host)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>rm -f /etc/letsencrypt/live/media.hachyderm.io/privkey.pem
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ln -s /etc/letsencrypt/archive/media.hachyderm.io/privkey3.pem /etc/letsencrypt/live/media.hachyderm.io/privkey.pem
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#8f5902;font-style:italic"># Fullchain (on CDN host)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>rm -f /etc/letsencrypt/live/media.hachyderm.io/fullchain.pem
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ln -s /etc/letsencrypt/archive/media.hachyderm.io/fullchain3.pem /etc/letsencrypt/live/media.hachyderm.io/fullchain.pem
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The full list of CDN hosts:&lt;/p>
&lt;ul>
&lt;li>cdn-frankfurt-1&lt;/li>
&lt;li>cdn-fremont-1&lt;/li>
&lt;li>sally&lt;/li>
&lt;li>esme&lt;/li>
&lt;/ul>
&lt;p>Restarting &lt;code>nginx&lt;/code> on each of the CDN hosts was able to fix the problem.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#8f5902;font-style:italic"># On a CDN host&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>nginx -t &lt;span style="color:#8f5902;font-style:italic"># Test the config&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>systemctl reload nginx &lt;span style="color:#8f5902;font-style:italic"># Reload the service&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#8f5902;font-style:italic"># On your local machine &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>emacs /etc/hosts &lt;span style="color:#8f5902;font-style:italic"># Point &amp;#34;hachyderm.io&amp;#34; and &amp;#34;media.hachyderm.io&amp;#34; to IP of CDN host&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#8f5902;font-style:italic"># Check your browser for working images&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="impact">Impact&lt;/h3>
&lt;ul>
&lt;li>Full image outage across the site in all regions.&lt;/li>
&lt;li>A stressful situation interrupting dinner and impacting the family.&lt;/li>
&lt;li>Even more chaos and confusion with certificate material.&lt;/li>
&lt;/ul>
&lt;h3 id="lessons-learned">Lessons Learned&lt;/h3>
&lt;ul>
&lt;li>We still have outstanding legacy certificate management problems.&lt;/li>
&lt;/ul>
&lt;h3 id="things-that-went-well">Things that went well&lt;/h3>
&lt;ul>
&lt;li>We had a quick report, and the mean time to resolution was &amp;lt;60 mins.&lt;/li>
&lt;/ul>
&lt;h3 id="things-that-went-poorly">Things that went poorly&lt;/h3>
&lt;ul>
&lt;li>The certs are in an even more chaotic state.&lt;/li>
&lt;li>There was no alerting that the images broke.&lt;/li>
&lt;li>There was a high stress situation that impacting our personal lives.&lt;/li>
&lt;/ul>
&lt;h3 id="where-we-got-lucky">Where we got lucky&lt;/h3>
&lt;ul>
&lt;li>I still had access to the servers, and was able to remedy the situation from existing knowledge.&lt;/li>
&lt;/ul>
&lt;h1 id="action-items">Action items&lt;/h1>
&lt;ul>
&lt;li>We need to destroy the vast majority of nginx configurations and domains in production&lt;/li>
&lt;li>We need to destroy all TLS certs and re-create them with a cohesive strategy&lt;/li>
&lt;li>We need a better way to perform the Acme challenge that doesn&amp;rsquo;t involve changing DNS around the globe&lt;/li>
&lt;li>Nóva to send list of domains to discord to destroy&lt;/li>
&lt;/ul></description></item><item><title>Blog: Fritz Timeouts</title><link>https://community.hachyderm.io/blog/2023/01/07/fritz-timeouts/</link><pubDate>Sat, 07 Jan 2023 00:00:00 +0000</pubDate><guid>https://community.hachyderm.io/blog/2023/01/07/fritz-timeouts/</guid><description>
&lt;p>On January 7th, 2023 at approximately 22:26 UTC Hachyderm experienced a spike in HTTP response times as well as a spike in 504 Timeouts across the CDN.&lt;/p>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2023/01/07/fritz-timeouts/img.png" alt="img.png">&lt;/p>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2023/01/07/fritz-timeouts/img_1.png" alt="img_1.png">&lt;/p>
&lt;p>Working backwards from the CDN to &lt;code>fritz&lt;/code> we discovered another cascading failure.&lt;/p>
&lt;h1 id="context">Context&lt;/h1>
&lt;p>There is a fleet of CDN nodes around the world, commonly referred to as &amp;ldquo;POP&amp;rdquo; servers (Point of Presence) or even just &amp;ldquo;The CDN&amp;rdquo;. These servers reverse proxy over dedicated connections back to our core infrastructure.&lt;/p>
&lt;p>These CDN servers served content timeouts at roughly &lt;strong>22:20:00&lt;/strong> UTC.&lt;/p>
&lt;p>These CDN servers depend on the &lt;code>mastodon-streaming&lt;/code> service to offer websocket connections.&lt;/p>
&lt;h3 id="impact">Impact&lt;/h3>
&lt;ul>
&lt;li>Total streaming server outage reported in Discord (Uptime Robot)&lt;/li>
&lt;li>Slow/Timeouts reported by users in Twitch chat&lt;/li>
&lt;li>Nóva noticed slow/timeouts on her phone&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>HTTP Response Times&lt;/strong> measured &amp;gt; 3s&lt;/p>
&lt;h1 id="background">Background&lt;/h1>
&lt;p>We received some valuable insight from &lt;a href="https://github.com/ThisIsMissEm">@ThisIsMissEm&lt;/a> who has experience with both node.js websocket servers and the mastodon codebase, which can be read &lt;a href="https://hackmd.io/8bhI7IWcTvSJvRhu9M45nQ">here in HackMD&lt;/a>.&lt;/p>
&lt;p>An important takeaway from this knowledge is that the &lt;code>mastodon-streaming&lt;/code> service and the &lt;code>mastodon-web&lt;/code> service will not rate limit if they are communicating over &lt;code>localhost&lt;/code>.&lt;/p>
&lt;p>In other words, you should be scheduling &lt;code>mastodon-streaming&lt;/code> on the same node you are running &lt;code>mastodon-web&lt;/code>.&lt;/p>
&lt;p>We believe that the way the streaming API works, that if there is a &lt;strong>&amp;ldquo;large event&amp;rdquo;&lt;/strong> such as having a post go out by a largely followed account it can cause a cascading effect on everyone connected via the streaming API.&lt;/p>
&lt;blockquote>
&lt;p>A good metric to track would actually be the percentage of connections that a single write is going to. If the mastodon server has one highly followed user, a post by them, especially in a &amp;ldquo;busy&amp;rdquo; timezone for the instance, will result in unbalanced write behaviours, where one message posted will result in iterating over a heap more connections than others (one per follower who&amp;rsquo;s connected to streaming), so you can end up doing 40,000 network writes very easily, locking up node.js temporarily from processing disconnections correctly.&lt;/p>
&lt;/blockquote>
&lt;p>We believe that the streaming API began to drop connections which cascaded out to the CDN nodes via the &lt;code>mastodon-web&lt;/code> service.&lt;/p>
&lt;p>We can correlate this theory by connecting observe logged lines to the Mastodon code bash.&lt;/p>
&lt;h5 id="logs-from-mastodon-streaming-on-fritz">Logs from &lt;code>mastodon-streaming&lt;/code> on &lt;strong>Fritz&lt;/strong>&lt;/h5>
&lt;pre tabindex="0">&lt;code>06-4afe-a449-a42f861855b2 Tried writing to closed socket
33-414d-9143-6a5080bd6254 Tried writing to closed socket
33-414d-9143-6a5080bd6254 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
41-4385-9762-c5c1d829ba27 Tried writing to closed socket
0f-4eb4-9751-b5ac7e21c648 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
60-40d1-99b4-349f03610b36 Tried writing to closed socket
60-40d1-99b4-349f03610b36 Tried writing to closed socket
33-414d-9143-6a5080bd6254 Tried writing to closed socket
06-4afe-a449-a42f861855b2 Tried writing to closed socket
&lt;/code>&lt;/pre>&lt;h5 id="code-from-mastodon-main">Code from Mastodon main&lt;/h5>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-javascript" data-lang="javascript">&lt;span style="display:flex;">&lt;span> &lt;span style="color:#204a87;font-weight:bold">const&lt;/span> &lt;span style="color:#000">streamToWs&lt;/span> &lt;span style="color:#ce5c00;font-weight:bold">=&lt;/span> &lt;span style="color:#000;font-weight:bold">(&lt;/span>&lt;span style="color:#000">req&lt;/span>&lt;span style="color:#000;font-weight:bold">,&lt;/span> &lt;span style="color:#000">ws&lt;/span>&lt;span style="color:#000;font-weight:bold">,&lt;/span> &lt;span style="color:#000">streamName&lt;/span>&lt;span style="color:#000;font-weight:bold">)&lt;/span> &lt;span style="color:#000;font-weight:bold">=&amp;gt;&lt;/span> &lt;span style="color:#000;font-weight:bold">(&lt;/span>&lt;span style="color:#000">event&lt;/span>&lt;span style="color:#000;font-weight:bold">,&lt;/span> &lt;span style="color:#000">payload&lt;/span>&lt;span style="color:#000;font-weight:bold">)&lt;/span> &lt;span style="color:#000;font-weight:bold">=&amp;gt;&lt;/span> &lt;span style="color:#000;font-weight:bold">{&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#204a87;font-weight:bold">if&lt;/span> &lt;span style="color:#000;font-weight:bold">(&lt;/span>&lt;span style="color:#000">ws&lt;/span>&lt;span style="color:#000;font-weight:bold">.&lt;/span>&lt;span style="color:#000">readyState&lt;/span> &lt;span style="color:#ce5c00;font-weight:bold">!==&lt;/span> &lt;span style="color:#000">ws&lt;/span>&lt;span style="color:#000;font-weight:bold">.&lt;/span>&lt;span style="color:#000">OPEN&lt;/span>&lt;span style="color:#000;font-weight:bold">)&lt;/span> &lt;span style="color:#000;font-weight:bold">{&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#000">log&lt;/span>&lt;span style="color:#000;font-weight:bold">.&lt;/span>&lt;span style="color:#000">error&lt;/span>&lt;span style="color:#000;font-weight:bold">(&lt;/span>&lt;span style="color:#000">req&lt;/span>&lt;span style="color:#000;font-weight:bold">.&lt;/span>&lt;span style="color:#000">requestId&lt;/span>&lt;span style="color:#000;font-weight:bold">,&lt;/span> &lt;span style="color:#4e9a06">&amp;#39;Tried writing to closed socket&amp;#39;&lt;/span>&lt;span style="color:#000;font-weight:bold">);&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#204a87;font-weight:bold">return&lt;/span>&lt;span style="color:#000;font-weight:bold">;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#000;font-weight:bold">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Found in &lt;a href="https://github.com/mastodon/mastodon/blob/main/streaming/index.js#L827-L831">mastodon/streaming/index.js&lt;/a>&lt;/p>
&lt;h5 id="logs-correlation-from-mastodon-web-on-fritz">Logs (correlation) from &lt;code>mastodon-web&lt;/code> on &lt;strong>Fritz&lt;/strong>&lt;/h5>
&lt;p>This is where we are suspecting that we are hitting the &amp;ldquo;Rack Attack&amp;rdquo; rate limit in the streaming service.&lt;/p>
&lt;pre tabindex="0">&lt;code>-4589-97ed-b67c66eb8c38] Rate limit hit (throttle): 98.114.90.221 GET /api/v1/timelines/home?since_id=109&amp;gt;
&lt;/code>&lt;/pre>&lt;h1 id="working-theory-root-cause">Working Theory (root cause)&lt;/h1>
&lt;p>We are maxing out the streaming service on &lt;strong>Fritz&lt;/strong>, and it is rate limiting the mastodon web (puma) service.
The &amp;ldquo;maxing out&amp;rdquo; can be described in the write-up by &lt;a href="https://github.com/ThisIsMissEm">@ThisIsMissEm&lt;/a> where NodeJS struggles to process/drop the connections that are potentially a result of a &lt;strong>&amp;ldquo;Large Event&amp;rdquo;&lt;/strong>.&lt;/p>
&lt;p>As the websocket count increases there is a cascading failure that starts on &lt;strong>Fritz&lt;/strong> and works it way out to the nodes.&lt;/p>
&lt;p>Eventually the code that is executing (looping) over the large amounts of websockets will &lt;strong>&amp;ldquo;break&amp;rdquo;&lt;/strong> and there is a large release where a spike in network traffic can be observed.&lt;/p>
&lt;p>We see an enormous (relatively) amount of events occur during the second of &lt;strong>22:17:30&lt;/strong> on &lt;strong>Fritz&lt;/strong> which we suspect is the &amp;ldquo;release&amp;rdquo; of the execution path.&lt;/p>
&lt;p>As the streaming service recovers, the rest of Hachyderm slowly stabilizes.&lt;/p>
&lt;h3 id="lessons-learned">Lessons Learned&lt;/h3>
&lt;p>Websockets are a big deal, and will likely be the next area of our service we need to start observing.&lt;/p>
&lt;p>We will need to start monitoring the relationship between the streaming service and the main mastodon web service pretty closely.&lt;/p>
&lt;h3 id="things-that-went-well">Things that went well&lt;/h3>
&lt;p>We found some great help on Twitch, and we ended up discovering an unrelated (but potentially disastrous) problem with &lt;strong>Nietzsche&lt;/strong> (the main database server).&lt;/p>
&lt;p>We have a path forward for debugging the streaming issues.&lt;/p>
&lt;h3 id="things-that-went-poorly">Things that went poorly&lt;/h3>
&lt;p>Nóva was short on Twitch again and struggles to deal with a lot of &amp;ldquo;noise/distractions&amp;rdquo; while she is debugging production.&lt;/p>
&lt;p>In general there isn&amp;rsquo;t much more we can do operationally other than keep a closer eye on things. The code base is gonna&amp;rsquo; do what the code base is gonna&amp;rsquo; do until we decide to fork it or wait for improvements from the community.&lt;/p>
&lt;h3 id="where-we-got-lucky">Where we got lucky&lt;/h3>
&lt;p>Seriously the &lt;strong>Nietzsche&lt;/strong> discovery was huge, and had nothing to do with the streaming &amp;ldquo;hiccups&amp;rdquo;. We got extremely lucky here.&lt;/p>
&lt;p>Consequently, Nóva fixed the problem on &lt;strong>Nietzsche&lt;/strong> which was that our main database NVMe disk was at 98% capacity.&lt;/p>
&lt;ul>
&lt;li>We did NOT receive storage alerts in Discord (I believe we should have?)&lt;/li>
&lt;li>Nóva could NOT find an existing cron job on the server to clean the archive.&lt;/li>
&lt;li>Nóva scheduled the cron job (Using &lt;code>sudo crontab -e&lt;/code>)&lt;/li>
&lt;/ul>
&lt;p>The directory (archive) that was full:&lt;/p>
&lt;pre tabindex="0">&lt;code>/var/lib/postgres/data/archive
&lt;/code>&lt;/pre>&lt;p>&lt;strong>Nietzsche&lt;/strong> is now back down to ~30%&lt;/p>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2023/01/07/fritz-timeouts/img_2.png" alt="img_2.png">&lt;/p>
&lt;h1 id="action-items">Action items&lt;/h1>
&lt;h3 id="1-set-up-websocket-observability-on-cdn-nodes-clients-and-fritz-server">1) Set up websocket observability on CDN nodes (clients) and Fritz (server)&lt;/h3>
&lt;p>We want to see how many &amp;ldquo;writes&amp;rdquo; we have on the client side and how many socket connections they are mapped to if possible. We might need to PR a log entry for this to the Mastodon code base.&lt;/p>
&lt;h3 id="2-verify-cron-is-running-on-nietzsche">2) Verify cron is running on Nietzsche&lt;/h3>
&lt;p>We need to make sure the cron is running and the archive is emptying&lt;/p>
&lt;h3 id="3-debug-why-we-didnt-receive-nietzsche-alerts">3) Debug why we didn&amp;rsquo;t receive Nietzsche alerts&lt;/h3>
&lt;p>I think we should have seen these, but I am not sure?&lt;/p>
&lt;h3 id="4-we-likely-need-a-bigger-fritz">4) We likely need a bigger &amp;ldquo;Fritz&amp;rdquo;&lt;/h3>
&lt;p>Sounds like we need donations and a bigger server (it will be hard to move streaming off of the same machine as web).&lt;/p></description></item><item><title>Blog: Fritz on the fritz</title><link>https://community.hachyderm.io/blog/2023/01/03/fritz-on-the-fritz/</link><pubDate>Tue, 03 Jan 2023 00:00:00 +0000</pubDate><guid>https://community.hachyderm.io/blog/2023/01/03/fritz-on-the-fritz/</guid><description>
&lt;p>On January 3th, 2023 at approximately 12:30 UTC Hachyderm experienced a spike in
response times. This appeared to be due to a certificate that had not been
renewed on &lt;code>fritz&lt;/code>, which runs the Mastodon Puma and Streaming services. The
service appeared to recover until approximately 15:00 UTC when another spike in
response times was observed.&lt;/p>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2023/01/03/fritz-on-the-fritz/high_response_times.png" alt="high_response_times.png">&lt;/p>
&lt;p>Alerts were firing in discord alerting us to the issue.&lt;/p>
&lt;h1 id="background">Background&lt;/h1>
&lt;p>&lt;code>fritz&lt;/code> runs mastodon-web and mastodon-streaming and all other web nodes proxy
to &lt;code>fritz&lt;/code>.&lt;/p>
&lt;p>mastodon-web was configured with 16 processes each having 20 threads.&lt;/p>
&lt;p>mastodon-streaming was configured with 16 processes&lt;/p>
&lt;h1 id="impact">Impact&lt;/h1>
&lt;p>p90 response times grew from ~400ms to &amp;gt;2s.
increase of 502 responses to &amp;gt;1000 per minute.&lt;/p>
&lt;h1 id="root-causes-and-trigger">Root causes and trigger&lt;/h1>
&lt;p>organic growth in users and traffic coupled with the return from vacation of
the US caused the streaming and puma processes on &lt;code>fritz&lt;/code> to use more CPU. CPU
load hit &amp;gt;90% consistently on &lt;code>fritz&lt;/code>. this in turn caused responses to fail to
be returned to the upstream web frontends.&lt;/p>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2023/01/03/fritz-on-the-fritz/7d_cpu_core.png" alt="7d_cpu_core.png">&lt;/p>
&lt;h3 id="lessons-learned">Lessons Learned&lt;/h3>
&lt;p>response times are very sensitive to puma threads (reducing from 20 to 16 threads
per process doubled GET response times).&lt;/p>
&lt;p>the site functions well with fewer streaming processes.&lt;/p>
&lt;h3 id="things-that-went-well">Things that went well&lt;/h3>
&lt;p>we had the core CPU load on the public dashboard.&lt;/p>
&lt;h3 id="things-that-went-poorly">Things that went poorly&lt;/h3>
&lt;p>in an attempt to get things under control both mastodon-streaming and
mastodon-web were changed. puma was then reverted as we had
over-corrected and response times were getting quite bad.&lt;/p>
&lt;p>no CPU load alerts were configured for &lt;code>fritz&lt;/code> specifically.&lt;/p>
&lt;h3 id="where-we-got-lucky">Where we got lucky&lt;/h3>
&lt;p>&lt;code>@dma&lt;/code> was already keyed in to fritz thanks to an earlier issue where
certs hadn&amp;rsquo;t been renewed.&lt;/p>
&lt;h1 id="action-items">Action items&lt;/h1>
&lt;h3 id="1-streaming-processes-reduces-dma-repair">1) Streaming processes reduces &lt;code>@dma&lt;/code> [repair]&lt;/h3>
&lt;p>Reduced the number of streaming processes on &lt;code>fritz&lt;/code> from 16 to 12.&lt;/p>
&lt;h3 id="2-better-alerting-on-cpu-load-dma-detect">2) Better alerting on CPU load &lt;code>@dma&lt;/code> [detect]&lt;/h3>
&lt;p>We should implement better CPU load alerting on every host to detect these
issues and be able to respond even more quickly.&lt;/p>
&lt;h3 id="3-postmortem-documented-dma">3) Postmortem documented &lt;code>@dma&lt;/code>&lt;/h3>
&lt;p>This blog post and a hackmd postmortem doc.&lt;/p></description></item><item><title>Blog: The Queues ☃️ down in Queueville</title><link>https://community.hachyderm.io/blog/2022/12/20/the-queues-%EF%B8%8F-down-in-queueville/</link><pubDate>Tue, 20 Dec 2022 00:00:00 +0000</pubDate><guid>https://community.hachyderm.io/blog/2022/12/20/the-queues-%EF%B8%8F-down-in-queueville/</guid><description>
&lt;p>Every Queue down in Queueville liked ActivityPub a lot.
But John Mastodon who lived just north of Queuville, did not!
John Mastodon hated ActivityPub, the whole Activity season!
Now please don&amp;rsquo;t ask why. No one quite knows the reason.&lt;/p>
&lt;p>It could be, perhaps, that his &lt;code>WEB_CONCURRENCY&lt;/code> was too tight.
It could be his &lt;code>MAX_THREADS&lt;/code> wasn&amp;rsquo;t screwed on just right.
But I think that the most likely reason of all
May have been that his &lt;code>CPU&lt;/code> was two sizes too small.&lt;/p>
&lt;p>But, whatever the reason, his &lt;code>WEB_CONCURRENCY&lt;/code> or &lt;code>CPU&lt;/code>s,
He stood there on Activity Eve hating the Queues&amp;hellip;
Staring down from his cave with systemd hacks
At the warm buzzing servers below in their racks&lt;/p>
&lt;p>For he knew every Queue down in Queueville beneath
Was busy now hanging an Activity-Wreath.
&amp;ldquo;And they&amp;rsquo;re posting their statuses,&amp;rdquo; he snarled with a sneer.
&amp;ldquo;Tomorrow is Activity-Mas! It&amp;rsquo;s practically here!&amp;rdquo;&lt;/p>
&lt;p>Then he growled, with John Mastodon fingers nervously drumming,
&amp;ldquo;I must find some way to keep the statuses from coming&amp;rdquo;!&lt;/p>
&lt;p>For, tomorrow, I know all the Queues and the &amp;ldquo;they&amp;quot;s and the &amp;ldquo;them&amp;quot;s
Will wake bright and early for ActivitySeason to begin!&lt;/p>
&lt;p>And then! Oh, the noise! Oh, the noise! Noise! Noise! Noise!
There&amp;rsquo;s one thing John Mastodon hates: All the NOISE! NOISE! NOISE! NOISE!&lt;/p>
&lt;p>And they&amp;rsquo;ll shriek squeaks and squeals, racing &amp;lsquo;round on their hosts.
They&amp;rsquo;ll update with jingtinglers tied onto their posts!
They&amp;rsquo;ll toot their floofloovers. They&amp;rsquo;ll tag their tartookas.
They&amp;rsquo;ll share their whohoopers. They&amp;rsquo;ll follow their #caturday-ookas.
They&amp;rsquo;ll spin their #hashtags. They&amp;rsquo;ll boost their slooslunkas.
They&amp;rsquo;ll defederate their blumbloopas. But complain about their whowonkas.&lt;/p>
&lt;p>And they&amp;rsquo;ll play noisy games like post a cat on #caturday,
An ActivityPub type of all the queers and the gays!
And then they&amp;rsquo;ll make ear-splitting noises galooks
On their great big postgres whocarnio ruby monolith flooks!&lt;/p>
&lt;p>Then the Queues, young and old, will sit down to a feast.
And they&amp;rsquo;ll feast! And they&amp;rsquo;ll feast! And they&amp;rsquo;ll FEAST! FEAST! FEAST! FEAST!&lt;/p>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2022/12/20/the-queues-%EF%B8%8F-down-in-queueville/img.png" alt="img.png">&lt;/p>
&lt;p>They&amp;rsquo;ll feast on Queue-pudding, and rare Queue-roast-beast,
Ingress Queue roast beast is a feast I can&amp;rsquo;t stand in the least!&lt;/p>
&lt;p>And then they&amp;rsquo;ll do something I hate most of all!
Every Queue down in Queueville, the tall and the small,&lt;/p>
&lt;p>They&amp;rsquo;ll stand close together, with UptimeRobot bells ringing.
They&amp;rsquo;ll stand hand-in-hand, and those Queues will start singing!&lt;/p>
&lt;p>And they&amp;rsquo;ll sing! And they&amp;rsquo;ll sing! And they&amp;rsquo;d SING! SING! SING! SING!
And the more John Mastodon thought of this Queue Activity Sing,
The more John Mastodon thought, &amp;ldquo;I must stop this whole thing!&amp;rdquo;&lt;/p>
&lt;p>Why for fifty-three days I&amp;rsquo;ve put up with it now!
I must stop ActivityPub from coming! But how?&lt;/p>
&lt;h1 id="timeline">Timeline&lt;/h1>
&lt;p>All events are documented in &lt;a href="https://en.wikipedia.org/wiki/Coordinated_Universal_Time">UTC&lt;/a> time.&lt;/p>
&lt;ul>
&lt;li>13:00 &lt;code>@dma&lt;/code> Noticed the ingress queue was backing up&lt;/li>
&lt;li>16:45 &lt;code>@quintessence&lt;/code> Noticed the ingress queue was still lagging&lt;/li>
&lt;li>17:00 &lt;code>@nova&lt;/code> Declared an incident&lt;/li>
&lt;li>17:30 &lt;code>@hazelweakly&lt;/code> Noticed CPU at 100% on Freud and Franz&lt;/li>
&lt;li>17:34 &lt;code>@hazelweakly&lt;/code> Worked with &lt;code>@dma&lt;/code> to rebalance queues across Freud, Franz, and Nietzsche&lt;/li>
&lt;li>17:37 &lt;code>@dma&lt;/code> Notices CPU on Nietzsche is not changing&lt;/li>
&lt;li>17:45 &lt;code>@hazelweakly&lt;/code> Changes 5 &lt;code>MAX_THREADS&lt;/code> to 20 &lt;code>MAX_THREADS&lt;/code> on Nietzsche&lt;/li>
&lt;/ul>
&lt;h1 id="activityeve">ActivityEve&lt;/h1>
&lt;p>&amp;ldquo;I know just what to do!&amp;rdquo; John Mastodon laughed in his throat.
&amp;ldquo;I&amp;rsquo;ll max out the CPU, and cause the network to bloat.&amp;rdquo;&lt;/p>
&lt;p>And he chuckled, and clucked, &amp;ldquo;What a great John Mastodon trick!
With this CPU and network lag, I&amp;rsquo;ll cause the latency to stick!&amp;rdquo;&lt;/p>
&lt;p>&amp;ldquo;All I need is a denial of service.&amp;rdquo; John Mastodon looked around.
But since denial of services are scarce, there was none to be found.&lt;/p>
&lt;p>Did that stop John Mastodon? Hah! John Mastodon simply said,
&amp;ldquo;If I can&amp;rsquo;t find a denial of service, I&amp;rsquo;ll make one instead!&amp;rdquo;&lt;/p>
&lt;p>So he took his dog &lt;code>MAX&lt;/code>, and he took some more &lt;code>EMPTY_THREADS&lt;/code>.
And he tied big &lt;code>WEB_CONCURRENCY&lt;/code> on top of his head.
Then he loaded some cores and some old empty racks.
On a ramshackle sleigh and he whistled for &lt;code>MAX&lt;/code>.&lt;/p>
&lt;p>Then John Mastodon said &amp;ldquo;Giddyap!&amp;rdquo; and the sleigh started down
Toward the homes where the Queues lay a-snooze in their town.&lt;/p>
&lt;p>All their graphs were dark. No one knew he was there.
All the Queues were all dreaming sweet dreams without care.
When he came to the first little house of the square.&lt;/p>
&lt;p>&amp;ldquo;This is stop number one,&amp;rdquo; John Mastodon hissed,
As he climbed up load average, empty cores in his fist.&lt;/p>
&lt;p>Then he slid down the ingress, a rather tight bond.
But if a denial of service could do it, then so could John Mastodon.&lt;/p>
&lt;p>The queues drained only once, for a minute or two.
Then he stuck his posts out in front of the ingress queue!&lt;/p>
&lt;p>Where the little Queue messages hung all in a row.
&amp;ldquo;These messages,&amp;rdquo; he grinched, &amp;ldquo;are the first things to go!&amp;rdquo;&lt;/p>
&lt;p>Then he slithered and slunk, with a smile most unpleasant,
Around the whole server, and he took every message!&lt;/p>
&lt;p>Cat pics, and updates, artwork, and birdsite plea&amp;rsquo;s!
Holiday cheer, Hanukkah, Kwanza and holiday trees!&lt;/p>
&lt;p>And he stuffed them in memory. John Mastodon very nimbly,
Stuffed all the posts, one by one, up the chimney.&lt;/p>
&lt;p>Then he slunk to the default queues. He took the queues&amp;rsquo; feast!
He took the queue pudding! He took the roast beast!&lt;/p>
&lt;p>He cleaned out that &lt;code>/inbox&lt;/code> as quick as a flash.
Why, John Mastodon even took the last can of queue hash!&lt;/p>
&lt;p>Then he stuffed all the queues up the chimney with glee.
&amp;ldquo;Now,&amp;rdquo; grinned John Mastodon, &amp;ldquo;I will stuff up the whole process tree!&amp;rdquo;&lt;/p>
&lt;p>As John Mastodon took the process tree, as he started to shove,
He heard a small sound like the coo of a dove&amp;hellip;&lt;/p>
&lt;p>He turned around fast, and he saw a small Queue!
Little Cindy-Lou Queue, who was no more than two.&lt;/p>
&lt;p>She stared at John Mastodon and said, &amp;ldquo;our statuses, why?
Why are you filling our queues? Why?&amp;rdquo;&lt;/p>
&lt;p>But, you know, John Mastodon was so smart and so slick,
He thought up a lie, and he thought it up quick!&lt;/p>
&lt;p>&amp;ldquo;Why, my sweet little tot,&amp;rdquo; John Mastodon lied,
&amp;ldquo;There&amp;rsquo;s a status on this &lt;code>/inbox&lt;/code> that won&amp;rsquo;t light on one side.&lt;/p>
&lt;p>So I&amp;rsquo;m taking it home to my workshop, my dear.
I&amp;rsquo;ll fix it up there, then I&amp;rsquo;ll bring it back here.&amp;rdquo;&lt;/p>
&lt;p>And his fib fooled the child. Then he patted her head,
And he got her a drink, and he sent her to bed.&lt;/p>
&lt;p>And when Cindy-Lou Queue was in bed with her cup,
He crupt to the chimney and stuffed the ingress queues up!&lt;/p>
&lt;p>Then he went up the chimney himself, the old liar.
And the last thing he took was &lt;code>/var/log&lt;/code> for their fire.
On their &lt;code>.bash_history&lt;/code> he left nothing but hooks and some wire.&lt;/p>
&lt;p>And the one speck of content that he left in the house
Was a crumb that was even too small for a mouse.&lt;/p>
&lt;p>Then he did the same thing to the other Queues&amp;rsquo; houses,
Leaving crumbs much too small for the other Queues&amp;rsquo; mouses!&lt;/p>
&lt;h1 id="timeline-1">Timeline&lt;/h1>
&lt;p>All events are documented in &lt;a href="https://en.wikipedia.org/wiki/Coordinated_Universal_Time">UTC&lt;/a> time.&lt;/p>
&lt;ul>
&lt;li>17:58 &lt;code>@dma&lt;/code> Notices we are no longer bottlenecked on Ingress after &lt;code>@hazelweakly&lt;/code> makes changes&lt;/li>
&lt;li>18:03 &lt;code>@dma&lt;/code> Provides update on priority of systemd flags&lt;/li>
&lt;li>18:10 &lt;code>@dma&lt;/code> Provides spreadsheet for us to calculate connections to database&lt;/li>
&lt;/ul>
&lt;h1 id="activitymorn">ActivityMorn&lt;/h1>
&lt;p>It was quarter of dawn. All the Queues still a-bed,
All the Queues still a-snooze, when he packed up his sled,&lt;/p>
&lt;p>Packed it up with their statuses, their posts, their wrappings,
Their posts and their hashtags, their trendings and trappings!&lt;/p>
&lt;p>Ten thousand feet up, up the side of Mount Crumpet,
He rode with his load average to the tiptop to dump it!&lt;/p>
&lt;p>&amp;ldquo;Pooh-pooh to the Queues!&amp;rdquo; he was John Msatodon humming.
&amp;ldquo;They&amp;rsquo;re finding out now that no ActivityPub messages are coming!&lt;/p>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2022/12/20/the-queues-%EF%B8%8F-down-in-queueville/img_1.png" alt="img_1.png">&lt;/p>
&lt;p>They&amp;rsquo;re just waking up! I know just what they&amp;rsquo;ll do!
Their mouths will hang open a minute or two
Then the Queues down in Queueville will all cry boo-hoo!&lt;/p>
&lt;p>That&amp;rsquo;s a noise,&amp;rdquo; grinned John Mastodon, &amp;ldquo;that I simply must hear!&amp;rdquo;
He paused, and John Mastodon put a hand to his ear.&lt;/p>
&lt;p>And he did hear a sound rising over the snow.
It started in low, then it started to grow.&lt;/p>
&lt;p>But this sound wasn&amp;rsquo;t sad!
Why, this sound sounded glad!&lt;/p>
&lt;p>Every Queue down in Queueville, the tall and the small,
Was singing without any ActivityPub messages at all!&lt;/p>
&lt;p>He hadn&amp;rsquo;t stopped ActivityPub messages from coming! They came!
Somehow or other, they came just the same!&lt;/p>
&lt;p>And John Mastodon, with his feet ice-cold in the snow,
Stood puzzling and puzzling. &amp;ldquo;How could it be so?&amp;rdquo;&lt;/p>
&lt;p>Posts came without #hashtags! It came without tags!
It came without content warnings or bags!&lt;/p>
&lt;p>He puzzled and puzzled till his puzzler was sore.
Then John Mastodon thought of something he hadn&amp;rsquo;t before.&lt;/p>
&lt;p>Maybe ActivityPub, he thought, doesn&amp;rsquo;t come from a database store.
Maybe ActivityPub, perhaps, means a little bit more!&lt;/p>
&lt;p>And what happened then? Well, in Queueville they say
That John Mastodon&amp;rsquo;s small heart grew three sizes that day!&lt;/p>
&lt;p>And then the true meaning of ActivityPub came through,
And John Mastodon found the strength of ten John Mastodon&amp;rsquo;s, plus two!&lt;/p>
&lt;p>And now that his heart didn&amp;rsquo;t feel quite so tight,
He whizzed with his load average through the bright morning light!&lt;/p>
&lt;p>With a smile to his soul, he descended Mount Crumpet
Cheerily blowing &amp;ldquo;Queue! Queue!&amp;rdquo; aloud on his trumpet.&lt;/p>
&lt;p>He road into Queuville. He brought back their joys.
He brought back their #caturday images to the Queue girls and boys!&lt;/p>
&lt;p>He brought back their status and their pictures and tags,
Brought back their posts, their content and #hashtags.&lt;/p>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2022/12/20/the-queues-%EF%B8%8F-down-in-queueville/img_2.png" alt="img_2.png">&lt;/p>
&lt;p>He brought everything back, all the CPU for the feast!
And he, he himself, John Mastodon carved the roast beast!&lt;/p>
&lt;p>Welcome ActivityPub. Bring your cheer,
Cheer to all Queues, far and near.&lt;/p>
&lt;p>ActivityDay is in our grasp
So long as we have friends&amp;rsquo; statuses to grasp.&lt;/p>
&lt;p>ActivityDay will always be
Just as long as we have we.&lt;/p>
&lt;p>Welcome ActivityPub while we stand
Heart to heart and hand in hand.&lt;/p>
&lt;h1 id="timeline-2">Timeline&lt;/h1>
&lt;p>All events are documented in &lt;a href="https://en.wikipedia.org/wiki/Coordinated_Universal_Time">UTC&lt;/a> time.&lt;/p>
&lt;ul>
&lt;li>18:10 &lt;code>@hazelweakly&lt;/code> Provides update that queues are now balancing and load is coming down&lt;/li>
&lt;li>18:18 &lt;code>@nova&lt;/code> Confirms queues are draining and systems are stabilizing&lt;/li>
&lt;/ul>
&lt;h1 id="root-cause">Root Cause&lt;/h1>
&lt;p>John Mastodon took the queue hash, and up the chimney he stuck it.
The Hachyderm crew was too tired to fill out the report and said &amp;ldquo;fuck it&amp;rdquo;.&lt;/p>
&lt;pre tabindex="0">&lt;code>Nietzsche:
- 4 default queues (unchanged)
- 32 default ingress (changed)
Franz:
- 6 default queues (unchanged)
- 1 ingress queue (changed)
- 5 pull queues (unchanged)
- 5 push queues (unchanged)
Freud:
- 3 default queues (unchanged)
- 2 ingress queues (changed)
- 2 pull queue (changed)
- 2 push queue (changed)
Changes:
Because the database connection count per ingress queue process changed, when necessary, I will clarify queue amounts in terms of database connections.
- Moved 2 ingress queues (40 DB connections) from franz to nietzsche
- Moved 2 ingress queues (40 DB connections) from freud to nietzsche
- Changed DB_POOL on ingress queues from 20 to 5 as they&amp;#39;re heavily CPU bound.
- Changed -c 20 on ingress queues from 20 to 5 as they&amp;#39;re heavily CPU bound.
- Scaled Nietzsche up from 8 ingress queues to 32 to keep the amount of total database connections the same.
- Restarted the one ingress queue remaining on franz (this lowered ingress DB connections from 20 to 5).
- Restarted the two ingress queues remaining on freud (this lowered ingress DB connections from 40 to 10).
- Removed a &amp;#34;pushpull&amp;#34; systemd service on Freud and replaced it with independent push and pull sidekiq processes (neutral db connection change).
&lt;/code>&lt;/pre></description></item><item><title>Blog: Degraded Service: Media Caching and Queue Latency</title><link>https://community.hachyderm.io/blog/2022/12/18/degraded-service-media-caching-and-queue-latency/</link><pubDate>Sun, 18 Dec 2022 00:00:00 +0000</pubDate><guid>https://community.hachyderm.io/blog/2022/12/18/degraded-service-media-caching-and-queue-latency/</guid><description>
&lt;p>On Saturday, December 17th, 2022 at roughly 12:43 UTC Hachyderm received our &lt;a href="https://github.com/hachyderm/community/issues/217">first report of media failures&lt;/a> which started a 2-day-long investigation of our systems by &lt;code>@hazelweakly&lt;/code>, &lt;code>@quintessence&lt;/code>, &lt;code>@dma&lt;/code>, and &lt;code>@nova&lt;/code>. The investigation coincidentally overlapped with a well-anticipated spike in growth which also unexpectedly degraded our systems simultaneously.&lt;/p>
&lt;p>The first degradation was unplanned media failures, typically in the form of avatar and profile icons intermittently on the service. We had an increase in 4XX level responses due to misconfigured cache settings in our CDN. We believe the Western US to be the only region impacted by this degradation.&lt;/p>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2022/12/18/degraded-service-media-caching-and-queue-latency/img.png" alt="img.png">&lt;/p>
&lt;p>The second degradation was unplanned queue latency increasing presumably from the increase in usage due to the fallout of Twitter mass exodus. We experienced an increase in our &lt;code>push&lt;/code> and &lt;code>pull&lt;/code> queues, as well as a short period of &lt;code>default&lt;/code> latency.&lt;/p>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2022/12/18/degraded-service-media-caching-and-queue-latency/img-2.png" alt="img-2.png">&lt;/p>
&lt;h1 id="timeline">Timeline&lt;/h1>
&lt;p>All events are documented in &lt;a href="https://en.wikipedia.org/wiki/Coordinated_Universal_Time">UTC&lt;/a> time.&lt;/p>
&lt;ul>
&lt;li>Dec 16th &lt;strong>12:43&lt;/strong> &lt;code>@arjenpdevries&lt;/code> First report of media cache misses &lt;a href="https://github.com/hachyderm/community/issues/217">#217&lt;/a>&lt;/li>
&lt;li>Dec 17th &lt;strong>08:21&lt;/strong> &lt;code>@blueturtleai &lt;/code> 2nd Report, and first confirmation of media cache misses &lt;a href="https://github.com/hachyderm/community/issues/218">#218&lt;/a>&lt;/li>
&lt;li>Dec 17th &lt;strong>21:43&lt;/strong> &lt;code>@quintessence&lt;/code> 3rd Report of media cache misses&lt;/li>
&lt;li>Dec 17th &lt;strong>21:44&lt;/strong> &lt;code>@nova&lt;/code> False mediation of &lt;code>cmd+shift+r&lt;/code> cache refresh&lt;/li>
&lt;li>Dec 17th &lt;strong>22:XX&lt;/strong> More reports of cache failures, multiple Discord channels, and posts&lt;/li>
&lt;li>Dec 17th &lt;strong>23:XX&lt;/strong> More reports of cache failures, multiple Discord channels, and posts&lt;/li>
&lt;li>Dec 17th &lt;strong>24:XX&lt;/strong> Still assuming &amp;ldquo;cache problems&amp;rdquo; will just fix themselves&lt;/li>
&lt;li>Dec 18th &lt;strong>14:45&lt;/strong> &lt;code>@dma&lt;/code> Nginx audit and &lt;code>location{}&lt;/code> rewrite on &lt;code>fritz&lt;/code>; no results&lt;/li>
&lt;li>Dec 18th &lt;strong>14:45&lt;/strong> &lt;code>@dma&lt;/code> No success debugging various CDN nodes and cache strategies&lt;/li>
&lt;li>Dec 18th &lt;strong>15:16&lt;/strong> &lt;code>@dma&lt;/code> Check mastodon-web logs on CDNs; /system GETs with 404s&lt;/li>
&lt;li>Dec 18th &lt;strong>20:32&lt;/strong> &lt;code>@hazelweakly&lt;/code> Discovered &lt;code>.env.production&lt;/code> misconfiguration cdn-frankfurt-1, franz&lt;/li>
&lt;li>Dec 18th &lt;strong>20:41&lt;/strong> &lt;code>@quintessence&lt;/code> Confirms queues are backing up&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2022/12/18/degraded-service-media-caching-and-queue-latency/img_1.png" alt="img_1.png">&lt;/p>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2022/12/18/degraded-service-media-caching-and-queue-latency/img_2.png" alt="img_2.png">&lt;/p>
&lt;ul>
&lt;li>Dec 18th &lt;strong>20:45&lt;/strong> &lt;code>@hazelweakly&lt;/code> Confirms actively reloading services to drain queues&lt;/li>
&lt;li>Dec 18th &lt;strong>21:17&lt;/strong> &lt;code>@malte_j&lt;/code> Appears from vacation, and is told to go back to relaxing&lt;/li>
&lt;li>Dec 18th &lt;strong>21:23&lt;/strong> &lt;code>@hazelweakly&lt;/code> Continues to &amp;ldquo;tweak and tune&amp;rdquo; the queues&lt;/li>
&lt;li>Dec 18th &lt;strong>21:32&lt;/strong> &lt;code>@hazelweakly&lt;/code> Claims we are growing at &amp;lt;1 user per minute&lt;/li>
&lt;li>Dec 18th &lt;strong>21:45&lt;/strong> &lt;code>@dma&lt;/code> Reminder to only focus on &lt;code>ingress&lt;/code> and &lt;code>default&lt;/code> queues&lt;/li>
&lt;li>Dec 18th &lt;strong>21:47&lt;/strong> &lt;code>@hazelweakly&lt;/code> Identifies queue priority fix using systemd units&lt;/li>
&lt;li>Dec 18th &lt;strong>21:47&lt;/strong> &lt;code>@hazelweakly&lt;/code> Suggests moving queues to CDN nodes&lt;/li>
&lt;li>Dec 18th &lt;strong>21:59&lt;/strong> &lt;code>@dma&lt;/code> Suggests migrating DB from &lt;code>freud&lt;/code> -&amp;gt; &lt;code>nietzsche&lt;/code>&lt;/li>
&lt;li>Dec 18th &lt;strong>22:15&lt;/strong> &lt;code>@hazelweakly&lt;/code> Summary confirms sidekiq running on CDNs&lt;/li>
&lt;li>Dec 18th &lt;strong>22:18&lt;/strong> &lt;code>@nova&lt;/code> Identifies conversation in Discord, and begins report&lt;/li>
&lt;/ul>
&lt;h1 id="root-cause">Root Cause&lt;/h1>
&lt;p>The cause of the caching 4XX responses and broken avatars was a misconfigured &lt;code>.env.production&lt;/code> file on &lt;code>cdn-fremont-1&lt;/code> and &lt;code>franz&lt;/code>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#000">S3_ENABLED&lt;/span>&lt;span style="color:#ce5c00;font-weight:bold">=&lt;/span>FALSE &lt;span style="color:#8f5902;font-style:italic"># Should be true&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#000">3_BUCKET&lt;/span>&lt;span style="color:#ce5c00;font-weight:bold">=&lt;/span>&lt;span style="color:#4e9a06">&amp;#34;..&amp;#34;&lt;/span> &lt;span style="color:#8f5902;font-style:italic"># Should be S3_BUCKET&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The cause of the queue latency is suspected to be the increase in usage from Twitter, as well as the queue priority documented &lt;a href="https://docs.joinmastodon.org/admin/scaling/#sidekiq-queues">here in the official Mastodon scaling up documentation&lt;/a>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#000">ExecStart&lt;/span>&lt;span style="color:#ce5c00;font-weight:bold">=&lt;/span>/usr/bin/bundle &lt;span style="color:#204a87">exec&lt;/span> sidekiq -c &lt;span style="color:#0000cf;font-weight:bold">10&lt;/span> -q default
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="things-that-went-well">Things that went well&lt;/h3>
&lt;p>We have the cache media fixed, and we have been alerted to a high-risk concern early giving the team enough time to respond.&lt;/p>
&lt;h3 id="things-that-went-poorly">Things that went poorly&lt;/h3>
&lt;p>An outage was never declared for this incident, and therefore it was not handled as well as it could have been.
Various members of the team were mutating production with reckless working habits&lt;/p>
&lt;ul>
&lt;li>Documenting informally in private infrastructure GitHub repository&lt;/li>
&lt;li>Discord used as documentation&lt;/li>
&lt;li>No documenting just &amp;ldquo;tinkering&amp;rdquo; alone&lt;/li>
&lt;li>Documenting after the fact&lt;/li>
&lt;li>Not using descriptive language, EG: &amp;ldquo;Tweaked the CDNs&amp;rdquo; instead of changed &lt;this file> on &lt;this server> from &lt;this value> to &lt;that value>.&lt;/li>
&lt;/ul>
&lt;p>Unknown state of production after the incident. Unsure which services are running where, and who has what expectations for which services.&lt;/p>
&lt;p>The configuration roll-out obviously had failed at some point, indicating a stronger need for config management on our servers.&lt;/p>
&lt;p>We seemed to lose track of where the incident started and stopped and where improvements and action items began. For some reason we decided to make suggestions about next steps before we were entirely sure on the state of the systems today, and having a plan in place.&lt;/p>
&lt;h3 id="opportunities">Opportunities&lt;/h3>
&lt;p>Config management should be a top priority.&lt;/p>
&lt;p>Auditing and migrating sidekiq services off of CDN nodes should be a top priority.&lt;/p>
&lt;p>Migrating the database from &lt;code>freud&lt;/code> -&amp;gt; &lt;code>nietzsche&lt;/code> should be a priority.&lt;/p>
&lt;p>We shouldn&amp;rsquo;t be planning or discussing future improvements until the systems are restored to stability. Incidents are not also a venue for decision-making.&lt;/p>
&lt;h1 id="resulting-action">Resulting Action&lt;/h1>
&lt;h4 id="1-plan-for-postgres-migration">1) Plan for Postgres migration&lt;/h4>
&lt;p>&lt;code>@nova&lt;/code> and &lt;code>@hazelweakly&lt;/code> planning live stream to migrate production database and clear up more compute power for sidekiq queues&lt;/p>
&lt;h4 id="2-todo-configuration-management">2) TODO Configuration Management&lt;/h4>
&lt;p>We need to identify a configuration management pattern for our systems sooner than later. Perhaps an opportunity for a new volunteer.&lt;/p>
&lt;h4 id="3-todo-discord-bot-incident-command">3) TODO Discord Bot Incident Command&lt;/h4>
&lt;p>We need to identify ways of managing and starting and stopping incidents using Discord. Maybe in the future we can have &amp;ldquo;live operating room&amp;rdquo; incidents where folks can watch read-only during the action.&lt;/p></description></item><item><title>Blog: Global Outage: 504 Timeouts</title><link>https://community.hachyderm.io/blog/2022/12/13/global-outage-504-timeouts/</link><pubDate>Tue, 13 Dec 2022 00:00:00 +0000</pubDate><guid>https://community.hachyderm.io/blog/2022/12/13/global-outage-504-timeouts/</guid><description>
&lt;p>On Tuesday, December 13th, 2022 at roughly 18:52 UTC Hachyderm experienced a 7 minute cascading failure that has impacted our users around the globe resulting in unresponsive HTTP(s) requests and 5XX level requests. The service has not experienced any data loss. We believe this was a total service outage.&lt;/p>
&lt;p>Impacted users experienced 504 timeout responses from &lt;code>https://hachyderm.io&lt;/code> in all regions of the world.&lt;/p>
&lt;h1 id="timeline">Timeline&lt;/h1>
&lt;p>All events are documented in &lt;a href="https://en.wikipedia.org/wiki/Coordinated_Universal_Time">UTC&lt;/a> time.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>18:53&lt;/strong> &lt;code>@nova&lt;/code> First report of slow response times in Discord&lt;/li>
&lt;li>&lt;strong>18:55&lt;/strong> &lt;code>@dma&lt;/code> First confirmation, and first report of 5XX responses globally&lt;/li>
&lt;li>&lt;strong>18:56&lt;/strong> &lt;code>@dma&lt;/code> Check of Mastodon web services, no immediate concerns&lt;/li>
&lt;li>&lt;strong>18:56&lt;/strong> &lt;code>@nova&lt;/code> Check of CDN proxy services, no immediate concerns&lt;/li>
&lt;li>&lt;strong>18:57&lt;/strong> &lt;code>@nova&lt;/code> First observed 504 timeout&lt;/li>
&lt;li>&lt;strong>18:58&lt;/strong> &lt;code>@dma&lt;/code> &lt;a href="https://status.hachyderm.io">status.hachyderm.io&lt;/a> updated acknowledging the outage&lt;/li>
&lt;li>&lt;strong>18:59&lt;/strong> &lt;code>@nova&lt;/code> First observed redis error, unable to persist to disk&lt;/li>
&lt;/ul>
&lt;pre tabindex="0">&lt;code>Dec 13 18:59:01 fritz bundle[588687]: [2eae54f0-292d-488e-8fdd-5c35873676c0] Redis::CommandError (MISCONF Redis is configured to save RDB snapshots, but it&amp;#39;s currently unable to persist to disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.):
&lt;/code>&lt;/pre>&lt;ul>
&lt;li>&lt;strong>19:01&lt;/strong> &lt;code>@UptimeRobot&lt;/code> First alert received&lt;/li>
&lt;/ul>
&lt;pre tabindex="0">&lt;code>Monitor is DOWN: hachyderm streaming
( https://hachyderm.io/api/v1/streaming/health ) - Reason: HTTP 502 - Bad Gateway
&lt;/code>&lt;/pre>&lt;ul>
&lt;li>&lt;strong>19:02&lt;/strong> &lt;code>@nova&lt;/code> Root cause detected. The root filesystem is full on our primary database server.&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://community.hachyderm.io/blog/2022/12/13/global-outage-504-timeouts/img.png" alt="img.png">&lt;/p>
&lt;ul>
&lt;li>&lt;strong>19:04&lt;/strong> &lt;code>@nova&lt;/code> Identified postgres archive &lt;code>/var/lib/postgres/archive&lt;/code> data exceeds 400Gb of history&lt;/li>
&lt;li>&lt;strong>19:05&lt;/strong> &lt;code>@malte_j&lt;/code> Request to destroy archive&lt;/li>
&lt;li>&lt;strong>19:06&lt;/strong> &lt;code>@malte_j&lt;/code> Confirmed archive has been destroyed&lt;/li>
&lt;li>&lt;strong>19:06&lt;/strong> &lt;code>@malte_j&lt;/code> Confirmed 187Gb of space has been recovered&lt;/li>
&lt;li>&lt;strong>19:06&lt;/strong> &lt;code>@dma&lt;/code> &lt;a href="https://status.hachyderm.io">status.hachyderm.io&lt;/a> updated acknowledging the root cause&lt;/li>
&lt;li>&lt;strong>19:07&lt;/strong> &lt;code>@nova&lt;/code> Begin drafting postmortem notes&lt;/li>
&lt;li>&lt;strong>19:16&lt;/strong> &lt;code>@nova&lt;/code> Official announcement posted to Hachyderm&lt;/li>
&lt;/ul>
&lt;h1 id="root-cause">Root Cause&lt;/h1>
&lt;p>Full root filesystem on primary database server resulted in a cascading failure that first impacted Redis&amp;rsquo;s ability to persist to disk which later resulted in 5XX responses on the edge.&lt;/p>
&lt;h3 id="things-that-went-well">Things that went well&lt;/h3>
&lt;p>We had a place to organize, and folks on standby to respond to the incident.&lt;/p>
&lt;p>We were able to respond and recover in less than 10 minutes.&lt;/p>
&lt;p>We were able to document and move forward in less than 60 minutes.&lt;/p>
&lt;h3 id="things-that-went-poorly">Things that went poorly&lt;/h3>
&lt;p>There was confusion about who had access to update &lt;code>status.hachyderm.io&lt;/code> and this is still unclear.&lt;/p>
&lt;p>There was confusion about where redis lived, and which systems where interdependent upon redis in the stack.&lt;/p>
&lt;p>The Novix installer is still our largest problem and is responsible for a lot of confusion. We do not have a better way forward to manage packages and configs in production. We need to decide on &lt;code>Nix&lt;/code> and our path forward as soon as possible.&lt;/p>
&lt;h3 id="opportunities">Opportunities&lt;/h3>
&lt;p>We need to harden our credential management process, and account management. We need to have access to our systems.&lt;/p>
&lt;p>We need global architecture, ideally observed from the systems themselves and not in a diagram.&lt;/p>
&lt;p>When an announcement is resolved, it removes the status entirely from UptimeRobot. We can likely improve this.&lt;/p>
&lt;h1 id="resulting-action">Resulting Action&lt;/h1>
&lt;h4 id="1-cron-cleanup-scheduled-malte_j">1) Cron cleanup scheduled &lt;code>@malte_j&lt;/code>&lt;/h4>
&lt;p>Cron scheduled to remove postgres archive greater than 5 days.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#8f5902;font-style:italic">#!/bin/bash
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#8f5902;font-style:italic">&lt;/span>&lt;span style="color:#204a87">set&lt;/span> -e
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#204a87">cd&lt;/span> /var/lib/postgres/data/archive
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>find * -type f -mtime &lt;span style="color:#0000cf;font-weight:bold">5&lt;/span> -print0 &lt;span style="color:#000;font-weight:bold">|&lt;/span> sort -z &lt;span style="color:#000;font-weight:bold">|&lt;/span> tail -z -n &lt;span style="color:#0000cf;font-weight:bold">1&lt;/span> &lt;span style="color:#000;font-weight:bold">|&lt;/span> xargs -r0 pg_archivecleanup /var/lib/postgres/data/archive
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="2-alerts-configured-dma">2) Alerts configured &lt;code>@dma&lt;/code>&lt;/h4>
&lt;p>Alerts scheduled for &lt;code>&amp;gt;90%&lt;/code> filesystem storage on database nodes.&lt;/p>
&lt;p>&lt;a href="https://hackmd.io/9WtCp6MgQ_al1eKGvqAWkg">Postmortem template&lt;/a> created for future incidents.&lt;/p>
&lt;h4 id="3-postmortem-documented-nova">3) Postmortem documented &lt;code>@nova&lt;/code>&lt;/h4>
&lt;p>This blog post as well as a small discussion in Discord.&lt;/p></description></item></channel></rss>