Over this past weekend (October 4 through October 7, 2013), you may have noticed that CellarTracker has experienced some site outages. The root cause of the outage was a catastrophic failure of one of the flash cards in our new storage array, and required a full restore of the site from our backups in order get things running again. As a result, we, unfortunately, have suffered some minor data loss between the time of our last recoverable backups and the time of the total failure (around 6:30 AM PDT on October 6, 2013). Specifically:
- Any activity occurring AFTER 6:38 PM PDT (9:38 PM EDT on October 5 / 1:38 AM GMT on October 6) has been, unfortunately, lost.
- All forum activity occurring AFTER September 29, 2013 has also been lost.
- All label images uploaded AFTER September 29, 2013 are currently in limbo. We may be able to recover these, however further investigation is required as to the feasibility of doing so.
We understand the trust you place in us with your data, and we consider data protection and security of the utmost importance. Unfortunately, even with our redundant systems in place, the nature of the failure was such that we had no choice but to revert to our backups. We sincerely apologize for the inconvenience this outage may have caused you, and you can be assured we will be doing our best to ensure nothing of this nature ever happens again. CellarTracker is a labor of love, and it pains us immensely to even have to post this message. We truly appreciate your support and patience as we work through the final stages of bringing the site back to 100%.
Please note: in order to replace our failed component, we are expecting another 10-15 minute window of downtime sometime between 2:45pm and 3:15pm PDT on Monday, October 7. We will do our best to minimize the inconvenience.
As always, if you have any questions or concerns, please contact us. Additional information about the failure, the root cause, and our post-mortem will be available below; or, you can follow the discussion in the forum thread about the outage.
Eric LeVine, Dan Polivy, and Andrew Hall
Last Updated: October 7, 2013 - 7 am PDT
Our focus at the moment is restoring the site to 100% availability, and a full post-mortem will be forthcoming. In the meantime, in an effort to offer full transparency, the following is the summarized timeline of events leading up to, and including, the outage:
- At the end of August, 2013, we made significant upgrades to our hardware infrastructure, including the addition of a solid-state storage array. The solid-state storage array provided blazing-fast storage and extremely low latencies, ultimately resulting in better site performance.
- October 4, 2013: Our main SQL database began to show CRC errors, indicating some sort of data corruption. Tracing the errors down the stack, we find low-level errors occurring on one of the cards in the storage array itself.
- October 5, 2013, Morning: We initiated a fail-over from the bad card to the hot spare. After a few minutes of work, that process failed and took the storage array offline completely, resulting in a few hour site outage the morning of Friday October 5 (PDT).
- October 5, 2013, Noon: A reboot of the storage array appeared to clear the failure, and while it was still running on the bad card, the device re-initiated a fail-over to the hot spare. During this time, the storage was accessible, and no further errors were seen, so we decided to bring the site back up, while in parallel running additional backup jobs on our most important assets.
- October 6, 2013, 6:50 am PDT: The fail-over on the storage array completed automatically, and, unfortunately, this was the event that took down the entire site. The data copied from the bad card to the hot spare was corrupted, and after much investigation, deemed fully unrecoverable. Unfortunately, our most recent backups were still in the process of being copied from the storage array to an alternate device at the time of the failure.
- October 6, 2013: We begin to initiate a full restore of all of our servers from the most recently available backups. We are able to recover all machines.
- October 7, 2013: We apply our most recent database backup and run a full consistency check to verify data integrity. This process successfully completed around 5am PDT, and we slowly bring the site back online.
- October 7, 2013, 7am PDT: As of 7am, the site is back online. We are still working to restore our secondary database servers to bring our capacity back to 100%. We are also working with our vendor to replace the failed flash card, which will require a few minutes of downtime. Additional updates will be posted as available.