Rolling Restart Tue/Wed/Thu July 8-10 — Please Test Sever Version 1.23 on the Preview Grid Now!
Thursday, July 3rd, 2008For more information, please see this post on the public blog.
For more information, please see this post on the public blog.
For more information, please see this post on our main blog.
[Update 2008-05-01 08:02am] The rolling restart to deploy 1.21.1 to the rest of the grid began at about 5:00am this morning. It is now complete.
[Update 2008-04-30 09:35am] The rolling restart to deploy 1.21.1 to the first half of the grid began at about 6:15am, and is now complete. The rest of the grid will receive 1.21.1 tomorrow morning.
[Update 2008-04-29 5:30pm] We will be pushing another pilot roll to the same 3 racks as yesterday. This will occur at 5pm today. The roll is complete. The schedule below has been updated to reflect this.
[Update 2008-04-29 9:15am] Just to confirm the earlier update - we’re officially rescheduling the rolling restart to Wednesday/Thursday. The schedule below has been updated to reflect this.
[Update 2008-04-29 6:00am] Because of the ongoing network problems that we are struggling to resolve, the rolling restart has not begun yet this morning. It will almost certainly be postponed; the rolling restart is likely to happen Wednesday and Thursday mornings instead of today and tomorrow. More information will be posted here as it becomes available.
One of the changes that went out in the 1.21 Server codebase enables us to alleviate database load caused by “spare” simulators - processes waiting to pick up regions after a restart. Unfortunately, a bug was found that prevents us from enabling the service. The bug did not hold up the 1.21 Server deploy significantly since it affected hosts in only one of our co-location facilities, and the new service was disabled within a few minutes of this being noticed for those hosts.
To send out a fix and reap the benefits of lower database load we need to do a follow-up rolling restart to 1.21.1 Server. (We’re as thrilled as you are.) There are no behavior changes. No new viewer is required. Each region will be given a 5 minute warning and then restarted.
Schedule:
[Updated Saturday @ 09:10am] The rolling restart of the rest of the grid is now complete.
[Updated Saturday @ 8:40am] The rolling restart of the rest of the grid is now in progress. It began at 5:10am, and is now 93% complete. As usual, each region will be down for ~5 minutes. if your region is down for more than 20 minutes, please contact support.
[Updated Saturday @ 7:06am] The rolling restart of the rest of the grid is now in progress. It began at 5:10am, and is now 46% complete. As usual, each region will be down for ~5 minutes. if your region is down for more than 20 minutes, please contact support.
[Updated Saturday @ 6:05am] The rolling restart of the rest of the grid is now in progress. It began at 5:10am, and is now 16% complete. As usual, each region will be down for ~5 minutes. if your region is down for more than 20 minutes, please contact support.
[Updated Saturday @ 5:10am] The rolling restart of the rest of the grid is now in progress. It began at 5:10am; we will post hourly updates with a percentage completed. As usual, each region will be down for ~5 minutes. if your region is down for more than 20 minutes, please contact support.
[Updated Friday @ 8:39am] The rolling restart to half of the grid is now complete but for 7 hosts that needed to be manually updated; those will be completed within a few minutes. The rest of the grid will be updated tomorrow morning.
[Updated Thursday @ 7:10pm] We are beginning have completed the deploy of 1.21 to 3 racks (632 regions). Here is a list of regions that as of now are on version 1.21.0.85745.
[Updated Thursday at 12:47pm] We will shortly be deploying have deployed 1.21 to 1 rack (about 170 regions) again. If all goes well, we will continue with the tenative timeline listed in the Wednesday at 8:10pm update below.
[Update Wednesday @ 9:15pm] A slight and subtle wrinkle during the deploy left some object-to-object emails non-functional. The responsible systems have gotten a stern talking to, and this service should be operational again.
[Update Wednesday @ 8:10pm] Another bug was found after we rolled out to one rack. That bug has been found and fixed. We will evaluate exactly what we’re going to do with this deploy after testing tomorrow, but it will likely shift the timeline forward by one day. Meanwhile, we are rolling back the 170 regions that had previously received a 1.21 deploy so that for all simulators are once again running on version 1.20.1 of the server code.
The central updates to 1.21 are complete and things seem “nominal” at the moment, but of course we’ll be watching closely.
[Update Wednesday @ 10:25am]
The bug in the 1.21 Server code identified last night during an initial rollout to 1 rack has been found, fixed, and verified. We’d planning to proceed with the rollout to avoid delaying the code update another week. On the table for today are the central services updates and limited rolling restarts.
What’s Changed in 1.21 Server
The most notable fixes will be physics-related, and have been in testing in the Beta Preview for several days. No new viewer is required.
Read on for more information…
[Update 2008-04-16 21:10] Several of the regions that received version 1.21 are showing problems, so we are going to revert them to 1.20. Many of the regions remain down; they will be back up within 1/2 hour.
[Update 2008-04-16 20:30] The deploy to 490 regions will begin momentarily
[Update 2008-04-16 17:00] We are in the middle of updating the central servers. Note that if you watch the concurrency plots, you will see dips in it as we restart servers that report concurrency numbers. This doesn’t actually mean that people are getting kicked offline, it’s just a reset of the data collection. The deploy to 500 regions will begin later tonight.
We will be doing a rolling restart this Wednesday and Thursday to roll out the patches to the server that were to be rolled out with last week’s cancelled rolling restart. Changes include security patches, performance improvements for Havok4 (including the issue that “openspace” or “void” sims have with Havok4), and code designed to mitigate the load on the central database systems.
We will do this with a usual 3-stage deploy:
There will be no viewer updates required as a result of this deploy. All regions will receive warnings beginning five minutes before they are shut down. During the rolling restart, regions should be back 5-10 minutes after they are stopped. If your region stays down more than 20 minutes, please contact support.