Rolling Restart for 1.21 Server Deploy Wed/Thu/Fri

[Updated Friday @ 8:39am] The rolling restart to half of the grid is now complete but for 7 hosts that needed to be manually updated; those will be completed within a few minutes. The rest of the grid will be updated tomorrow morning.

[Updated Thursday @ 7:10pm] We are beginning have completed the deploy of 1.21 to 3 racks (632 regions). Here is a list of regions that as of now are on version 1.21.0.85745.

[Updated Thursday at 12:47pm] We will shortly be deploying have deployed 1.21 to 1 rack (about 170 regions) again. If all goes well, we will continue with the tenative timeline listed in the Wednesday at 8:10pm update below.

[Update Wednesday @ 9:15pm] A slight and subtle wrinkle during the deploy left some object-to-object emails non-functional. The responsible systems have gotten a stern talking to, and this service should be operational again.

[Update Wednesday @ 8:10pm] Another bug was found after we rolled out to one rack. That bug has been found and fixed. We will evaluate exactly what we’re going to do with this deploy after testing tomorrow, but it will likely shift the timeline forward by one day. Meanwhile, we are rolling back the 170 regions that had previously received a 1.21 deploy so that for all simulators are once again running on version 1.20.1 of the server code.

The central updates to 1.21 are complete and things seem “nominal” at the moment, but of course we’ll be watching closely.

  • Wednesday 4/23 @ 11am - deploy to 1 rack [DONE] [REVERTED]
  • Wednesday 4/23 - update central systems throughout the day [COMPLETE]
  • Thursday 4/23 @ 6pm - deploy to 3 racks [COMPLETE]
  • Friday 4/25 @ 5am-11am - deploy to half of remaining servers
  • Saturday 4/26 @ 5am-11am - deploy to remaining servers

[Update Wednesday @ 10:25am]

The bug in the 1.21 Server code identified last night during an initial rollout to 1 rack has been found, fixed, and verified. We’d planning to proceed with the rollout to avoid delaying the code update another week. On the table for today are the central services updates and limited rolling restarts.

What’s Changed in 1.21 Server

The most notable fixes will be physics-related, and have been in testing in the Beta Preview for several days. No new viewer is required.

Read on for more information…

More Details

A “rack” is a physical set of about 40 sim hosts, so about 160 regions, give or take spares. This is also a handy sized unit for initial rollouts. We’ve started doing restarts spread across several days to catch any configuration or scaling issues before they affect the whole service, and also because the service is now so large (10 times as many hosts as when I started, I believe) that we need to do it in pieces

During the central system updates we expect brief disruptions of some services (less than 2 minutes). For example, it may be necessary to re-join group chats, the reported “residents online” numbers may drop, and logins may not function briefly as the processes responsible for those services are stopped/started. These activities are partitioned by agent and group - for example, a particular resident may not be able to log in while one of the hosts is restarted, while other residents are able to log in. This usually takes less than a minute for each of 16 hosts.

1.21 Rollout History

The 1.21 Server deploy was initially scheduled for April 16th/17th. During the rollout, some problems were encountered which caused us to roll back, review the code, make some fixes, and proceed cautiously. In detail:

  • a component swap intended reduce the disruption caused by central updates exhibited poor performance in production; this was reverted in favor of a different architectural change already planned, but waiting for a future update
  • a data migration intended to alleviate database load during login and other actions turned out to have a subtle bug that required reversion; after further review, the updated code is going out in a multi-phase approach starting with 1.21.
  • during an initial “3-rack roll“, several services on the targeted hosts were not started correctly; further investigation could not determine if this was due to a momentary network glitch, database hiccup, problem with the deploy tools, or a problem with the code that was not caught during testing. Subsequent testing was unable to reproduce the problem. Since this is easily detected and recovered from, we’re proceeding cautiously. So far, no issues have been seen.

[Initial Post Details from April 22nd]

We’re ready to initiate the update of the Second Life servers to the 1.21 version of the code.

The deploy is going to happen in several phases:

  • Tuesday 4/22 @ 1pm - deploy to 1 rack - [DONE]
  • Tuesday 4/22 @ 5pm 7pm - deploy to 3 racks
    (delayed a bit due to a wrinkle discovered during the previous step)
  • Wednesday 4/23 @ 6am - deploy to 10 racks
  • Wednesday 4/23 - update central systems throughout the day
  • Thursday 4/24 @ 5am-11am - deploy to half of remaining servers
  • Friday 4/25 @ 5am-11am - deploy to remaining servers

Should problems be encountered with the 1.21 rollout, we will likely proceed with deploying a subset of the changes focused on physics-related fixes, as we did with last week’s 1.20 patch rollout.

[Update Tuesday April 22nd @ 7:55pm]

After an initial re-rollout to 1 rack, reports came in of attachment failures. The rack currently is being reverted to the previous (1.20.1) simulator version. After some quick tests, we believe we’ve narrowed down the changes responsible (tests for rez permission appear to be checking a remote parcel incorrectly), but a fix is unlikely until tomorrow at the soonest. The issue also affects the “backup plan” for a smaller patch deploy mentioned at the end of the post (which hints at the source of the bug).

[Sorry for making updates at both the top and bottom of this post; I want it to remain understandable for residents who are reading it for the first time, yet retain the history of the post for later comments to remain sensible. -- Joshua Linden]