Here’s what happened.
At about 4am PDT, a defect in the Azure AD tenant provisioning system updated a DNS CNAME record to point to the Access Control 2.0 service front ends in the North Central US region instead of to our global load balancer. Because the North Central US region services ACS namespaces and does not service Azure AD tenants, authentication requests for Azure AD tenants worldwide started to fail with 404 “Not found”. Due to a separate defect in our synthetic monitoring system, our synthetic monitors continued to successfully probe Azure AD because they had cached connections to the correct datacenters that service Azure AD tenants.
At about 6am PDT we received the first reports of a problem and started our investigation, and by about 9am PDT we had corrected the CNAME entry and the recovery started. The Time To Live on the CNAME record in question is 60 minutes, thus recovery took about 60 mins as DNS resolvers around the world refreshed their caches. For the same reason, customers experienced the start of the problem at different times as their DNS resolvers refreshed their caches and picked up the erroneous CNAME record.
We have since taken action to mitigate the defect in the tenant provisioning system and have started work to update our synthetic monitoring system. During the incident there was no impact to Access Control 2.0 service availability and only Azure Active Directory tenants were affected. Affected scenarios included sign on to web applications using WS-Federation and SAML 2.0, requesting access tokens for calling the Graph API using OAuth 2.0, and certain Office 365 server-to-server functionality that uses OAuth 2.0. Some clients using OAuth 2.0 were unaffected, because they had cached access tokens with 12 hour lifetimes obtained before the incident and that did not need to be refreshed during the incident.
The incident was unrelated to the recent changes in Azure AD to update the claim set in security tokens and update the names of the service endpoints.
This service is in Preview and during this period we are continually testing the service so that we can identify and fix issues early, to avoid hitting these issues when the service reaches general availability. We apologize for any inconvenience this has caused our customers.