[Micronet] Friday Calnet and www.berkeley.edu outages?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[Micronet] Friday Calnet and www.berkeley.edu outages?

Christopher Brooks
Any word on the root cause of the CalNet and www.berkeley.edu outages on Friday evening and Saturday?

http://systemstatus.berkeley.edu/ (below) mentions a fire.

On Friday evening, it seemed like things were a bit quiet, so I sent my eecs.berkeley.edu account, which worked.

Is there a plan to add some redundancy to some of these systems? 

We were fortunate that it happened over the weekend.

Many thanks to the people who put in time over the weekend bringing this back up.

_Christopher

Outage Type: UNSCHEDULED OUTAGE
Date Submitted: Friday, September 18, – Monday, 21, 2015
Outage Start/End Time: 1930 – TBD
Equipment: Campus Network

Description: Monday, 09/20/2015 1040:

The Service Desk is receiving reports of some residual connection problems:

  • Campus DHCP registration is unavailable

Sunday, 09/20/2015 1330:

  • ETS – AWS intermittent issues with LMS have been resolved.
  • SAIT – is reporting all systems are operational. Financial aid completed, still working on Drop Date Deadline change.  Smaller issues will be addressed via normal channels.
  • DARS – vendor will be on-site late this afternoon to resolve remaining issues.
  • QA/Dev environments have been prioritized by High/Low and date.  This will be used once the team is ready to begin turning these back on.  The start of this work is still dependent on the final fix for the  Dell / Brocade issues on Chassis 20
  • Communication to Leadership was sent this morning  from Lyle. A final all campus communication will be sent once approved by Lyle.
  • This will be the final update on this recovery effort. Any issues surfacing at this point should be reported via a ticket and normal reporting processes.

Sunday, 09/20/2015 0900:

  • bCourses is experiencing an unrelated vendor outage due to AWS issues. All users are experiencing intermittent login issues. <a href="http://status.instructure.com/" onclick="_gaq.push(['_trackEvent', 'outbound-article', 'http://status.instructure.com/', 'Details of this outage can be found on the Instructure Canvas System Status page']);">Details of this outage can be found on the Instructure Canvas System Status page.
  • DARS – is still experiencing issues, team is actively working with vendor to resolve today.
  • EMS Grad2, and Summer Sessions, are up and working.
  • LMS – as of 7:00 a.m there have been intermittent issues.  Team is actively working with Vendor to solve.  This is unrelated to the Data Center issues.  Team will communicate directly with users on the status of this.
  • Dev/QA environments – ETA to begin restoring is sometime this afternoon
    • there is a dependency on a couple of hardware issues with switches that must be resolved prior to bringing the remaining VM’s live.
    • Teams should start prioritizing environments and identifying what can wait when we do go live.
    • Karen Kato will start a Google sheet to track these priorities.  Each lead need to input this info.
  • All Clear was called at this mornings check in – teams will continue working through normal channels to resolve minor issues.

Next Status will be after a 1:30pm status call.

Saturday, 09/19/2015 1730:

  • All production databases are up and running – any additional issues that are found should have a ticket opened and will be dealt with as soon as possible.
  • Phones/voicemail, as reported earlier, are restored.
  • Applications – some systems are continuing to run batch jobs and will be opened up by tomorrow for user/functional testing tomorrow.
  • Ironweed server continues to be worked by IT and the service provider. All indications point to it should be ok and restored over the next few days.
  • Currently an extended catch up process will take place on jobs that should have run in the prior 24 hours. Targeting an “all clear” on these by tomorrow afternoon.
  • Student systems are all in pretty good shape. Once all partners have been contacted and verify their individual tools have been verified we will be able to consider this clear.
    • Dars in particular has a team actively working to ensure successful return to service.
    • EMS – consortium of multiple areas. Rec Sports is the main service provider and will be contacted for current status.
    • Grad Dept and Disabled Student outreach will be done this evening to verify how things look on their side.
  • All clear will be posted on the Berkeley news site as well as a campus wide email. ETA – tomorrow.
  • Service leads are actively reaching out to their business partners to verify status.

This will be the last update for tonight. Next update will be posted Sunday morning.

Saturday 09/19/2015 16:53: All applications identified as critical have been restored.

  • All phone and voicemail issues caused by the outage have been resolved
  • Network is fully operational
  • Some non-critical blades, chassis and servers are still in progress.
  • Applications – Bairs, BFS, CalAnswers, CalPlanning are all confirmed up.
    • BFS –  running batch jobs
    • HCM – running batch jobs
    • Caltime – available to users, but problem with some HTML servers
  • Please note applications are all up, but batch jobs will need to complete then open to functional partners for validation/testing
  • Additional info from SAIT is being gathered and will be in the next update.
  • QA and dev systems will be deferred until tomorrow.

Saturday 09/19/2015 13:30:

  • CalMessages is now up and available
  • Go anywhere – in progress
  • Smaller issues with infrastructure hardware will be addressed as soon as critical applications are restored
  • Pharmacy database, Footprints, Goanywhere.
  • Citrix and VMWare are up

Saturday 09/19/2015 12:11

  • Progress continues to be made to bring applications back online
  • www.berkeley.edu is now available
  • CalMail lists and all legacy CalMail services are operating normally. Some messages that were sent during the outage were kept in a queue and have now been delivered.

Saturday 09/19/2015 11:18:

  • CalNet is up and running.  Access to Google, Box, bCourses, Service Now and other cloud hosted systems has been restored
  • VPN services are now up and available
  • Databases and Webfarm are being brought up to enable www.berkeley.edu (11:30 ETA)
  • Work has started to restore service to campus phones that were affected by the outage

Saturday 09/19/2015 10:28:

  • Still working to get CalNet up and running
  • After CalNet is up, database team will be bringing databases up systematically
  • Once databases are up, applications can begin to be brought online
  • www.berkeley.edu will be brought back up as soon as possible to aid in campuswide communications

Saturday 09/19/2015 09:24: The follow progress has been made:

  • Power has been restored
  • Management Systems are up.
  • Network is up and running
  • Working to get CalNet up and running
  • Wifi still unavailable due to database dependency
  • Database systems being brought online in a very methodical way
  • Storage systems still dependent on other systems

Next update will be Shortly after 10:00 am.

Saturday 09/19/2015 07:45: We are currently restoring systems and bringing applications back online as they become available.

Please rest assured your email is still working.  Even if you are unable to access your mail at this time, it will be there waiting for you when CalNet authentication is restored. Your phone and external email clients can still be used to access your account.

Once our systems are available, instructors are being asked to provide students with appropriate accommodations for possible missed assignments or other issues related to the outage.

Friday 09/18/2015 21:44:  Most IT systems are currently down due to an overheating issue in the data center. Any system requiring CalNet authentication is also unavailable. We are working quickly to assess the impact of this event. The assessment is expected to begin at 7:00 a.m. on Saturday 9/19. Once our assessment is complete we will post additional information on the restoration of all systems.

Friday 09/18/2015 21:18: There was a fire in the data center. Extent is not known at this time. Steve Aguirre is on-site and actively working on the issue(s). ETA unknown at this time.

The CSS-IT Service Desk is receiving reports that internet connectivity has been interrupted for multiple applications, users, and locations across campus.

IST is working to identify the root cause and resolve the issue.

No ETA is available at this time.

CMR: 4078


-- 
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], 707.332.0670           (Office: 545Q Cory)

 
-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.
Reply | Threaded
Open this post in threaded view
|

Re: [Micronet] Friday Calnet and www.berkeley.edu outages?

Graham Patterson

I'd wait on the analysis of the event which is being done.

bMail was working fine if you used a device set up with a Google Key,
though not everyone is in that position and might not be reachable.

I am glad the LHS museum is closed this month for renovations, as we
would have had difficulties with some of our data inaccessible.


Graham

On 9/21/15 11:55 AM, Christopher Brooks wrote:

> Any word on the root cause of the CalNet and www.berkeley.edu outages on
> Friday evening and Saturday?
>
> http://systemstatus.berkeley.edu/ (below) mentions a fire.
>
> On Friday evening, it seemed like things were a bit quiet, so I sent my
> eecs.berkeley.edu account, which worked.
>
> Is there a plan to add some redundancy to some of these systems?
>
> We were fortunate that it happened over the weekend.
>
> Many thanks to the people who put in time over the weekend bringing this
> back up.
>
> _Christopher
>
>> *Outage Type:* UNSCHEDULED OUTAGE
>> *Date Submitted*: Friday, September 18, – Monday, 21, 2015
>> *Outage Start/End Time*: 1930 – TBD
>> *Equipment*: Campus Network
>>
>> *Description*: *Monday, 09/20/2015 1040:*
>>
>> The Service Desk is receiving reports of some residual connection
>> problems:
>>
>>   * Campus DHCP registration is unavailable
>>
>> *Sunday, 09/20/2015 1330:*
>>
>>   * ETS – AWS intermittent issues with LMS have been resolved.
>>   * SAIT – is reporting all systems are operational. Financial aid
>>     completed, still working on Drop Date Deadline change.  Smaller
>>     issues will be addressed via normal channels.
>>   * DARS – vendor will be on-site late this afternoon to resolve
>>     remaining issues.
>>   * QA/Dev environments have been prioritized by High/Low and date.
>>     This will be used once the team is ready to begin turning these
>>     back on.  The start of this work is still dependent on the final
>>     fix for the  Dell / Brocade issues on Chassis 20
>>   * Communication to Leadership was sent this morning  from Lyle. A
>>     final all campus communication will be sent once approved by Lyle.
>>   * This will be the final update on this recovery effort. Any issues
>>     surfacing at this point should be reported via a ticket and normal
>>     reporting processes.
>>
>> *Sunday, 09/20/2015 0900:*
>>
>>   * bCourses is experiencing an unrelated vendor outage due to AWS
>>     issues. All users are experiencing intermittent login issues.
>>     Details of this outage can be found on the Instructure Canvas
>>     System Status page <http://status.instructure.com/>.
>>   * DARS – is still experiencing issues, team is actively working with
>>     vendor to resolve today.
>>   * EMS Grad2, and Summer Sessions, are up and working.
>>   * LMS – as of 7:00 a.m there have been intermittent issues.  Team is
>>     actively working with Vendor to solve.  This is unrelated to the
>>     Data Center issues.  Team will communicate directly with users on
>>     the status of this.
>>   * Dev/QA environments – ETA to begin restoring is sometime this
>>     afternoon
>>       o there is a dependency on a couple of hardware issues with
>>         switches that must be resolved prior to bringing the remaining
>>         VM’s live.
>>       o Teams should start prioritizing environments and identifying
>>         what can wait when we do go live.
>>       o Karen Kato will start a Google sheet to track these
>>         priorities.  Each lead need to input this info.
>>   * All Clear was called at this mornings check in – teams will
>>     continue working through normal channels to resolve minor issues.
>>
>> Next Status will be after a 1:30pm status call.
>>
>> *Saturday, 09/19/2015 1730:*
>>
>>   * All production databases are up and running – any additional
>>     issues that are found should have a ticket opened and will be
>>     dealt with as soon as possible.
>>   * Phones/voicemail, as reported earlier, are restored.
>>   * Applications – some systems are continuing to run batch jobs and
>>     will be opened up by tomorrow for user/functional testing tomorrow.
>>   * Ironweed server continues to be worked by IT and the service
>>     provider. All indications point to it should be ok and restored
>>     over the next few days.
>>   * Currently an extended catch up process will take place on jobs
>>     that should have run in the prior 24 hours. Targeting an “all
>>     clear” on these by tomorrow afternoon.
>>   * Student systems are all in pretty good shape. Once all partners
>>     have been contacted and verify their individual tools have been
>>     verified we will be able to consider this clear.
>>       o Dars in particular has a team actively working to ensure
>>         successful return to service.
>>       o EMS – consortium of multiple areas. Rec Sports is the main
>>         service provider and will be contacted for current status.
>>       o Grad Dept and Disabled Student outreach will be done this
>>         evening to verify how things look on their side.
>>   * All clear will be posted on the Berkeley news site as well as a
>>     campus wide email. ETA – tomorrow.
>>   * Service leads are actively reaching out to their business partners
>>     to verify status.
>>
>> This will be the last update for tonight. Next update will be posted
>> Sunday morning.
>>
>> *Saturday 09/19/2015 16:53:* All applications identified as critical
>> have been restored.
>>
>>   * All phone and voicemail issues caused by the outage have been resolved
>>   * Network is fully operational
>>   * Some non-critical blades, chassis and servers are still in progress.
>>   * Applications – Bairs, BFS, CalAnswers, CalPlanning are all
>>     confirmed up.
>>       o BFS –  running batch jobs
>>       o HCM – running batch jobs
>>       o Caltime – available to users, but problem with some HTML servers
>>   * Please note applications are all up, but batch jobs will need to
>>     complete then open to functional partners for validation/testing
>>   * Additional info from SAIT is being gathered and will be in the
>>     next update.
>>   * QA and dev systems will be deferred until tomorrow.
>>
>> *Saturday 09/19/2015 13:30:*
>>
>>   * CalMessages is now up and available
>>   * Go anywhere – in progress
>>   * Smaller issues with infrastructure hardware will be addressed as
>>     soon as critical applications are restored
>>   * Pharmacy database, Footprints, Goanywhere.
>>   * Citrix and VMWare are up
>>
>> *Saturday 09/19/2015 12:11*
>>
>>   * Progress continues to be made to bring applications back online
>>   * www.berkeley.edu <http://www.berkeley.edu> is now available
>>   * CalMail lists and all legacy CalMail services are operating
>>     normally. Some messages that were sent during the outage were kept
>>     in a queue and have now been delivered.
>>
>> *Saturday 09/19/2015 11:18:*
>>
>>   * CalNet is up and running.  Access to Google, Box, bCourses,
>>     Service Now and other cloud hosted systems has been restored
>>   * VPN services are now up and available
>>   * Databases and Webfarm are being brought up to enable
>>     <http://www.berkeley.edu>www.berkeley.edu (11:30 ETA)
>>   * Work has started to restore service to campus phones that were
>>     affected by the outage
>>
>> *Saturday 09/19/2015 10:28:
>> *
>>
>>   * Still working to get CalNet up and running
>>   * After CalNet is up, database team will be bringing databases up
>>     systematically
>>   * Once databases are up, applications can begin to be brought online
>>   * www.berkeley.edu <http://www.berkeley.edu> will be brought back up
>>     as soon as possible to aid in campuswide communications
>>
>> *Saturday 09/19/2015 09:24: *The follow progress has been made:
>>
>>   * Power has been restored
>>   * Management Systems are up.
>>   * Network is up and running
>>   * Working to get CalNet up and running
>>   * Wifi still unavailable due to database dependency
>>   * Database systems being brought online in a very methodical way
>>   * Storage systems still dependent on other systems
>>
>> Next update will be Shortly after 10:00 am.
>>
>> *Saturday 09/19/2015 07:45: *We are currently restoring systems and
>> bringing applications back online as they become available.
>>
>> Please rest assured your email is still working.  Even if you are
>> unable to access your mail at this time, it will be there waiting for
>> you when CalNet authentication is restored. Your phone and external
>> email clients can still be used to access your account.
>>
>> Once our systems are available, instructors are being asked to provide
>> students with appropriate accommodations for possible missed
>> assignments or other issues related to the outage.
>>
>> *Friday 09/18/2015 21:44:  *Most IT systems are currently down due to
>> an overheating issue in the data center. Any system requiring CalNet
>> authentication is also unavailable. We are working quickly to assess
>> the impact of this event. The assessment is expected to begin at 7:00
>> a.m. on Saturday 9/19. Once our assessment is complete we will post
>> additional information on the restoration of all systems.
>>
>> *Friday 09/18/2015 21:18: *There was a fire in the data center. Extent
>> is not known at this time. Steve Aguirre is on-site and actively
>> working on the issue(s). ETA unknown at this time.
>>
>> The CSS-IT Service Desk is receiving reports that internet
>> connectivity has been interrupted for multiple applications, users,
>> and locations across campus.
>>
>> IST is working to identify the root cause and resolve the issue.
>>
>> No ETA is available at this time.
>>
>> *CMR*: 4078
>>
>
> --
> Christopher Brooks, PMP                       University of California
> Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
> CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
> [hidden email], 707.332.0670           (Office: 545Q Cory)
>
>
>
>  
> -------------------------------------------------------------------------
> The following was automatically added to this message by the list server:
>
> To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:
>
> http://micronet.berkeley.edu
>
> Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.
>
> ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.
>


--
Graham Patterson, Systems Administrator
Rm 111, Lawrence Hall of Science, UC Berkeley   510-643-1984
"...past the iguana, the tyrannosaurus, the mastodon, the mathematical
puzzles, and the meteorite..." - used to be the directions to my office.

 
-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.
Reply | Threaded
Open this post in threaded view
|

Re: [Micronet] Friday [Sept. 18] Calnet and www.berkeley.edu outages?

Christopher Brooks
In reply to this post by Christopher Brooks
Since we are discussing the data center, was there ever any word on the cause of the outage in September?

I looked at http://ucbsystems.org/2015/09/ and there was nothing about it. 

http://technology.berkeley.edu/news/campus-it-systems-and-applications-fully-restored states

As is our normal protocol, we will be conducting a full post mortem first thing Monday morning to review the incident and our practices and procedures for emergency response to situations like this.

Are there lessons learned about how we can avoid this problem in the future?

Are there plans to add geographical redundancy to CalNet?

Something may have appeared in the various campus-wide spam^H^H^H^H mailing lists, but I filter those out. :-)

_Christopher

On 9/21/15 11:55 AM, Christopher Brooks wrote:
Any word on the root cause of the CalNet and www.berkeley.edu outages on Friday evening and Saturday?

http://systemstatus.berkeley.edu/ (below) mentions a fire.

On Friday evening, it seemed like things were a bit quiet, so I sent my eecs.berkeley.edu account, which worked.

Is there a plan to add some redundancy to some of these systems? 

We were fortunate that it happened over the weekend.

Many thanks to the people who put in time over the weekend bringing this back up.

_Christopher

Outage Type: UNSCHEDULED OUTAGE
Date Submitted: Friday, September 18, – Monday, 21, 2015
Outage Start/End Time: 1930 – TBD
Equipment: Campus Network

Description: Monday, 09/20/2015 1040:

The Service Desk is receiving reports of some residual connection problems:

  • Campus DHCP registration is unavailable

Sunday, 09/20/2015 1330:

  • ETS – AWS intermittent issues with LMS have been resolved.
  • SAIT – is reporting all systems are operational. Financial aid completed, still working on Drop Date Deadline change.  Smaller issues will be addressed via normal channels.
  • DARS – vendor will be on-site late this afternoon to resolve remaining issues.
  • QA/Dev environments have been prioritized by High/Low and date.  This will be used once the team is ready to begin turning these back on.  The start of this work is still dependent on the final fix for the  Dell / Brocade issues on Chassis 20
  • Communication to Leadership was sent this morning  from Lyle. A final all campus communication will be sent once approved by Lyle.
  • This will be the final update on this recovery effort. Any issues surfacing at this point should be reported via a ticket and normal reporting processes.

Sunday, 09/20/2015 0900:

  • bCourses is experiencing an unrelated vendor outage due to AWS issues. All users are experiencing intermittent login issues. <a moz-do-not-send="true" href="http://status.instructure.com/" onclick="_gaq.push(['_trackEvent', 'outbound-article', 'http://status.instructure.com/', 'Details of this outage can be found on the Instructure Canvas System Status page']);">Details of this outage can be found on the Instructure Canvas System Status page.
  • DARS – is still experiencing issues, team is actively working with vendor to resolve today.
  • EMS Grad2, and Summer Sessions, are up and working.
  • LMS – as of 7:00 a.m there have been intermittent issues.  Team is actively working with Vendor to solve.  This is unrelated to the Data Center issues.  Team will communicate directly with users on the status of this.
  • Dev/QA environments – ETA to begin restoring is sometime this afternoon
    • there is a dependency on a couple of hardware issues with switches that must be resolved prior to bringing the remaining VM’s live.
    • Teams should start prioritizing environments and identifying what can wait when we do go live.
    • Karen Kato will start a Google sheet to track these priorities.  Each lead need to input this info.
  • All Clear was called at this mornings check in – teams will continue working through normal channels to resolve minor issues.

Next Status will be after a 1:30pm status call.

Saturday, 09/19/2015 1730:

  • All production databases are up and running – any additional issues that are found should have a ticket opened and will be dealt with as soon as possible.
  • Phones/voicemail, as reported earlier, are restored.
  • Applications – some systems are continuing to run batch jobs and will be opened up by tomorrow for user/functional testing tomorrow.
  • Ironweed server continues to be worked by IT and the service provider. All indications point to it should be ok and restored over the next few days.
  • Currently an extended catch up process will take place on jobs that should have run in the prior 24 hours. Targeting an “all clear” on these by tomorrow afternoon.
  • Student systems are all in pretty good shape. Once all partners have been contacted and verify their individual tools have been verified we will be able to consider this clear.
    • Dars in particular has a team actively working to ensure successful return to service.
    • EMS – consortium of multiple areas. Rec Sports is the main service provider and will be contacted for current status.
    • Grad Dept and Disabled Student outreach will be done this evening to verify how things look on their side.
  • All clear will be posted on the Berkeley news site as well as a campus wide email. ETA – tomorrow.
  • Service leads are actively reaching out to their business partners to verify status.

This will be the last update for tonight. Next update will be posted Sunday morning.

Saturday 09/19/2015 16:53: All applications identified as critical have been restored.

  • All phone and voicemail issues caused by the outage have been resolved
  • Network is fully operational
  • Some non-critical blades, chassis and servers are still in progress.
  • Applications – Bairs, BFS, CalAnswers, CalPlanning are all confirmed up.
    • BFS –  running batch jobs
    • HCM – running batch jobs
    • Caltime – available to users, but problem with some HTML servers
  • Please note applications are all up, but batch jobs will need to complete then open to functional partners for validation/testing
  • Additional info from SAIT is being gathered and will be in the next update.
  • QA and dev systems will be deferred until tomorrow.

Saturday 09/19/2015 13:30:

  • CalMessages is now up and available
  • Go anywhere – in progress
  • Smaller issues with infrastructure hardware will be addressed as soon as critical applications are restored
  • Pharmacy database, Footprints, Goanywhere.
  • Citrix and VMWare are up

Saturday 09/19/2015 12:11

  • Progress continues to be made to bring applications back online
  • www.berkeley.edu is now available
  • CalMail lists and all legacy CalMail services are operating normally. Some messages that were sent during the outage were kept in a queue and have now been delivered.

Saturday 09/19/2015 11:18:

  • CalNet is up and running.  Access to Google, Box, bCourses, Service Now and other cloud hosted systems has been restored
  • VPN services are now up and available
  • Databases and Webfarm are being brought up to enable www.berkeley.edu (11:30 ETA)
  • Work has started to restore service to campus phones that were affected by the outage

Saturday 09/19/2015 10:28:

  • Still working to get CalNet up and running
  • After CalNet is up, database team will be bringing databases up systematically
  • Once databases are up, applications can begin to be brought online
  • www.berkeley.edu will be brought back up as soon as possible to aid in campuswide communications

Saturday 09/19/2015 09:24: The follow progress has been made:

  • Power has been restored
  • Management Systems are up.
  • Network is up and running
  • Working to get CalNet up and running
  • Wifi still unavailable due to database dependency
  • Database systems being brought online in a very methodical way
  • Storage systems still dependent on other systems

Next update will be Shortly after 10:00 am.

Saturday 09/19/2015 07:45: We are currently restoring systems and bringing applications back online as they become available.

Please rest assured your email is still working.  Even if you are unable to access your mail at this time, it will be there waiting for you when CalNet authentication is restored. Your phone and external email clients can still be used to access your account.

Once our systems are available, instructors are being asked to provide students with appropriate accommodations for possible missed assignments or other issues related to the outage.

Friday 09/18/2015 21:44:  Most IT systems are currently down due to an overheating issue in the data center. Any system requiring CalNet authentication is also unavailable. We are working quickly to assess the impact of this event. The assessment is expected to begin at 7:00 a.m. on Saturday 9/19. Once our assessment is complete we will post additional information on the restoration of all systems.

Friday 09/18/2015 21:18: There was a fire in the data center. Extent is not known at this time. Steve Aguirre is on-site and actively working on the issue(s). ETA unknown at this time.

The CSS-IT Service Desk is receiving reports that internet connectivity has been interrupted for multiple applications, users, and locations across campus.

IST is working to identify the root cause and resolve the issue.

No ETA is available at this time.

CMR: 4078


-- 
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], 707.332.0670           (Office: 545Q Cory)

-- 
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], 707.332.0670           (Office: 545Q Cory)

 
-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.
Reply | Threaded
Open this post in threaded view
|

Re: [Micronet] Friday [Sept. 18] Calnet and www.berkeley.edu outages?

Jack Shnell
Christopher,

The root cause of this DC shutdown was a fire caused by an exploding capacitor in a co-located server.  The fire suppression system was activated, which as part of the protocol also shuts down all power in the DC.

Amusingly enough, the culprit server belonged to EECS, along with several others of the same custom design.  Of course the remainder were immediately removed from service.

Jack


On Mon, Dec 7, 2015 at 2:27 PM, Christopher Brooks <[hidden email]> wrote:
Since we are discussing the data center, was there ever any word on the cause of the outage in September?

I looked at http://ucbsystems.org/2015/09/ and there was nothing about it. 

http://technology.berkeley.edu/news/campus-it-systems-and-applications-fully-restored states

As is our normal protocol, we will be conducting a full post mortem first thing Monday morning to review the incident and our practices and procedures for emergency response to situations like this.

Are there lessons learned about how we can avoid this problem in the future?

Are there plans to add geographical redundancy to CalNet?

Something may have appeared in the various campus-wide spam^H^H^H^H mailing lists, but I filter those out. :-)

_Christopher

On 9/21/15 11:55 AM, Christopher Brooks wrote:
Any word on the root cause of the CalNet and www.berkeley.edu outages on Friday evening and Saturday?

http://systemstatus.berkeley.edu/ (below) mentions a fire.

On Friday evening, it seemed like things were a bit quiet, so I sent my eecs.berkeley.edu account, which worked.

Is there a plan to add some redundancy to some of these systems? 

We were fortunate that it happened over the weekend.

Many thanks to the people who put in time over the weekend bringing this back up.

_Christopher

Outage Type: UNSCHEDULED OUTAGE
Date Submitted: Friday, September 18, – Monday, 21, 2015
Outage Start/End Time: 1930 – TBD
Equipment: Campus Network

Description: Monday, 09/20/2015 1040:

The Service Desk is receiving reports of some residual connection problems:

  • Campus DHCP registration is unavailable

Sunday, 09/20/2015 1330:

  • ETS – AWS intermittent issues with LMS have been resolved.
  • SAIT – is reporting all systems are operational. Financial aid completed, still working on Drop Date Deadline change.  Smaller issues will be addressed via normal channels.
  • DARS – vendor will be on-site late this afternoon to resolve remaining issues.
  • QA/Dev environments have been prioritized by High/Low and date.  This will be used once the team is ready to begin turning these back on.  The start of this work is still dependent on the final fix for the  Dell / Brocade issues on Chassis 20
  • Communication to Leadership was sent this morning  from Lyle. A final all campus communication will be sent once approved by Lyle.
  • This will be the final update on this recovery effort. Any issues surfacing at this point should be reported via a ticket and normal reporting processes.

Sunday, 09/20/2015 0900:

  • bCourses is experiencing an unrelated vendor outage due to AWS issues. All users are experiencing intermittent login issues. Details of this outage can be found on the Instructure Canvas System Status page.
  • DARS – is still experiencing issues, team is actively working with vendor to resolve today.
  • EMS Grad2, and Summer Sessions, are up and working.
  • LMS – as of 7:00 a.m there have been intermittent issues.  Team is actively working with Vendor to solve.  This is unrelated to the Data Center issues.  Team will communicate directly with users on the status of this.
  • Dev/QA environments – ETA to begin restoring is sometime this afternoon
    • there is a dependency on a couple of hardware issues with switches that must be resolved prior to bringing the remaining VM’s live.
    • Teams should start prioritizing environments and identifying what can wait when we do go live.
    • Karen Kato will start a Google sheet to track these priorities.  Each lead need to input this info.
  • All Clear was called at this mornings check in – teams will continue working through normal channels to resolve minor issues.

Next Status will be after a 1:30pm status call.

Saturday, 09/19/2015 1730:

  • All production databases are up and running – any additional issues that are found should have a ticket opened and will be dealt with as soon as possible.
  • Phones/voicemail, as reported earlier, are restored.
  • Applications – some systems are continuing to run batch jobs and will be opened up by tomorrow for user/functional testing tomorrow.
  • Ironweed server continues to be worked by IT and the service provider. All indications point to it should be ok and restored over the next few days.
  • Currently an extended catch up process will take place on jobs that should have run in the prior 24 hours. Targeting an “all clear” on these by tomorrow afternoon.
  • Student systems are all in pretty good shape. Once all partners have been contacted and verify their individual tools have been verified we will be able to consider this clear.
    • Dars in particular has a team actively working to ensure successful return to service.
    • EMS – consortium of multiple areas. Rec Sports is the main service provider and will be contacted for current status.
    • Grad Dept and Disabled Student outreach will be done this evening to verify how things look on their side.
  • All clear will be posted on the Berkeley news site as well as a campus wide email. ETA – tomorrow.
  • Service leads are actively reaching out to their business partners to verify status.

This will be the last update for tonight. Next update will be posted Sunday morning.

Saturday 09/19/2015 16:53: All applications identified as critical have been restored.

  • All phone and voicemail issues caused by the outage have been resolved
  • Network is fully operational
  • Some non-critical blades, chassis and servers are still in progress.
  • Applications – Bairs, BFS, CalAnswers, CalPlanning are all confirmed up.
    • BFS –  running batch jobs
    • HCM – running batch jobs
    • Caltime – available to users, but problem with some HTML servers
  • Please note applications are all up, but batch jobs will need to complete then open to functional partners for validation/testing
  • Additional info from SAIT is being gathered and will be in the next update.
  • QA and dev systems will be deferred until tomorrow.

Saturday 09/19/2015 13:30:

  • CalMessages is now up and available
  • Go anywhere – in progress
  • Smaller issues with infrastructure hardware will be addressed as soon as critical applications are restored
  • Pharmacy database, Footprints, Goanywhere.
  • Citrix and VMWare are up

Saturday 09/19/2015 12:11

  • Progress continues to be made to bring applications back online
  • www.berkeley.edu is now available
  • CalMail lists and all legacy CalMail services are operating normally. Some messages that were sent during the outage were kept in a queue and have now been delivered.

Saturday 09/19/2015 11:18:

  • CalNet is up and running.  Access to Google, Box, bCourses, Service Now and other cloud hosted systems has been restored
  • VPN services are now up and available
  • Databases and Webfarm are being brought up to enable www.berkeley.edu (11:30 ETA)
  • Work has started to restore service to campus phones that were affected by the outage

Saturday 09/19/2015 10:28:

  • Still working to get CalNet up and running
  • After CalNet is up, database team will be bringing databases up systematically
  • Once databases are up, applications can begin to be brought online
  • www.berkeley.edu will be brought back up as soon as possible to aid in campuswide communications

Saturday 09/19/2015 09:24: The follow progress has been made:

  • Power has been restored
  • Management Systems are up.
  • Network is up and running
  • Working to get CalNet up and running
  • Wifi still unavailable due to database dependency
  • Database systems being brought online in a very methodical way
  • Storage systems still dependent on other systems

Next update will be Shortly after 10:00 am.

Saturday 09/19/2015 07:45: We are currently restoring systems and bringing applications back online as they become available.

Please rest assured your email is still working.  Even if you are unable to access your mail at this time, it will be there waiting for you when CalNet authentication is restored. Your phone and external email clients can still be used to access your account.

Once our systems are available, instructors are being asked to provide students with appropriate accommodations for possible missed assignments or other issues related to the outage.

Friday 09/18/2015 21:44:  Most IT systems are currently down due to an overheating issue in the data center. Any system requiring CalNet authentication is also unavailable. We are working quickly to assess the impact of this event. The assessment is expected to begin at 7:00 a.m. on Saturday 9/19. Once our assessment is complete we will post additional information on the restoration of all systems.

Friday 09/18/2015 21:18: There was a fire in the data center. Extent is not known at this time. Steve Aguirre is on-site and actively working on the issue(s). ETA unknown at this time.

The CSS-IT Service Desk is receiving reports that internet connectivity has been interrupted for multiple applications, users, and locations across campus.

IST is working to identify the root cause and resolve the issue.

No ETA is available at this time.

CMR: 4078


-- 
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], <a href="tel:707.332.0670" value="+17073320670" target="_blank">707.332.0670           (Office: 545Q Cory)

-- 
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], <a href="tel:707.332.0670" value="+17073320670" target="_blank">707.332.0670           (Office: 545Q Cory)


-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.



 
-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.
Reply | Threaded
Open this post in threaded view
|

Re: [Micronet] Friday [Sept. 18] Calnet and www.berkeley.edu outages?

Christopher Brooks
Writing as someone from EECS, it figures that an EECS server would have a component level failure. 

What about geographical redundancy for CalNet authentication?  Our business continuity plans presumably handle this some how, but what is the expected recovery time?

We were lucky that the outage happened late on Friday, which gave a couple of days to bring things back up.

_Christopher


On 12/7/15 2:50 PM, Jack M. SHNELL wrote:
Christopher,

The root cause of this DC shutdown was a fire caused by an exploding capacitor in a co-located server.  The fire suppression system was activated, which as part of the protocol also shuts down all power in the DC.

Amusingly enough, the culprit server belonged to EECS, along with several others of the same custom design.  Of course the remainder were immediately removed from service.

Jack


On Mon, Dec 7, 2015 at 2:27 PM, Christopher Brooks <[hidden email]> wrote:
Since we are discussing the data center, was there ever any word on the cause of the outage in September?

I looked at http://ucbsystems.org/2015/09/ and there was nothing about it. 

http://technology.berkeley.edu/news/campus-it-systems-and-applications-fully-restored states

As is our normal protocol, we will be conducting a full post mortem first thing Monday morning to review the incident and our practices and procedures for emergency response to situations like this.

Are there lessons learned about how we can avoid this problem in the future?

Are there plans to add geographical redundancy to CalNet?

Something may have appeared in the various campus-wide spam^H^H^H^H mailing lists, but I filter those out. :-)

_Christopher

On 9/21/15 11:55 AM, Christopher Brooks wrote:
Any word on the root cause of the CalNet and www.berkeley.edu outages on Friday evening and Saturday?

http://systemstatus.berkeley.edu/ (below) mentions a fire.

On Friday evening, it seemed like things were a bit quiet, so I sent my eecs.berkeley.edu account, which worked.

Is there a plan to add some redundancy to some of these systems? 

We were fortunate that it happened over the weekend.

Many thanks to the people who put in time over the weekend bringing this back up.

_Christopher

Outage Type: UNSCHEDULED OUTAGE
Date Submitted: Friday, September 18, – Monday, 21, 2015
Outage Start/End Time: 1930 – TBD
Equipment: Campus Network

Description: Monday, 09/20/2015 1040:

The Service Desk is receiving reports of some residual connection problems:

  • Campus DHCP registration is unavailable

Sunday, 09/20/2015 1330:

  • ETS – AWS intermittent issues with LMS have been resolved.
  • SAIT – is reporting all systems are operational. Financial aid completed, still working on Drop Date Deadline change.  Smaller issues will be addressed via normal channels.
  • DARS – vendor will be on-site late this afternoon to resolve remaining issues.
  • QA/Dev environments have been prioritized by High/Low and date.  This will be used once the team is ready to begin turning these back on.  The start of this work is still dependent on the final fix for the  Dell / Brocade issues on Chassis 20
  • Communication to Leadership was sent this morning  from Lyle. A final all campus communication will be sent once approved by Lyle.
  • This will be the final update on this recovery effort. Any issues surfacing at this point should be reported via a ticket and normal reporting processes.

Sunday, 09/20/2015 0900:

  • bCourses is experiencing an unrelated vendor outage due to AWS issues. All users are experiencing intermittent login issues. Details of this outage can be found on the Instructure Canvas System Status page.
  • DARS – is still experiencing issues, team is actively working with vendor to resolve today.
  • EMS Grad2, and Summer Sessions, are up and working.
  • LMS – as of 7:00 a.m there have been intermittent issues.  Team is actively working with Vendor to solve.  This is unrelated to the Data Center issues.  Team will communicate directly with users on the status of this.
  • Dev/QA environments – ETA to begin restoring is sometime this afternoon
    • there is a dependency on a couple of hardware issues with switches that must be resolved prior to bringing the remaining VM’s live.
    • Teams should start prioritizing environments and identifying what can wait when we do go live.
    • Karen Kato will start a Google sheet to track these priorities.  Each lead need to input this info.
  • All Clear was called at this mornings check in – teams will continue working through normal channels to resolve minor issues.

Next Status will be after a 1:30pm status call.

Saturday, 09/19/2015 1730:

  • All production databases are up and running – any additional issues that are found should have a ticket opened and will be dealt with as soon as possible.
  • Phones/voicemail, as reported earlier, are restored.
  • Applications – some systems are continuing to run batch jobs and will be opened up by tomorrow for user/functional testing tomorrow.
  • Ironweed server continues to be worked by IT and the service provider. All indications point to it should be ok and restored over the next few days.
  • Currently an extended catch up process will take place on jobs that should have run in the prior 24 hours. Targeting an “all clear” on these by tomorrow afternoon.
  • Student systems are all in pretty good shape. Once all partners have been contacted and verify their individual tools have been verified we will be able to consider this clear.
    • Dars in particular has a team actively working to ensure successful return to service.
    • EMS – consortium of multiple areas. Rec Sports is the main service provider and will be contacted for current status.
    • Grad Dept and Disabled Student outreach will be done this evening to verify how things look on their side.
  • All clear will be posted on the Berkeley news site as well as a campus wide email. ETA – tomorrow.
  • Service leads are actively reaching out to their business partners to verify status.

This will be the last update for tonight. Next update will be posted Sunday morning.

Saturday 09/19/2015 16:53: All applications identified as critical have been restored.

  • All phone and voicemail issues caused by the outage have been resolved
  • Network is fully operational
  • Some non-critical blades, chassis and servers are still in progress.
  • Applications – Bairs, BFS, CalAnswers, CalPlanning are all confirmed up.
    • BFS –  running batch jobs
    • HCM – running batch jobs
    • Caltime – available to users, but problem with some HTML servers
  • Please note applications are all up, but batch jobs will need to complete then open to functional partners for validation/testing
  • Additional info from SAIT is being gathered and will be in the next update.
  • QA and dev systems will be deferred until tomorrow.

Saturday 09/19/2015 13:30:

  • CalMessages is now up and available
  • Go anywhere – in progress
  • Smaller issues with infrastructure hardware will be addressed as soon as critical applications are restored
  • Pharmacy database, Footprints, Goanywhere.
  • Citrix and VMWare are up

Saturday 09/19/2015 12:11

  • Progress continues to be made to bring applications back online
  • www.berkeley.edu is now available
  • CalMail lists and all legacy CalMail services are operating normally. Some messages that were sent during the outage were kept in a queue and have now been delivered.

Saturday 09/19/2015 11:18:

  • CalNet is up and running.  Access to Google, Box, bCourses, Service Now and other cloud hosted systems has been restored
  • VPN services are now up and available
  • Databases and Webfarm are being brought up to enable www.berkeley.edu (11:30 ETA)
  • Work has started to restore service to campus phones that were affected by the outage

Saturday 09/19/2015 10:28:

  • Still working to get CalNet up and running
  • After CalNet is up, database team will be bringing databases up systematically
  • Once databases are up, applications can begin to be brought online
  • www.berkeley.edu will be brought back up as soon as possible to aid in campuswide communications

Saturday 09/19/2015 09:24: The follow progress has been made:

  • Power has been restored
  • Management Systems are up.
  • Network is up and running
  • Working to get CalNet up and running
  • Wifi still unavailable due to database dependency
  • Database systems being brought online in a very methodical way
  • Storage systems still dependent on other systems

Next update will be Shortly after 10:00 am.

Saturday 09/19/2015 07:45: We are currently restoring systems and bringing applications back online as they become available.

Please rest assured your email is still working.  Even if you are unable to access your mail at this time, it will be there waiting for you when CalNet authentication is restored. Your phone and external email clients can still be used to access your account.

Once our systems are available, instructors are being asked to provide students with appropriate accommodations for possible missed assignments or other issues related to the outage.

Friday 09/18/2015 21:44:  Most IT systems are currently down due to an overheating issue in the data center. Any system requiring CalNet authentication is also unavailable. We are working quickly to assess the impact of this event. The assessment is expected to begin at 7:00 a.m. on Saturday 9/19. Once our assessment is complete we will post additional information on the restoration of all systems.

Friday 09/18/2015 21:18: There was a fire in the data center. Extent is not known at this time. Steve Aguirre is on-site and actively working on the issue(s). ETA unknown at this time.

The CSS-IT Service Desk is receiving reports that internet connectivity has been interrupted for multiple applications, users, and locations across campus.

IST is working to identify the root cause and resolve the issue.

No ETA is available at this time.

CMR: 4078


-- 
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], <a moz-do-not-send="true" href="tel:707.332.0670" value="+17073320670" target="_blank">707.332.0670           (Office: 545Q Cory)

-- 
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], <a moz-do-not-send="true" href="tel:707.332.0670" value="+17073320670" target="_blank">707.332.0670           (Office: 545Q Cory)


-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.



-- 
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], 707.332.0670           (Office: 545Q Cory)

 
-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.
Reply | Threaded
Open this post in threaded view
|

Re: [Micronet] Friday [Sept. 18] Calnet and www.berkeley.edu outages?

Jack Shnell
I've been told it was a management call during the incident to wait for CalNet to come back up at UCB rather than fail over to our DR site at SDSC, since the DNS change required for this application would have taken just as long anyway.  However, I believe this is a question perhaps better answered by the Platform Services Manager, Joey Curtis, or Dave Browne, the Director of Information Services, who were both directly involved in the initial recovery effort that Friday night.

I do know that, because this was a worst-case scenario in some respects for a DC outage, it contributed to the current, significant expansion of IST resources committed to the continuing improvement of our DR capabilities.

  

On Tue, Dec 8, 2015 at 7:23 AM, Christopher Brooks <[hidden email]> wrote:
Writing as someone from EECS, it figures that an EECS server would have a component level failure. 

What about geographical redundancy for CalNet authentication?  Our business continuity plans presumably handle this some how, but what is the expected recovery time?

We were lucky that the outage happened late on Friday, which gave a couple of days to bring things back up.

_Christopher



On 12/7/15 2:50 PM, Jack M. SHNELL wrote:
Christopher,

The root cause of this DC shutdown was a fire caused by an exploding capacitor in a co-located server.  The fire suppression system was activated, which as part of the protocol also shuts down all power in the DC.

Amusingly enough, the culprit server belonged to EECS, along with several others of the same custom design.  Of course the remainder were immediately removed from service.

Jack


On Mon, Dec 7, 2015 at 2:27 PM, Christopher Brooks <[hidden email][hidden email]> wrote:
Since we are discussing the data center, was there ever any word on the cause of the outage in September?

I looked at http://ucbsystems.org/2015/09/ and there was nothing about it. 

http://technology.berkeley.edu/news/campus-it-systems-and-applications-fully-restored states

As is our normal protocol, we will be conducting a full post mortem first thing Monday morning to review the incident and our practices and procedures for emergency response to situations like this.

Are there lessons learned about how we can avoid this problem in the future?

Are there plans to add geographical redundancy to CalNet?

Something may have appeared in the various campus-wide spam^H^H^H^H mailing lists, but I filter those out. :-)

_Christopher

On 9/21/15 11:55 AM, Christopher Brooks wrote:
Any word on the root cause of the CalNet and www.berkeley.edu outages on Friday evening and Saturday?

http://systemstatus.berkeley.edu/ (below) mentions a fire.

On Friday evening, it seemed like things were a bit quiet, so I sent my eecs.berkeley.edu account, which worked.

Is there a plan to add some redundancy to some of these systems? 

We were fortunate that it happened over the weekend.

Many thanks to the people who put in time over the weekend bringing this back up.

_Christopher

Outage Type: UNSCHEDULED OUTAGE
Date Submitted: Friday, September 18, – Monday, 21, 2015
Outage Start/End Time: 1930 – TBD
Equipment: Campus Network

Description: Monday, 09/20/2015 1040:

The Service Desk is receiving reports of some residual connection problems:

  • Campus DHCP registration is unavailable

Sunday, 09/20/2015 1330:

  • ETS – AWS intermittent issues with LMS have been resolved.
  • SAIT – is reporting all systems are operational. Financial aid completed, still working on Drop Date Deadline change.  Smaller issues will be addressed via normal channels.
  • DARS – vendor will be on-site late this afternoon to resolve remaining issues.
  • QA/Dev environments have been prioritized by High/Low and date.  This will be used once the team is ready to begin turning these back on.  The start of this work is still dependent on the final fix for the  Dell / Brocade issues on Chassis 20
  • Communication to Leadership was sent this morning  from Lyle. A final all campus communication will be sent once approved by Lyle.
  • This will be the final update on this recovery effort. Any issues surfacing at this point should be reported via a ticket and normal reporting processes.

Sunday, 09/20/2015 0900:

  • bCourses is experiencing an unrelated vendor outage due to AWS issues. All users are experiencing intermittent login issues. Details of this outage can be found on the Instructure Canvas System Status page.
  • DARS – is still experiencing issues, team is actively working with vendor to resolve today.
  • EMS Grad2, and Summer Sessions, are up and working.
  • LMS – as of 7:00 a.m there have been intermittent issues.  Team is actively working with Vendor to solve.  This is unrelated to the Data Center issues.  Team will communicate directly with users on the status of this.
  • Dev/QA environments – ETA to begin restoring is sometime this afternoon
    • there is a dependency on a couple of hardware issues with switches that must be resolved prior to bringing the remaining VM’s live.
    • Teams should start prioritizing environments and identifying what can wait when we do go live.
    • Karen Kato will start a Google sheet to track these priorities.  Each lead need to input this info.
  • All Clear was called at this mornings check in – teams will continue working through normal channels to resolve minor issues.

Next Status will be after a 1:30pm status call.

Saturday, 09/19/2015 1730:

  • All production databases are up and running – any additional issues that are found should have a ticket opened and will be dealt with as soon as possible.
  • Phones/voicemail, as reported earlier, are restored.
  • Applications – some systems are continuing to run batch jobs and will be opened up by tomorrow for user/functional testing tomorrow.
  • Ironweed server continues to be worked by IT and the service provider. All indications point to it should be ok and restored over the next few days.
  • Currently an extended catch up process will take place on jobs that should have run in the prior 24 hours. Targeting an “all clear” on these by tomorrow afternoon.
  • Student systems are all in pretty good shape. Once all partners have been contacted and verify their individual tools have been verified we will be able to consider this clear.
    • Dars in particular has a team actively working to ensure successful return to service.
    • EMS – consortium of multiple areas. Rec Sports is the main service provider and will be contacted for current status.
    • Grad Dept and Disabled Student outreach will be done this evening to verify how things look on their side.
  • All clear will be posted on the Berkeley news site as well as a campus wide email. ETA – tomorrow.
  • Service leads are actively reaching out to their business partners to verify status.

This will be the last update for tonight. Next update will be posted Sunday morning.

Saturday 09/19/2015 16:53: All applications identified as critical have been restored.

  • All phone and voicemail issues caused by the outage have been resolved
  • Network is fully operational
  • Some non-critical blades, chassis and servers are still in progress.
  • Applications – Bairs, BFS, CalAnswers, CalPlanning are all confirmed up.
    • BFS –  running batch jobs
    • HCM – running batch jobs
    • Caltime – available to users, but problem with some HTML servers
  • Please note applications are all up, but batch jobs will need to complete then open to functional partners for validation/testing
  • Additional info from SAIT is being gathered and will be in the next update.
  • QA and dev systems will be deferred until tomorrow.

Saturday 09/19/2015 13:30:

  • CalMessages is now up and available
  • Go anywhere – in progress
  • Smaller issues with infrastructure hardware will be addressed as soon as critical applications are restored
  • Pharmacy database, Footprints, Goanywhere.
  • Citrix and VMWare are up

Saturday 09/19/2015 12:11

  • Progress continues to be made to bring applications back online
  • www.berkeley.edu is now available
  • CalMail lists and all legacy CalMail services are operating normally. Some messages that were sent during the outage were kept in a queue and have now been delivered.

Saturday 09/19/2015 11:18:

  • CalNet is up and running.  Access to Google, Box, bCourses, Service Now and other cloud hosted systems has been restored
  • VPN services are now up and available
  • Databases and Webfarm are being brought up to enable www.berkeley.edu (11:30 ETA)
  • Work has started to restore service to campus phones that were affected by the outage

Saturday 09/19/2015 10:28:

  • Still working to get CalNet up and running
  • After CalNet is up, database team will be bringing databases up systematically
  • Once databases are up, applications can begin to be brought online
  • www.berkeley.edu will be brought back up as soon as possible to aid in campuswide communications

Saturday 09/19/2015 09:24: The follow progress has been made:

  • Power has been restored
  • Management Systems are up.
  • Network is up and running
  • Working to get CalNet up and running
  • Wifi still unavailable due to database dependency
  • Database systems being brought online in a very methodical way
  • Storage systems still dependent on other systems

Next update will be Shortly after 10:00 am.

Saturday 09/19/2015 07:45: We are currently restoring systems and bringing applications back online as they become available.

Please rest assured your email is still working.  Even if you are unable to access your mail at this time, it will be there waiting for you when CalNet authentication is restored. Your phone and external email clients can still be used to access your account.

Once our systems are available, instructors are being asked to provide students with appropriate accommodations for possible missed assignments or other issues related to the outage.

Friday 09/18/2015 21:44:  Most IT systems are currently down due to an overheating issue in the data center. Any system requiring CalNet authentication is also unavailable. We are working quickly to assess the impact of this event. The assessment is expected to begin at 7:00 a.m. on Saturday 9/19. Once our assessment is complete we will post additional information on the restoration of all systems.

Friday 09/18/2015 21:18: There was a fire in the data center. Extent is not known at this time. Steve Aguirre is on-site and actively working on the issue(s). ETA unknown at this time.

The CSS-IT Service Desk is receiving reports that internet connectivity has been interrupted for multiple applications, users, and locations across campus.

IST is working to identify the root cause and resolve the issue.

No ETA is available at this time.

CMR: 4078


-- 
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], <a href="tel:707.332.0670" value="+17073320670" target="_blank">707.332.0670           (Office: 545Q Cory)

-- 
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], <a href="tel:707.332.0670" value="+17073320670" target="_blank">707.332.0670           (Office: 545Q Cory)


-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.



-- 
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], <a href="tel:707.332.0670" value="+17073320670" target="_blank">707.332.0670           (Office: 545Q Cory)


 
-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.
Reply | Threaded
Open this post in threaded view
|

Re: [Micronet] Friday [Sept. 18] Calnet and www.berkeley.edu outages?

Isaac Orr
Jack's basically correct here.  In fact, the cutover via DNS to SDSC
may have taken longer than the restoration of EWH because network
systems that are required to make DNS changes were also impacted by
the outage.

Today our SDSC DR facility is really engineered around providing
continuity in the even of a catastrophic event that will leave the
Berkeley Data Center out of service for longer than 48 hours.

One of the biggest issues with bringing systems up in alternate
locations has, in the past, been our DNS management infrastructure.
DNS changes required several hours to take effect, and the system used
to make those changes was aging and increasingly brittle.

Prior to the September Data Center problem, the networking group was
working on replacing that infrastructure, and we completed this in
October.  We're now in a much better state than we were at the time of
the data center fire.  If a similar issue were to occur today, we
would still have full control over our DNS infrastructure, and the
ability to make DNS changes to direct traffic to systems outside the
datacenter where available, with relatively little time needed to make
those changes.

There's another infrastructure change that needs to be made to enable
services like CalNet to have geographic diversity.  Our existing load
balancer service is not really up to meeting this type of need.  We
had also identified that problem prior to the fire, and had a project
underway to improve the service.  We expect that to be completed early
in the new year.

Once that work is done, critical systems like CalNet should be able to
take advantage of the new infrastructure to provide improved
resiliency.  There's still significant issues around application
architecture when you start looking at running with geographic
diversity, but I believe that the CalNet folks already have that side
of things under control.

iso


On Tue, Dec 8, 2015 at 12:21 PM, Jack M. SHNELL <[hidden email]> wrote:

> I've been told it was a management call during the incident to wait for
> CalNet to come back up at UCB rather than fail over to our DR site at SDSC,
> since the DNS change required for this application would have taken just as
> long anyway.  However, I believe this is a question perhaps better answered
> by the Platform Services Manager, Joey Curtis, or Dave Browne, the Director
> of Information Services, who were both directly involved in the initial
> recovery effort that Friday night.
>
> I do know that, because this was a worst-case scenario in some respects for
> a DC outage, it contributed to the current, significant expansion of IST
> resources committed to the continuing improvement of our DR capabilities.
>
>
>
> On Tue, Dec 8, 2015 at 7:23 AM, Christopher Brooks <[hidden email]>
> wrote:
>>
>> Writing as someone from EECS, it figures that an EECS server would have a
>> component level failure.
>>
>> What about geographical redundancy for CalNet authentication?  Our
>> business continuity plans presumably handle this some how, but what is the
>> expected recovery time?
>>
>> We were lucky that the outage happened late on Friday, which gave a couple
>> of days to bring things back up.
>>
>> _Christopher
>>
>>
>>
>> On 12/7/15 2:50 PM, Jack M. SHNELL wrote:
>>
>> Christopher,
>>
>> The root cause of this DC shutdown was a fire caused by an exploding
>> capacitor in a co-located server.  The fire suppression system was
>> activated, which as part of the protocol also shuts down all power in the
>> DC.
>>
>> Amusingly enough, the culprit server belonged to EECS, along with several
>> others of the same custom design.  Of course the remainder were immediately
>> removed from service.
>>
>> Jack
>>
>>
>> On Mon, Dec 7, 2015 at 2:27 PM, Christopher Brooks <[hidden email]>
>> wrote:
>>>
>>> Since we are discussing the data center, was there ever any word on the
>>> cause of the outage in September?
>>>
>>> I looked at http://ucbsystems.org/2015/09/ and there was nothing about
>>> it.
>>>
>>>
>>> http://technology.berkeley.edu/news/campus-it-systems-and-applications-fully-restored
>>> states
>>>
>>> As is our normal protocol, we will be conducting a full post mortem first
>>> thing Monday morning to review the incident and our practices and procedures
>>> for emergency response to situations like this.
>>>
>>>
>>> Are there lessons learned about how we can avoid this problem in the
>>> future?
>>>
>>> Are there plans to add geographical redundancy to CalNet?
>>>
>>> Something may have appeared in the various campus-wide spam^H^H^H^H
>>> mailing lists, but I filter those out. :-)
>>>
>>> _Christopher
>>>
>>> On 9/21/15 11:55 AM, Christopher Brooks wrote:
>>>
>>> Any word on the root cause of the CalNet and www.berkeley.edu outages on
>>> Friday evening and Saturday?
>>>
>>> http://systemstatus.berkeley.edu/ (below) mentions a fire.
>>>
>>> On Friday evening, it seemed like things were a bit quiet, so I sent my
>>> eecs.berkeley.edu account, which worked.
>>>
>>> Is there a plan to add some redundancy to some of these systems?
>>>
>>> We were fortunate that it happened over the weekend.
>>>
>>> Many thanks to the people who put in time over the weekend bringing this
>>> back up.
>>>
>>> _Christopher
>>>
>>> Outage Type: UNSCHEDULED OUTAGE
>>> Date Submitted: Friday, September 18, – Monday, 21, 2015
>>> Outage Start/End Time: 1930 – TBD
>>> Equipment: Campus Network
>>>
>>> Description: Monday, 09/20/2015 1040:
>>>
>>> The Service Desk is receiving reports of some residual connection
>>> problems:
>>>
>>> Campus DHCP registration is unavailable
>>>
>>> Sunday, 09/20/2015 1330:
>>>
>>> ETS – AWS intermittent issues with LMS have been resolved.
>>> SAIT – is reporting all systems are operational. Financial aid completed,
>>> still working on Drop Date Deadline change.  Smaller issues will be
>>> addressed via normal channels.
>>> DARS – vendor will be on-site late this afternoon to resolve remaining
>>> issues.
>>> QA/Dev environments have been prioritized by High/Low and date.  This
>>> will be used once the team is ready to begin turning these back on.  The
>>> start of this work is still dependent on the final fix for the  Dell /
>>> Brocade issues on Chassis 20
>>> Communication to Leadership was sent this morning  from Lyle. A final all
>>> campus communication will be sent once approved by Lyle.
>>> This will be the final update on this recovery effort. Any issues
>>> surfacing at this point should be reported via a ticket and normal reporting
>>> processes.
>>>
>>> Sunday, 09/20/2015 0900:
>>>
>>> bCourses is experiencing an unrelated vendor outage due to AWS issues.
>>> All users are experiencing intermittent login issues. Details of this outage
>>> can be found on the Instructure Canvas System Status page.
>>> DARS – is still experiencing issues, team is actively working with vendor
>>> to resolve today.
>>> EMS Grad2, and Summer Sessions, are up and working.
>>> LMS – as of 7:00 a.m there have been intermittent issues.  Team is
>>> actively working with Vendor to solve.  This is unrelated to the Data Center
>>> issues.  Team will communicate directly with users on the status of this.
>>> Dev/QA environments – ETA to begin restoring is sometime this afternoon
>>>
>>> there is a dependency on a couple of hardware issues with switches that
>>> must be resolved prior to bringing the remaining VM’s live.
>>> Teams should start prioritizing environments and identifying what can
>>> wait when we do go live.
>>> Karen Kato will start a Google sheet to track these priorities.  Each
>>> lead need to input this info.
>>>
>>> All Clear was called at this mornings check in – teams will continue
>>> working through normal channels to resolve minor issues.
>>>
>>> Next Status will be after a 1:30pm status call.
>>>
>>> Saturday, 09/19/2015 1730:
>>>
>>> All production databases are up and running – any additional issues that
>>> are found should have a ticket opened and will be dealt with as soon as
>>> possible.
>>> Phones/voicemail, as reported earlier, are restored.
>>> Applications – some systems are continuing to run batch jobs and will be
>>> opened up by tomorrow for user/functional testing tomorrow.
>>> Ironweed server continues to be worked by IT and the service provider.
>>> All indications point to it should be ok and restored over the next few
>>> days.
>>> Currently an extended catch up process will take place on jobs that
>>> should have run in the prior 24 hours. Targeting an “all clear” on these by
>>> tomorrow afternoon.
>>> Student systems are all in pretty good shape. Once all partners have been
>>> contacted and verify their individual tools have been verified we will be
>>> able to consider this clear.
>>>
>>> Dars in particular has a team actively working to ensure successful
>>> return to service.
>>> EMS – consortium of multiple areas. Rec Sports is the main service
>>> provider and will be contacted for current status.
>>> Grad Dept and Disabled Student outreach will be done this evening to
>>> verify how things look on their side.
>>>
>>> All clear will be posted on the Berkeley news site as well as a campus
>>> wide email. ETA – tomorrow.
>>> Service leads are actively reaching out to their business partners to
>>> verify status.
>>>
>>> This will be the last update for tonight. Next update will be posted
>>> Sunday morning.
>>>
>>> Saturday 09/19/2015 16:53: All applications identified as critical have
>>> been restored.
>>>
>>> All phone and voicemail issues caused by the outage have been resolved
>>> Network is fully operational
>>> Some non-critical blades, chassis and servers are still in progress.
>>> Applications – Bairs, BFS, CalAnswers, CalPlanning are all confirmed up.
>>>
>>> BFS –  running batch jobs
>>> HCM – running batch jobs
>>> Caltime – available to users, but problem with some HTML servers
>>>
>>> Please note applications are all up, but batch jobs will need to complete
>>> then open to functional partners for validation/testing
>>> Additional info from SAIT is being gathered and will be in the next
>>> update.
>>> QA and dev systems will be deferred until tomorrow.
>>>
>>> Saturday 09/19/2015 13:30:
>>>
>>> CalMessages is now up and available
>>> Go anywhere – in progress
>>> Smaller issues with infrastructure hardware will be addressed as soon as
>>> critical applications are restored
>>> Pharmacy database, Footprints, Goanywhere.
>>> Citrix and VMWare are up
>>>
>>> Saturday 09/19/2015 12:11
>>>
>>> Progress continues to be made to bring applications back online
>>> www.berkeley.edu is now available
>>> CalMail lists and all legacy CalMail services are operating normally.
>>> Some messages that were sent during the outage were kept in a queue and have
>>> now been delivered.
>>>
>>> Saturday 09/19/2015 11:18:
>>>
>>> CalNet is up and running.  Access to Google, Box, bCourses, Service Now
>>> and other cloud hosted systems has been restored
>>> VPN services are now up and available
>>> Databases and Webfarm are being brought up to enable www.berkeley.edu
>>> (11:30 ETA)
>>> Work has started to restore service to campus phones that were affected
>>> by the outage
>>>
>>> Saturday 09/19/2015 10:28:
>>>
>>> Still working to get CalNet up and running
>>> After CalNet is up, database team will be bringing databases up
>>> systematically
>>> Once databases are up, applications can begin to be brought online
>>> www.berkeley.edu will be brought back up as soon as possible to aid in
>>> campuswide communications
>>>
>>> Saturday 09/19/2015 09:24: The follow progress has been made:
>>>
>>> Power has been restored
>>> Management Systems are up.
>>> Network is up and running
>>> Working to get CalNet up and running
>>> Wifi still unavailable due to database dependency
>>> Database systems being brought online in a very methodical way
>>> Storage systems still dependent on other systems
>>>
>>> Next update will be Shortly after 10:00 am.
>>>
>>> Saturday 09/19/2015 07:45: We are currently restoring systems and
>>> bringing applications back online as they become available.
>>>
>>> Please rest assured your email is still working.  Even if you are unable
>>> to access your mail at this time, it will be there waiting for you when
>>> CalNet authentication is restored. Your phone and external email clients can
>>> still be used to access your account.
>>>
>>> Once our systems are available, instructors are being asked to provide
>>> students with appropriate accommodations for possible missed assignments or
>>> other issues related to the outage.
>>>
>>> Friday 09/18/2015 21:44:  Most IT systems are currently down due to an
>>> overheating issue in the data center. Any system requiring CalNet
>>> authentication is also unavailable. We are working quickly to assess the
>>> impact of this event. The assessment is expected to begin at 7:00 a.m. on
>>> Saturday 9/19. Once our assessment is complete we will post additional
>>> information on the restoration of all systems.
>>>
>>> Friday 09/18/2015 21:18: There was a fire in the data center. Extent is
>>> not known at this time. Steve Aguirre is on-site and actively working on the
>>> issue(s). ETA unknown at this time.
>>>
>>> The CSS-IT Service Desk is receiving reports that internet connectivity
>>> has been interrupted for multiple applications, users, and locations across
>>> campus.
>>>
>>> IST is working to identify the root cause and resolve the issue.
>>>
>>> No ETA is available at this time.
>>>
>>> CMR: 4078
>>>
>>>
>>> --
>>> Christopher Brooks, PMP                       University of California
>>> Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
>>> CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
>>> [hidden email], 707.332.0670           (Office: 545Q Cory)
>>>
>>>
>>> --
>>> Christopher Brooks, PMP                       University of California
>>> Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
>>> CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
>>> [hidden email], 707.332.0670           (Office: 545Q Cory)
>>>
>>>
>>>
>>> -------------------------------------------------------------------------
>>> The following was automatically added to this message by the list server:
>>>
>>> To learn more about Micronet, including how to subscribe to or
>>> unsubscribe from its mailing list and how to find out about upcoming
>>> meetings, please visit the Micronet Web site:
>>>
>>> http://micronet.berkeley.edu
>>>
>>> Messages you send to this mailing list are public and world-viewable, and
>>> the list's archives can be browsed and searched on the Internet.  This means
>>> these messages can be viewed by (among others) your bosses, prospective
>>> employers, and people who have known you in the past.
>>>
>>> ANNOUNCEMENTS: To send announcements to the Micronet list, please use the
>>> [hidden email] list.
>>>
>>
>>
>> --
>> Christopher Brooks, PMP                       University of California
>> Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
>> CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
>> [hidden email], 707.332.0670           (Office: 545Q Cory)
>
>
>
>
> -------------------------------------------------------------------------
> The following was automatically added to this message by the list server:
>
> To learn more about Micronet, including how to subscribe to or unsubscribe
> from its mailing list and how to find out about upcoming meetings, please
> visit the Micronet Web site:
>
> http://micronet.berkeley.edu
>
> Messages you send to this mailing list are public and world-viewable, and
> the list's archives can be browsed and searched on the Internet.  This means
> these messages can be viewed by (among others) your bosses, prospective
> employers, and people who have known you in the past.
>
> ANNOUNCEMENTS: To send announcements to the Micronet list, please use the
> [hidden email] list.
>



--
Isaac Simon Orr
Manager, Network Operations and Services
IST Telecommunications, UC Berkeley
P: +1 510 643 9837 C: +1 510 517 9408 E: [hidden email]

 
-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.
Reply | Threaded
Open this post in threaded view
|

Re: [Micronet] Friday [Sept. 18] Calnet and www.berkeley.edu outages?

Jeremy Rosenberg
Thanks Isaac

I was going to add much of this to the conversation but I’m out of the office with limited time today.  But to help clarify, all of the pieces of the CalNet stack are up and running at the SDSC facility. And they were available during the outage.  Much of our infrastructure is configured to fail over automatically, for example, if CAS is unable to connect to the on-site LDAP or Kerberos servers, it will fail over to San Diego.  It has done so on rare occasions and the campus would have been none the wiser since CAS authentication continued to work.

But when the CAS server itself is unavailable, the only way to redirect traffic to the SDSC CAS instance is to either modify the DNS or modify the configuration of the application that is calling CAS (or Shibboleth).  We briefly considered modifying the configuration of the google apps integration to start using the SDSC instance of Shibboleth during that outage to get email service back sooner.  In the end, the infrastructure and CalNet teams moved quickly enough to get the primary CalNet stack back up that it was not worth the additional risks.

This is a good discussion to have and it has inspired us to do a better job of publishing both our disaster recovery strategy and documentation on how application owners can be proactive with their own configurations to leverage our offsite instances.  The CalNet website is getting a major makeover and expect to see this new information in the new year.

I would also like reassure that each time we experience any kind of degraded performance, CalNet does a full review and implements mitigation strategies to reduce the likelihood of a repeat event.

Jeremy


====================================================
Jeremy Rosenberg
Manager, CalNet Identity and Access Management 
UC Berkeley





On Dec 8, 2015, at 12:44 PM, Isaac Orr <[hidden email]> wrote:

Jack's basically correct here.  In fact, the cutover via DNS to SDSC
may have taken longer than the restoration of EWH because network
systems that are required to make DNS changes were also impacted by
the outage.

Today our SDSC DR facility is really engineered around providing
continuity in the even of a catastrophic event that will leave the
Berkeley Data Center out of service for longer than 48 hours.

One of the biggest issues with bringing systems up in alternate
locations has, in the past, been our DNS management infrastructure.
DNS changes required several hours to take effect, and the system used
to make those changes was aging and increasingly brittle.

Prior to the September Data Center problem, the networking group was
working on replacing that infrastructure, and we completed this in
October.  We're now in a much better state than we were at the time of
the data center fire.  If a similar issue were to occur today, we
would still have full control over our DNS infrastructure, and the
ability to make DNS changes to direct traffic to systems outside the
datacenter where available, with relatively little time needed to make
those changes.

There's another infrastructure change that needs to be made to enable
services like CalNet to have geographic diversity.  Our existing load
balancer service is not really up to meeting this type of need.  We
had also identified that problem prior to the fire, and had a project
underway to improve the service.  We expect that to be completed early
in the new year.

Once that work is done, critical systems like CalNet should be able to
take advantage of the new infrastructure to provide improved
resiliency.  There's still significant issues around application
architecture when you start looking at running with geographic
diversity, but I believe that the CalNet folks already have that side
of things under control.

iso


On Tue, Dec 8, 2015 at 12:21 PM, Jack M. SHNELL <[hidden email]> wrote:
I've been told it was a management call during the incident to wait for
CalNet to come back up at UCB rather than fail over to our DR site at SDSC,
since the DNS change required for this application would have taken just as
long anyway.  However, I believe this is a question perhaps better answered
by the Platform Services Manager, Joey Curtis, or Dave Browne, the Director
of Information Services, who were both directly involved in the initial
recovery effort that Friday night.

I do know that, because this was a worst-case scenario in some respects for
a DC outage, it contributed to the current, significant expansion of IST
resources committed to the continuing improvement of our DR capabilities.



On Tue, Dec 8, 2015 at 7:23 AM, Christopher Brooks <[hidden email]>
wrote:

Writing as someone from EECS, it figures that an EECS server would have a
component level failure.

What about geographical redundancy for CalNet authentication?  Our
business continuity plans presumably handle this some how, but what is the
expected recovery time?

We were lucky that the outage happened late on Friday, which gave a couple
of days to bring things back up.

_Christopher



On 12/7/15 2:50 PM, Jack M. SHNELL wrote:

Christopher,

The root cause of this DC shutdown was a fire caused by an exploding
capacitor in a co-located server.  The fire suppression system was
activated, which as part of the protocol also shuts down all power in the
DC.

Amusingly enough, the culprit server belonged to EECS, along with several
others of the same custom design.  Of course the remainder were immediately
removed from service.

Jack


On Mon, Dec 7, 2015 at 2:27 PM, Christopher Brooks <[hidden email]>
wrote:

Since we are discussing the data center, was there ever any word on the
cause of the outage in September?

I looked at http://ucbsystems.org/2015/09/ and there was nothing about
it.


http://technology.berkeley.edu/news/campus-it-systems-and-applications-fully-restored
states

As is our normal protocol, we will be conducting a full post mortem first
thing Monday morning to review the incident and our practices and procedures
for emergency response to situations like this.


Are there lessons learned about how we can avoid this problem in the
future?

Are there plans to add geographical redundancy to CalNet?

Something may have appeared in the various campus-wide spam^H^H^H^H
mailing lists, but I filter those out. :-)

_Christopher

On 9/21/15 11:55 AM, Christopher Brooks wrote:

Any word on the root cause of the CalNet and www.berkeley.edu outages on
Friday evening and Saturday?

http://systemstatus.berkeley.edu/ (below) mentions a fire.

On Friday evening, it seemed like things were a bit quiet, so I sent my
eecs.berkeley.edu account, which worked.

Is there a plan to add some redundancy to some of these systems?

We were fortunate that it happened over the weekend.

Many thanks to the people who put in time over the weekend bringing this
back up.

_Christopher

Outage Type: UNSCHEDULED OUTAGE
Date Submitted: Friday, September 18, – Monday, 21, 2015
Outage Start/End Time: 1930 – TBD
Equipment: Campus Network

Description: Monday, 09/20/2015 1040:

The Service Desk is receiving reports of some residual connection
problems:

Campus DHCP registration is unavailable

Sunday, 09/20/2015 1330:

ETS – AWS intermittent issues with LMS have been resolved.
SAIT – is reporting all systems are operational. Financial aid completed,
still working on Drop Date Deadline change.  Smaller issues will be
addressed via normal channels.
DARS – vendor will be on-site late this afternoon to resolve remaining
issues.
QA/Dev environments have been prioritized by High/Low and date.  This
will be used once the team is ready to begin turning these back on.  The
start of this work is still dependent on the final fix for the  Dell /
Brocade issues on Chassis 20
Communication to Leadership was sent this morning  from Lyle. A final all
campus communication will be sent once approved by Lyle.
This will be the final update on this recovery effort. Any issues
surfacing at this point should be reported via a ticket and normal reporting
processes.

Sunday, 09/20/2015 0900:

bCourses is experiencing an unrelated vendor outage due to AWS issues.
All users are experiencing intermittent login issues. Details of this outage
can be found on the Instructure Canvas System Status page.
DARS – is still experiencing issues, team is actively working with vendor
to resolve today.
EMS Grad2, and Summer Sessions, are up and working.
LMS – as of 7:00 a.m there have been intermittent issues.  Team is
actively working with Vendor to solve.  This is unrelated to the Data Center
issues.  Team will communicate directly with users on the status of this.
Dev/QA environments – ETA to begin restoring is sometime this afternoon

there is a dependency on a couple of hardware issues with switches that
must be resolved prior to bringing the remaining VM’s live.
Teams should start prioritizing environments and identifying what can
wait when we do go live.
Karen Kato will start a Google sheet to track these priorities.  Each
lead need to input this info.

All Clear was called at this mornings check in – teams will continue
working through normal channels to resolve minor issues.

Next Status will be after a 1:30pm status call.

Saturday, 09/19/2015 1730:

All production databases are up and running – any additional issues that
are found should have a ticket opened and will be dealt with as soon as
possible.
Phones/voicemail, as reported earlier, are restored.
Applications – some systems are continuing to run batch jobs and will be
opened up by tomorrow for user/functional testing tomorrow.
Ironweed server continues to be worked by IT and the service provider.
All indications point to it should be ok and restored over the next few
days.
Currently an extended catch up process will take place on jobs that
should have run in the prior 24 hours. Targeting an “all clear” on these by
tomorrow afternoon.
Student systems are all in pretty good shape. Once all partners have been
contacted and verify their individual tools have been verified we will be
able to consider this clear.

Dars in particular has a team actively working to ensure successful
return to service.
EMS – consortium of multiple areas. Rec Sports is the main service
provider and will be contacted for current status.
Grad Dept and Disabled Student outreach will be done this evening to
verify how things look on their side.

All clear will be posted on the Berkeley news site as well as a campus
wide email. ETA – tomorrow.
Service leads are actively reaching out to their business partners to
verify status.

This will be the last update for tonight. Next update will be posted
Sunday morning.

Saturday 09/19/2015 16:53: All applications identified as critical have
been restored.

All phone and voicemail issues caused by the outage have been resolved
Network is fully operational
Some non-critical blades, chassis and servers are still in progress.
Applications – Bairs, BFS, CalAnswers, CalPlanning are all confirmed up.

BFS –  running batch jobs
HCM – running batch jobs
Caltime – available to users, but problem with some HTML servers

Please note applications are all up, but batch jobs will need to complete
then open to functional partners for validation/testing
Additional info from SAIT is being gathered and will be in the next
update.
QA and dev systems will be deferred until tomorrow.

Saturday 09/19/2015 13:30:

CalMessages is now up and available
Go anywhere – in progress
Smaller issues with infrastructure hardware will be addressed as soon as
critical applications are restored
Pharmacy database, Footprints, Goanywhere.
Citrix and VMWare are up

Saturday 09/19/2015 12:11

Progress continues to be made to bring applications back online
www.berkeley.edu is now available
CalMail lists and all legacy CalMail services are operating normally.
Some messages that were sent during the outage were kept in a queue and have
now been delivered.

Saturday 09/19/2015 11:18:

CalNet is up and running.  Access to Google, Box, bCourses, Service Now
and other cloud hosted systems has been restored
VPN services are now up and available
Databases and Webfarm are being brought up to enable www.berkeley.edu
(11:30 ETA)
Work has started to restore service to campus phones that were affected
by the outage

Saturday 09/19/2015 10:28:

Still working to get CalNet up and running
After CalNet is up, database team will be bringing databases up
systematically
Once databases are up, applications can begin to be brought online
www.berkeley.edu will be brought back up as soon as possible to aid in
campuswide communications

Saturday 09/19/2015 09:24: The follow progress has been made:

Power has been restored
Management Systems are up.
Network is up and running
Working to get CalNet up and running
Wifi still unavailable due to database dependency
Database systems being brought online in a very methodical way
Storage systems still dependent on other systems

Next update will be Shortly after 10:00 am.

Saturday 09/19/2015 07:45: We are currently restoring systems and
bringing applications back online as they become available.

Please rest assured your email is still working.  Even if you are unable
to access your mail at this time, it will be there waiting for you when
CalNet authentication is restored. Your phone and external email clients can
still be used to access your account.

Once our systems are available, instructors are being asked to provide
students with appropriate accommodations for possible missed assignments or
other issues related to the outage.

Friday 09/18/2015 21:44:  Most IT systems are currently down due to an
overheating issue in the data center. Any system requiring CalNet
authentication is also unavailable. We are working quickly to assess the
impact of this event. The assessment is expected to begin at 7:00 a.m. on
Saturday 9/19. Once our assessment is complete we will post additional
information on the restoration of all systems.

Friday 09/18/2015 21:18: There was a fire in the data center. Extent is
not known at this time. Steve Aguirre is on-site and actively working on the
issue(s). ETA unknown at this time.

The CSS-IT Service Desk is receiving reports that internet connectivity
has been interrupted for multiple applications, users, and locations across
campus.

IST is working to identify the root cause and resolve the issue.

No ETA is available at this time.

CMR: 4078


--
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], 707.332.0670           (Office: 545Q Cory)


--
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], 707.332.0670           (Office: 545Q Cory)



-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or
unsubscribe from its mailing list and how to find out about upcoming
meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and
the list's archives can be browsed and searched on the Internet.  This means
these messages can be viewed by (among others) your bosses, prospective
employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the
[hidden email] list.



--
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], 707.332.0670           (Office: 545Q Cory)




-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe
from its mailing list and how to find out about upcoming meetings, please
visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and
the list's archives can be browsed and searched on the Internet.  This means
these messages can be viewed by (among others) your bosses, prospective
employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the
[hidden email] list.




--
Isaac Simon Orr
Manager, Network Operations and Services
IST Telecommunications, UC Berkeley
P: +1 510 643 9837 C: +1 510 517 9408 E: [hidden email]


-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.


 
-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.
Reply | Threaded
Open this post in threaded view
|

Re: [Micronet] Friday [Sept. 18] Calnet and www.berkeley.edu outages?

Christopher Brooks
Thank you all, this is the sort of information that I was looking for.

In my experience, it seems like there are plenty of plans to handle infrequent and very severe incidents (building destroyed, major earthquake), but those incidents happen infrequently.   It seems like recovery from less severe incidents, which are more common, is not as well planned.  I think this is just human nature: during less severe incidents, it is sometimes hard to make the risky decision to invoke more of the recovery plan.

For this outage, because it happened on a Friday, there was time to bring things back up.  I suspect that if the problem had occurred on a Monday, then taking the more aggressive and possibly risky course of changing DNS or reconfiguring applications would have been more appealing.

If we were designing the DC again, would it make sense to have one room with fewer machines that are important to the campus and another room that has client-supplied equipment that are important to individual departments?

What about the machine failure, are there any lessons learned that would benefit people who are selecting hardware?  What should we do differently? 

_Christopher

On 12/8/15 2:59 PM, Jeremy Rosenberg wrote:
Thanks Isaac

I was going to add much of this to the conversation but I’m out of the office with limited time today.  But to help clarify, all of the pieces of the CalNet stack are up and running at the SDSC facility. And they were available during the outage.  Much of our infrastructure is configured to fail over automatically, for example, if CAS is unable to connect to the on-site LDAP or Kerberos servers, it will fail over to San Diego.  It has done so on rare occasions and the campus would have been none the wiser since CAS authentication continued to work.

But when the CAS server itself is unavailable, the only way to redirect traffic to the SDSC CAS instance is to either modify the DNS or modify the configuration of the application that is calling CAS (or Shibboleth).  We briefly considered modifying the configuration of the google apps integration to start using the SDSC instance of Shibboleth during that outage to get email service back sooner.  In the end, the infrastructure and CalNet teams moved quickly enough to get the primary CalNet stack back up that it was not worth the additional risks.

This is a good discussion to have and it has inspired us to do a better job of publishing both our disaster recovery strategy and documentation on how application owners can be proactive with their own configurations to leverage our offsite instances.  The CalNet website is getting a major makeover and expect to see this new information in the new year.

I would also like reassure that each time we experience any kind of degraded performance, CalNet does a full review and implements mitigation strategies to reduce the likelihood of a repeat event.

Jeremy


====================================================
Jeremy Rosenberg
Manager, CalNet Identity and Access Management 
UC Berkeley





On Dec 8, 2015, at 12:44 PM, Isaac Orr <[hidden email]> wrote:

Jack's basically correct here.  In fact, the cutover via DNS to SDSC
may have taken longer than the restoration of EWH because network
systems that are required to make DNS changes were also impacted by
the outage.

Today our SDSC DR facility is really engineered around providing
continuity in the even of a catastrophic event that will leave the
Berkeley Data Center out of service for longer than 48 hours.

One of the biggest issues with bringing systems up in alternate
locations has, in the past, been our DNS management infrastructure.
DNS changes required several hours to take effect, and the system used
to make those changes was aging and increasingly brittle.

Prior to the September Data Center problem, the networking group was
working on replacing that infrastructure, and we completed this in
October.  We're now in a much better state than we were at the time of
the data center fire.  If a similar issue were to occur today, we
would still have full control over our DNS infrastructure, and the
ability to make DNS changes to direct traffic to systems outside the
datacenter where available, with relatively little time needed to make
those changes.

There's another infrastructure change that needs to be made to enable
services like CalNet to have geographic diversity.  Our existing load
balancer service is not really up to meeting this type of need.  We
had also identified that problem prior to the fire, and had a project
underway to improve the service.  We expect that to be completed early
in the new year.

Once that work is done, critical systems like CalNet should be able to
take advantage of the new infrastructure to provide improved
resiliency.  There's still significant issues around application
architecture when you start looking at running with geographic
diversity, but I believe that the CalNet folks already have that side
of things under control.

iso


On Tue, Dec 8, 2015 at 12:21 PM, Jack M. SHNELL <[hidden email]> wrote:
I've been told it was a management call during the incident to wait for
CalNet to come back up at UCB rather than fail over to our DR site at SDSC,
since the DNS change required for this application would have taken just as
long anyway.  However, I believe this is a question perhaps better answered
by the Platform Services Manager, Joey Curtis, or Dave Browne, the Director
of Information Services, who were both directly involved in the initial
recovery effort that Friday night.

I do know that, because this was a worst-case scenario in some respects for
a DC outage, it contributed to the current, significant expansion of IST
resources committed to the continuing improvement of our DR capabilities.



On Tue, Dec 8, 2015 at 7:23 AM, Christopher Brooks <[hidden email]>
wrote:

Writing as someone from EECS, it figures that an EECS server would have a
component level failure.

What about geographical redundancy for CalNet authentication?  Our
business continuity plans presumably handle this some how, but what is the
expected recovery time?

We were lucky that the outage happened late on Friday, which gave a couple
of days to bring things back up.

_Christopher



On 12/7/15 2:50 PM, Jack M. SHNELL wrote:

Christopher,

The root cause of this DC shutdown was a fire caused by an exploding
capacitor in a co-located server.  The fire suppression system was
activated, which as part of the protocol also shuts down all power in the
DC.

Amusingly enough, the culprit server belonged to EECS, along with several
others of the same custom design.  Of course the remainder were immediately
removed from service.

Jack


On Mon, Dec 7, 2015 at 2:27 PM, Christopher Brooks <[hidden email]>
wrote:

Since we are discussing the data center, was there ever any word on the
cause of the outage in September?

I looked at http://ucbsystems.org/2015/09/ and there was nothing about
it.


http://technology.berkeley.edu/news/campus-it-systems-and-applications-fully-restored
states

As is our normal protocol, we will be conducting a full post mortem first
thing Monday morning to review the incident and our practices and procedures
for emergency response to situations like this.


Are there lessons learned about how we can avoid this problem in the
future?

Are there plans to add geographical redundancy to CalNet?

Something may have appeared in the various campus-wide spam^H^H^H^H
mailing lists, but I filter those out. :-)

_Christopher

On 9/21/15 11:55 AM, Christopher Brooks wrote:

Any word on the root cause of the CalNet and www.berkeley.edu outages on
Friday evening and Saturday?

http://systemstatus.berkeley.edu/ (below) mentions a fire.

On Friday evening, it seemed like things were a bit quiet, so I sent my
eecs.berkeley.edu account, which worked.

Is there a plan to add some redundancy to some of these systems?

We were fortunate that it happened over the weekend.

Many thanks to the people who put in time over the weekend bringing this
back up.

_Christopher

Outage Type: UNSCHEDULED OUTAGE
Date Submitted: Friday, September 18, – Monday, 21, 2015
Outage Start/End Time: 1930 – TBD
Equipment: Campus Network

Description: Monday, 09/20/2015 1040:

The Service Desk is receiving reports of some residual connection
problems:

Campus DHCP registration is unavailable

Sunday, 09/20/2015 1330:

ETS – AWS intermittent issues with LMS have been resolved.
SAIT – is reporting all systems are operational. Financial aid completed,
still working on Drop Date Deadline change.  Smaller issues will be
addressed via normal channels.
DARS – vendor will be on-site late this afternoon to resolve remaining
issues.
QA/Dev environments have been prioritized by High/Low and date.  This
will be used once the team is ready to begin turning these back on.  The
start of this work is still dependent on the final fix for the  Dell /
Brocade issues on Chassis 20
Communication to Leadership was sent this morning  from Lyle. A final all
campus communication will be sent once approved by Lyle.
This will be the final update on this recovery effort. Any issues
surfacing at this point should be reported via a ticket and normal reporting
processes.

Sunday, 09/20/2015 0900:

bCourses is experiencing an unrelated vendor outage due to AWS issues.
All users are experiencing intermittent login issues. Details of this outage
can be found on the Instructure Canvas System Status page.
DARS – is still experiencing issues, team is actively working with vendor
to resolve today.
EMS Grad2, and Summer Sessions, are up and working.
LMS – as of 7:00 a.m there have been intermittent issues.  Team is
actively working with Vendor to solve.  This is unrelated to the Data Center
issues.  Team will communicate directly with users on the status of this.
Dev/QA environments – ETA to begin restoring is sometime this afternoon

there is a dependency on a couple of hardware issues with switches that
must be resolved prior to bringing the remaining VM’s live.
Teams should start prioritizing environments and identifying what can
wait when we do go live.
Karen Kato will start a Google sheet to track these priorities.  Each
lead need to input this info.

All Clear was called at this mornings check in – teams will continue
working through normal channels to resolve minor issues.

Next Status will be after a 1:30pm status call.

Saturday, 09/19/2015 1730:

All production databases are up and running – any additional issues that
are found should have a ticket opened and will be dealt with as soon as
possible.
Phones/voicemail, as reported earlier, are restored.
Applications – some systems are continuing to run batch jobs and will be
opened up by tomorrow for user/functional testing tomorrow.
Ironweed server continues to be worked by IT and the service provider.
All indications point to it should be ok and restored over the next few
days.
Currently an extended catch up process will take place on jobs that
should have run in the prior 24 hours. Targeting an “all clear” on these by
tomorrow afternoon.
Student systems are all in pretty good shape. Once all partners have been
contacted and verify their individual tools have been verified we will be
able to consider this clear.

Dars in particular has a team actively working to ensure successful
return to service.
EMS – consortium of multiple areas. Rec Sports is the main service
provider and will be contacted for current status.
Grad Dept and Disabled Student outreach will be done this evening to
verify how things look on their side.

All clear will be posted on the Berkeley news site as well as a campus
wide email. ETA – tomorrow.
Service leads are actively reaching out to their business partners to
verify status.

This will be the last update for tonight. Next update will be posted
Sunday morning.

Saturday 09/19/2015 16:53: All applications identified as critical have
been restored.

All phone and voicemail issues caused by the outage have been resolved
Network is fully operational
Some non-critical blades, chassis and servers are still in progress.
Applications – Bairs, BFS, CalAnswers, CalPlanning are all confirmed up.

BFS –  running batch jobs
HCM – running batch jobs
Caltime – available to users, but problem with some HTML servers

Please note applications are all up, but batch jobs will need to complete
then open to functional partners for validation/testing
Additional info from SAIT is being gathered and will be in the next
update.
QA and dev systems will be deferred until tomorrow.

Saturday 09/19/2015 13:30:

CalMessages is now up and available
Go anywhere – in progress
Smaller issues with infrastructure hardware will be addressed as soon as
critical applications are restored
Pharmacy database, Footprints, Goanywhere.
Citrix and VMWare are up

Saturday 09/19/2015 12:11

Progress continues to be made to bring applications back online
www.berkeley.edu is now available
CalMail lists and all legacy CalMail services are operating normally.
Some messages that were sent during the outage were kept in a queue and have
now been delivered.

Saturday 09/19/2015 11:18:

CalNet is up and running.  Access to Google, Box, bCourses, Service Now
and other cloud hosted systems has been restored
VPN services are now up and available
Databases and Webfarm are being brought up to enable www.berkeley.edu
(11:30 ETA)
Work has started to restore service to campus phones that were affected
by the outage

Saturday 09/19/2015 10:28:

Still working to get CalNet up and running
After CalNet is up, database team will be bringing databases up
systematically
Once databases are up, applications can begin to be brought online
www.berkeley.edu will be brought back up as soon as possible to aid in
campuswide communications

Saturday 09/19/2015 09:24: The follow progress has been made:

Power has been restored
Management Systems are up.
Network is up and running
Working to get CalNet up and running
Wifi still unavailable due to database dependency
Database systems being brought online in a very methodical way
Storage systems still dependent on other systems

Next update will be Shortly after 10:00 am.

Saturday 09/19/2015 07:45: We are currently restoring systems and
bringing applications back online as they become available.

Please rest assured your email is still working.  Even if you are unable
to access your mail at this time, it will be there waiting for you when
CalNet authentication is restored. Your phone and external email clients can
still be used to access your account.

Once our systems are available, instructors are being asked to provide
students with appropriate accommodations for possible missed assignments or
other issues related to the outage.

Friday 09/18/2015 21:44:  Most IT systems are currently down due to an
overheating issue in the data center. Any system requiring CalNet
authentication is also unavailable. We are working quickly to assess the
impact of this event. The assessment is expected to begin at 7:00 a.m. on
Saturday 9/19. Once our assessment is complete we will post additional
information on the restoration of all systems.

Friday 09/18/2015 21:18: There was a fire in the data center. Extent is
not known at this time. Steve Aguirre is on-site and actively working on the
issue(s). ETA unknown at this time.

The CSS-IT Service Desk is receiving reports that internet connectivity
has been interrupted for multiple applications, users, and locations across
campus.

IST is working to identify the root cause and resolve the issue.

No ETA is available at this time.

CMR: 4078


--
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], 707.332.0670           (Office: 545Q Cory)


--
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], 707.332.0670           (Office: 545Q Cory)



-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or
unsubscribe from its mailing list and how to find out about upcoming
meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and
the list's archives can be browsed and searched on the Internet.  This means
these messages can be viewed by (among others) your bosses, prospective
employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the
[hidden email] list.



--
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], 707.332.0670           (Office: 545Q Cory)




-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe
from its mailing list and how to find out about upcoming meetings, please
visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and
the list's archives can be browsed and searched on the Internet.  This means
these messages can be viewed by (among others) your bosses, prospective
employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the
[hidden email] list.




--
Isaac Simon Orr
Manager, Network Operations and Services
IST Telecommunications, UC Berkeley
P: +1 510 643 9837 C: +1 510 517 9408 E: [hidden email]


-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.



 
-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.

-- 
Christopher Brooks, PMP                       University of California
Academic Program Manager & Software Engineer  US Mail: 337 Cory Hall
CHESS/iCyPhy/Ptolemy/TerraSwarm               Berkeley, CA 94720-1774
[hidden email], 707.332.0670           (Office: 545Q Cory)

 
-------------------------------------------------------------------------
The following was automatically added to this message by the list server:

To learn more about Micronet, including how to subscribe to or unsubscribe from its mailing list and how to find out about upcoming meetings, please visit the Micronet Web site:

http://micronet.berkeley.edu

Messages you send to this mailing list are public and world-viewable, and the list's archives can be browsed and searched on the Internet.  This means these messages can be viewed by (among others) your bosses, prospective employers, and people who have known you in the past.

ANNOUNCEMENTS: To send announcements to the Micronet list, please use the [hidden email] list.