Thread 'News on Project Outages'

Message boards : Projects : News on Project Outages
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 30 · 31 · 32 · 33 · 34 · 35 · 36 . . . 67 · Next

AuthorMessage
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 104975 - Posted: 6 Aug 2021, 19:27:52 UTC - in response to Message 104973.  

Two tasks reported, four tasks allocated, four sets of downloads failed.

You'd have thought they knew about that one by now :-(
ID: 104975 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2704
United Kingdom
Message 104978 - Posted: 7 Aug 2021, 8:29:58 UTC - in response to Message 104975.  

Got this from Andy.

Hi Dave,

Thanks. Engineering IT Support have partially restored networking to a number of machines, but a number of key machines still have no networking access following the switch work on Tuesday. I have submitted a ticket to them for the other machines and I will follow this up on Monday with them.

Best regards,

Andy
ID: 104978 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 104979 - Posted: 7 Aug 2021, 8:46:44 UTC - in response to Message 104978.  

I'm not very impressed by the Oxford University Engineering IT Support team. They will have scheduled this work for the summer vacation, when the undergraduate demand is low: but university postgrad and faculty research continues 52 weeks of the year. This is also a very busy time of year for university administration, dealing with applications from next year's intake of new students.

Letting a planned infrastructure upgrade over-run by a week is bad management, to say the least.
ID: 104979 · Report as offensive     Reply Quote
Bryn Mawr
Help desk expert

Send message
Joined: 31 Dec 18
Posts: 296
United Kingdom
Message 104982 - Posted: 7 Aug 2021, 15:04:01 UTC - in response to Message 104979.  

I'm not very impressed by the Oxford University Engineering IT Support team. They will have scheduled this work for the summer vacation, when the undergraduate demand is low: but university postgrad and faculty research continues 52 weeks of the year. This is also a very busy time of year for university administration, dealing with applications from next year's intake of new students.

Letting a planned infrastructure upgrade over-run by a week is bad management, to say the least.


And then not working the weekend to clear the problem - I know I’d never have got away with that.
ID: 104982 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2704
United Kingdom
Message 104993 - Posted: 9 Aug 2021, 19:38:51 UTC - in response to Message 104982.  

And then not working the weekend to clear the problem - I know I’d never have got away with that.


Sadly, when I worked in the NHS they were as bad or worse about sorting out problems after, "upgrades." However having had a normal work day to sort things out and no signs of progress I am beginning to despair of them.
ID: 104993 · Report as offensive     Reply Quote
Bryn Mawr
Help desk expert

Send message
Joined: 31 Dec 18
Posts: 296
United Kingdom
Message 104997 - Posted: 10 Aug 2021, 7:51:49 UTC - in response to Message 104993.  

And then not working the weekend to clear the problem - I know I’d never have got away with that.


Sadly, when I worked in the NHS they were as bad or worse about sorting out problems after, "upgrades." However having had a normal work day to sort things out and no signs of progress I am beginning to despair of them.


When I was supporting system upgrades you worked until the system worked - either fix forward or pull the upgrade and fall back to the starting position. You did not break the system then go home.
ID: 104997 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 104998 - Posted: 10 Aug 2021, 8:23:06 UTC

So, does anyone know whether the CPDN download servers have been re-connected to the internet yet? I'm on Andy Bowery's email distribution list, and I haven't seen anything yet - and I've completed upgrading my machines to Linux Mint v20.2

Memo to project staff: the project shouldn't be restarted after maintenance until all components are tested and working.
ID: 104998 · Report as offensive     Reply Quote
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 105000 - Posted: 10 Aug 2021, 9:50:24 UTC - in response to Message 104998.  

It doesn't appear to be.
And Andy is probably "in a mood" by now, so I'm staying well away from it.

If Oxford IT hired external workers to do this, the air in the place has probably turned blue by now. :)
ID: 105000 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2704
United Kingdom
Message 105001 - Posted: 10 Aug 2021, 10:08:01 UTC - in response to Message 104998.  

I am enabling internet access if I have an upload or two ready to go. One is almost uploading at the moment. I am going to suspend it again when it has finished as no movement on the downloads.

If it were possible to just suspend uploads or downloads I could leave internet access on and just check once a day to see whether the download server problem was fixed.
ID: 105001 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 105002 - Posted: 10 Aug 2021, 10:29:35 UTC - in response to Message 105001.  

CPDN needs to be aware that BOINC is designed to manage multiple projects in parallel, and that many of us use it that way. There was once a proposal by, I think, user 'Thyme Lawn' to allow/suspend transfers by project: he coded it for precisely this scenario, but it was rejected by the gatekeepers.

For that reason, I can't follow your example: all my recent tasks have declared their download errors to be permanent and have reported their task status as 'download failed'. I've set 'no new tasks' until I receive positive confirmation that the network is operating properly again.
ID: 105002 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2704
United Kingdom
Message 105003 - Posted: 10 Aug 2021, 10:35:19 UTC - in response to Message 105002.  
Last modified: 10 Aug 2021, 10:59:26 UTC

CPDN needs to be aware that BOINC is designed to manage multiple projects in parallel, and that many of us use it that way. There was once a proposal by, I think, user 'Thyme Lawn' to allow/suspend transfers by project: he coded it for precisely this scenario, but it was rejected by the gatekeepers.

For that reason, I can't follow your example: all my recent tasks have declared their download errors to be permanent and have reported their task status as 'download failed'. I've set 'no new tasks' until I receive positive confirmation that the network is operating properly again.


MY downloads are now shifting - two 10MB atmos.gz files have downloaded. The slow speed is I think my bored band rather than the servers getting hammered though I guess that is probably happening as well.

Edit: the trickle server isn't running again yet though.

Edit2:Well the server status page says that at least. I will know in about ten minutes whether trickles are going through as well. One task has finished downloading so that side seems to have been fixed.

Edit3: Does suspending the project stop the uploads/downloads?
ID: 105003 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2704
United Kingdom
Message 105005 - Posted: 10 Aug 2021, 12:08:10 UTC

Trickle server still showing as down after last update to server status page.
ID: 105005 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 105008 - Posted: 10 Aug 2021, 13:45:42 UTC - in response to Message 105003.  
Last modified: 10 Aug 2021, 13:46:09 UTC

Edit3: Does suspending the project stop the uploads/downloads?
I think not.
ID: 105008 · Report as offensive     Reply Quote
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 105009 - Posted: 10 Aug 2021, 15:37:59 UTC

I turned my net access back on a few hours ago, and the four that I had from before downloaded while I was sleeping.
ID: 105009 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2704
United Kingdom
Message 105010 - Posted: 10 Aug 2021, 17:26:48 UTC - in response to Message 105008.  

Edit3: Does suspending the project stop the uploads/downloads?
I think not.


A shame. that would be a simple solution.
ID: 105010 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 105048 - Posted: 12 Aug 2021, 16:48:32 UTC
Last modified: 12 Aug 2021, 17:22:14 UTC

At 15:24 UTC on 12 Aug 2021 Andy Bowery wrote:
All services have been restored now to climateprediction.net infrastructure. The Department of Engineering IT Support decided to roll back the changes they made to the networking. This has allowed us to restore all the CPDN services.
Edit: Yes, I can confirm that all files for new tasks are being downloaded cleanly.
ID: 105048 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2704
United Kingdom
Message 105051 - Posted: 12 Aug 2021, 17:27:43 UTC - in response to Message 105048.  

But their having to roll back the network changes on top of the time taken over and above that scheduled has me joining your verdict on the IT staff at Oxford.
ID: 105051 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 105053 - Posted: 12 Aug 2021, 17:29:52 UTC - in response to Message 105051.  

Staff, or outside contractors?
ID: 105053 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2704
United Kingdom
Message 105055 - Posted: 12 Aug 2021, 18:39:41 UTC - in response to Message 105053.  

Staff, or outside contractors?


I have no idea but the level of incompetence is the same.
ID: 105055 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5129
United Kingdom
Message 105126 - Posted: 14 Aug 2021, 11:16:06 UTC

Andy reports that a new problem - tentatively identified as a hardware failure - has been observed on the CPDN 'dev' (test) servers. He has shut down "the project" to minimise data loss, and fears that this closure may last for several days.

The main, production, CPDN server is also reporting 'shut down for maintenance'. I have asked Andy to clarify whether both versions of the project need to be shut down because of the single hardware failure, and am awaiting his reply.
ID: 105126 · Report as offensive     Reply Quote
Previous · 1 . . . 30 · 31 · 32 · 33 · 34 · 35 · 36 . . . 67 · Next

Message boards : Projects : News on Project Outages

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.