Thread 'News on Project Outages'

Message boards : Projects : News on Project Outages
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 48 · 49 · 50 · 51 · 52 · 53 · 54 . . . 66 · Next

AuthorMessage
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2636
United Kingdom
Message 111229 - Posted: 8 Mar 2023, 18:00:23 UTC - in response to Message 111228.  

Dennis currently telling me it has no work available.

It has un-sent units currently (was 0 before but many in progress).

Paul.
Yep. Got loads now.
ID: 111229 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 392
Sweden
Message 111231 - Posted: 9 Mar 2023, 0:33:04 UTC
Last modified: 9 Mar 2023, 0:33:47 UTC

And another working day has ended in Toronto. No results this day either. WCG still dead as a doornail.
It's as if they're trying to land on Mars or something, and not just restart/reconnect a storage system.
ID: 111231 · Report as offensive     Reply Quote
[CSF] Aleksey Belkov

Send message
Joined: 3 Mar 23
Posts: 14
Russia
Message 111232 - Posted: 9 Mar 2023, 0:56:45 UTC - in response to Message 111231.  

just restart/reconnect a storage system.

This can be completely different level of problem.
It's time to calm down and just wait for result.
Blaming them every day can't help this situation at all : )
ID: 111232 · Report as offensive     Reply Quote
Robokapp

Send message
Joined: 8 Mar 23
Posts: 10
Message 111233 - Posted: 9 Mar 2023, 2:57:37 UTC

i wonder if a hospital's network goes down, if it takes 8 days and counting to fix it...
ID: 111233 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2636
United Kingdom
Message 111234 - Posted: 9 Mar 2023, 5:46:44 UTC - in response to Message 111233.  

i wonder if a hospital's network goes down, if it takes 8 days and counting to fix it...
It did at the one I worked at once after a Citrix, "upgrade."
ID: 111234 · Report as offensive     Reply Quote
Phillip Spencer

Send message
Joined: 3 Mar 23
Posts: 10
France
Message 111236 - Posted: 9 Mar 2023, 8:33:54 UTC - in response to Message 111232.  

just restart/reconnect a storage system.

This can be completely different level of problem.
It's time to calm down and just wait for result.
Blaming them every day can't help this situation at all : )

Unfortunately, this is not the first issue since restart (well into double digits now, I suspect, even ignoring the length of time migration took).
I believe there is a combination of issues at play (to name a few off the top of my head):
- Jurisica lab not understanding the complexity of what they were taking on from IBM and biting off more than they can chew;
- Krebil's IT having other priorities (be interesting to see their service level agreement for this!);
- insufficient resources invested to avoid single point of failure outages;
- old/repurposed devices pressed into action as temporary fixes;
- unanticipated consequences from moving off IBM technology to alternative software.
In this particular instance I would like to understand better how they use RAID for storage back-up and mirroring since the early reference to having tape back-ups was disconcerting as I would have assumed these were overnight and not real-time.
If Jurisica Lab's communication was significantly better / faster that might "calm down" everyone who, like me, has invested time and effort into supporting WCG over the years.
Cheers
Phillip
ID: 111236 · Report as offensive     Reply Quote
Bryn Mawr
Help desk expert

Send message
Joined: 31 Dec 18
Posts: 293
United Kingdom
Message 111237 - Posted: 9 Mar 2023, 11:46:40 UTC - in response to Message 111232.  

just restart/reconnect a storage system.

This can be completely different level of problem.
It's time to calm down and just wait for result.
Blaming them every day can't help this situation at all : )


Agreed - and this is not a problem to park at Krembil’s door, it’s totally the responsibility of the data centre.
ID: 111237 · Report as offensive     Reply Quote
[CSF] Aleksey Belkov

Send message
Joined: 3 Mar 23
Posts: 14
Russia
Message 111238 - Posted: 9 Mar 2023, 11:53:18 UTC - in response to Message 111236.  


Unfortunately, this is not the first issue since restart
....
everyone who, like me, has invested time and effort into supporting WCG over the years.

We can lament the current situation and problems of the project as much as we like, but in truth, this does not give any of us the right to demand or dictate anything to the project, regardless of how much we have invested in this project over the years of participation in it.
This is still a voluntary computing project.

- Jurisica lab not understanding the complexity of what they were taking on from IBM and biting off more than they can chew;
- Krebil's IT having other priorities (be interesting to see their service level agreement for this!);
- insufficient resources invested to avoid single point of failure outages;
- old/repurposed devices pressed into action as temporary fixes;
- unanticipated consequences from moving off IBM technology to alternative software.

I do not want to remove responsibility from anyone, but it cannot be excluded that there was simply no other way out at that/current moment.
Anyway we don't know ins and outs of agreement on the transition of WCG from IBM to Krebil and who of them was interested in what(or was forced to do that).
ID: 111238 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 392
Sweden
Message 111241 - Posted: 9 Mar 2023, 18:02:48 UTC

New update, 30 minutes ago:

Update #4: The "new" system did recognize the data hardware RAIDs. All have been rebuilt,
and the data center is attempting to repair the OS drives/RAID.
ID: 111241 · Report as offensive     Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 10 May 07
Posts: 1418
United States
Message 111242 - Posted: 9 Mar 2023, 23:11:40 UTC - in response to Message 111241.  

Not going to get my hopes up that WCG will be resuscitated until I see it.
ID: 111242 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 392
Sweden
Message 111243 - Posted: 9 Mar 2023, 23:34:32 UTC - in response to Message 111242.  
Last modified: 9 Mar 2023, 23:43:14 UTC

Not going to get my hopes up that WCG will be resuscitated until I see it.
Same here, and since I do not have any interest in any other project, I have shut down all my crunching computers.

They really should concentrate on getting the BOINC part of the system up and running. The WCG website,
bells and whistles, can wait until the BOINC part is running normally.
ID: 111243 · Report as offensive     Reply Quote
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 884
United States
Message 111244 - Posted: 10 Mar 2023, 0:52:07 UTC

Asteroids@home is offline due to SSL certificate expiration


ID: 111244 · Report as offensive     Reply Quote
Robokapp

Send message
Joined: 8 Mar 23
Posts: 10
Message 111245 - Posted: 10 Mar 2023, 2:46:12 UTC - in response to Message 111243.  

Tomorrow is Friday.

if WCG doesnt fix it by tomorrow afternoon... next chance is Monday.
ID: 111245 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 392
Sweden
Message 111246 - Posted: 10 Mar 2023, 4:51:50 UTC - in response to Message 111245.  
Last modified: 10 Mar 2023, 5:18:43 UTC

Tomorrow is Friday.

if WCG doesnt fix it by tomorrow afternoon... next chance is Monday.
That is true, because as we have learned by now, SHARCNET (Shared Hierarchical Academic Research Computing Network),
does not help their customers, (at least not WCG) during Weekends, evenings, and nights. At least that's what we learned from one of the WCG
updates, when they wrote "Unfortunately, data center staff will not be able to help us over the weekend."

Very strange data center. They only seem to work during office hours. After office hours, customer systems (or only WCG) obviously will be allowed to crash and burn.

Below is an interesting comment from a WCG user, on their FB account:

Eric Pohlke writes:

I'm lost here David. I've tried to reach out to Krembil several times with no answer. I get responses from their funding partners and Service Providers, but nothing from the WCG. 
A problem like this would have been solved and a solution applied with 2 days or I would have fired a few people. Building a top end Grid and Rack Server would have taken less time.

Swap the controller box out and put a spare in. All data Centres that the University Health Network use have triple redundancy backup. They can't afford to loose their link and data stream 
with their clients (doctors, specialist, research engineers, etc.) The WCG volunteers should be given the same consideration and respect. There were once over 1,720,000 WCG volunteers, 
and way before any smartphone app or console. All on their PCs. IBM would share things, seek advice, etc. with their volunteers.

Have the Mr. J. Bains consider allocating much more resources to the World Community Grid Project. Donald Weaver, the former Director invited the World Community Grid to Krembil and 
was so excited about its achievements and the energy its volunteers put forth. Joseph Bains needs to understand the importance and potential of this project.

Back in the day, most PC users only had a single core, low yield CPU and Windows 98 to use. Today, the hope gaming Rig is the power of a mini-Super Computer. And the Commercial hardware, 
any decent computer tech can build one and know where to get the parts without taking out a second mortgage. We need this energy back in Krembil.
ID: 111246 · Report as offensive     Reply Quote
Bob Harder

Send message
Joined: 11 Oct 10
Posts: 13
United States
Message 111253 - Posted: 10 Mar 2023, 16:00:08 UTC - in response to Message 111246.  

The SHARCNET website clearly states (look under Support) that they are a 0900 to 1700 EST (Canada) 5 day a week operation. No support on weekends,. Subtract out lunch breaks, staff meetings, etc. maybe there is an average of 6 hours of support per day.

With this work schedule, it is no surprise it took forever to get WCG up and running. And no surprise it takes forever to get anything fixed.

SHARCNET has other users, We (WCG users) have no idea where WCG lies on the priority list of having issues resolved. Maybe WCG is at the bottom of the list.

All of this had to be known to Krembil from day 1.

So, it is what it is. Just find other projects to use your computer time. No use complaining. Nothing is going to change.
ID: 111253 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 392
Sweden
Message 111254 - Posted: 10 Mar 2023, 16:46:56 UTC
Last modified: 10 Mar 2023, 16:50:15 UTC

Yeah, no support after business hours. Incredible.

To the question "How long should I expect to wait for support?", on this page: https://helpwiki.sharcnet.ca/wiki/FAQ,
The answer is:

"Unfortunately Compute Canada/SHARCNET does not have adequate funding to provide support 24 hours a day, 7 days a week.
User support and system monitoring is limited to regular business hours: there is no official support on weekends or holidays,
or outside 9:00 - 17:00 EST .

Please note that this includes monitoring of our systems and operations, so typically when there are problems overnight or on
weekends/holidays system notices will not be posted until the next business day."


So, no wonder then that everything, including the migration from IBM, takes such long time, compared to when WCG was run by IBM.
That state of affairs is not going to work in the long run. If there's no support outside of business hours, WCG will slowly fade away.
ID: 111254 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 392
Sweden
Message 111255 - Posted: 10 Mar 2023, 19:35:56 UTC
Last modified: 10 Mar 2023, 19:47:25 UTC

WCG New update, 15 minutes ago:

"Update #5: The storage server was revived yesterday late afternoon. Both database filesystems mounted as before,
but the science filesystem did not. It needs a repair; erasing the old log first."
ID: 111255 · Report as offensive     Reply Quote
[CSF] Aleksey Belkov

Send message
Joined: 3 Mar 23
Posts: 14
Russia
Message 111256 - Posted: 10 Mar 2023, 23:59:17 UTC - in response to Message 111253.  

Just find other projects to use your computer time. No use complaining. Nothing is going to change.

"Came, offended, left." (=

Perhaps, before inflating further hysteria that "everything is lost", still wait for this story ends and only THEN draw any conclusions (especially with calls to abandon the project)?
ID: 111256 · Report as offensive     Reply Quote
ProfileContact
Avatar

Send message
Joined: 29 Aug 05
Posts: 74
Canada
Message 111259 - Posted: 11 Mar 2023, 16:12:36 UTC - in response to Message 111246.  

as we have learned by now, SHARCNET (Shared Hierarchical Academic Research Computing Network),
does not help their customers, (at least not WCG) during Weekends, evenings, and nights.
SHARCNET has free access to Compute Canada for academic research.
https://youtu.be/hWkWAaNBILs?t=146

Free makes sense. I don't see a flow of cash to the project. Limited service makes sense from a free service. It's actually amazing to have any service at all for no charge! After all, somebody (Canadian taxpayer) is paying for replacement parts and labour and delivery etc...
It also makes sense that this system is now overburdened by World Community Grid. It was not set up with the intention to host anything like a huge BOINC project.
Good on these people for still trying to help us.
They are relentless :)
ID: 111259 · Report as offensive     Reply Quote
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 884
United States
Message 111261 - Posted: 11 Mar 2023, 21:19:47 UTC

Asteroids@home is back online.


ID: 111261 · Report as offensive     Reply Quote
Previous · 1 . . . 48 · 49 · 50 · 51 · 52 · 53 · 54 . . . 66 · Next

Message boards : Projects : News on Project Outages

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.