Thread 'News on Project Outages'

Author	Message
BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65325 - Posted: 9 Nov 2015, 1:38:07 UTC - in response to Message 65324. Over the past few weeks, the outages have been relatively short 00 say four to eight hours. This one is now at about 15 hours.... ID: 65325 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65326 - Posted: 9 Nov 2015, 2:14:32 UTC - in response to Message 65325. That the Collatz database server has been offline this long over the weekend raises the possibility that the admin there (it's a one person effort, running it out of his home I believe), may not be around this weekend to do a server restart. ID: 65326 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65355 - Posted: 10 Nov 2015, 21:31:53 UTC - in response to Message 65326. Collatz database server was back and running Monday morning -- it is offline again as of 1PM PST today. ID: 65355 · Reply Quote

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15563	Message 65357 - Posted: 11 Nov 2015, 0:30:21 UTC Seti going to be out of work for a day or more. Matt Lebofsky wrote: BUT ALSO we needed to update some fields in the current science database schema to also make the database itself telescope agnostic. Just a few "alter table" commands to lengthen the tape name fields beyond 20 characters. We thought these alters would take a few hours (and completed before the end of today's Tuesday outage). Now it looks like it might take a day. We can't split/assimilate any new work until the alters are finished. Oh well. We're going to run out of work tonight, but should have fresh work sometime tomorrow morning. It is a holiday tomorrow, so cut us some slack, if it's later than tomorrow morning :). Source. ID: 65357 · Reply Quote

Blurf Send message Joined: 18 Jul 11 Posts: 217	Message 65362 - Posted: 11 Nov 2015, 17:00:44 UTC Not getting any work from WCG Fight AIDS Phase 2 ID: 65362 · Reply Quote

Blurf Send message Joined: 18 Jul 11 Posts: 217	Message 65367 - Posted: 11 Nov 2015, 20:48:59 UTC - in response to Message 65362. Not getting any work from WCG Fight AIDS Phase 2 Received work ID: 65367 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65405 - Posted: 13 Nov 2015, 16:40:52 UTC - in response to Message 65355. As is frequently the case with the higher workload presented by the shorter work units, the Collatz database server is offline (this now happens 3 or more times a week). The outages typically run from 4 hours to 24 hours at which time the database server gets recycled and the clock on the next outage is restarted. ID: 65405 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65423 - Posted: 13 Nov 2015, 20:36:54 UTC - in response to Message 65405. Collatz is back up -- outage this time was about 8 hours. ID: 65423 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65568 - Posted: 23 Nov 2015, 17:07:56 UTC Collatz appears to be in a bit of a yo-yo mode. The database was offline yesterday (11/22) for about 8 hours. Back up around noon yesterday. It was up this morning about 8AM, and is offline again as of 9AM. One thing of note, the problem there (which is clearly chronic) doesn't extend to uploads -- they seem to go through. That means that when the database server is rebooted (and it appears that is all that is being done at the moment when it crashes) it validates a very large collected set of uploaded work units, processes the large set of new reports that occur once the data base server is alive and validates these. Then it sends out new work. Until the next time it crashes. With the very much increased workload that the sieve units (which complete in a much shorter time than the previous work units), the database server has been crashing quite a bit more frequently of late -- about 10 times in the past month. ID: 65568 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65577 - Posted: 24 Nov 2015, 3:11:34 UTC - in response to Message 65568. The Collatz database remains offline -- this outage in a bit longer than the average (and frequent) outage for Collatz -- it is now at about 12 hours. ID: 65577 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65633 - Posted: 26 Nov 2015, 0:25:52 UTC - in response to Message 65577. Collatz was back online for almost an entire day before it crashed yet againabout 2 hours ago (2:30PM PST). It seems given the regularity with which it crashes and the seeming undefined problem which has persisted for a VERY long time, perhaps some effort could be put toward developing a shut down / restart script which could be run automatically on a daily basis.... ID: 65633 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65635 - Posted: 26 Nov 2015, 6:58:52 UTC - in response to Message 65633. Last modified: 26 Nov 2015, 6:59:24 UTC It seems that the frequency of Collatz database crashes is increasing while the return to operation cycle is getting longer.... I am not sure that anyone else is noticing this.... ID: 65635 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65643 - Posted: 26 Nov 2015, 16:49:13 UTC - in response to Message 65635. Collatz is currently back up and running... ID: 65643 · Reply Quote

Alexander Send message Joined: 28 May 10 Posts: 52	Message 65646 - Posted: 26 Nov 2015, 19:32:52 UTC Has anyone information about Citizen Science Grid (DNA@home and Subset Sum)? Most servers are down and questions on message board are not answered. ID: 65646 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65648 - Posted: 27 Nov 2015, 16:21:16 UTC Collatz is down yet again -- went down about 4AM this morning. It is still down. This has become an almost daily event for Collatz -- followed by a downtime which runs between 4 and 24 hours at which time the database server has been manually rebooted. At this stage, the down time for Collatz is running equal to its up time. Given that there has been no discussion over at the project regarding the problems, it seems that folks may do as I'm about to do and consign Collatz to secondary project status -- to be run only when its status can be closely watched externally by the users. For me that means, before I go to sleep, I'll suspend Collatz as I have some faith that the data base server will crash over night. Then, during days when I can check Collatz status regularly, I'll let the project process. Then when (not if) it goes offline during the day, I'll suspend it again until the next moment when it is running. I realize that Collatz is a one person enterprise, I just wish it was somewhat less unreliable these days. ID: 65648 · Reply Quote

Gary Charpentier Send message Joined: 23 Feb 08 Posts: 2493	Message 65649 - Posted: 27 Nov 2015, 18:06:17 UTC - in response to Message 65648. Last modified: 27 Nov 2015, 18:06:29 UTC Collatz is down yet again -- went down about 4AM this morning. It is still down. This has become an almost daily event for Collatz -- followed by a downtime which runs between 4 and 24 hours at which time the database server has been manually rebooted. At this stage, the down time for Collatz is running equal to its up time. Given that there has been no discussion over at the project regarding the problems, it seems that folks may do as I'm about to do and consign Collatz to secondary project status -- to be run only when its status can be closely watched externally by the users. For me that means, before I go to sleep, I'll suspend Collatz as I have some faith that the data base server will crash over night. Then, during days when I can check Collatz status regularly, I'll let the project process. Then when (not if) it goes offline during the day, I'll suspend it again until the next moment when it is running. I realize that Collatz is a one person enterprise, I just wish it was somewhat less unreliable these days. The reality is it suspends itself when it becomes unreachable and your BOINC scheduler will automatically grab work from other projects. There isn't that much data traffic sending out a few packets that are not answered waiting for a timeout. BOINC keeps on with other projects and work units. But if you want the aggravation of manually starting and stopping a project, you can choose to give yourself this headache. ID: 65649 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65650 - Posted: 27 Nov 2015, 19:08:52 UTC - in response to Message 65649. Last modified: 27 Nov 2015, 19:09:25 UTC Gary, thanks for the reply. Yeah, I get that Gary, but I had shifted it to a primary on multiple systems -- which I had also done years ago as well. Over the past few years I had shifted to Milkyway, MooWrap, and GPUGrid. With Collatz shifting the sieve work units I shifted back to Collatz as the primary project. With GPUGrid though, given its long run work units -- it is either primary or the work units can time out. So that's the major change. I just don't like seeing those 10 minute completed work units stack up with Collatz. Though as long as Collatz has a 50 unit limit and I have the other projects in a receive work mode, I suppose (aside from the systems running GPUgrid) that would work out. It is rather unfortunate the Collatz breaks down so regularly these days. Today's daily outage is at 9 hours or more... ID: 65650 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65653 - Posted: 27 Nov 2015, 23:18:28 UTC - in response to Message 65650. As I noted before, it was almost inevitable that with the short run sieve work units, the stability problems that have manifested themselves with the Collatz project would become far more frequent. To me, this has the feel of some form of memory leak where the database server simply runs out of available memory and collapses under the strain. Not knowing how one would deal with it, my simplistic thinking would be to run a script or scripts. One would take the database server down 'gracefully' based on a clock. One would then restart the database server as part of a planned server reset to restore the memory. But that's just me and I am likely to be rather clueless about the issue and a work around. ID: 65653 · Reply Quote

BarryAZ Send message Joined: 4 Sep 09 Posts: 381	Message 65659 - Posted: 28 Nov 2015, 16:12:21 UTC - in response to Message 65653. Collatz is running again -- at the momemnt. I'm wondering with the increasingly frequent database server crashes whether something might done to make them planned instead of unplanned and thus much shorter in duration. I get it that with the very short sieve units, the processing load has increased a lot. My own suspicion, perhaps ill-informed, is that the database server is encountering some memory leak (as I suspect it always had), which is made worse by the higher volume processing. Since it appears that resolving the actual problem is not an option for whatever reason, how about pre-empting it? My (admittedly novice) suggestion would be a pair of scripts. One would take down the database server gracefully at a programmed time of day (perhaps every day). The other would restart the database server about 10 minutes later. Perhaps something along these lines would restore the server to a 'memory clean slate' each cycle. Just a thought from one of the users. ID: 65659 · Reply Quote

Skivelitis2 Send message Joined: 8 Nov 14 Posts: 11	Message 65738 - Posted: 3 Dec 2015, 14:28:29 UTC Anyone have any word on QCN? ID: 65738 · Reply Quote

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.