bug? computation error on restarting based on time of day

Message boards : Questions and problems : bug? computation error on restarting based on time of day
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 91866 - Posted: 17 Jun 2019, 14:31:48 UTC

Due to heat during summer and several systems in the garage with no cooling I was testing running only at night 23:00- 06:00 on a system with 6 GPUs. Have never used this feature before. I had two tasks running on a pair of RX560 when I enabled the 23:00-06:00 filter and they all paused just fine. After a few minutes I set the time filter to 23-23 and to let these last two finish as there were no others queued up. Unaccountably, one reported a compute error instantly, the other gpu continue to run just fine. If this is a one-off occurrence then no problem. I allowed more work to make sure the problem was not the device and got 150 tasks before I could stop the downloads. All 6 gpus are working and I assume this was a random bug in the resume feature and at most a single task would be lost per device, if any.
ID: 91866 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5080
United Kingdom
Message 91867 - Posted: 17 Jun 2019, 14:45:39 UTC - in response to Message 91866.  

It might be worth randomly suspending single tasks and allowing them to restart later in their own time.

GPU tasks are always removed from GPU memory when suspended for any reason (unlike CPU tasks, which can be left in memory if your preferences permit). That means that on restart, they have to be reloaded from the checkpoint file.

Over the years, some projects have from time to time had problems with their checkpointing code. You wouldn't notice any problems when suspending, but you can get a crash when the app tries to read back a bad file. The same thing can happen if your hard disk is flaky.

Start with the project that failed on restart, but test other projects too. Only rely on unattended stops/starts when you're certain that all your active projects have reliable checkpoint/restart code.
ID: 91867 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 91888 - Posted: 18 Jun 2019, 15:54:22 UTC

I ran some tests to try to see how often this problem occurs and it does happen but rarely and it seem the problem is the project\'s checkpoint handling like you suggested.

Tried doing a few manual suspensions and resumes: This did not cause an error so I was thinking the problem is the suspension was issued "all at once" when the time of day hits the stop deadline.

I then repeatedly set the deadline and resumed (set start and to same time of day). After about the 4th attempt I managed to get a single seti gpu task to go bad.

I then tried a stop of the service (ubuntu: sudo ./boinc-client restart) without suspending any tasks which pretty much interrupts the processing with little or no warning. All gpu (setI) tasks started up just fine but two WCG CPU tasks generated computation problems.


The above just shows that the projects have difficulty with their checkpoint implementation.

I did notice the following that concerns me. I have two systems that I would like to run only at night when cool The ubuntu system does not shut down the fans on the six RX560 gpus. The windows system shut down the fans on its Rx570s. I have replaced fans on a number of occasions. It is always a PITA trying to locate the exact replacement, more often than not from mainland china. Usually the entire heat sink needs to be removed. Both of these systems use about 115 watts or so with no load whether the fans are turning or not. I think I now want to shut down them when not being used. Is suspect there is a way in bios to turn the system on at a certain time and I suspect there is a network management or remote procedure call I can make to shut them down. Some programs like Nero and Acronis have a shutdown procedure but AFAICT neither Boinc nor it manager have that option.
ID: 91888 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2533
United Kingdom
Message 91895 - Posted: 18 Jun 2019, 17:57:51 UTC - in response to Message 91888.  

With CPDN stopping the client without suspending computation first has a very high computation error rate for Linux tasks. Much less so with Windows ones at least running under WINE. It is too many years since I ran anything under Windows natively to have a clue about that.

There are a couple of new model types around now one has been run on the main site, the other is still only running on the test site. I don't know how these behave in this regard yet.
ID: 91895 · Report as offensive

Message boards : Questions and problems : bug? computation error on restarting based on time of day

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.