Stop switching between WU

Message boards : Questions and problems : Stop switching between WU
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Eduardo Bicudo Dreyfuss

Send message
Joined: 26 Jun 09
Posts: 8
Brazil
Message 25750 - Posted: 28 Jun 2009, 3:06:49 UTC - in response to Message 25733.  

I re-downloaded 6.6.36 and reset the project and it's running normally, i.e., running 2 cpu WUs and just 1 gpu task. 3hours has passed and I've already 3 cuda WUs finished.
ID: 25750 · Report as offensive
Eduardo Bicudo Dreyfuss

Send message
Joined: 26 Jun 09
Posts: 8
Brazil
Message 25773 - Posted: 29 Jun 2009, 1:53:30 UTC - in response to Message 25750.  

24hours and still running good (only 1 cuda task) after re-download of 6.6.36 and seti project reset. I set the contact/stock preferences to 1/3 and it has already fetched more work for either cpu and gpu and this didn't bothered the processing sequence till now. And the numbers already came back to the rising curve. My set up includes 1 hour outage everyday and the outage was also surpassed w/o any troubles.
ID: 25773 · Report as offensive
Fred - efmer.com
Avatar

Send message
Joined: 8 Aug 08
Posts: 570
Netherlands
Message 25775 - Posted: 29 Jun 2009, 5:41:44 UTC - in response to Message 25773.  

I got more problems not less.
Now I got the problem on a computer with less than 1/2 day of WU left <400
It had about 30 CUDA WU waiting, and 7 in memory so that caused 1 of them to go in fallback mode. Had to do a reboot.
The WU that are still in waiting mode finish and report ok when they eventually get done that is.

I am going to fight this... Got 2 tactics :
1) Check the GPU temperature -> if it goes below 58C the system will reboot solving the problem.
2) Check for running CUDA programs -> if there are more than allowed, in this case 2 the system will reboot, solving the problem.
Written all of this in a program, hopefully this solves things. This is only for unattended system of course....
For the other system I send an email instead of doing the reboot, so I can decide what to do.

Because when the system reboots and runs again everything is fine for a long long time.
ID: 25775 · Report as offensive
Eduardo Bicudo Dreyfuss

Send message
Joined: 26 Jun 09
Posts: 8
Brazil
Message 25786 - Posted: 29 Jun 2009, 11:18:55 UTC - in response to Message 25773.  

24hours and still running good (only 1 cuda task) after re-download of 6.6.36 and seti project reset. I set the contact/stock preferences to 1/3 and it has already fetched more work for either cpu and gpu and this didn't bothered the processing sequence till now. And the numbers already came back to the rising curve. My set up includes 1 hour outage everyday and the outage was also surpassed w/o any troubles.

And then, without any interference, the boinc jumped from a first gpu task to a second one living the first at 20s from finish . Now I have 2 on the way and this means that Boinc is already into the failure mode.
It happened 2 seconds after Boinc started a new cpu task and it left a message about starting the second task BUT NO MESSAGE about the interruption of the first. This second wu was NOT recently loaded and was in the list for 6 hours laready, date limit jul 05th, 5:48:31 PM and the first one jul 05th, 6:20:00 PM.
AND right now, while I was observing it, when the second one reached 14s form finishing boinc jumped back to the first one, living NO MESAGE AT ALL(I was with my hands off, just watching and no other operation was in curse). At re-start of the first one, the remaining time jumped back to 1'30" fom finish. And 20s after this, just 1s after the cpu has started a new cpu one, the Boinc left the first one at 1'10" from finish and started a third one, living a message about the start of this third one (that was already in the list for hours and has a date limit of jul 21st!) and no message about the interrption. It's like the GPU followed the CPU. NOW I HAVE 3 GPU tasks in curse.
The first and the second cuda units are as short as 20 minutes and the third 1h07. This third finished and boinc started a forth one, date jul21st also, I have still 3 on the way.
This is the failure mode or one of the failure modes, at least.
Hope the description helps,
Eduardo
ID: 25786 · Report as offensive
Eduardo Bicudo Dreyfuss

Send message
Joined: 26 Jun 09
Posts: 8
Brazil
Message 25809 - Posted: 30 Jun 2009, 10:09:56 UTC - in response to Message 25786.  

24hours and still running good (only 1 cuda task) after re-download of 6.6.36 and seti project reset. I set the contact/stock preferences to 1/3 and it has already fetched more work for either cpu and gpu and this didn't bothered the processing sequence till now. And the numbers already came back to the rising curve. My set up includes 1 hour outage everyday and the outage was also surpassed w/o any troubles.

And then, without any interference, the boinc jumped from a first gpu task to a second one living the first at 20s from finish . Now I have 2 on the way and this means that Boinc is already into the failure mode.
It happened 2 seconds after Boinc started a new cpu task and it left a message about starting the second task BUT NO MESSAGE about the interruption of the first. This second wu was NOT recently loaded and was in the list for 6 hours laready, date limit jul 05th, 5:48:31 PM and the first one jul 05th, 6:20:00 PM.
AND right now, while I was observing it, when the second one reached 14s form finishing boinc jumped back to the first one, living NO MESAGE AT ALL(I was with my hands off, just watching and no other operation was in curse). At re-start of the first one, the remaining time jumped back to 1'30" fom finish. And 20s after this, just 1s after the cpu has started a new cpu one, the Boinc left the first one at 1'10" from finish and started a third one, living a message about the start of this third one (that was already in the list for hours and has a date limit of jul 21st!) and no message about the interrption. It's like the GPU followed the CPU. NOW I HAVE 3 GPU tasks in curse.
The first and the second cuda units are as short as 20 minutes and the third 1h07. This third finished and boinc started a forth one, date jul21st also, I have still 3 on the way.
This is the failure mode or one of the failure modes, at least.
Hope the description helps,
Eduardo

Left alone all day long, boinc has recovered somehow, it is running just 1 cuda wu at a time and the performance of the day was just fine.
ID: 25809 · Report as offensive
Fred - efmer.com
Avatar

Send message
Joined: 8 Aug 08
Posts: 570
Netherlands
Message 25812 - Posted: 30 Jun 2009, 12:47:57 UTC - in response to Message 25811.  
Last modified: 30 Jun 2009, 12:48:47 UTC

Whatever it does it does, occasionally poking in the checkin list, you find that the GPU scheduling modifications is not finished yet:

David 26 June 2009
6036 - client: when suspending a GPU job,
6037 always remove it from memory, even if it hasn't checkpointed.
6038 Otherwise we'll typically run another GPU job right away,
6039 and it will bomb out or revert to CPU mode because it
6040 can't allocate video RAM

TThrottle For those interested TThrottle can alert you, send an email or restart the system, when this happens...... In Rules add: if gpu number > 2 email
In the Programs tab Active must be checked!
ID: 25812 · Report as offensive
Claggy

Send message
Joined: 23 Apr 07
Posts: 1112
United Kingdom
Message 25818 - Posted: 30 Jun 2009, 17:47:46 UTC - in response to Message 25811.  

There's an Echo in here.

Claggy
ID: 25818 · Report as offensive
Eduardo Bicudo Dreyfuss

Send message
Joined: 26 Jun 09
Posts: 8
Brazil
Message 25932 - Posted: 10 Jul 2009, 1:18:50 UTC - in response to Message 25809.  

24hours and still running good (only 1 cuda task) after re-download of 6.6.36 and seti project reset. I set the contact/stock preferences to 1/3 and it has already fetched more work for either cpu and gpu and this didn't bothered the processing sequence till now. And the numbers already came back to the rising curve. My set up includes 1 hour outage everyday and the outage was also surpassed w/o any troubles.

And then, without any interference, the boinc jumped from a first gpu task to a second one living the first at 20s from finish . Now I have 2 on the way and this means that Boinc is already into the failure mode.
It happened 2 seconds after Boinc started a new cpu task and it left a message about starting the second task BUT NO MESSAGE about the interruption of the first. This second wu was NOT recently loaded and was in the list for 6 hours laready, date limit jul 05th, 5:48:31 PM and the first one jul 05th, 6:20:00 PM.
AND right now, while I was observing it, when the second one reached 14s form finishing boinc jumped back to the first one, living NO MESAGE AT ALL(I was with my hands off, just watching and no other operation was in curse). At re-start of the first one, the remaining time jumped back to 1'30" fom finish. And 20s after this, just 1s after the cpu has started a new cpu one, the Boinc left the first one at 1'10" from finish and started a third one, living a message about the start of this third one (that was already in the list for hours and has a date limit of jul 21st!) and no message about the interrption. It's like the GPU followed the CPU. NOW I HAVE 3 GPU tasks in curse.
The first and the second cuda units are as short as 20 minutes and the third 1h07. This third finished and boinc started a forth one, date jul21st also, I have still 3 on the way.
This is the failure mode or one of the failure modes, at least.
Hope the description helps,
Eduardo

Left alone all day long, boinc has recovered somehow, it is running just 1 cuda wu at a time and the performance of the day was just fine.

After a week, boinc got back to problems, running more than one cuda unit. It's happening for more than a week now and, it's not only a impression, I took many hours in observing the boinc behaviour and I found that:
this behavior starts when processing a "BAD UNIT", i.é, a unit boinc is not capable to properly unwind for some reason. This cause the boinc to start jumping from one unit to another and starting a new one and so. As soon as you abort the unit, it returns to normal behavior(sometimes there are more than one bad unit that need to be killed).
An additional side effect is that while the problems with the bad unit endures in the GPU, it cause the computer almost to paralize, reducing the cpu performance on cpu units and other customer requests as well.
The last cuda unit I'd killed was 14dc08ab.29925.2526.8.8.212_2. It had reached just 1,4% after more than an hour of procesing and it had paralised the boinc.
ID: 25932 · Report as offensive
Eduardo Bicudo Dreyfuss

Send message
Joined: 26 Jun 09
Posts: 8
Brazil
Message 25933 - Posted: 10 Jul 2009, 10:30:47 UTC - in response to Message 25932.  

24hours and still running good (only 1 cuda task) after re-download of 6.6.36 and seti project reset. I set the contact/stock preferences to 1/3 and it has already fetched more work for either cpu and gpu and this didn't bothered the processing sequence till now. And the numbers already came back to the rising curve. My set up includes 1 hour outage everyday and the outage was also surpassed w/o any troubles.

And then, without any interference, the boinc jumped from a first gpu task to a second one living the first at 20s from finish . Now I have 2 on the way and this means that Boinc is already into the failure mode.
It happened 2 seconds after Boinc started a new cpu task and it left a message about starting the second task BUT NO MESSAGE about the interruption of the first. This second wu was NOT recently loaded and was in the list for 6 hours laready, date limit jul 05th, 5:48:31 PM and the first one jul 05th, 6:20:00 PM.
AND right now, while I was observing it, when the second one reached 14s form finishing boinc jumped back to the first one, living NO MESAGE AT ALL(I was with my hands off, just watching and no other operation was in curse). At re-start of the first one, the remaining time jumped back to 1'30" fom finish. And 20s after this, just 1s after the cpu has started a new cpu one, the Boinc left the first one at 1'10" from finish and started a third one, living a message about the start of this third one (that was already in the list for hours and has a date limit of jul 21st!) and no message about the interrption. It's like the GPU followed the CPU. NOW I HAVE 3 GPU tasks in curse.
The first and the second cuda units are as short as 20 minutes and the third 1h07. This third finished and boinc started a forth one, date jul21st also, I have still 3 on the way.
This is the failure mode or one of the failure modes, at least.
Hope the description helps,
Eduardo

Left alone all day long, boinc has recovered somehow, it is running just 1 cuda wu at a time and the performance of the day was just fine.

After a week, boinc got back to problems, running more than one cuda unit. It's happening for more than a week now and, it's not only a impression, I took many hours in observing the boinc behaviour and I found that:
this behavior starts when processing a "BAD UNIT", i.é, a unit boinc is not capable to properly unwind for some reason. This cause the boinc to start jumping from one unit to another and starting a new one and so. As soon as you abort the unit, it returns to normal behavior(sometimes there are more than one bad unit that need to be killed).
An additional side effect is that while the problems with the bad unit endures in the GPU, it cause the computer almost to paralize, reducing the cpu performance on cpu units and other customer requests as well.
The last cuda unit I'd killed was 14dc08ab.29925.2526.8.8.212_2. It had reached just 1,4% after more than an hour of procesing and it had paralised the boinc.

And a missing information: during these periods of almost paralize, the explorer uses up to 50% of cpu time.
ID: 25933 · Report as offensive
Fred - efmer.com
Avatar

Send message
Joined: 8 Aug 08
Posts: 570
Netherlands
Message 25935 - Posted: 10 Jul 2009, 10:47:19 UTC - in response to Message 25934.  

On risk of echoing... think they did something with 6.6.37 and GPU crunching... it's alpha... well 6.6.36 was really alpha, but that's probably an echoed statement too ;>)
I'm testing that one now, but there is not enough work.. Yet, to get these kind of problems.
ID: 25935 · Report as offensive
Previous · 1 · 2

Message boards : Questions and problems : Stop switching between WU

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.