The Seti is Slumbering Cafe

Author	Message
Jimbocous Send message Joined: 1 Oct 15 Posts: 391	Message 95069 - Posted: 15 Jan 2020, 1:02:06 UTC - in response to Message 95066. Yikes. This is to the point we may have to find a virgin and throw them into a volcano! I hear there is one - a volcano - available now. I had one newer cruncher that was virgin to Einstein@home, which has now been sacrificed. Given the near-volcanic heat produced by GPUs on Einstein, I'm hoping this will perhaps suffice. ID: 95069 · Reply Quote

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 95072 - Posted: 15 Jan 2020, 2:11:58 UTC The extended outage allowed me to notice that a 4 core (8 thread) CPU cannot feed 9 GPUs running Einstein. I had to configure for 4 concurrent Einstein and 5 concurrent Milkyway and in addition had to scrap the "64" spoofed GPUs as that got too many Einstein. I had resources set to 0 but got way more than 64 work units. Should have gotten 1 for each GPU but I am looking at 110 on one mining system and 241 on another. Resource on both for Einstein was 0 so something not right. ID: 95072 · Reply Quote

Dr Who Fan Send message Joined: 10 May 07 Posts: 1348	Message 95073 - Posted: 15 Jan 2020, 2:17:07 UTC 13 plus hours of outrage makes things a DOUBLE OUTRAGE! Time to break out the heavy stuff in celebration of the DOUBLE OUTRAGE Line Aqavit from the old country. Anyone care for a shot or two? ID: 95073 · Reply Quote

Jimbocous Send message Joined: 1 Oct 15 Posts: 391	Message 95074 - Posted: 15 Jan 2020, 2:20:34 UTC - in response to Message 95072. The extended outage allowed me to notice that a 4 core (8 thread) CPU cannot feed 9 GPUs running Einstein. I had to configure for 4 concurrent Einstein and 5 concurrent Milkyway and in addition had to scrap the "64" spoofed GPUs as that got too many Einstein. I had resources set to 0 but got way more than 64 work units. Should have gotten 1 for each GPU but I am looking at 110 on one mining system and 241 on another. Resource on both for Einstein was 0 so something not right. Agreed. Something got really broken in 7.16.3 on resource sharing and scheduling. Been fighting this for a while. Best example is a case where I'm trying to clear out the Einstein queue after SETI resumes, using a resource share of 1, det NNT, and max concurrent set to less than all physical GPUs. What now happens, and didn't on 7.14.2, is that when max_concurrent # of GPUs are engaged, the other GPUs will sit idle rather then process SETI, apparently because of a resource share debt. My contention is that GPUs should never sit idle, regardless of any perceived debt. Apparently, the software feels otherwise. I'd be interested to see if you experience anything like this. ID: 95074 · Reply Quote

Gary Charpentier Send message Joined: 23 Feb 08 Posts: 2464	Message 95077 - Posted: 15 Jan 2020, 2:31:14 UTC - in response to Message 95071. Close to 13 hours downtime now. Plenty of active volcanoes around, but virgins.....hmm.... ID: 95077 · Reply Quote

juan BFP Send message Joined: 2 Jan 18 Posts: 170	Message 95078 - Posted: 15 Jan 2020, 2:32:19 UTC Last modified: 15 Jan 2020, 2:35:38 UTC I not k now what metodoth or program you use to spoofed the GPU count, but i could tell for sure, max concurrent & scheduler works totaly different (not broken) from the previous versions than on the 7.16 Boinc. That is why we not use that with the spoofed client we use. Instead of that we manage the number of active cores/threads with CPU usage. BTW I will remain at the outrage pub for about 1/2 hour, need to work tomorrow soon, hope that will be enought to satisfy the SETI Gods and bring the servers back to life. Tried to find a virgin here to sacrify at the vulcano and that was impossible. ID: 95078 · Reply Quote

betreger Volunteer tester Help desk expert Send message Joined: 18 Oct 14 Posts: 1472	Message 95079 - Posted: 15 Jan 2020, 2:44:51 UTC - in response to Message 95077. If things are not fixed soon Einstein here I come. ID: 95079 · Reply Quote

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 95080 - Posted: 15 Jan 2020, 3:00:27 UTC - in response to Message 95074. Last modified: 15 Jan 2020, 3:02:02 UTC My contention is that GPUs should never sit idle, regardless of any perceived debt. Apparently, the software feels otherwise. I'd be interested to see if you experience anything like this. Exactly what I have been looking at in the last 2 hours and trying to figure out. I had 4 GPU idle that should have been running Einstein and the other 5 GPUs are running milkyway. This system normally runs SETI and GPUgrid at %100 and Einstein at %0. I added Milkyway at 0 and after a while the Einstein GPUs went idle. The work count in excess of 64 seem to be "lost work units" and I am guessing that number is not used when checking the GPU count. Both mining systems had a lot of "lost work units": However, I cannot account for something like 300 lost units. I only run Einstein when seti is offline. I clicked on Einstein's "www host schedule log" which duplicate info shown in the event viewer: "...lost tasks..." However, I also saw a strange message "..[CRITCAL] … two instances of the scheduler running.." or something to that wording. I am not running two instances of Boinc. The so-called "schedule" is an Einstein app that (my understanding) arranges to download database items, not just project work units. There is no reason for the 4 GPUs to be idle. I aborted the Milkyway as I didn't want them stopping Einstein from running. Einstein then started up and, !INCREDIBLY! I got 3 GPUgrid work units. Probably been a week or more since any showed up. 7 of the 9 GPUs are at %100 utilization but I got 2 idle due to the CPU not having enough threads. ID: 95080 · Reply Quote

arkayn Send message Joined: 21 Mar 09 Posts: 33	Message 95081 - Posted: 15 Jan 2020, 3:02:57 UTC - in response to Message 95079. If things are not fixed soon Einstein here I come. I am running some Collatz for now. ID: 95081 · Reply Quote

Jimbocous Send message Joined: 1 Oct 15 Posts: 391	Message 95083 - Posted: 15 Jan 2020, 3:16:54 UTC - in response to Message 95080. Exactly what I have been looking at in the last 2 hours and trying to figure out. I had 4 GPU idle that should have been running Einstein and the other 5 GPUs are running milkyway. ... There is no reason for the 4 GPUs to be idle. . . Agreed. Well, at least know you know it isn't just you. I would fall back to 7.14.2, but the "finish file present too long" error was becoming annoying, even ignoring other factors. Thanks for the confirmation. ID: 95083 · Reply Quote

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 95084 - Posted: 15 Jan 2020, 3:18:53 UTC - in response to Message 95078. Last modified: 15 Jan 2020, 3:19:37 UTC I not k now what metodoth or program you use to spoofed the GPU count, but i could tell for sure, max concurrent & scheduler works totaly different (not broken) from the previous versions than on the 7.16 Boinc. That is why we not use that with the spoofed client we use. Instead of that we manage the number of active cores/threads with CPU usage. BTW I will remain at the outrage pub for about 1/2 hour, need to work tomorrow soon, hope that will be enought to satisfy the SETI Gods and bring the servers back to life. Tried to find a virgin here to sacrify at the vulcano and that was impossible. I made a change to my program as I had been applying the 64 to all projects. I am now using the project app_config and setting the # of gpus depending on the project. Since this system has 9 GPUs then the below just limits the count to 4 instead of 9. Seti still has 64 to get through the off-line time. However, the 4000 limit I use did not get me over the 13+ hours. root@h110btc:/var/lib/boinc/projects/einstein.phys.uwm.edu# cat app_config.xml <app_config> <app> <name>einstein_O2MDF</name> <max_concurrent>4</max_concurrent> </app> <spoofedgpus>4</spoofedgpus> </app_config> I set the value in cs_scheduler // update hardware info, and write host info // host_info.get_host_info(false); set_ncpus(); iGPU = (gstate.spoof_gpus == -1) ? 0 : gstate.spoof_gpus; if(p->app_configs.spoofedgpus > 0) iGPU = p->app_configs.spoofedgpus; host_info.write(mf, !cc_config.suppress_net_info, false, iGPU); ID: 95084 · Reply Quote

Jimbocous Send message Joined: 1 Oct 15 Posts: 391	Message 95085 - Posted: 15 Jan 2020, 3:19:55 UTC - in response to Message 95078. Last modified: 15 Jan 2020, 3:20:19 UTC ...(not broken)... I would suggest any situation where GPUs will sit idle when there is work they could be performing simply because it's not "the right work" indicates "broke". :) ID: 95085 · Reply Quote

Jimbocous Send message Joined: 1 Oct 15 Posts: 391	Message 95087 - Posted: 15 Jan 2020, 3:25:48 UTC - in response to Message 95086. 15 hours. A fantastic outrage :-) Actually, it's closer to 21 hours now, if you consider the fact that it basically quit handing out work ~6 hrs before maintenance began ~0500 PST. ID: 95087 · Reply Quote

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 868	Message 95088 - Posted: 15 Jan 2020, 3:29:21 UTC - in response to Message 95080. However, I also saw a strange message "..[CRITCAL] … two instances of the scheduler running.." or something to that wording. I am not running two instances of Boinc. The so-called "schedule" is an Einstein app that (my understanding) arranges to download database items, not just project work units. That would be the output of the Einstein "locality scheduling" They run very old server software that uses very different (from current BOINC) schedulers. The log output from Einstein can be pages worth in reporting what is possible and not possible for various work units and the size of your cache. So it really messes up other projects scheduling based on REC. ID: 95088 · Reply Quote

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 868	Message 95089 - Posted: 15 Jan 2020, 3:32:22 UTC - in response to Message 95085. I would suggest any situation where GPUs will sit idle when there is work they could be performing simply because it's not "the right work" indicates "broke". :) But doesn't fit David Anderson's definition of what is idle. This is caused by the changes in 7.16.3 that fixed the issue with max_concurrent and exclude_gpu. ID: 95089 · Reply Quote

betreger Volunteer tester Help desk expert Send message Joined: 18 Oct 14 Posts: 1472	Message 95090 - Posted: 15 Jan 2020, 3:45:26 UTC - in response to Message 95081. Collatz has always made me feel stupid. ID: 95090 · Reply Quote

arkayn Send message Joined: 21 Mar 09 Posts: 33	Message 95091 - Posted: 15 Jan 2020, 4:12:04 UTC - in response to Message 95090. Collatz has always made me feel stupid. 41 valid tasks and an RAC of 114k or so. ID: 95091 · Reply Quote

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 95092 - Posted: 15 Jan 2020, 4:26:56 UTC - in response to Message 95091. Collatz has always made me feel stupid. 41 valid tasks and an RAC of 114k or so. How about 320,000 credits every 5 and 1/2 seconds? http://www.ukboincteam.org.uk/newforum/viewtopic.php?t=6221 The project is good for credit points only and ranks up there with bitcoin utopia. No scientific value what-so-ever but that is just my honest opinion worth about 2c. I did run up a lot of points on it and also on bitcoin utopia but could have been finding solution for medical problems over at WCG or other more useful work. Again, just IMHO but I didn't know better. ID: 95092 · Reply Quote

betreger Volunteer tester Help desk expert Send message Joined: 18 Oct 14 Posts: 1472	Message 95093 - Posted: 15 Jan 2020, 5:11:07 UTC - in response to Message 95092. +1 ID: 95093 · Reply Quote

Dr Who Fan Send message Joined: 10 May 07 Posts: 1348	Message 95094 - Posted: 15 Jan 2020, 5:29:27 UTC Anyone still sober and or still awake? It is almost 2130 in the evening at Berkeley and SETI is still down! This Super Duper Massive Grand Mal Outrage is going to be one to remember for the recent history books. ID: 95094 · Reply Quote

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.