nanoHUB_at_home has failed every single task -- work time exceeded

Message boards : Projects : nanoHUB_at_home has failed every single task -- work time exceeded
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Toombra
Avatar

Send message
Joined: 29 Dec 08
Posts: 14
Canada
Message 99949 - Posted: 14 Jul 2020, 18:10:50 UTC

Science United has recently decided to start throwing nanHUB_at_home work units at my computer. Unfortunately not a single one of these is processing. All have an estimated time of 3min 47sec, and are timing out after 1h 33min of crunching time. Each work unit has log files similar to the following:

2020-07-10 02:27:29 AM | nanoHUB_at_home | Aborting task 07839998_45_0: exceeded elapsed time limit 5576.75 (86400.00G/15.81G)
2020-07-10 02:27:31 AM | nanoHUB_at_home | Aborting task 07839998_37_0: exceeded elapsed time limit 5576.75 (86400.00G/15.81G)
2020-07-10 02:27:46 AM | nanoHUB_at_home | Computation for task 07839998_45_0 finished
2020-07-10 02:27:46 AM | nanoHUB_at_home | Output file 07839998_45_0_r1958071825_0 for task 07839998_45_0 absent
2020-07-10 02:27:46 AM | nanoHUB_at_home | Computation for task 07839998_37_0 finished
2020-07-10 02:27:46 AM | nanoHUB_at_home | Output file 07839998_37_0_r1960253673_0 for task 07839998_37_0 absent


https://imgur.com/a/fDPBmi4

In-use computing is only able to process about 4 WUs at a time with the rest showing Waiting for Memory as per my preferences.

Specs listed below for reference

2020-07-10 02:40:08 AM |  | Starting BOINC client version 7.16.7 for windows_x86_64
2020-07-10 02:40:08 AM |  | log flags: file_xfer, sched_ops, task
2020-07-10 02:40:08 AM |  | Libraries: libcurl/7.47.1 OpenSSL/1.0.2s zlib/1.2.8
2020-07-10 02:40:09 AM |  | CUDA: NVIDIA GPU 0: GeForce GTX 1650 (driver version 451.48, CUDA version 11.0, compute capability 7.5, 4096MB, 3327MB available, 3037 GFLOPS peak)
2020-07-10 02:40:09 AM |  | OpenCL: NVIDIA GPU 0: GeForce GTX 1650 (driver version 451.48, device version OpenCL 1.2 CUDA, 4096MB, 3327MB available, 3037 GFLOPS peak)
2020-07-10 02:40:09 AM |  | Windows processor group 0: 12 processors
2020-07-10 02:40:09 AM |  | Processor: 12 AuthenticAMD AMD Ryzen 5 2600 Six-Core Processor [Family 23 Model 8 Stepping 2]
2020-07-10 02:40:09 AM |  | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 svm sse4a osvw skinit wdt tce topx page1gb rdtscp fsgsbase bmi1 smep bmi2
2020-07-10 02:40:09 AM |  | OS: Microsoft Windows 10: Core x64 Edition, (10.00.18363.00)
2020-07-10 02:40:09 AM |  | Memory: 15.93 GB physical, 21.18 GB virtual
2020-07-10 02:40:09 AM |  | Disk: 931.51 GB total, 587.71 GB free
2020-07-10 02:40:09 AM |  | Local time is UTC -6 hours
2020-07-10 02:40:09 AM |  | No WSL found.
2020-07-10 02:40:09 AM |  | VirtualBox version: 6.0.22
2020-07-10 02:40:09 AM |  | General prefs: from https://scienceunited.org/ (last modified 26-Jun-2020 02:52:09)
2020-07-10 02:40:09 AM |  | Host location: none
2020-07-10 02:40:09 AM |  | General prefs: using your defaults
2020-07-10 02:40:09 AM |  | Reading preferences override file
2020-07-10 02:40:09 AM |  | Preferences:
2020-07-10 02:40:09 AM |  | max memory usage when active: 8974.23 MB
2020-07-10 02:40:09 AM |  | max memory usage when idle: 14685.10 MB
2020-07-10 02:40:09 AM |  | max disk usage: 592.37 GB
2020-07-10 02:40:09 AM |  | max CPUs used: 10
2020-07-10 02:40:09 AM |  | don't use GPU while active
2020-07-10 02:40:09 AM |  | suspend work if non-BOINC CPU load exceeds 35%
2020-07-10 02:40:09 AM |  | (to change preferences, visit a project web site or select Preferences in the Manager)
2020-07-10 02:40:09 AM |  | Setting up project and slot directories
2020-07-10 02:40:09 AM |  | Checking active tasks
2020-07-10 02:40:09 AM |  | Using account manager Science United


Since I've written the above post, I've had more WUs downloaded with the exact same issue. I'm manually aborting 100s of these as they'll just waste processing.
ID: 99949 · Report as offensive
Jim1348

Send message
Joined: 8 Nov 10
Posts: 310
United States
Message 99950 - Posted: 14 Jul 2020, 19:48:30 UTC - in response to Message 99949.  

The " Output file absent" error sometimes means that the disk drive is too slow to access that file. I don't see the problem on my SSDs. A disk write cache (PrimoCache) would also help.

Sometimes the newer versions of VirtualBox have similar problems.
I always use VBox 5.2.x.
https://www.virtualbox.org/wiki/Download_Old_Builds_5_2
ID: 99950 · Report as offensive
Nick Name

Send message
Joined: 14 Aug 19
Posts: 55
United States
Message 99955 - Posted: 14 Jul 2020, 21:16:43 UTC

First, I suggest you dump Science United and use the standard BOINC manager. That way you will have full control over what's running on your machines. You could then detach from the project permanently or set the project to No New Tasks until the problem is solved.

This sounds to me like the work is never actually running. If you are successfully running other VirtualBox work then this problem needs to be reported to the project. I'd at least check their forums for reports of problems / solutions.

If you aren't running other VB work, you should check the excellent LHC guide for VB work to make sure your machine is setup correctly. The info about specific VB versions is a bit dated but the information overall is good. I also like to enable the VB window when I have a problem, I can usually see what it is from the log that's shown there. To do that, go to your cc_config file and set the vbox_window to 1.

<vbox_window>1</vbox_window>

Team USA forum
Follow us on Twitter
Help us #crunchforcures!
ID: 99955 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 99956 - Posted: 14 Jul 2020, 21:26:59 UTC - in response to Message 99950.  

The "Output file absent" error simply means that the science application finished - either by failure, or by forced closure - before it had time to write out the scientific result. I've never seen the minuscule difference in timing between a classic hard disk and an SSD cause this error - either the scientific data is present, or it isn't.

More likely, the nanohub application isn't really making any real progress. BOINC invents a 'pseudoprogress' to reassure the user that something is happening, when in reality the science has got stuck. That possibility should be checked, but (for me) not at this time of night. I'll take a look in the morning.
ID: 99956 · Report as offensive
ProDigit

Send message
Joined: 8 Nov 19
Posts: 718
United States
Message 99960 - Posted: 14 Jul 2020, 23:56:10 UTC

Large RAM WUs usually are via docker or VM.
If your OS is Windows, it could have to emulate Linux through an emulation layer.
If you don't have enough RAM, and the system is reading from SWAP, it may cause the WU to process very slowly.
ID: 99960 · Report as offensive

Message boards : Projects : nanoHUB_at_home has failed every single task -- work time exceeded

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.