Why am being forced into two consecutive 24 delays with no work returned yet

Message boards : Questions and problems : Why am being forced into two consecutive 24 delays with no work returned yet
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Keith Myers
Avatar

Send message
Joined: 17 Nov 16
Posts: 205
United States
Message 91261 - Posted: 29 Apr 2019, 16:52:00 UTC

Why am being forced into two consecutive 24 delays with no work returned yet. I am trying to deploy a new app and after returning all the sent work as errors, I was forced into a 24 hour delay.

However after the first 24 hour delay expired and I requested work again, I was forced into another 24 hour delay. WHY?

I just got the same "reached daily quota of 11 tasks"

This is the host. https://einsteinathome.org/host/12775352
ID: 91261 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3297
United Kingdom
Message 91263 - Posted: 29 Apr 2019, 17:31:26 UTC - in response to Message 91261.  
Last modified: 29 Apr 2019, 17:32:46 UTC

Einstein says you've exceeded your daily quota:

2019-04-29 01:11:56.3583 [PID=29900]    [send] stopping work search - daily quota exceeded (24>=11)
2019-04-29 01:11:56.3622 [PID=29900]    Sending reply to [HOST#12775352]: 0 results, delay req 85052.00
It's "until tomorrow", not quite the full 24

Your quota is low because you've errored every one of the 43 attempts so far: try to use the remaining time to fix the error, then you can start increasing the quota each time you return a successful completion.

Edit - you're missing a library:

../../projects/einstein.phys.uwm.edu/einsteinbinary_cuda64: error while loading shared libraries: libcufft.so.8.0: cannot open shared object file: No such file or directory
ID: 91263 · Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 17 Nov 16
Posts: 205
United States
Message 91265 - Posted: 29 Apr 2019, 18:50:24 UTC - in response to Message 91263.  
Last modified: 29 Apr 2019, 19:03:08 UTC

I think I fixed the issue of the missing libraries a couple of days ago, by using trick of putting the libcufft and libcudart libraries directly in the project directory like we did with the CUDA80 Seti MB app.

None of the normal and supposedly proper ways to link the system's stock libcudart and libcufft libraries to those required by the app worked. Neither did exporting the LDLibrary path.

But without any work to test with I still don't know if putting the files directly in the same directory as the application will work. The app is working for the developer, I just haven't figured out why it won't work on my system yet.

I didn't realize that the total of failed work units was what was applied to the daily quote limit.

Question. What could I possibly do if I had downloaded 500 tasks the first time and instantly errored them out? Would I have had to wait 45 days before getting work for the project again? I only have my normal 0.5 days of work cache for the host. Same as all my hosts.

[Edit] I just reduced the host's venue down to 0.1 day of cache. Hope that only retrieves 1 or 2 tasks in case they fail again.
ID: 91265 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3297
United Kingdom
Message 91266 - Posted: 29 Apr 2019, 19:22:24 UTC - in response to Message 91265.  

By default, the daily quota reduces to a minimum of one task per day - that gives you an escape route. I imagine the same applies to Einstein, although they have heavily customised their code.

Is this the machine you are running under Anonymous Platform (app_info.xml)? If so, ensure that any extra file you put in the project directory is properly declared and referenced - otherwise it may not be available when the app is actually run from the slot directory.
ID: 91266 · Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 17 Nov 16
Posts: 205
United States
Message 91269 - Posted: 29 Apr 2019, 22:05:28 UTC - in response to Message 91266.  

And that is a big thank you. I had forgotten to do that. I am updating the app_info right now to add those file references to the existing ones. At least I did this before the next attempt at running the app when I can get new work again.
ID: 91269 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3297
United Kingdom
Message 91272 - Posted: 29 Apr 2019, 22:15:35 UTC - in response to Message 91269.  

When I was preparing the AIstub files under Windows, I used SysInternals' Dependency Walker to see what libraries were needed: I don't know the name of the equivalent tool under Linux, but there must be one.

The results of not checking can still be seen at SETI Beta message 39386
ID: 91272 · Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 17 Nov 16
Posts: 205
United States
Message 91273 - Posted: 30 Apr 2019, 3:05:13 UTC

Help Richard! Maybe you can tell me why I am still unable to run any gpu tasks. I can start them but they error out with a disk limit exceeded. Even though I increased the Home venue disk usage twice now. The amount needed didn't change.

keith@Nano:~/boinc$ ./boinc
29-Apr-2019 19:51:32 [---] Starting BOINC client version 7.9.3 for aarch64-unknown-linux-gnu
29-Apr-2019 19:51:32 [---] log flags: file_xfer, sched_ops, task, cpu_sched, sched_op_debug
29-Apr-2019 19:51:32 [---] Libraries: libcurl/7.58.0 OpenSSL/1.1.0g zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
29-Apr-2019 19:51:32 [---] Data directory: /home/keith/boinc
29-Apr-2019 19:51:32 [---] CUDA: NVIDIA GPU 0: NVIDIA Tegra X1 (driver version unknown, CUDA version 10.0, compute capability 5.3, 3957MB, 2179MB available, 236 GFLOPS peak)
29-Apr-2019 19:51:32 [Einstein@Home] Found app_info.xml; using anonymous platform
29-Apr-2019 19:51:33 [---] [libc detection] gathered: 2.27, Ubuntu GLIBC 2.27-3ubuntu1
29-Apr-2019 19:51:33 [---] Host name: Nano
29-Apr-2019 19:51:33 [---] Processor: 4 ARM ARMv8 Processor rev 1 (v8l) [Impl 0x41 Arch 8 Variant 0x1 Part 0xd07 Rev 1]
29-Apr-2019 19:51:33 [---] Processor features: fp asimd evtstrm aes pmull sha1 sha2 crc32
29-Apr-2019 19:51:33 [---] OS: Linux Ubuntu: Ubuntu 18.04.2 LTS [4.9.140-tegra|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]
29-Apr-2019 19:51:33 [---] Memory: 3.86 GB physical, 0 bytes virtual
29-Apr-2019 19:51:33 [---] Disk: 29.21 GB total, 17.65 GB free
29-Apr-2019 19:51:33 [---] Local time is UTC -7 hours
29-Apr-2019 19:51:33 [---] Config: GUI RPC allowed from any host
29-Apr-2019 19:51:33 [---] Config: GUI RPCs allowed from:
29-Apr-2019 19:51:33 [---] 192.168.2.34
29-Apr-2019 19:51:33 [---] Config: report completed tasks immediately
29-Apr-2019 19:51:33 [Einstein@Home] URL http://einstein.phys.uwm.edu/; Computer ID 12775352; resource share 25
29-Apr-2019 19:51:33 [SETI@home] URL http://setiathome.berkeley.edu/; Computer ID 8707387; resource share 75
29-Apr-2019 19:51:33 [Einstein@Home] General prefs: from Einstein@Home (last modified ---)
29-Apr-2019 19:51:33 [Einstein@Home] Computer location: home
29-Apr-2019 19:51:33 [---] General prefs: using separate prefs for home
29-Apr-2019 19:51:33 [---] Reading preferences override file
29-Apr-2019 19:51:33 [---] Preferences:
29-Apr-2019 19:51:33 [---] max memory usage when active: 1978.28 MB
29-Apr-2019 19:51:33 [---] max memory usage when idle: 3560.91 MB
29-Apr-2019 19:51:33 [---] max disk usage: 10.00 GB
29-Apr-2019 19:51:33 [---] max CPUs used: 2
29-Apr-2019 19:51:33 [---] suspend work if non-BOINC CPU load exceeds 25%
29-Apr-2019 19:51:33 [---] (to change preferences, visit a project web site or select Preferences in the Manager)
29-Apr-2019 19:51:33 [---] Setting up project and slot directories
29-Apr-2019 19:51:33 [---] Checking active tasks
29-Apr-2019 19:51:33 [---] Setting up GUI RPC socket
29-Apr-2019 19:51:33 [---] gui_rpc_auth.cfg is empty - no GUI RPC password protection
29-Apr-2019 19:51:33 [---] Checking presence of 63 project files
29-Apr-2019 19:51:33 Initialization completed
29-Apr-2019 19:51:33 [Einstein@Home] [sched_op] Starting scheduler request
29-Apr-2019 19:51:33 [Einstein@Home] Sending scheduler request: To report completed tasks.
29-Apr-2019 19:51:33 [Einstein@Home] Reporting 5 completed tasks
29-Apr-2019 19:51:33 [Einstein@Home] Requesting new tasks for NVIDIA GPU
29-Apr-2019 19:51:33 [Einstein@Home] [sched_op] CPU work request: 0.00 seconds; 0.00 devices
29-Apr-2019 19:51:33 [Einstein@Home] [sched_op] NVIDIA GPU work request: 9504.00 seconds; 1.00 devices
29-Apr-2019 19:51:37 [Einstein@Home] Scheduler request completed: got 6 new tasks
29-Apr-2019 19:51:37 [Einstein@Home] [sched_op] Server version 611
29-Apr-2019 19:51:37 [Einstein@Home] Project requested delay of 60 seconds
29-Apr-2019 19:51:37 [Einstein@Home] New computer location: home
29-Apr-2019 19:51:37 [Einstein@Home] General prefs: from Einstein@Home (last modified ---)
29-Apr-2019 19:51:37 [Einstein@Home] Computer location: home
29-Apr-2019 19:51:37 [---] General prefs: using separate prefs for home
29-Apr-2019 19:51:37 [---] Reading preferences override file
29-Apr-2019 19:51:37 [---] Preferences:
29-Apr-2019 19:51:37 [---] max memory usage when active: 1978.28 MB
29-Apr-2019 19:51:37 [---] max memory usage when idle: 3560.91 MB
29-Apr-2019 19:51:37 [---] max disk usage: 10.00 GB
29-Apr-2019 19:51:37 [---] Number of usable CPUs has changed from 2 to 3.
29-Apr-2019 19:51:37 [---] max CPUs used: 3
29-Apr-2019 19:51:37 [---] suspend work if non-BOINC CPU load exceeds 25%
29-Apr-2019 19:51:37 [---] (to change preferences, visit a project web site or select Preferences in the Manager)
29-Apr-2019 19:51:37 [Einstein@Home] [sched_op] estimated total CPU task duration: 0 seconds
29-Apr-2019 19:51:37 [Einstein@Home] [sched_op] estimated total NVIDIA GPU task duration: 7473 seconds
29-Apr-2019 19:51:37 [Einstein@Home] [sched_op] handle_scheduler_reply(): got ack for task p2030.20170414.G44.61-02.33.N.b6s0g0.00000_1388_0
29-Apr-2019 19:51:37 [Einstein@Home] [sched_op] handle_scheduler_reply(): got ack for task p2030.20170414.G44.61-02.33.N.b6s0g0.00000_1389_0
29-Apr-2019 19:51:37 [Einstein@Home] [sched_op] handle_scheduler_reply(): got ack for task p2030.20170414.G44.61-02.33.N.b6s0g0.00000_1387_1
29-Apr-2019 19:51:37 [Einstein@Home] [sched_op] handle_scheduler_reply(): got ack for task p2030.20170414.G44.61-02.33.N.b6s0g0.00000_1391_0
29-Apr-2019 19:51:37 [Einstein@Home] [sched_op] handle_scheduler_reply(): got ack for task p2030.20170414.G44.61-02.33.N.b6s0g0.00000_1390_0
29-Apr-2019 19:51:37 [Einstein@Home] [sched_op] Deferring communication for 00:01:00
29-Apr-2019 19:51:37 [Einstein@Home] [sched_op] Reason: requested by project
29-Apr-2019 19:51:39 [Einstein@Home] Started download of p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1646.bin4
29-Apr-2019 19:51:39 [Einstein@Home] Started download of p2030.20170414.G44.77-01.19.C.b0s0g0.00000.zap
29-Apr-2019 19:51:42 [Einstein@Home] Finished download of p2030.20170414.G44.77-01.19.C.b0s0g0.00000.zap
29-Apr-2019 19:51:42 [Einstein@Home] Started download of p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1647.bin4
29-Apr-2019 19:51:44 [Einstein@Home] Finished download of p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1646.bin4
29-Apr-2019 19:51:44 [Einstein@Home] Started download of p2030.20170414.G44.61-02.33.N.b5s0g0.00000_99.bin4
29-Apr-2019 19:51:44 [Einstein@Home] Starting task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1646_1
29-Apr-2019 19:51:44 [Einstein@Home] [cpu_sched] Starting task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1646_1 using einsteinbinary_BRP4 version 999 in slot 3
29-Apr-2019 19:51:46 [Einstein@Home] Aborting task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1646_1: exceeded disk limit: 127.11MB > 19.07MB
29-Apr-2019 19:51:46 [Einstein@Home] [sched_op] Deferring communication for 00:01:28
29-Apr-2019 19:51:46 [Einstein@Home] [sched_op] Reason: Unrecoverable error for task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1646_1
29-Apr-2019 19:51:47 [Einstein@Home] Finished download of p2030.20170414.G44.61-02.33.N.b5s0g0.00000_99.bin4
29-Apr-2019 19:51:47 [Einstein@Home] Started download of p2030.20170414.G44.61-02.33.N.b5s0g0.00000.zap
29-Apr-2019 19:51:47 [Einstein@Home] Computation for task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1646_1 finished
29-Apr-2019 19:51:47 [Einstein@Home] Output file p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1646_1_0 for task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1646_1 absent
29-Apr-2019 19:51:48 [Einstein@Home] Finished download of p2030.20170414.G44.61-02.33.N.b5s0g0.00000.zap
29-Apr-2019 19:51:48 [Einstein@Home] Started download of p2030.20170414.G44.61-02.33.N.b5s0g0.00000_100.bin4
29-Apr-2019 19:51:48 [Einstein@Home] Starting task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_99_0
29-Apr-2019 19:51:48 [Einstein@Home] [cpu_sched] Starting task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_99_0 using einsteinbinary_BRP4 version 999 in slot 3
29-Apr-2019 19:51:50 [Einstein@Home] Aborting task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_99_0: exceeded disk limit: 127.11MB > 19.07MB
29-Apr-2019 19:51:50 [Einstein@Home] [sched_op] Deferring communication for 00:03:55
29-Apr-2019 19:51:50 [Einstein@Home] [sched_op] Reason: Unrecoverable error for task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_99_0
29-Apr-2019 19:51:51 [Einstein@Home] Finished download of p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1647.bin4
29-Apr-2019 19:51:51 [Einstein@Home] Started download of p2030.20170414.G44.61-02.33.N.b5s0g0.00000_101.bin4
29-Apr-2019 19:51:51 [Einstein@Home] Computation for task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_99_0 finished
29-Apr-2019 19:51:51 [Einstein@Home] Output file p2030.20170414.G44.61-02.33.N.b5s0g0.00000_99_0_0 for task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_99_0 absent
29-Apr-2019 19:51:51 [Einstein@Home] Starting task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1647_0
29-Apr-2019 19:51:51 [Einstein@Home] [cpu_sched] Starting task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1647_0 using einsteinbinary_BRP4 version 999 in slot 3
29-Apr-2019 19:51:52 [Einstein@Home] Aborting task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1647_0: exceeded disk limit: 127.11MB > 19.07MB
29-Apr-2019 19:51:52 [Einstein@Home] [sched_op] Deferring communication for 00:07:33
29-Apr-2019 19:51:52 [Einstein@Home] [sched_op] Reason: Unrecoverable error for task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1647_0
29-Apr-2019 19:51:52 [Einstein@Home] Finished download of p2030.20170414.G44.61-02.33.N.b5s0g0.00000_100.bin4
29-Apr-2019 19:51:52 [Einstein@Home] Started download of p2030.20170414.G44.61-02.33.N.b5s0g0.00000_102.bin4
29-Apr-2019 19:51:52 [Einstein@Home] Starting task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_100_0
29-Apr-2019 19:51:52 [Einstein@Home] [cpu_sched] Starting task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_100_0 using einsteinbinary_BRP4 version 999 in slot 4
29-Apr-2019 19:52:01 [Einstein@Home] Aborting task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_100_0: exceeded disk limit: 127.11MB > 19.07MB
29-Apr-2019 19:52:01 [Einstein@Home] [sched_op] Deferring communication for 00:13:53
29-Apr-2019 19:52:01 [Einstein@Home] [sched_op] Reason: Unrecoverable error for task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_100_0
29-Apr-2019 19:52:01 [Einstein@Home] Computation for task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1647_0 finished
29-Apr-2019 19:52:01 [Einstein@Home] Output file p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1647_0_0 for task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1647_0 absent
29-Apr-2019 19:52:13 [Einstein@Home] Computation for task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_100_0 finished
29-Apr-2019 19:52:13 [Einstein@Home] Output file p2030.20170414.G44.61-02.33.N.b5s0g0.00000_100_0_0 for task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_100_0 absent
29-Apr-2019 19:52:14 [Einstein@Home] Finished download of p2030.20170414.G44.61-02.33.N.b5s0g0.00000_101.bin4
29-Apr-2019 19:52:14 [Einstein@Home] Finished download of p2030.20170414.G44.61-02.33.N.b5s0g0.00000_102.bin4
29-Apr-2019 19:52:14 [Einstein@Home] Starting task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_101_0
29-Apr-2019 19:52:14 [Einstein@Home] [cpu_sched] Starting task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_101_0 using einsteinbinary_BRP4 version 999 in slot 3
29-Apr-2019 19:52:15 [Einstein@Home] Aborting task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_101_0: exceeded disk limit: 127.11MB > 19.07MB
29-Apr-2019 19:52:15 [Einstein@Home] [sched_op] Deferring communication for 00:19:53
29-Apr-2019 19:52:15 [Einstein@Home] [sched_op] Reason: Unrecoverable error for task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_101_0
29-Apr-2019 19:52:16 [Einstein@Home] Computation for task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_101_0 finished
29-Apr-2019 19:52:16 [Einstein@Home] Output file p2030.20170414.G44.61-02.33.N.b5s0g0.00000_101_0_0 for task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_101_0 absent
29-Apr-2019 19:52:16 [Einstein@Home] Starting task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_102_0
29-Apr-2019 19:52:16 [Einstein@Home] [cpu_sched] Starting task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_102_0 using einsteinbinary_BRP4 version 999 in slot 3
29-Apr-2019 19:52:17 [Einstein@Home] Aborting task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_102_0: exceeded disk limit: 127.11MB > 19.07MB
29-Apr-2019 19:52:17 [Einstein@Home] [sched_op] Deferring communication for 00:48:42
29-Apr-2019 19:52:17 [Einstein@Home] [sched_op] Reason: Unrecoverable error for task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_102_0
29-Apr-2019 19:52:19 [Einstein@Home] Computation for task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_102_0 finished
29-Apr-2019 19:52:19 [Einstein@Home] Output file p2030.20170414.G44.61-02.33.N.b5s0g0.00000_102_0_0 for task p2030.20170414.G44.61-02.33.N.b5s0g0.00000_102_0 absent
^C29-Apr-2019 19:52:44 [---] Received signal 2
29-Apr-2019 19:52:45 [---] Exiting
keith@Nano:~/boinc$
ID: 91273 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3297
United Kingdom
Message 91274 - Posted: 30 Apr 2019, 8:46:14 UTC - in response to Message 91273.  

You mean lines like

29-Apr-2019 19:51:46 [Einstein@Home] Aborting task p2030.20170414.G44.77-01.19.C.b0s0g0.00000_1646_1: exceeded disk limit: 127.11MB > 19.07MB
I think that will be an individual limit on the task - you wouldn't have a limit below 20 meg on any global or system disk usage. Looking on my machine, I can see

<workunit>
    <name>p2030.20170414.G44.61-02.33.N.b5s0g0.00000_1438</name>
    <app_name>einsteinbinary_BRP4</app_name>
    <version_num>134</version_num>
    <rsc_fpops_est>17500000000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>350000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>260000000.000000</rsc_memory_bound>
    <rsc_disk_bound>20000000.000000</rsc_disk_bound>
...
That last line of 20,000,000 is the culprit - it translates to 19.07 binary MiB

Why such big disk usage? My guess is that you've put a <copy_file/> on that big cuFFT library, so the whole darn thing is copied into the slot folder and counts towards disk usage. Try just allowing BOINC to create a softlink as usual (remove <copy_file/>, leave the rest of app_info unchanged) and run a test task. I think Linux should be able to follow softlinks, but I don't know for sure where libraries are concerned. If it doesn't work, you'll have to find a way of installing the FFT library in such a way that Linux finds it by whatever it uses as a 'PATH' equivalent. My Beta task from ten years ago actually succeeded where I'd expected it to fail, because I'd installed NVidia's developer SDK and the system had a way of finding the library in that package. Drivers don't install FFT support, but the developer tools do - but it has to be exactly the right version, and it changes with every CUDA version release.
ID: 91274 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3297
United Kingdom
Message 91275 - Posted: 30 Apr 2019, 9:04:03 UTC

The Windows and Linux processes are different. BOINC started by using Linux conventions, so you may be in luck, but I remember having to introduce David Anderson to

https://docs.microsoft.com/en-us/windows/desktop/dlls/dynamic-link-library-search-order#search-order-for-desktop-applications

when the Linux techniques failed under Windows. You may have to reverse the process.
ID: 91275 · Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 17 Nov 16
Posts: 205
United States
Message 91284 - Posted: 30 Apr 2019, 14:16:43 UTC - in response to Message 91274.  

I tried for the first two days to make soft links and exporting Paths and LDLibrary paths with no success. That is why all my first tasks had the missing libcufft and libcudart library errors.
Finally decided to try just copying the system CUDA10 system libraries into the ones the application needed and referencing them directly in app_info.

The system already comes with all the necessary libraries pre-installed. This is a developers kit system image made for developing apps so CUDA10, CUDnn, GCC, C+ and C++ are already there. I shouldn't have to install another version of CUDA.

Since I didn't have any app_info version anymore that used CUDA file references, I just patterned the cuda references after the references for the dev files in the original app_info. That included file copies.

Thanks for explaining that is the culprit since I didn't know or understand what that did. I have removed the file copies from app_info. Now just have to wait out the penalty box again.
ID: 91284 · Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 17 Nov 16
Posts: 205
United States
Message 91288 - Posted: 30 Apr 2019, 14:28:29 UTC - in response to Message 91275.  

The Windows and Linux processes are different. BOINC started by using Linux conventions, so you may be in luck, but I remember having to introduce David Anderson to

https://docs.microsoft.com/en-us/windows/desktop/dlls/dynamic-link-library-search-order#search-order-for-desktop-applications

when the Linux techniques failed under Windows. You may have to reverse the process.


I think the problem comes from having to use the repo version of BOINC with its stranglehold on group and user ownership. I checked every time to make sure the executable had all its dependencies satisfied. And they were every location where the application was loaded. But when it came to actually running the client, it always failed to find the CUDA8 libraries.

I should have just compiled a new BOINC for the aarch64 platform on my own and placed it in /home like I do with all my x86_64 hosts. That makes it so much easier to use BOINC and edit and move files.

Finally had enough and stripped out the main BOINC files and moved them to a new boinc folder in /home and removed all the init files and dynamic links scattered all over the system referencing the old repo locations of things.

Finally can run BOINC from /home like normal for me. I think after I finally am able to process a task correctly and get rid of my low daily allowance, I will try once more to make soft or symbolic links to the CUDA8 libraries. I think with BOINC in /home now, with /home being owned by $USER that the symbolic links will probably work as expected.
ID: 91288 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3297
United Kingdom
Message 91289 - Posted: 30 Apr 2019, 15:15:38 UTC

Problems that require <copy_file/> to solve also arise when multiple different versions exist and have to be stored under different names in the filing system, but programmers are told to build their apps expecting a simple, generic, name. It's such a bloomin' nuisance that it has its own special name under Windows - 'DLL Hell'.

NVidia were guilty of it in the earliest days - recycling the generic cudart.dll and cufft.dll over several (incompatible) generations. Under Windows, they've learned their lesson, and now use a strongly versioned name for each new release.

It can be necessary to do a <copy_file/> with rename (<file_name> / <open_name>) to unambiguously resolve the confusion, but let's hope not. It's another reason to use Dependency Walker, to find out exactly which version of the filename has been embedded in the binary. Far more robust than using a Hex editor for the same purpose...
ID: 91289 · Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 17 Nov 16
Posts: 205
United States
Message 91295 - Posted: 30 Apr 2019, 16:52:08 UTC - in response to Message 91289.  
Last modified: 30 Apr 2019, 16:57:25 UTC

Well since this is Linux and not Windows, Dependency Walker is of no use. To find the dependencies of any executable application in Linux all you need to do is:
ldd <application name>
in Terminal.

keith@Nano:~/boinc/projects/einstein.phys.uwm.edu$ ldd einsteinbinary_cuda64
	linux-vdso.so.1 (0x0000007f86d1f000)
	libcufft.so.8.0 => /usr/local/cuda/lib64/libcufft.so.8.0 (0x0000007f7ede5000)
	libcuda.so.1 => /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 (0x0000007f7de9f000)
	libcudart.so.8.0 => /usr/local/cuda/lib64/libcudart.so.8.0 (0x0000007f7de2e000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007f7de02000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007f7dca9000)
	/lib/ld-linux-aarch64.so.1 (0x0000007f86cf4000)
	libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000007f7db14000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007f7da5a000)
	libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000007f7da45000)
	librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000007f7da2e000)
	libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000007f7da0a000)
	libnvrm_gpu.so => /usr/lib/aarch64-linux-gnu/tegra/libnvrm_gpu.so (0x0000007f7d9c7000)
	libnvrm.so => /usr/lib/aarch64-linux-gnu/tegra/libnvrm.so (0x0000007f7d985000)
	libnvrm_graphics.so => /usr/lib/aarch64-linux-gnu/tegra/libnvrm_graphics.so (0x0000007f7d966000)
	libnvidia-fatbinaryloader.so.32.1.0 => /usr/lib/aarch64-linux-gnu/tegra/libnvidia-fatbinaryloader.so.32.1.0 (0x0000007f7d908000)
	libnvos.so => /usr/lib/aarch64-linux-gnu/tegra/libnvos.so (0x0000007f7d8ea000)


This is the symbolic link for the cuda libraries the app needs.

keith@Nano:/usr/local/cuda/lib64$ ls -l
lrwxrwxrwx 1 root root        21 Apr 26 19:15 libcudart.so.8.0 -> libcudart.so.10.0.166
lrwxrwxrwx 1 root root        20 Apr 26 19:15 libcufft.so.8.0 -> libcufft.so.10.0.166
ID: 91295 · Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 17 Nov 16
Posts: 205
United States
Message 91296 - Posted: 30 Apr 2019, 17:11:52 UTC

This is the app_info I am going to attempt to use.

<app_info>

  <app>
    <name>einsteinbinary_BRP4</name>
  </app>
  <file_info>
    <name>einsteinbinary_cuda64</name>
    <executable/>
  </file_info>
  <file_info>
    <name>einsteinbinary_cuda-db.dev</name>
  </file_info>
  <file_info>
    <name>einsteinbinary_cuda-dbhs.dev</name>
  </file_info>
  <file_info>
    <name>libcufft.so.8.0</name>
  </file_info>
  <file_info>
   <name>libcudart.so.8.0</name>
  </file_info>
  <app_version>
    <app_name>einsteinbinary_BRP4</app_name>
    <version_num>999</version_num>
    <api_version>7.2.2</api_version>
    <coproc>
      <type>CUDA</type>
      <count>1.0</count>
    </coproc>
    <file_ref>
      <file_name>einsteinbinary_cuda64</file_name>
      <main_program/>
    </file_ref>
    <file_ref>
      <file_name>einsteinbinary_cuda-db.dev</file_name>
      <open_name>db.dev</open_name>
      <copy_file/>
    </file_ref>
    <file_ref>
      <file_name>einsteinbinary_cuda-dbhs.dev</file_name>
      <open_name>dbhs.dev</open_name>
      <copy_file/>
    </file_ref>
    <file_ref>
    <file_name>libcufft.so.8.0</file_name>
    </file_ref>
    <file_ref>
      <file_name>libcudart.so.8.0</file_name>
    </file_ref>
  </app_version>

</app_info>


ID: 91296 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3297
United Kingdom
Message 91297 - Posted: 30 Apr 2019, 17:38:45 UTC - in response to Message 91296.  

Fingers crossed. It might be wise to mark those library files as

<executable/>

as well - they certainly contain binary code which is going to be executed.
ID: 91297 · Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 17 Nov 16
Posts: 205
United States
Message 91302 - Posted: 30 Apr 2019, 22:31:32 UTC - in response to Message 91297.  

Fingers crossed. It might be wise to mark those library files as

<executable/>

as well - they certainly contain binary code which is going to be executed.

Don't think they work that way. At least they weren't marked executable in the CUDA90 special app app_info which I managed to find laying around in a forgotten disk. I looked at how the CUDA90 libcufft and libcudart libraries were referenced and used that app_info as pattern. They certainly worked well for that app.

Agree, fingers crossed. Think it will work this time when I can get work again.
ID: 91302 · Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 17 Nov 16
Posts: 205
United States
Message 91305 - Posted: 1 May 2019, 0:50:52 UTC

Well my 24 hour delay expired and then BOINC set another 24 hour delay. Still unable to get any work for testing.
ID: 91305 · Report as offensive     Reply Quote
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 3297
United Kingdom
Message 91312 - Posted: 1 May 2019, 7:41:05 UTC - in response to Message 91305.  
Last modified: 1 May 2019, 7:42:22 UTC

You returned the 'disk limit exceeded' errors at 30 Apr 2019 3:58:40 UTC, and requested new work at 1 May 2019 0:44:29 UTC - that's less than 24 hours.

I'm not sure exactly when Einstein resets it's 'daily' clock in relation to your time zone, but it might be worth a manual update when you wake up in the morning. (In this context, it's the Einstein server which is setting the delays)
ID: 91312 · Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 17 Nov 16
Posts: 205
United States
Message 91328 - Posted: 1 May 2019, 17:28:54 UTC - in response to Message 91312.  

You returned the 'disk limit exceeded' errors at 30 Apr 2019 3:58:40 UTC, and requested new work at 1 May 2019 0:44:29 UTC - that's less than 24 hours.

I'm not sure exactly when Einstein resets it's 'daily' clock in relation to your time zone, but it might be worth a manual update when you wake up in the morning. (In this context, it's the Einstein server which is setting the delays)

I had a feeling that Einstein bases its delay on calendar day and not UTC time. I still had 9 hours on the delay clock timer this morning but I went ahead and did an update and got work again.

This time I am able to process tasks without errors. None have validated yet but expect they will. So finally configured the host and the app_info correctly.

Thank you Richard for pointing out the real cause of the disk exceeded errors and the likely reason. The <file_copy> was the culprit.
ID: 91328 · Report as offensive     Reply Quote

Message boards : Questions and problems : Why am being forced into two consecutive 24 delays with no work returned yet

Copyright © 2019 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.