WCG OPNG sans OPN1

Message boards : Projects : WCG OPNG sans OPN1
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4925
United Kingdom
Message 105891 - Posted: 29 Oct 2021, 18:38:20 UTC

Tried one on a Linux (Mint v20.2) machine. It starts...

29/10/2021 19:24:35 | World Community Grid | Started upload of OPNG_0098436_00093_1_r329524617_0
29/10/2021 19:24:35 | World Community Grid | Started upload of OPNG_0098436_00093_1_r329524617_1
29/10/2021 19:24:35 | World Community Grid | [http] [ID#26650] Info: Found bundle for host upload.worldcommunitygrid.org: 0x55cc8cb3a400 [serially]
29/10/2021 19:24:35 | World Community Grid | [http] [ID#26650] Info: Server doesn't support multiplex (yet)
29/10/2021 19:24:35 | World Community Grid | [http] [ID#26649] Info: Trying 169.47.63.74:443...
- I won't bore you with the rest.

Seems like the multiplexing might be your problem, but I don't know Linux well enough to go much further.
ID: 105891 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105892 - Posted: 29 Oct 2021, 18:42:18 UTC

This from my cc_config:
<fetch_minimal_work>0</fetch_minimal_work>
<fetch_on_update>1</fetch_on_update>
<force_auth>basic</force_auth>
<http_transfer_timeout>3000</http_transfer_timeout>
<http_transfer_timeout_bps>10</http_transfer_timeout_bps>
<http_1_0>0</http_1_0>
Should fetch on update be set???
I wish there were better explanations for all these options.
ID: 105892 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4925
United Kingdom
Message 105893 - Posted: 29 Oct 2021, 18:49:05 UTC - in response to Message 105892.  

Should fetch on update be set???
I wish there were better explanations for all these options.
I don't have it set, but it doesn't sound like it would matter.

The explataions, such as they are, are here.

Actually,

<fetch_on_update>0|1</fetch_on_update>
When updating a project, request work even if not highest priority project.
sounds good - I might try that myself.
ID: 105893 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105894 - Posted: 29 Oct 2021, 18:51:26 UTC

After rebooting the files are ULing one or two at a time. Sometimes it says it can multiplex:
2472	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Received header from server: HTTP/2 200 
2473	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Received header from server: date: Fri, 29 Oct 2021 18:48:15 GMT
2474	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Received header from server: server: Apache
2475	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Received header from server: vary: Accept-Encoding	
2476	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Received header from server: content-encoding: gzip	
2477	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Received header from server: content-length: 75	
2478	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Received header from server: content-type: text/plain; charset=UTF-8	
2479	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Received header from server: 	
2480	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Received header from server: ‹	
2481	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Info:  Connection #0 to host upload.worldcommunitygrid.org left intact	
2482	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Info:  Found bundle for host upload.worldcommunitygrid.org: 0x557ee1c94df0 [can multiplex]	
2483	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Info:  Re-using existing connection! (#0) with host upload.worldcommunitygrid.org	
2484	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Info:  Connected to upload.worldcommunitygrid.org (169.47.63.74) port 443 (#0)	
2485	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Info:  Using Stream ID: d7 (easy handle 0x557ee21e5c30)	
2486	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Sent header to server: POST /boinc/wcg_cgi/file_upload_handler HTTP/2
2487	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Sent header to server: Host: upload.worldcommunitygrid.org	
2488	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Sent header to server: user-agent: BOINC client (x86_64-pc-linux-gnu 7.16.6)	
2489	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Sent header to server: accept: */*
2490	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Sent header to server: accept-encoding: deflate, gzip, br
2491	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Sent header to server: accept-language: en_US
2492	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Sent header to server: content-length: 16656866
2493	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Sent header to server: content-type: application/x-www-form-urlencoded
2494	World Community Grid	10/29/2021 11:48:15 AM	[http] [ID#74] Sent header to server: 
2495	World Community Grid	10/29/2021 11:48:42 AM	[http] [ID#74] Info:  We are completely uploaded and fine	
2496	World Community Grid	10/29/2021 11:48:42 AM	[http] [ID#74] Received header from server: HTTP/2 200 
2497	World Community Grid	10/29/2021 11:48:42 AM	[http] [ID#74] Received header from server: date: Fri, 29 Oct 2021 18:48:16 GMT
2498	World Community Grid	10/29/2021 11:48:42 AM	[http] [ID#74] Received header from server: server: Apache
2499	World Community Grid	10/29/2021 11:48:42 AM	[http] [ID#74] Received header from server: content-length: 64
2500	World Community Grid	10/29/2021 11:48:42 AM	[http] [ID#74] Received header from server: content-type: text/plain; charset=UTF-8
2501	World Community Grid	10/29/2021 11:48:42 AM	[http] [ID#74] Received header from server: 
2502	World Community Grid	10/29/2021 11:48:42 AM	[http] [ID#74] Received header from server: <data_server_reply>	
2503	World Community Grid	10/29/2021 11:48:42 AM	[http] [ID#74] Received header from server:     <status>0</status>	
2504	World Community Grid	10/29/2021 11:48:42 AM	[http] [ID#74] Received header from server: </data_server_reply>	
2505	World Community Grid	10/29/2021 11:48:42 AM	[http] [ID#74] Info:  Connection #0 to host upload.worldcommunitygrid.org left intact	
2506	World Community Grid	10/29/2021 11:48:43 AM	Finished upload of ARP1_0001451_100_1_r1391244310_3	
2507	World Community Grid	10/29/2021 11:48:44 AM	Sending scheduler request: To report completed tasks.	
2508	World Community Grid	10/29/2021 11:48:44 AM	Reporting 5 completed tasks	
2509	World Community Grid	10/29/2021 11:48:44 AM	Not requesting tasks: "no new tasks" requested via Manager	
2510	World Community Grid	10/29/2021 11:48:44 AM	[http] HTTP_OP::init_post(): https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi	
2511	World Community Grid	10/29/2021 11:48:45 AM	[http] [ID#1] Info:  Too old connection (141 seconds), disconnect it	
2512	World Community Grid	10/29/2021 11:48:45 AM	[http] [ID#1] Info:  Connection 18 seems to be dead!	
2513	World Community Grid	10/29/2021 11:48:45 AM	[http] [ID#1] Info:  Closing connection 18	
2514	World Community Grid	10/29/2021 11:48:45 AM	[http] [ID#1] Info:  TLSv1.2 (OUT), TLS alert, close notify (256):
ID: 105894 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105895 - Posted: 29 Oct 2021, 18:58:39 UTC - in response to Message 105893.  

The explataions, such as they are, are here.
I know. I've read them many times and they're still clear as mud :-)
ID: 105895 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4925
United Kingdom
Message 105896 - Posted: 29 Oct 2021, 19:01:22 UTC

Well, that's got me completely confused. :-(

But if they're uploading, that's the main thing. I think it's getting a bit late to take this further tonight - I'll try and read up on multiplexing over the weekend. I think WCG uses multiple servers to share the load - maybe the configuration isn't quite identical across the cluster?
ID: 105896 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105897 - Posted: 29 Oct 2021, 19:03:45 UTC

I go back and forth on this one. I had it set to 256 and just changed it to 16. Is there a safe 'n sane setting???
<max_file_xfers>16</max_file_xfers>
<max_file_xfers_per_project>16</max_file_xfers_per_project>
ID: 105897 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4925
United Kingdom
Message 105898 - Posted: 29 Oct 2021, 19:07:43 UTC - in response to Message 105897.  

I run smaller machines (max 2 GPUs), so I haven't been tempted to change the defaults. Anyone?
ID: 105898 · Report as offensive
Jim1348

Send message
Joined: 8 Nov 10
Posts: 310
United States
Message 105899 - Posted: 29 Oct 2021, 22:02:10 UTC - in response to Message 105897.  

I go back and forth on this one. I had it set to 256 and just changed it to 16. Is there a safe 'n sane setting???
<max_file_xfers>16</max_file_xfers>
<max_file_xfers_per_project>16</max_file_xfers_per_project>

I use this:
<max_file_xfers>8</max_file_xfers>
<max_file_xfers_per_project>4</max_file_xfers_per_project>

The problem with setting it too large is that they seem to interfere with each other. Also, when first attaching, I want at least one work unit to start up early. If you set the transfers too large, you have to wait longer for the first one to finish downloading. But my maximum machine has 32 virtual cores. I don't know about more than that.
ID: 105899 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105901 - Posted: 30 Oct 2021, 15:31:14 UTC
Last modified: 30 Oct 2021, 15:46:06 UTC

The problem is spreading. After rebooting Rig-44 and then getting it to clear the ULs it's doing it again. Also Rig-13 is unable to UL anything for no apparent reason.
I'm convinced this is entirely due to something set wrong with ARP.

Note: If I could edit the title of this thread I'd call it WCG Troubleshooting.
ID: 105901 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105902 - Posted: 30 Oct 2021, 16:26:50 UTC
Last modified: 30 Oct 2021, 17:23:27 UTC

I may have found a trick to get these ARPs uploading.
<max_file_xfers>4</max_file_xfers>
<max_file_xfers_per_project>1</max_file_xfers_per_project>
Setting these parameters and reading your config files won't restart the uploads. It requires a reboot.
Edit: If all uploads are stopped and are not attempting to upload then reading config files and restarting one WU works. If you have a large backlog and any of them are trying to upload then it requires a reboot.

Problem with WCG is that this global command considers WCG a project. So if you have ARP, HST & OPN WUs it won't UL one of each for a total of 3 but will only upload one at a time.
It used to UL 4 ARP WUs at a time when I had this set higher but then everything log-jammed and WCG uploads seized up.
Beware: This limits downloads as well.
I'll try two on the next computer I can reboot. Edit: 2 works fine.
ID: 105902 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105915 - Posted: 1 Nov 2021, 0:01:22 UTC

This morning it seemed like this was working, but not now. I have 100s of WUs backlogged. Once the transient error appears it seems the only way to restart uploads is by reloading. A real pain considering ARP may not checkpoint for 9 hours.
Sure would be nice if someone would tell them the ARP server is broken.
ID: 105915 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105917 - Posted: 1 Nov 2021, 14:47:04 UTC

Saw a new message today I've never seen before:
335	World Community Grid	11/1/2021 5:03:54 AM	Not requesting tasks: too many uploads in progress	
336	World Community Grid	11/1/2021 5:03:55 AM	Scheduler request completed
Since I saw knreed say he'd changed server settings because someone had made numerous idle connections the ability to upload ARP WUs has collapsed to virtually impossible. It's really ashamed because I can only perform less than 5% of what I'm capable of doing for WCG's ARP or ClimatePrediction.
I think this is a multifaceted problem but trying to put a tiny bandaid on it using <max_file_xfers_per_project>2</max_file_xfers_per_project> is no solution at all.
I'm moving this systemic BOINC problem to github issues.
ID: 105917 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4925
United Kingdom
Message 105918 - Posted: 1 Nov 2021, 14:54:42 UTC - in response to Message 105917.  

Before you do - 'too many uploads' is a client decision. Irrespective of how many GPUs you may have, the limit will kick in when the number of uploads on the system - whether ready to upload, or backed off following some previous problem - goes above twice the number of CPUs visible to BOINC in the system.

It's designed to protect project servers. If there's a problem blocking uploads - we've all seen 'server storage full', and even complete server breakdowns - there's no point in downloading more and more work, which might (for all BOINC knows) never be uploaded or reported. That would be a waste of energy to compute.
ID: 105918 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105922 - Posted: 1 Nov 2021, 17:32:13 UTC - in response to Message 105918.  
Last modified: 1 Nov 2021, 17:35:00 UTC

Before you do - 'too many uploads' is a client decision. Irrespective of how many GPUs you may have, the limit will kick in when the number of uploads on the system - whether ready to upload, or backed off following some previous problem - goes above twice the number of CPUs visible to BOINC in the system.
It's designed to protect project servers. If there's a problem blocking uploads - we've all seen 'server storage full', and even complete server breakdowns - there's no point in downloading more and more work, which might (for all BOINC knows) never be uploaded or reported. That would be a waste of energy to compute.
Completely irrelevant to my problem. I never download too much work. I do NOT bunker. The backlog is caused by failures between the BOINC client and the server.
I question your explanation of twice the number of CPUs. Are they logical or physical cores? I'll watch and see if that's true.
Contributing to the problem may be the fact that big projects return multiple files per WU. E.g., a completed ARP WU returns seven 17 MB files. Anecdotally it seems that the problem starts when four ARP WUs complete and are trying to upload. The CPUs running are i9-9980XE with 18c/36t. So maybe you've hit on something. Four ARPs finishing needs to send 28 files up. So are you saying that if I have 7 other uploads pending of any kind on that computer it'll seize up??? That's a major design flaw that significantly curtails ones ability to contribute. An ARP WU typically needs 1 GB of RAM. My computers have 32 GB so I could run 32 ARPs plus OPNs. I've been trying to run 16 and it seizes every computer I've tried it on. It takes up to two days for me to clear it short of dumping days worth of uncheckpointed running WUs. I could be submitting over 2,000 ARP WUs a day if they'd actually upload.
ID: 105922 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105923 - Posted: 1 Nov 2021, 17:39:26 UTC - in response to Message 105918.  
Last modified: 1 Nov 2021, 17:45:30 UTC

...there's no point in downloading more and more work, which might (for all BOINC knows) never be uploaded or reported. That would be a waste of energy to compute.
How does blocking uploads of completed work help the server not download excess files???
Start by separating the control of uploads from downloads. Then make <max_file_xfers_per_project>1</max_file_xfers_per_project> an app_config command and not a global command in the cc_config. Further it needs to be an app command that applies to a specific project, e.g. with WCG be able to apply it only to ARP1. E.g.,
<app_config>
<!-- i9-10980XE   18c36t   32 GB   L3 Cache = 24.75 MB  -->
    <app>
        <name>opng</name>
        <plan_class>opencl_nvidia_102</plan_class>
        <gpu_versions>
            <cpu_usage>1.0</cpu_usage>
            <gpu_usage>1.0</gpu_usage>
        </gpu_versions>
    </app>
    <app>
        <name>mcm1</name>
        <max_concurrent>4</max_concurrent>
    </app>
    <app>
        <name>opn1</name>
        <max_concurrent>34</max_concurrent>
    </app>
    <app>
        <name>arp1</name>
        <!-- needs 1 GB RAM per arp1 WU -->
        <max_concurrent>18</max_concurrent>
        <max_file_xfers_per_project>2</max_file_xfers_per_project>
    </app>
</app_config>

Still that's a tiny bandaid and not a bona fide fix of a root cause problem.
ID: 105923 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4925
United Kingdom
Message 105924 - Posted: 1 Nov 2021, 18:15:37 UTC - in response to Message 105922.  

Completely irrelevant to my problem.
Agreed. But don't shoot the messenger.

I just wanted to be sure that you knew where the pushback was going to come from, so that you could respond pre-emptively in your proposed issue.
ID: 105924 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105927 - Posted: 1 Nov 2021, 19:02:33 UTC

Sorry Richard, didn't mean to shoot you :-) I always appreciate your help.
The real pushback is going come from the fact that very few folks even run into this problem. Who wants to fix a problem that only affects 5 people on Earth? Forget about the fact that the lion's share of the work is done by a couple dozen large compute nodes. Just do a cum plot of WCG stats to see that.
ID: 105927 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105988 - Posted: 5 Nov 2021, 14:31:43 UTC

Still plagued by transient HTTP errors. Is this a BOINC issue or a WCG issue???
ID: 105988 · Report as offensive
Jim1348

Send message
Joined: 8 Nov 10
Posts: 310
United States
Message 105989 - Posted: 5 Nov 2021, 17:20:53 UTC - in response to Message 105988.  

You have too many machines. I have two Ryzen 3600's on WCG, one on MCM and the other on ARP. They never have that problem.

I am beginning to wonder though if it is like GPUGrid. They always have an unofficial block on too many connections at once. No one knows what it is, but it is speculated that it is a DDoS protection measure, probably instituted at the university network level that the project can't do anything about. I am not sure what the equivalent is here. But you are overloading something.
ID: 105989 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Projects : WCG OPNG sans OPN1

Copyright © 2022 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.