Message boards : Projects : WCG OPNG sans OPN1
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 5 Oct 06 Posts: 5118 |
Should fetch on update be set???I don't have it set, but it doesn't sound like it would matter. The explataions, such as they are, are here. Actually, <fetch_on_update>0|1</fetch_on_update>sounds good - I might try that myself. |
Send message Joined: 13 Sep 17 Posts: 26 |
After rebooting the files are ULing one or two at a time. Sometimes it says it can multiplex: 2472 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Received header from server: HTTP/2 200 2473 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Received header from server: date: Fri, 29 Oct 2021 18:48:15 GMT 2474 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Received header from server: server: Apache 2475 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Received header from server: vary: Accept-Encoding 2476 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Received header from server: content-encoding: gzip 2477 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Received header from server: content-length: 75 2478 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Received header from server: content-type: text/plain; charset=UTF-8 2479 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Received header from server: 2480 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Received header from server: ‹ 2481 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Info: Connection #0 to host upload.worldcommunitygrid.org left intact 2482 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Info: Found bundle for host upload.worldcommunitygrid.org: 0x557ee1c94df0 [can multiplex] 2483 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Info: Re-using existing connection! (#0) with host upload.worldcommunitygrid.org 2484 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Info: Connected to upload.worldcommunitygrid.org (169.47.63.74) port 443 (#0) 2485 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Info: Using Stream ID: d7 (easy handle 0x557ee21e5c30) 2486 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Sent header to server: POST /boinc/wcg_cgi/file_upload_handler HTTP/2 2487 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Sent header to server: Host: upload.worldcommunitygrid.org 2488 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Sent header to server: user-agent: BOINC client (x86_64-pc-linux-gnu 7.16.6) 2489 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Sent header to server: accept: */* 2490 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Sent header to server: accept-encoding: deflate, gzip, br 2491 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Sent header to server: accept-language: en_US 2492 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Sent header to server: content-length: 16656866 2493 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Sent header to server: content-type: application/x-www-form-urlencoded 2494 World Community Grid 10/29/2021 11:48:15 AM [http] [ID#74] Sent header to server: 2495 World Community Grid 10/29/2021 11:48:42 AM [http] [ID#74] Info: We are completely uploaded and fine 2496 World Community Grid 10/29/2021 11:48:42 AM [http] [ID#74] Received header from server: HTTP/2 200 2497 World Community Grid 10/29/2021 11:48:42 AM [http] [ID#74] Received header from server: date: Fri, 29 Oct 2021 18:48:16 GMT 2498 World Community Grid 10/29/2021 11:48:42 AM [http] [ID#74] Received header from server: server: Apache 2499 World Community Grid 10/29/2021 11:48:42 AM [http] [ID#74] Received header from server: content-length: 64 2500 World Community Grid 10/29/2021 11:48:42 AM [http] [ID#74] Received header from server: content-type: text/plain; charset=UTF-8 2501 World Community Grid 10/29/2021 11:48:42 AM [http] [ID#74] Received header from server: 2502 World Community Grid 10/29/2021 11:48:42 AM [http] [ID#74] Received header from server: <data_server_reply> 2503 World Community Grid 10/29/2021 11:48:42 AM [http] [ID#74] Received header from server: <status>0</status> 2504 World Community Grid 10/29/2021 11:48:42 AM [http] [ID#74] Received header from server: </data_server_reply> 2505 World Community Grid 10/29/2021 11:48:42 AM [http] [ID#74] Info: Connection #0 to host upload.worldcommunitygrid.org left intact 2506 World Community Grid 10/29/2021 11:48:43 AM Finished upload of ARP1_0001451_100_1_r1391244310_3 2507 World Community Grid 10/29/2021 11:48:44 AM Sending scheduler request: To report completed tasks. 2508 World Community Grid 10/29/2021 11:48:44 AM Reporting 5 completed tasks 2509 World Community Grid 10/29/2021 11:48:44 AM Not requesting tasks: "no new tasks" requested via Manager 2510 World Community Grid 10/29/2021 11:48:44 AM [http] HTTP_OP::init_post(): https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi 2511 World Community Grid 10/29/2021 11:48:45 AM [http] [ID#1] Info: Too old connection (141 seconds), disconnect it 2512 World Community Grid 10/29/2021 11:48:45 AM [http] [ID#1] Info: Connection 18 seems to be dead! 2513 World Community Grid 10/29/2021 11:48:45 AM [http] [ID#1] Info: Closing connection 18 2514 World Community Grid 10/29/2021 11:48:45 AM [http] [ID#1] Info: TLSv1.2 (OUT), TLS alert, close notify (256): |
Send message Joined: 13 Sep 17 Posts: 26 |
The explataions, such as they are, are here.I know. I've read them many times and they're still clear as mud :-) |
Send message Joined: 5 Oct 06 Posts: 5118 |
Well, that's got me completely confused. :-( But if they're uploading, that's the main thing. I think it's getting a bit late to take this further tonight - I'll try and read up on multiplexing over the weekend. I think WCG uses multiple servers to share the load - maybe the configuration isn't quite identical across the cluster? |
Send message Joined: 13 Sep 17 Posts: 26 |
I go back and forth on this one. I had it set to 256 and just changed it to 16. Is there a safe 'n sane setting??? <max_file_xfers>16</max_file_xfers> <max_file_xfers_per_project>16</max_file_xfers_per_project> |
Send message Joined: 5 Oct 06 Posts: 5118 |
I run smaller machines (max 2 GPUs), so I haven't been tempted to change the defaults. Anyone? |
Send message Joined: 8 Nov 10 Posts: 310 |
I go back and forth on this one. I had it set to 256 and just changed it to 16. Is there a safe 'n sane setting??? I use this: <max_file_xfers>8</max_file_xfers> The problem with setting it too large is that they seem to interfere with each other. Also, when first attaching, I want at least one work unit to start up early. If you set the transfers too large, you have to wait longer for the first one to finish downloading. But my maximum machine has 32 virtual cores. I don't know about more than that. |
Send message Joined: 13 Sep 17 Posts: 26 |
The problem is spreading. After rebooting Rig-44 and then getting it to clear the ULs it's doing it again. Also Rig-13 is unable to UL anything for no apparent reason. I'm convinced this is entirely due to something set wrong with ARP. Note: If I could edit the title of this thread I'd call it WCG Troubleshooting. |
Send message Joined: 13 Sep 17 Posts: 26 |
I may have found a trick to get these ARPs uploading. <max_file_xfers>4</max_file_xfers> <max_file_xfers_per_project>1</max_file_xfers_per_project>Setting these parameters and reading your config files won't restart the uploads. It requires a reboot. Edit: If all uploads are stopped and are not attempting to upload then reading config files and restarting one WU works. If you have a large backlog and any of them are trying to upload then it requires a reboot. Problem with WCG is that this global command considers WCG a project. So if you have ARP, HST & OPN WUs it won't UL one of each for a total of 3 but will only upload one at a time. It used to UL 4 ARP WUs at a time when I had this set higher but then everything log-jammed and WCG uploads seized up. Beware: This limits downloads as well. I'll try two on the next computer I can reboot. Edit: 2 works fine. |
Send message Joined: 13 Sep 17 Posts: 26 |
This morning it seemed like this was working, but not now. I have 100s of WUs backlogged. Once the transient error appears it seems the only way to restart uploads is by reloading. A real pain considering ARP may not checkpoint for 9 hours. Sure would be nice if someone would tell them the ARP server is broken. |
Send message Joined: 13 Sep 17 Posts: 26 |
Saw a new message today I've never seen before: 335 World Community Grid 11/1/2021 5:03:54 AM Not requesting tasks: too many uploads in progress 336 World Community Grid 11/1/2021 5:03:55 AM Scheduler request completedSince I saw knreed say he'd changed server settings because someone had made numerous idle connections the ability to upload ARP WUs has collapsed to virtually impossible. It's really ashamed because I can only perform less than 5% of what I'm capable of doing for WCG's ARP or ClimatePrediction. I think this is a multifaceted problem but trying to put a tiny bandaid on it using <max_file_xfers_per_project>2</max_file_xfers_per_project> is no solution at all. I'm moving this systemic BOINC problem to github issues. |
Send message Joined: 5 Oct 06 Posts: 5118 |
Before you do - 'too many uploads' is a client decision. Irrespective of how many GPUs you may have, the limit will kick in when the number of uploads on the system - whether ready to upload, or backed off following some previous problem - goes above twice the number of CPUs visible to BOINC in the system. It's designed to protect project servers. If there's a problem blocking uploads - we've all seen 'server storage full', and even complete server breakdowns - there's no point in downloading more and more work, which might (for all BOINC knows) never be uploaded or reported. That would be a waste of energy to compute. |
Send message Joined: 13 Sep 17 Posts: 26 |
Before you do - 'too many uploads' is a client decision. Irrespective of how many GPUs you may have, the limit will kick in when the number of uploads on the system - whether ready to upload, or backed off following some previous problem - goes above twice the number of CPUs visible to BOINC in the system.Completely irrelevant to my problem. I never download too much work. I do NOT bunker. The backlog is caused by failures between the BOINC client and the server. I question your explanation of twice the number of CPUs. Are they logical or physical cores? I'll watch and see if that's true. Contributing to the problem may be the fact that big projects return multiple files per WU. E.g., a completed ARP WU returns seven 17 MB files. Anecdotally it seems that the problem starts when four ARP WUs complete and are trying to upload. The CPUs running are i9-9980XE with 18c/36t. So maybe you've hit on something. Four ARPs finishing needs to send 28 files up. So are you saying that if I have 7 other uploads pending of any kind on that computer it'll seize up??? That's a major design flaw that significantly curtails ones ability to contribute. An ARP WU typically needs 1 GB of RAM. My computers have 32 GB so I could run 32 ARPs plus OPNs. I've been trying to run 16 and it seizes every computer I've tried it on. It takes up to two days for me to clear it short of dumping days worth of uncheckpointed running WUs. I could be submitting over 2,000 ARP WUs a day if they'd actually upload. |
Send message Joined: 13 Sep 17 Posts: 26 |
...there's no point in downloading more and more work, which might (for all BOINC knows) never be uploaded or reported. That would be a waste of energy to compute.How does blocking uploads of completed work help the server not download excess files??? Start by separating the control of uploads from downloads. Then make <max_file_xfers_per_project>1</max_file_xfers_per_project> an app_config command and not a global command in the cc_config. Further it needs to be an app command that applies to a specific project, e.g. with WCG be able to apply it only to ARP1. E.g., <app_config> <!-- i9-10980XE 18c36t 32 GB L3 Cache = 24.75 MB --> <app> <name>opng</name> <plan_class>opencl_nvidia_102</plan_class> <gpu_versions> <cpu_usage>1.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> </app> <app> <name>mcm1</name> <max_concurrent>4</max_concurrent> </app> <app> <name>opn1</name> <max_concurrent>34</max_concurrent> </app> <app> <name>arp1</name> <!-- needs 1 GB RAM per arp1 WU --> <max_concurrent>18</max_concurrent> <max_file_xfers_per_project>2</max_file_xfers_per_project> </app> </app_config> Still that's a tiny bandaid and not a bona fide fix of a root cause problem. |
Send message Joined: 5 Oct 06 Posts: 5118 |
Completely irrelevant to my problem.Agreed. But don't shoot the messenger. I just wanted to be sure that you knew where the pushback was going to come from, so that you could respond pre-emptively in your proposed issue. |
Send message Joined: 13 Sep 17 Posts: 26 |
Sorry Richard, didn't mean to shoot you :-) I always appreciate your help. The real pushback is going come from the fact that very few folks even run into this problem. Who wants to fix a problem that only affects 5 people on Earth? Forget about the fact that the lion's share of the work is done by a couple dozen large compute nodes. Just do a cum plot of WCG stats to see that. |
Send message Joined: 13 Sep 17 Posts: 26 |
Still plagued by transient HTTP errors. Is this a BOINC issue or a WCG issue??? |
Send message Joined: 8 Nov 10 Posts: 310 |
You have too many machines. I have two Ryzen 3600's on WCG, one on MCM and the other on ARP. They never have that problem. I am beginning to wonder though if it is like GPUGrid. They always have an unofficial block on too many connections at once. No one knows what it is, but it is speculated that it is a DDoS protection measure, probably instituted at the university network level that the project can't do anything about. I am not sure what the equivalent is here. But you are overloading something. |
Send message Joined: 13 Sep 17 Posts: 26 |
It started after knreed made this post: https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,43839_offset,0#667954 Today I've had 12 computers seize up due to transient http errors. If it's me then how did I do the first 100 years of ARP without a problem??? They broke the system and I wish they'd put it back. |
Send message Joined: 13 Sep 17 Posts: 26 |
I suspect this is a systemic problem in BOINC. It affects the 3 projects that upload the largest files: WCG ARP, ClimatePrediction and GPUgrid. I can find nothing on my side that I can do to prevent this failure and keep work flowing continuously. Once the transient http error appears it takes a reboot or maybe a restart to get uploads going again. I've even seen it fail again after a reboot before all the stalled uploads have cleared requiring a second reboot. I bet there's a clue in the client_state file but I don't understand what I'm reading. Here's the first completed ARP upload today to stall and fail to upload. When ARP finishes a WU it uploads 7 files. Note that max_nbytes changes size twice for the same WU. Also, <num_retries>1</num_retries> sounds like we get one try to upload and then stall. <file> <name>ARP1_0028085_102_1_r1835721786_0</name> <nbytes>15923553.000000</nbytes> <max_nbytes>104857600.000000</max_nbytes> <md5_cksum>5072fc387af5ae4d884ec0a22044c364</md5_cksum> <status>1</status> <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>1</num_retries> <first_request_time>1636203793.649186</first_request_time> <next_request_time>0.000000</next_request_time> <time_so_far>300.559970</time_so_far> <last_bytes_xferred>0.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> <file> <name>ARP1_0028085_102_1_r1835721786_1</name> <nbytes>16475866.000000</nbytes> <max_nbytes>104857600.000000</max_nbytes> <md5_cksum>677e411747706ad3b16c6bd86742a649</md5_cksum> <status>1</status> <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>1</num_retries> <first_request_time>1636203793.649186</first_request_time> <next_request_time>0.000000</next_request_time> <time_so_far>300.588016</time_so_far> <last_bytes_xferred>0.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> <file> <name>ARP1_0028085_102_1_r1835721786_2</name> <nbytes>15998165.000000</nbytes> <max_nbytes>104857600.000000</max_nbytes> <md5_cksum>71446531f56153380749450507dc3767</md5_cksum> <status>1</status> <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>1</num_retries> <first_request_time>1636203793.649186</first_request_time> <next_request_time>0.000000</next_request_time> <time_so_far>300.568259</time_so_far> <last_bytes_xferred>0.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> <file> <name>ARP1_0028085_102_1_r1835721786_3</name> <nbytes>18614248.000000</nbytes> <max_nbytes>31457280.000000</max_nbytes> <md5_cksum>5ce5b3b488ac687b549bdc97170ecbbf</md5_cksum> <status>1</status> <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>1</num_retries> <first_request_time>1636203793.649186</first_request_time> <next_request_time>0.000000</next_request_time> <time_so_far>300.568259</time_so_far> <last_bytes_xferred>0.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> <file> <name>ARP1_0028085_102_1_r1835721786_4</name> <nbytes>16084763.000000</nbytes> <max_nbytes>31457280.000000</max_nbytes> <md5_cksum>78e54c61957a45ee11c70e167b2a0b00</md5_cksum> <status>1</status> <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>1</num_retries> <first_request_time>1636203793.649186</first_request_time> <next_request_time>0.000000</next_request_time> <time_so_far>300.554810</time_so_far> <last_bytes_xferred>0.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> <file> <name>ARP1_0028085_102_1_r1835721786_5</name> <nbytes>15363410.000000</nbytes> <max_nbytes>31457280.000000</max_nbytes> <md5_cksum>ff167aca654c19fe7de5760bc2b245d6</md5_cksum> <status>1</status> <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>1</num_retries> <first_request_time>1636203793.649186</first_request_time> <next_request_time>0.000000</next_request_time> <time_so_far>300.531569</time_so_far> <last_bytes_xferred>0.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> <file> <name>ARP1_0028085_102_1_r1835721786_6</name> <nbytes>132.000000</nbytes> <max_nbytes>10240.000000</max_nbytes> <md5_cksum>e16122bf2611e311bdb0ea8b8d826897</md5_cksum> <status>1</status> <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url> <persistent_file_xfer> <num_retries>1</num_retries> <first_request_time>1636203793.649186</first_request_time> <next_request_time>1636204842.009948</next_request_time> <time_so_far>300.530445</time_so_far> <last_bytes_xferred>0.000000</last_bytes_xferred> <is_upload>1</is_upload> </persistent_file_xfer> </file> |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.