WCG OPNG sans OPN1

Message boards : Projects : WCG OPNG sans OPN1
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105995 - Posted: 6 Nov 2021, 1:50:27 UTC

It started after knreed made this post: https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,43839_offset,0#667954
Today I've had 12 computers seize up due to transient http errors. If it's me then how did I do the first 100 years of ARP without a problem???
They broke the system and I wish they'd put it back.
ID: 105995 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 105998 - Posted: 6 Nov 2021, 13:46:00 UTC
Last modified: 6 Nov 2021, 13:47:34 UTC

I suspect this is a systemic problem in BOINC. It affects the 3 projects that upload the largest files: WCG ARP, ClimatePrediction and GPUgrid.
I can find nothing on my side that I can do to prevent this failure and keep work flowing continuously. Once the transient http error appears it takes a reboot or maybe a restart to get uploads going again. I've even seen it fail again after a reboot before all the stalled uploads have cleared requiring a second reboot. I bet there's a clue in the client_state file but I don't understand what I'm reading. Here's the first completed ARP upload today to stall and fail to upload. When ARP finishes a WU it uploads 7 files. Note that max_nbytes changes size twice for the same WU. Also, <num_retries>1</num_retries> sounds like we get one try to upload and then stall.
<file>
    <name>ARP1_0028085_102_1_r1835721786_0</name>
    <nbytes>15923553.000000</nbytes>
    <max_nbytes>104857600.000000</max_nbytes>
    <md5_cksum>5072fc387af5ae4d884ec0a22044c364</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>0.000000</next_request_time>
        <time_so_far>300.559970</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
<file>
    <name>ARP1_0028085_102_1_r1835721786_1</name>
    <nbytes>16475866.000000</nbytes>
    <max_nbytes>104857600.000000</max_nbytes>
    <md5_cksum>677e411747706ad3b16c6bd86742a649</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>0.000000</next_request_time>
        <time_so_far>300.588016</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
<file>
    <name>ARP1_0028085_102_1_r1835721786_2</name>
    <nbytes>15998165.000000</nbytes>
    <max_nbytes>104857600.000000</max_nbytes>
    <md5_cksum>71446531f56153380749450507dc3767</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>0.000000</next_request_time>
        <time_so_far>300.568259</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
<file>
    <name>ARP1_0028085_102_1_r1835721786_3</name>
    <nbytes>18614248.000000</nbytes>
    <max_nbytes>31457280.000000</max_nbytes>
    <md5_cksum>5ce5b3b488ac687b549bdc97170ecbbf</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>0.000000</next_request_time>
        <time_so_far>300.568259</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
<file>
    <name>ARP1_0028085_102_1_r1835721786_4</name>
    <nbytes>16084763.000000</nbytes>
    <max_nbytes>31457280.000000</max_nbytes>
    <md5_cksum>78e54c61957a45ee11c70e167b2a0b00</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>0.000000</next_request_time>
        <time_so_far>300.554810</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
<file>
    <name>ARP1_0028085_102_1_r1835721786_5</name>
    <nbytes>15363410.000000</nbytes>
    <max_nbytes>31457280.000000</max_nbytes>
    <md5_cksum>ff167aca654c19fe7de5760bc2b245d6</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>0.000000</next_request_time>
        <time_so_far>300.531569</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
<file>
    <name>ARP1_0028085_102_1_r1835721786_6</name>
    <nbytes>132.000000</nbytes>
    <max_nbytes>10240.000000</max_nbytes>
    <md5_cksum>e16122bf2611e311bdb0ea8b8d826897</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>1636204842.009948</next_request_time>
        <time_so_far>300.530445</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
ID: 105998 · Report as offensive
Aurum
Avatar

Send message
Joined: 13 Sep 17
Posts: 23
United States
Message 106000 - Posted: 6 Nov 2021, 18:24:42 UTC

See: Uploads Stopping for Projects with Large Files #4572
https://github.com/BOINC/boinc/issues/4572
ID: 106000 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4945
United Kingdom
Message 106001 - Posted: 6 Nov 2021, 20:43:43 UTC - in response to Message 106000.  

I still think it would be helpful to identify exactly which of the many possible 'transient upload errors' is being encountered in your use case. You referred to my #3778, but that was concerned with the problems which occur with the boinc backoffs when files fail to upload: The cause of that at GPUGrid is well known - a DDoS block on the connection when multiple computers try to connect to the server from the same IP address, in too short a time interval.

Your issue #4572 gives an example of a simple 'transient HTTP error', but the extended http_debug log shows an error-free 'Finished upload of ...' log, which takes us no further forward.

I also run WCG, and I've seen examples of an error "no server is available to handle your request" - which I don't think I've seen at any other BOINC project.

You also mention that you think that the errors are connected with the uploading of 'large' files - can you quantify that? I've recently handled problems caused by attempting to upload files larger than 150 megabytes at CPDN, and 500 megabytes at GPUGrid. Are your files in that sort of size range? Also, in my experience, once a connection has been established and data has started to flow, in general uploads complete: the exception being projects which transfer uploaded files to backing storage, which sometimes accept the whole file and then generate an http gateway error because the backing store is full or offline. Does your error occur at the start of the transfer (in which case the size probably isn't implicated), or does it occur part-way through, or at the end?
ID: 106001 · Report as offensive
Previous · 1 · 2 · 3

Message boards : Projects : WCG OPNG sans OPN1

Copyright © 2023 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.