I need a way to fine-tune download/upload back-off period

Message boards : Questions and problems : I need a way to fine-tune download/upload back-off period
Message board moderation

To post messages, you must log in.

AuthorMessage
Nim1

Send message
Joined: 29 Mar 13
Posts: 6
Message 48383 - Posted: 29 Mar 2013, 7:48:46 UTC
Last modified: 29 Mar 2013, 7:51:15 UTC

Hi

I'm on a rather unstable network and running a project that uses large amount of upload for each of its tasks. the server supports resume, so if the upload fails in the middle, it could continue from where it left off. however, each time the connection gets interrupted, the BOINC software, automatically sets a back-off period that gets bigger and bigger over time. i need to change that and set it to let's say to back-off only 1 minute no matter what.

Any idea how i could achieve that?
ID: 48383 · Report as offensive
SekeRob2

Send message
Joined: 6 Jul 10
Posts: 585
Italy
Message 48384 - Posted: 29 Mar 2013, 8:57:39 UTC - in response to Message 48383.  

What you are saying is that the back-off is getting ever bigger meaning the upload fails are consecutive. Doubt there's a BOINC way of overriding the increments, if there's nothing in the cc_config.xml manual. Taken the code, open source and doctor the back-off code. Can't remember exactly, but think to remember there was also some back-off counter that considers the upload to have permanently failed at 100(?) tries. You see the upload counter in the Transfer view. Could mean with what you want, that after 100 minutes your upload is aborted. Anyway, that's a not readily documented part [no current Google hits, but at CPDN a mod wrote same in 2010]
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 48384 · Report as offensive
mo.v
Avatar

Send message
Joined: 13 Aug 06
Posts: 778
United Kingdom
Message 48386 - Posted: 29 Mar 2013, 9:27:33 UTC

When the upload time limit for files was 14 days I opened a track ticket asking for more time:

http://boinc.berkeley.edu/trac/ticket/919#comment:2

David Anderson replied saying he'd changed the limit to I think three months. I was astonished he'd made such a generous limit. He wrote that on the ticket but everything except my initial request has now been removed and the link at the bottom saying r18845 doesn't work. So I don't know where this is now documented. I can only say that I read David's response more than once because it was my ticket.

I think Jorden mentioned the 100 upload attempts allowed. There must be a limit.
ID: 48386 · Report as offensive
Nim1

Send message
Joined: 29 Mar 13
Posts: 6
Message 48387 - Posted: 29 Mar 2013, 10:59:23 UTC
Last modified: 29 Mar 2013, 11:07:11 UTC

Thank you guys for your answers

Doubt there's a BOINC way of overriding the increments, if there's nothing in the cc_config.xml manual. Taken the code, open source and doctor the back-off code.

Yes, that was my thought as well. but first i wanted to make sure that i wasn't missing anything. thank you.

but think to remember there was also some back-off counter that considers the upload to have permanently failed at 100(?) tries

I think Jorden mentioned the 100 upload attempts allowed. There must be a limit.

Oh, i did not know that. thank you for pointing it out.

You see the upload counter in the Transfer view

Unfortunately, I'm unable to see such counter there. Project,File,Progress,Size,Elapsed Time,Speed and Status are the only columns I can see. giving the fact that seeing such value could really help me in finding the right code, I'm interested in knowing why I can not see that counter. Am i missing anything?

When the upload time limit for files was 14 days ... he'd changed the limit to I think three months


I did not know about such limit either. Though i suppose it wouldn't have been a problem for me even if it was 14 days, if i could disable the code responsible for increasing the back-off period. but i should probably take a look at the code to ensure that. Thank you
ID: 48387 · Report as offensive
mo.v
Avatar

Send message
Joined: 13 Aug 06
Posts: 778
United Kingdom
Message 48389 - Posted: 29 Mar 2013, 12:36:50 UTC

None of us can see the counter that adds the number of upload attempts. It isn't visible in BOINC Manager; it's somewhere in the BOINC code, hidden.

Perhaps someone can think of an extra imaginative and creative instruction that could be added to the configuration file. But you only want to disable the increasing time delays between backoffs. You still want to keep a fixed, shorter time delay.
ID: 48389 · Report as offensive
SekeRob2

Send message
Joined: 6 Jul 10
Posts: 585
Italy
Message 48390 - Posted: 29 Mar 2013, 13:52:08 UTC - in response to Message 48389.  

I can see it, so I must be special ;>)... it says something like [tried n]. Let me force the hand and pull the internet cable from the router, wont disconnect from router as that kind of immediately crashes running tasks on my Linux box [how old is that bug, I can't remember].

The 14 days limit also rings a bell. There's an issue with number of stuck uploads and not fetching work, so would be loath to set such a value to 3 months. The work fetch is stopped for a project if uploads exceeds 2 times the number of cores in a host. Recently saw it when I had like 20 results waiting to upload on an 8 core and the client was dying to get work. That point of too many was reached in 12 hours.

Maybe I'm [again] confusing the 100 with the number that is allowed on restarts of tasks... those abort with 100 zero status conditions. Then there's the "Too Many Exits" logged.
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 48390 · Report as offensive
BobCat13

Send message
Joined: 6 Dec 06
Posts: 118
United States
Message 48391 - Posted: 29 Mar 2013, 14:55:21 UTC - in response to Message 48390.  

Apparently there is no 100 retry limit, as this is from a Cels@Home upload info pulled from client_state.xml:

<persistent_file_xfer>
<num_retries>105</num_retries>

As for the day limit, I believe, like mo.v, it is set at 90. I have 4 uploads for Cels@Home and 18 uploads for UCT Malaria that I clear the persistent status on the first day of every odd number month. This has been working for several years to keep the uploads from expiring. Also, update the report_deadline to keep the results from being in deadline warning.
ID: 48391 · Report as offensive
Claggy

Send message
Joined: 23 Apr 07
Posts: 1112
United Kingdom
Message 48392 - Posted: 29 Mar 2013, 15:13:02 UTC - in response to Message 48390.  

I can see it, so I must be special ;>)... it says something like [tried n]. Let me force the hand and pull the internet cable from the router, wont disconnect from router as that kind of immediately crashes running tasks on my Linux box [how old is that bug, I can't remember].

Especially a problem on my Android devices (with NativeBoinc) where Network connectivity isn't always there, eithier because my mobile doesn't have 3G coverage, or my Tablet doesn't have a WiFi connection,
Wu's still error out even with Network suspended, i wonder if NativeBoinc is trying to do a News update at that point.

Claggy
ID: 48392 · Report as offensive
Nim1

Send message
Joined: 29 Mar 13
Posts: 6
Message 48393 - Posted: 29 Mar 2013, 15:41:52 UTC

@BobCat13:
Thank you for looking that up.

And thank all of you guys for the tips and suggestions. i will try to look at the code in my spare time and will post the result.
ID: 48393 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 48395 - Posted: 29 Mar 2013, 17:37:17 UTC - in response to Message 48386.  

When the upload time limit for files was 14 days I opened a track ticket asking for more time:

http://boinc.berkeley.edu/trac/ticket/919#comment:2

David Anderson replied saying he'd changed the limit to I think three months. I was astonished he'd made such a generous limit. He wrote that on the ticket but everything except my initial request has now been removed and the link at the bottom saying r18845 doesn't work. So I don't know where this is now documented. I can only say that I read David's response more than once because it was my ticket.

Your r18845 became

Revision: 511637c9b6402b9275af6f86e83f0323fa5b893d
Author: David Anderson <davea@ssl.berkeley.edu>
Date: 14/08/2009 20:00:29
Message:
- client: in the final stage of CPU scheduling,

give preference to multi-threaded jobs.
    Avoid running N-1 1-thread jobs and 1 N-thread job on N CPUs
- client: change file transfer giveup time from 14 to 90 days

svn path=/trunk/boinc/; revision=18845

but it seems to have vanished again with all the messing around from SVN to GIT and now GIT-v2.
ID: 48395 · Report as offensive
SekeRob2

Send message
Joined: 6 Jul 10
Posts: 585
Italy
Message 48408 - Posted: 30 Mar 2013, 16:22:16 UTC - in response to Message 48390.  

Snip quote
I can see it, so I must be special ;>)... it says something like [tried n]. Let me force the hand and pull the internet cable from the router, wont disconnect from router as that kind of immediately crashes running tasks on my Linux box [how old is that bug, I can't remember].

Yes I saw it, but my minds eye had it recorded differently [:red cheeks smiley]... it's the downloads printing count, upload just gives the incremental back-off times.

DSFL_00070-16_0000055_0471_DSFL_00070-16_0000055_0471.job 0.000 3.56 K 00:00:00 - 01:43:29 0.00 Kbps Download pending (Retry in: 00:24:16), retried: 4
DSFL_00070-16_0000055_0471_DSFL_00070-16_0000055_0471.zip 0.000 5.45 K 00:00:00 - 02:18:04 0.00 Kbps Download pending (Retry in: 00:58:51), retried: 5
GFAM_x1QNG_PfCypA_0086491_0050_0_0 0.000 231.54 K 00:00:00 - 00:21:05 0.00 Kbps Upload pending (Project backoff: 00:10:58)

Internet works, but not WCG :(



Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 48408 · Report as offensive
SekeRob2

Send message
Joined: 6 Jul 10
Posts: 585
Italy
Message 48410 - Posted: 30 Mar 2013, 16:30:23 UTC - in response to Message 48392.  

I can see it, so I must be special ;>)... it says something like [tried n]. Let me force the hand and pull the internet cable from the router, wont disconnect from router as that kind of immediately crashes running tasks on my Linux box [how old is that bug, I can't remember].

Especially a problem on my Android devices (with NativeBoinc) where Network connectivity isn't always there, eithier because my mobile doesn't have 3G coverage, or my Tablet doesn't have a WiFi connection,
Wu's still error out even with Network suspended, i wonder if NativeBoinc is trying to do a News update at that point.

Claggy

Well internet was up, Wifi was working but both Linux boxes trying to connect to WCG in vain, went into vomit mode. Since living close to Francesco's quarters, ask him if maybe he can do a special plea to get this fixed, for the good of humanity.

24827 World Community Grid 30-03-2013 16:46 [sched_op] Starting scheduler request
24828 World Community Grid 30-03-2013 16:46 Sending scheduler request: To fetch work.
24829 World Community Grid 30-03-2013 16:46 Requesting new tasks for CPU
24830 World Community Grid 30-03-2013 16:46 [sched_op] CPU work request: 22.80 seconds; 0.00 devices
24831 World Community Grid 30-03-2013 16:47 [sched_op] Deferring communication for 1 min 12 sec
24832 World Community Grid 30-03-2013 16:47 [sched_op] Reason: Unrecoverable error for task E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0
24833 World Community Grid 30-03-2013 16:47 Computation for task E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0 finished
24834 World Community Grid 30-03-2013 16:47 Output file E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0_0 for task E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0 absent
24835 World Community Grid 30-03-2013 16:47 Output file E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0_1 for task E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0 absent
24836 World Community Grid 30-03-2013 16:47 Output file E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0_2 for task E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0 absent
24837 World Community Grid 30-03-2013 16:47 Starting task X0900126080715201104131446_0 using hcc1 version 705 in slot 2
24838 World Community Grid 30-03-2013 16:47 Scheduler request failed: Couldn't resolve host name
24839 World Community Grid 30-03-2013 16:47 [sched_op] Deferring communication for 3 min 5 sec
24840 World Community Grid 30-03-2013 16:47 [sched_op] Reason: Scheduler request failed
24841 World Community Grid 30-03-2013 16:47 Task X0900126080807201104131445_0 exited with zero status but no 'finished' file
24842 World Community Grid 30-03-2013 16:47 If this happens repeatedly you may need to reset the project.
24843 World Community Grid 30-03-2013 16:47 Restarting task X0900126080807201104131445_0 using hcc1 version 705 in slot 1
24844 World Community Grid 30-03-2013 16:47 Task X0900126080735201104131445_0 exited with zero status but no 'finished' file
24845 World Community Grid 30-03-2013 16:47 If this happens repeatedly you may need to reset the project.
24846 World Community Grid 30-03-2013 16:47 Started upload of E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0_3
24847 World Community Grid 30-03-2013 16:47 Restarting task X0900126080735201104131445_0 using hcc1 version 705 in slot 0
24848 World Community Grid 30-03-2013 16:47 Task X0900126080803201104131445_0 exited with zero status but no 'finished' file
24849 World Community Grid 30-03-2013 16:47 If this happens repeatedly you may need to reset the project.
24850 30-03-2013 16:47 Project communication failed: attempting access to reference site
24851 World Community Grid 30-03-2013 16:47 Temporarily failed upload of E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0_3: can't resolve hostname
24852 World Community Grid 30-03-2013 16:47 Backing off 3 min 7 sec on upload of E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0_3
24853 World Community Grid 30-03-2013 16:47 Restarting task X0900126080803201104131445_0 using hcc1 version 705 in slot 3
24854 World Community Grid 30-03-2013 16:48 Task X0900126080715201104131446_0 exited with zero status but no 'finished' file
24855 World Community Grid 30-03-2013 16:48 If this happens repeatedly you may need to reset the project.
24856 30-03-2013 16:48 BOINC can't access Internet - check network connection or proxy configuration.

A good hour into this before discovering, saw the WCG forums and now they're gone too. Set clients with suspended network. That has never caused these fails to occur [yet]
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 48410 · Report as offensive
Nim1

Send message
Joined: 29 Mar 13
Posts: 6
Message 48415 - Posted: 30 Mar 2013, 17:28:13 UTC
Last modified: 30 Mar 2013, 17:41:59 UTC

I took a quick look at the code and found this:
http://boinc.berkeley.edu/trac/browser/boinc-v2/client/cs_cmdline.cpp

what is this? it seems to be a command line interface for debugging. and there are lots of interesting options there, including:

--pers_giveup N :               giveup time for persistent file xfer
--pers_retry_delay_max N :     max for file xfer exponential backoff
--pers_retry_delay_min N :        min for file xfer exponential backoff
--retry_cap N :                 exponential backoff limit
--sched_retry_delay_max N :       max for RPC exponential backoff
--sched_retry_delay_min N :       min for RPC exponential backoff


anyone knows more about this command-line interface? it seems to be not included with default BOINC installation.
ID: 48415 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15483
Netherlands
Message 48416 - Posted: 30 Mar 2013, 17:47:55 UTC - in response to Message 48415.  
Last modified: 30 Mar 2013, 18:12:19 UTC

anyone knows more about this command-line interface? it seems to be not included with default BOINC installation.

You can run the client (boinc.exe) from a command line, with those commands. You can then command the running client further with the BOINCCMD tool (BOINC Command).

BOINC Manager can be run from the command line with arguments as well. Do a boincmgr --help for which ones those are.

I am not going to help out if it's possible and how to circumvent the back-off period. It's there for a reason, so not to put out a massive DDoS on project servers.
ID: 48416 · Report as offensive
Nim1

Send message
Joined: 29 Mar 13
Posts: 6
Message 48417 - Posted: 30 Mar 2013, 17:59:54 UTC
Last modified: 30 Mar 2013, 18:05:06 UTC

You can run the client (boinc.exe) from a command line, with those commands. You can then command the running client further with the BOINCCMD tool (BOINC Command).


Oh, right, they are part of boinc.exe arguments. thank you

I am not going to help out if it's possible and how to circumvent the back-off period. It's there for a reason, so not to put out a massive DDoS on project servers.


I do understand that, but my case is a rather special one. my client needs to upload more than 20 tasks each day, and each of those are more than 35MB in size. and my internet connection is way less than stable for that to go smoothly. so if i don't do anything about those increment back-off periods, i simply would not be able to keep up with the uploads. i probably only need to change pers_retry_delay_max value which by default is 6 hours. I do think that i have enough info to pull this off now, so thanks again for all the info you provided.
ID: 48417 · Report as offensive
SekeRob2

Send message
Joined: 6 Jul 10
Posts: 585
Italy
Message 48418 - Posted: 30 Mar 2013, 18:48:45 UTC

35MB sounds awfully suspiciously like CEP2 uploads to Harvard of result file sub _4 [work comes from WCG]. Harvard restricted upload speed in some way. At least, never it go anything faster than 85KB for me, when I get much more when sending results to WCG. Usually sub-part _4 is done in 7 minutes.

Once upon a time, the project would not even allow to get work unless you had at least 128KB BOINC measured speed... then many complained of not being able to participate and an user override was created, set in the device profile which kicks in when more than default 1 per host is set. What I'm saying is, upload speed limit and flaky internet makes for an bad experience.

But I'm speculating.
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 48418 · Report as offensive
Nim1

Send message
Joined: 29 Mar 13
Posts: 6
Message 48419 - Posted: 31 Mar 2013, 5:47:47 UTC

@SekeRob2

Yes, it is CEP2. my upload speed could probably go up to 50KB/s in a good day. the speed is not a worry for me however; its the connection being dropped every couple of minutes with the combination of back-off functionality in BOINC that makes it a pain. at least i could try to make this work before giving up on the project.
ID: 48419 · Report as offensive

Message boards : Questions and problems : I need a way to fine-tune download/upload back-off period

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.