Message boards :
Questions and problems :
I need a way to fine-tune download/upload back-off period
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Mar 13 Posts: 6 |
Hi I'm on a rather unstable network and running a project that uses large amount of upload for each of its tasks. the server supports resume, so if the upload fails in the middle, it could continue from where it left off. however, each time the connection gets interrupted, the BOINC software, automatically sets a back-off period that gets bigger and bigger over time. i need to change that and set it to let's say to back-off only 1 minute no matter what. Any idea how i could achieve that? |
Send message Joined: 6 Jul 10 Posts: 585 |
What you are saying is that the back-off is getting ever bigger meaning the upload fails are consecutive. Doubt there's a BOINC way of overriding the increments, if there's nothing in the cc_config.xml manual. Taken the code, open source and doctor the back-off code. Can't remember exactly, but think to remember there was also some back-off counter that considers the upload to have permanently failed at 100(?) tries. You see the upload counter in the Transfer view. Could mean with what you want, that after 100 minutes your upload is aborted. Anyway, that's a not readily documented part [no current Google hits, but at CPDN a mod wrote same in 2010] Coelum Non Animum Mutant, Qui Trans Mare Currunt |
Send message Joined: 13 Aug 06 Posts: 778 |
When the upload time limit for files was 14 days I opened a track ticket asking for more time: http://boinc.berkeley.edu/trac/ticket/919#comment:2 David Anderson replied saying he'd changed the limit to I think three months. I was astonished he'd made such a generous limit. He wrote that on the ticket but everything except my initial request has now been removed and the link at the bottom saying r18845 doesn't work. So I don't know where this is now documented. I can only say that I read David's response more than once because it was my ticket. I think Jorden mentioned the 100 upload attempts allowed. There must be a limit. |
Send message Joined: 29 Mar 13 Posts: 6 |
Thank you guys for your answers Doubt there's a BOINC way of overriding the increments, if there's nothing in the cc_config.xml manual. Taken the code, open source and doctor the back-off code. Yes, that was my thought as well. but first i wanted to make sure that i wasn't missing anything. thank you. but think to remember there was also some back-off counter that considers the upload to have permanently failed at 100(?) tries I think Jorden mentioned the 100 upload attempts allowed. There must be a limit. Oh, i did not know that. thank you for pointing it out. You see the upload counter in the Transfer view Unfortunately, I'm unable to see such counter there. Project,File,Progress,Size,Elapsed Time,Speed and Status are the only columns I can see. giving the fact that seeing such value could really help me in finding the right code, I'm interested in knowing why I can not see that counter. Am i missing anything? When the upload time limit for files was 14 days ... he'd changed the limit to I think three months I did not know about such limit either. Though i suppose it wouldn't have been a problem for me even if it was 14 days, if i could disable the code responsible for increasing the back-off period. but i should probably take a look at the code to ensure that. Thank you |
Send message Joined: 13 Aug 06 Posts: 778 |
None of us can see the counter that adds the number of upload attempts. It isn't visible in BOINC Manager; it's somewhere in the BOINC code, hidden. Perhaps someone can think of an extra imaginative and creative instruction that could be added to the configuration file. But you only want to disable the increasing time delays between backoffs. You still want to keep a fixed, shorter time delay. |
Send message Joined: 6 Jul 10 Posts: 585 |
I can see it, so I must be special ;>)... it says something like [tried n]. Let me force the hand and pull the internet cable from the router, wont disconnect from router as that kind of immediately crashes running tasks on my Linux box [how old is that bug, I can't remember]. The 14 days limit also rings a bell. There's an issue with number of stuck uploads and not fetching work, so would be loath to set such a value to 3 months. The work fetch is stopped for a project if uploads exceeds 2 times the number of cores in a host. Recently saw it when I had like 20 results waiting to upload on an 8 core and the client was dying to get work. That point of too many was reached in 12 hours. Maybe I'm [again] confusing the 100 with the number that is allowed on restarts of tasks... those abort with 100 zero status conditions. Then there's the "Too Many Exits" logged. Coelum Non Animum Mutant, Qui Trans Mare Currunt |
Send message Joined: 6 Dec 06 Posts: 118 |
Apparently there is no 100 retry limit, as this is from a Cels@Home upload info pulled from client_state.xml: <persistent_file_xfer> <num_retries>105</num_retries> As for the day limit, I believe, like mo.v, it is set at 90. I have 4 uploads for Cels@Home and 18 uploads for UCT Malaria that I clear the persistent status on the first day of every odd number month. This has been working for several years to keep the uploads from expiring. Also, update the report_deadline to keep the results from being in deadline warning. |
Send message Joined: 23 Apr 07 Posts: 1112 |
I can see it, so I must be special ;>)... it says something like [tried n]. Let me force the hand and pull the internet cable from the router, wont disconnect from router as that kind of immediately crashes running tasks on my Linux box [how old is that bug, I can't remember]. Especially a problem on my Android devices (with NativeBoinc) where Network connectivity isn't always there, eithier because my mobile doesn't have 3G coverage, or my Tablet doesn't have a WiFi connection, Wu's still error out even with Network suspended, i wonder if NativeBoinc is trying to do a News update at that point. Claggy |
Send message Joined: 29 Mar 13 Posts: 6 |
@BobCat13: Thank you for looking that up. And thank all of you guys for the tips and suggestions. i will try to look at the code in my spare time and will post the result. |
Send message Joined: 5 Oct 06 Posts: 5082 |
When the upload time limit for files was 14 days I opened a track ticket asking for more time: Your r18845 became Revision: 511637c9b6402b9275af6f86e83f0323fa5b893d Author: David Anderson <davea@ssl.berkeley.edu> Date: 14/08/2009 20:00:29 Message: - client: in the final stage of CPU scheduling, give preference to multi-threaded jobs. Avoid running N-1 1-thread jobs and 1 N-thread job on N CPUs - client: change file transfer giveup time from 14 to 90 days svn path=/trunk/boinc/; revision=18845 but it seems to have vanished again with all the messing around from SVN to GIT and now GIT-v2. |
Send message Joined: 6 Jul 10 Posts: 585 |
Snip quote I can see it, so I must be special ;>)... it says something like [tried n]. Let me force the hand and pull the internet cable from the router, wont disconnect from router as that kind of immediately crashes running tasks on my Linux box [how old is that bug, I can't remember]. Yes I saw it, but my minds eye had it recorded differently [:red cheeks smiley]... it's the downloads printing count, upload just gives the incremental back-off times. DSFL_00070-16_0000055_0471_DSFL_00070-16_0000055_0471.job 0.000 3.56 K 00:00:00 - 01:43:29 0.00 Kbps Download pending (Retry in: 00:24:16), retried: 4 DSFL_00070-16_0000055_0471_DSFL_00070-16_0000055_0471.zip 0.000 5.45 K 00:00:00 - 02:18:04 0.00 Kbps Download pending (Retry in: 00:58:51), retried: 5 GFAM_x1QNG_PfCypA_0086491_0050_0_0 0.000 231.54 K 00:00:00 - 00:21:05 0.00 Kbps Upload pending (Project backoff: 00:10:58) Internet works, but not WCG :( Coelum Non Animum Mutant, Qui Trans Mare Currunt |
Send message Joined: 6 Jul 10 Posts: 585 |
I can see it, so I must be special ;>)... it says something like [tried n]. Let me force the hand and pull the internet cable from the router, wont disconnect from router as that kind of immediately crashes running tasks on my Linux box [how old is that bug, I can't remember]. Well internet was up, Wifi was working but both Linux boxes trying to connect to WCG in vain, went into vomit mode. Since living close to Francesco's quarters, ask him if maybe he can do a special plea to get this fixed, for the good of humanity. 24827 World Community Grid 30-03-2013 16:46 [sched_op] Starting scheduler request 24828 World Community Grid 30-03-2013 16:46 Sending scheduler request: To fetch work. 24829 World Community Grid 30-03-2013 16:46 Requesting new tasks for CPU 24830 World Community Grid 30-03-2013 16:46 [sched_op] CPU work request: 22.80 seconds; 0.00 devices 24831 World Community Grid 30-03-2013 16:47 [sched_op] Deferring communication for 1 min 12 sec 24832 World Community Grid 30-03-2013 16:47 [sched_op] Reason: Unrecoverable error for task E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0 24833 World Community Grid 30-03-2013 16:47 Computation for task E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0 finished 24834 World Community Grid 30-03-2013 16:47 Output file E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0_0 for task E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0 absent 24835 World Community Grid 30-03-2013 16:47 Output file E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0_1 for task E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0 absent 24836 World Community Grid 30-03-2013 16:47 Output file E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0_2 for task E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0 absent 24837 World Community Grid 30-03-2013 16:47 Starting task X0900126080715201104131446_0 using hcc1 version 705 in slot 2 24838 World Community Grid 30-03-2013 16:47 Scheduler request failed: Couldn't resolve host name 24839 World Community Grid 30-03-2013 16:47 [sched_op] Deferring communication for 3 min 5 sec 24840 World Community Grid 30-03-2013 16:47 [sched_op] Reason: Scheduler request failed 24841 World Community Grid 30-03-2013 16:47 Task X0900126080807201104131445_0 exited with zero status but no 'finished' file 24842 World Community Grid 30-03-2013 16:47 If this happens repeatedly you may need to reset the project. 24843 World Community Grid 30-03-2013 16:47 Restarting task X0900126080807201104131445_0 using hcc1 version 705 in slot 1 24844 World Community Grid 30-03-2013 16:47 Task X0900126080735201104131445_0 exited with zero status but no 'finished' file 24845 World Community Grid 30-03-2013 16:47 If this happens repeatedly you may need to reset the project. 24846 World Community Grid 30-03-2013 16:47 Started upload of E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0_3 24847 World Community Grid 30-03-2013 16:47 Restarting task X0900126080735201104131445_0 using hcc1 version 705 in slot 0 24848 World Community Grid 30-03-2013 16:47 Task X0900126080803201104131445_0 exited with zero status but no 'finished' file 24849 World Community Grid 30-03-2013 16:47 If this happens repeatedly you may need to reset the project. 24850 30-03-2013 16:47 Project communication failed: attempting access to reference site 24851 World Community Grid 30-03-2013 16:47 Temporarily failed upload of E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0_3: can't resolve hostname 24852 World Community Grid 30-03-2013 16:47 Backing off 3 min 7 sec on upload of E212505_359_C.32.C28H15NS2Se.00927237.4.set1d06_0_3 24853 World Community Grid 30-03-2013 16:47 Restarting task X0900126080803201104131445_0 using hcc1 version 705 in slot 3 24854 World Community Grid 30-03-2013 16:48 Task X0900126080715201104131446_0 exited with zero status but no 'finished' file 24855 World Community Grid 30-03-2013 16:48 If this happens repeatedly you may need to reset the project. 24856 30-03-2013 16:48 BOINC can't access Internet - check network connection or proxy configuration. A good hour into this before discovering, saw the WCG forums and now they're gone too. Set clients with suspended network. That has never caused these fails to occur [yet] Coelum Non Animum Mutant, Qui Trans Mare Currunt |
Send message Joined: 29 Mar 13 Posts: 6 |
I took a quick look at the code and found this: http://boinc.berkeley.edu/trac/browser/boinc-v2/client/cs_cmdline.cpp what is this? it seems to be a command line interface for debugging. and there are lots of interesting options there, including: --pers_giveup N : giveup time for persistent file xfer --pers_retry_delay_max N : max for file xfer exponential backoff --pers_retry_delay_min N : min for file xfer exponential backoff --retry_cap N : exponential backoff limit --sched_retry_delay_max N : max for RPC exponential backoff --sched_retry_delay_min N : min for RPC exponential backoff anyone knows more about this command-line interface? it seems to be not included with default BOINC installation. |
Send message Joined: 29 Aug 05 Posts: 15483 |
anyone knows more about this command-line interface? it seems to be not included with default BOINC installation. You can run the client (boinc.exe) from a command line, with those commands. You can then command the running client further with the BOINCCMD tool (BOINC Command). BOINC Manager can be run from the command line with arguments as well. Do a boincmgr --help for which ones those are. I am not going to help out if it's possible and how to circumvent the back-off period. It's there for a reason, so not to put out a massive DDoS on project servers. |
Send message Joined: 29 Mar 13 Posts: 6 |
You can run the client (boinc.exe) from a command line, with those commands. You can then command the running client further with the BOINCCMD tool (BOINC Command). Oh, right, they are part of boinc.exe arguments. thank you I am not going to help out if it's possible and how to circumvent the back-off period. It's there for a reason, so not to put out a massive DDoS on project servers. I do understand that, but my case is a rather special one. my client needs to upload more than 20 tasks each day, and each of those are more than 35MB in size. and my internet connection is way less than stable for that to go smoothly. so if i don't do anything about those increment back-off periods, i simply would not be able to keep up with the uploads. i probably only need to change pers_retry_delay_max value which by default is 6 hours. I do think that i have enough info to pull this off now, so thanks again for all the info you provided. |
Send message Joined: 6 Jul 10 Posts: 585 |
35MB sounds awfully suspiciously like CEP2 uploads to Harvard of result file sub _4 [work comes from WCG]. Harvard restricted upload speed in some way. At least, never it go anything faster than 85KB for me, when I get much more when sending results to WCG. Usually sub-part _4 is done in 7 minutes. Once upon a time, the project would not even allow to get work unless you had at least 128KB BOINC measured speed... then many complained of not being able to participate and an user override was created, set in the device profile which kicks in when more than default 1 per host is set. What I'm saying is, upload speed limit and flaky internet makes for an bad experience. But I'm speculating. Coelum Non Animum Mutant, Qui Trans Mare Currunt |
Send message Joined: 29 Mar 13 Posts: 6 |
@SekeRob2 Yes, it is CEP2. my upload speed could probably go up to 50KB/s in a good day. the speed is not a worry for me however; its the connection being dropped every couple of minutes with the combination of back-off functionality in BOINC that makes it a pain. at least i could try to make this work before giving up on the project. |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.