Message boards : Questions and problems : Please, help - too many downloaded jobs/too many jobs in progress
Message board moderation
Author | Message |
---|---|
Send message Joined: 11 May 20 Posts: 13 |
Hi BOINC experts, I need your help. Before I write my message, I really spent long hours trying to find a solution in the BOINC documentation, forums etc. Unfortunately, I have not been able to find anything which solves my problems and therefore I would like to ask you for help. I joined Rosetta with my computer cluster. There are 11 powerful Xeon computers which do not do anything at the moment due to the current situation, so I decided to use them for Rosetta and it has worked OK for 2 weeks, but now I can see a problem with some of them. I use BOINC 7.9.3 on Linux without GUI, just text mode (command line), I set target CPU run time to 30h (1 day and 6 hours) to minimise network connections. I set 100% CPU use and time and 90% memory, but apart from those I use default settings. 8 computers work absolutely fine, but 3 of them download much more tasks than can be calculated:( For instance, I have 16 cores in each machine, deadline is 72h, so should have 32 work units per machine to be able to complete them. Unfortunately I have more than 60 (sometimes even up to 100), so I miss deadlines or I return tasks which are already calculated by somebody else..... I would like to stress all computers are exactly the same, the same CPU, RAM, HDD, OS and its configuration, boinc version etc. and they have achieved very similar average credit within last month, differences are within 10%. The problem is just on 3 machines, the rest keeps around 24 tasks, so 16 in progress and 8 waiting to be processed which is fine. I have read the BOINC documentation how to configure a client and a project, I created config.xml in the rosetta project folder with max_wus_in_progress limited to 2 per core and I limited max_ncpus to 16 - no changes at all. Then I created config_aux.xml in rosetta folder with total job limit - again nothing. I read the event log, I cannot find any errors. By the way, it would be great to provide in the BOINC documentation some examples of config file or at least use different font style to show what is a command and what is a parameter, sometimes it is not obvious. At the moment I control it by stop and start of new tasks allowance manually, but I cannot do that during weekends, so after my return I always need to abort tens of jobs. Could somebody help me, please? Thanks in advance. |
Send message Joined: 25 Nov 05 Posts: 1654 |
In your Manager's menu, Options > Computing preferences > Computing, there are two settings which control the amount of work that computer gets. Store at least n days and Store up to an additional n days. These define how much of a buffer you want to keep. Set these to a very low number, maybe 0.1 days, and you should get a more reasonable amount of work. And make sure you set that to use local preferences. (At the top.) |
Send message Joined: 11 May 20 Posts: 13 |
Hi Les, thanks for you extremely quick answer, I am impressed:) At the moment my settings are 0.1 and 1, so it still should be fine, but of course I will try to change them tomorrow and check if it works. I do not use GUI, but I know in which file I need to change the options. However, all the options are exactly the same on all computers, why 3 of them take much more tasks? Could you tell me why the xml files did not work in that case, please? Thanks again! |
Send message Joined: 25 Nov 05 Posts: 1654 |
Oops. When I got stuck into the answer, I forgot about "no gui". These settings are also on your Account page at each project, under Preferences, but that affects all computers. As to why some of your computers are OK, it may be the work application that was sent. Different apps have different requirements. No idea about the xml files. Someone else may post about that. |
Send message Joined: 5 Mar 08 Posts: 272 |
If you want to stick with the CLI you can use BOINCtui to visually see what they are doing, one machine at a time. With multiple machines its best to use BOINCtasks on one machine that you trust (usually the one you use to remote into the others). It doesn't need to run BOINC at all. I'm using a Windows laptop to look after my cluster. If you don't have a windows machine you can run it under wine. You can download it from https://efmer.com/ and click on the BoincTasks option at the top of the screen. If you need help with configuring it just ask. It really makes managing a fleet of machines so much easier when you can see all of them on one screen. If you can update your boinc-client to a later version. To start with I would set all of the machines to no new tasks and let them run off their current work. Rosetta have stated they start looking at results 48 hours after they send a batch out so setting the default run time as high as your is a waste of effort. Set a zero cache. After it has finished off all work I I would set Rosetta closer to the default 8 hour target run time, maybe go for 12 hours. Adjust the target run time and tell the BOINC client to update so it picks up the latest setting from the project before requesting any new work. As for the xml files BOINC doesn't use config_aux. It uses an app_config.xml file, It goes in the project specific folder. Under Debian and Ubuntu it would be: /var/lib/boinc-client/projects/boinc.bakerlab.org_rosetta/app_config.xml If you want to limit it to half the available cores then you would use something like this in it: <app_config> <project_max_concurrent>6</project_max_concurrent> </app_config> That will limit it to running 6 at a time. Adjust the number as you want. Once you've returned at least 11 and they've validated then you can adjust your cache setting. I run 0.1 days with no extra. That gives the project a fast turn around and you don't get overloaded with tasks. I'd suggest you have all of the machines with the same settings seeing as they are the same hardware config. The project don't recommend more than a 1 day cache due to the short deadlines (which are 3 days). MarkJ |
Send message Joined: 25 May 09 Posts: 1301 |
Better to be store 1 and an additional 0.1 |
Send message Joined: 5 Oct 06 Posts: 5129 |
Even better to store 0.1 plus 0.1, or even less. Normally I would suggest looking in the Event Log, to work out exactly what work was requested, and how much was received in response. That could be done for a sample machine, using boinccmd - though refer also to the command line help, which is slightly different: --get_message_count show largest message seqno --get_messages [ seqno ] show messages > seqnoYou can control the level and nature of the detail shown with client configuration. |
Send message Joined: 5 Mar 08 Posts: 272 |
Better to be store 1 and an additional 0.1 Rosetta is a bit different. They have a target run time. It defaults to 8 hours if not set by the user but Mippi said he had it much higher. Work units will generally run for that length of time. They have allowed a +10 hour leeway where their watchdog timer will allow it to run over the target (it used to be 4 hours). They are using 3 day deadlines. With a target run time of 12 hours, assuming he reduces it to that, it doesn’t leave much time to return them in. It would be better to cache a minimum of work in these circumstances. The project doesn’t seem to have a shortage of tasks at the moment. Mippi also asked about app_config so assuming he sets that to half the available threads he gets half of them run at once. He is running an older version of BOINC which doesn’t have the work fetch/max concurrent logic that 7.16 has, so it gets tasks for all available threads even though they can’t all run at once. He may only need to reserve a couple of threads for the system in which case setting the % of processor available in BOINC would be better than using an app_config as it wouldn’t try and fetch more work than available threads. MarkJ |
Send message Joined: 11 May 20 Posts: 13 |
Hi MarkJ, thanks very much for your reply. If you want to stick with the CLI you can use BOINCtui to visually see what they are doing, one machine at a time. I did learn about BOINCtui and it looks really nice, but I wrote some bash scripts to controll all the machines, so it was not very useful for me, especially you need to connect with each machine separately. I did not know about BONICtasks and for sure I will test it, thanks for your advice. It looks like a piece of software I really need. If you can update your boinc-client to a later version. I would like to update BOINC, but it seems that at my Linux version it is the newset version. However, I will update OS on my all stations quite soon, so then I will install newer version. I do not understand what you stated: Rosetta have stated they start looking at results 48 hours after they send a batch out so setting the default run time as high as your is a waste of effort. Could you explain it more in details, please? I have read many pages on setting target CPU times and on all of them was written that it does not matter as all model needs to be tested anyway, so the longer time you set the more models you can calculate and it makes the process more efficent. However it can be wrong, so why do you think it is waste of time? I set 29h just because I want to reduce newtork use. My computers work 24/7 anyway, so I can use more of their time. As for the xml files BOINC doesn't use config_aux. It uses an app_config.xml file, It goes in the project specific folder. Under Debian and Ubuntu it would be: I am very surprised with the fact that BOINC ignores config_aux.xml, it is clearly describbed here: https://boinc.berkeley.edu/trac/wiki/ProjectConfigFile So, how could I know what is taken and what is ignored by the software? Is there any documentation which provides a correct configuration description? Let me know, please. At the moment I do not want to limit my cores, I want to use the full power of my computers. Maybe I will use it in the future, but not now. Taking the opportunity I would like to ask if it is possible to configure BOINC that for instance 1 core is dedicated to one project, 2 cores are fixed to another project and the rest of cores for a third project? Once you've returned at least 11 and they've validated then you can adjust your cache setting. I run 0.1 days with no extra. That gives the project a fast turn around and you don't get overloaded with tasks. I'd suggest you have all of the machines with the same settings seeing as they are the same hardware config. The project don't recommend more than a 1 day cache due to the short deadlines (which are 3 days). I have already changed the value to 0,1 as you suggested and I will see the effect within next days. Thanks:) |
Send message Joined: 11 May 20 Posts: 13 |
Hi Richard, thanks for your reply, I will test it tomorrow, I did not know the options you mentioned. Thanks! |
Send message Joined: 29 Aug 05 Posts: 15569 |
config_aux.xml is a file on the server, it's used for configurations of the BOINC back-end on the project's server. It's not a file you can use on the client. BOINC for users only has app_info.xml, app_config.xml and cc_config.xmlAs for the xml files BOINC doesn't use config_aux. It uses an app_config.xml file, It goes in the project specific folder. Under Debian and Ubuntu it would be: You're in the wrong wiki, the user manual starts here: https://boinc.berkeley.edu/wiki/User_manual |
Send message Joined: 11 May 20 Posts: 13 |
Thanks very much indeed, it is clear now :) |
Send message Joined: 5 Mar 08 Posts: 272 |
[ I do not understand what you stated: Rosetta have stated they start looking at results 48 hours after they send a batch out so setting the default run time as high as your is a waste of effort. Could you explain it more in details, please? I have read many pages on setting target CPU times and on all of them was written that it does not matter as all model needs to be tested anyway, so the longer time you set the more models you can calculate and it makes the process more efficent. However it can be wrong, so why do you think it is waste of time? I set 29h just because I want to reduce newtork use. My computers work 24/7 anyway, so I can use more of their time. It was a post by one of the moderators in their Number Crunching forum. As such I would suggest a target run time of 12 hours. If you end up setting the cache back to one day then that is potentially work on your system for 36 hours (one day cache plus your target run time for work currently running). If you set the cache to 0.5 days it would be 24 hours. At the moment I do not want to limit my cores, I want to use the full power of my computers. Maybe I will use it in the future, but not now. You could use the app_config to limit how many cores each project can use at a time but if one project runs out of work you’ll end up with idle cores. Also I mentioned before that the 7.16 client has a fix for work fetch when used with max_concurrent, the older BOINC client you are running doesn’t have that fix so it will fetch more work than it can run. BOINC is designed to share resources based upon your project weighting. It should balance things out over time. MarkJ |
Send message Joined: 8 Nov 19 Posts: 718 |
100 tasks are ok for your Xeon to finish in 72 hours, provided each task takes a few hours to finish. Remember that each core finishes 1 task. So if it takes 8 hours per task, each core does 9 tasks in the 72 hour timeframe, resulting in 144 tasks being done on a 16 core machine. Quite often, tasks use less than 8 hours (some use more too). Boinc will learn from how fast you crunch through tasks, and adjust accordingly. It won't retrieve more tasks, if it sees that it can't finish the tasks on time. Some tasks may be very short too! I've had tasks ranging from 5 minutes, to 3 days. Though the 3 day tasks usually have a deadline of several weeks, not 72 hours. |
Send message Joined: 11 May 20 Posts: 13 |
Hi ProDigit, If you read my e-mail carefully, I set processing time to 29h, just to save network use, therefore you need to divide your calculations by 3.5 to see how many jobs I can complete. Anyway, I think I solved the problem, so thanks for everyone's help and input. |
Send message Joined: 11 May 20 Posts: 13 |
Thanks again, I will install newer kernel and newer BOINC next week, it should solve a lot of problem. |
Send message Joined: 11 May 20 Posts: 13 |
Hi Jord, I have one more question to you: is there any method to assign a project to a GPU? I have got two cards in each station, one is very old and slow and the second one is much faster. They are called in my log Device 0 and Device 1. At the moment I can see that jobs are randomly directed to GPU. Unfortunately some projects do not need so many resources, some need a lot, so I would like to assign more demanding projects to more powerful card. Is there any method to do that? Thanks in advance for your help! |
Send message Joined: 29 Aug 05 Posts: 15569 |
There are several options, but I am restricted in my time for now so will answer you tonight my time. A quick solution is to remove the use_all_gpus option from cc_config.xml and restart the client. Then only the better GPU is used. But plenty of fine tuning to do that I will tip on tonight. If someone else doesn't already chip them in. |
Send message Joined: 8 Nov 19 Posts: 718 |
Hi ProDigit, If all you need is disable the network at some part of the day, it can be done too; while at the same time allow Boinc to continue crunching without accessing the network. My network often goes down, and in this time, it'll just crunch on the data it has downloaded. One of the main reasons I'm crunching on boinc, and not on FAH which needs access to a network at any point in time. |
Send message Joined: 11 May 20 Posts: 13 |
Thanks in advance and looking forward to having your solutions. Just to be clear, I would like to use both cards, just the weaker one for a less demanding project and the better one for a more demanding project and I would like to be sure that jobs from each project will be directed correctly. |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.