Client detached errors and no credit

Message boards : Questions and problems : Client detached errors and no credit
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Tim Porter

Send message
Joined: 8 Mar 09
Posts: 17
Message 23540 - Posted: 8 Mar 2009, 18:10:46 UTC

Hi,

I have a problem.

I have a cluster of machines, they are always giving me BOINC problems, I have finally given up trying to figure it out and will ask here.

About the only project that works on them is SpinHenge but I want to get other projects working on them as well.

If I attach to other projects (projects I have tried are seti at home, cosmology at home, milkyway at home I think WCG, and others that I forget) then they download units, work on them and appear to upload results. Usually the first machine to report any results for a project will get some credit for it but the rest wont. Once one of the other machines has reported then no further credit is granted for any machine including the first (genarally).

In the task list on the project websites it gives the status as "Client Detached".

I have several other machines that all have no problems. The cluster all gives me issues. The cluster is twenty diskless machines but each one boots Linux (Actually a customised version of Knoppix) over the network and each one gets its own network share on a machine elsewhere on the network with lots of disk space. In the folder for each machine are all the executables and project files. This is the only difference between the cluster machines and the rest of the machines that work.

I dont know what the problem is and I dont know what info is relevant, I could write for hours about my setup so any questions just let me know.

http://www.allprojectstats.com/showuser.php?projekt=0&id=1201531
ID: 23540 · Report as offensive
Tim Porter

Send message
Joined: 8 Mar 09
Posts: 17
Message 23542 - Posted: 8 Mar 2009, 18:39:19 UTC

No, each machine has its own space on the file server. Each machine downloaded its own copy of the BOINC client (yeah I hit the servers for twenty downloads at once, sorry about that) and installed it to a sub folder in its shared space.

I know what you mean about if you just make copies of one installation but that is not what I have done, I admit the symptoms are similar though.

I am not using an account manager. Just ./run_manager on occasion but generally just the command line ./boinc .
ID: 23542 · Report as offensive
Tim Porter

Send message
Joined: 8 Mar 09
Posts: 17
Message 23543 - Posted: 8 Mar 2009, 18:44:29 UTC - in response to Message 23542.  

Oh one thing that crossed my mind, when BOINC is running it wont know that the machine is diskless any anything that it writes to places outside of the folder I installed it too will appear to work but infact will just be held in RAM and won't be there on the next reboot.

The cluster is not up 100% of the time and is not running BOINC all the time that it is up, rebooting happens once or twice a day.
ID: 23543 · Report as offensive
Tim Porter

Send message
Joined: 8 Mar 09
Posts: 17
Message 23545 - Posted: 8 Mar 2009, 19:17:03 UTC - in response to Message 23544.  

Yeah I say shared to mean that the folder is on the file server and is shared between the file server and the diskless box. The file server has twenty such folders, one for each diskless box.

On each box I mount them like this...

cd /mnt/
mkdir small
# "small" is the name of the fileserver
mount -t nfs -o nolock 192.168.1.20:/tera2/cluster-homes/door3 small
# 192.168.1.20 is the IP of small, /tera2/cluster-homes is the path on its filesystem to all the shared folders, "door3" changes for each diskless box, it is essentially the position of the box in the room and the number to say which position in the stack it is. There are two stacks, door and window. I dont think this is important :)

cd small/
ls -l

total 4272
drwxr-xr-x 5 knoppix knoppix 4096 Mar 8 15:13 BOINC
-rwxr-xr-x 1 root root 4354474 Aug 4 2008 boinc_6.2.15_i686-pc-linux-gnu.sh

cd BOINC/
ls -l

total 9708
-rw-r--r-- 1 root root 247 Dec 22 09:55 account_boinc.bio.wzw.tum.de_boincsimap.xml
-rw-r--r-- 1 root root 2859 Mar 7 08:12 account_setiathome.berkeley.edu.xml
-rw-r--r-- 1 root root 1843 Oct 1 18:28 account_spin.fh-bielefeld.de.xml
-rw-r--r-- 1 root root 22585 Feb 28 09:38 all_projects_list.xml
-rw-r--r-- 1 knoppix knoppix 219 Aug 4 2008 binstall.sh
-rwxr-xr-x 1 knoppix knoppix 1729592 Aug 4 2008 boinc
-rwxr-xr-x 1 knoppix knoppix 208832 Jul 9 2008 boinc_cmd
-rwxr-xr-x 1 knoppix knoppix 208832 Aug 4 2008 boinccmd
-rwxr-xr-x 1 knoppix knoppix 6315324 Aug 4 2008 boincmgr
-rw-r--r-- 1 knoppix knoppix 815 May 6 2008 boincmgr.16x16.png
-rw-r--r-- 1 knoppix knoppix 2395 May 6 2008 boincmgr.32x32.png
-rw-r--r-- 1 knoppix knoppix 5570 May 6 2008 boincmgr.48x48.png
-rw-r--r-- 1 knoppix knoppix 238100 May 6 2008 ca-bundle.crt
-rw-r--r-- 1 root root 79862 Mar 8 15:13 client_state.xml
-rw-r--r-- 1 root root 79862 Mar 8 15:13 client_state_prev.xml
-rw-r--r-- 1 root root 143 Oct 1 18:11 create_account.xml
-rw-r--r-- 1 root root 8677 Feb 28 09:44 get_current_version.xml
-rw-r--r-- 1 root root 1461 Mar 5 15:40 get_project_config.xml
-rw-r--r-- 1 root root 1258 Dec 22 17:07 global_prefs.xml
-rw-r--r-- 1 root root 1235 Dec 15 17:08 global_prefs_override.xml
-rw------- 1 root root 32 Sep 29 17:32 gui_rpc_auth.cfg
-rw-r--r-- 1 root root 282 Dec 24 10:56 job_log_boinc.bio.wzw.tum.de_boincsimap.txt
-rw-r--r-- 1 root root 226 Dec 22 14:44 job_log_milkyway.cs.rpi.edu_milkyway.txt
-rw-r--r-- 1 root root 2280 Mar 6 21:23 job_log_setiathome.berkeley.edu.txt
-rw-r--r-- 1 root root 24474 Mar 8 14:25 job_log_spin.fh-bielefeld.de.txt
-rw-r--r-- 1 root root 303 Dec 20 13:50 job_log_www.cosmologyathome.org.txt
-rw-r--r-- 1 root root 115 Nov 7 18:28 job_log_www.worldcommunitygrid.org.txt
drwxr-xr-x 38 knoppix knoppix 4096 May 6 2008 locale
-rw-r--r-- 1 root root 0 Sep 29 17:32 lockfile
-rw-r--r-- 1 root root 139 Mar 5 15:41 lookup_account.xml
-rw-r--r-- 1 root root 5873 Dec 17 09:41 lookup_website.html
-rw-r--r-- 1 root root 10244 Dec 27 07:56 master_boinc.bio.wzw.tum.de_boincsimap.xml
-rw-r--r-- 1 root root 151 Mar 5 15:41 master_setiathome.berkeley.edu.xml
-rw-r--r-- 1 root root 13063 Oct 1 18:27 master_spin.fh-bielefeld.de.xml
drwxrwx--x 5 root root 4096 Mar 5 15:41 projects
-rwxr-xr-x 1 root root 41 Sep 29 17:31 run_client
-rwxr-xr-x 1 root root 44 Sep 29 17:31 run_manager
-rw-r--r-- 1 root root 754 Mar 5 15:43 sched_reply_boinc.bio.wzw.tum.de_boincsimap.xml
-rw-r--r-- 1 root root 5066 Mar 7 08:12 sched_reply_setiathome.berkeley.edu.xml
-rw-r--r-- 1 root root 4580 Mar 8 14:46 sched_reply_spin.fh-bielefeld.de.xml
-rw-r--r-- 1 root root 6110 Mar 5 15:43 sched_request_boinc.bio.wzw.tum.de_boincsimap.xml
-rw-r--r-- 1 root root 11686 Mar 7 08:12 sched_request_setiathome.berkeley.edu.xml
-rw-r--r-- 1 root root 12296 Mar 8 14:46 sched_request_spin.fh-bielefeld.de.xml
drwxrwx--x 4 root root 4096 Mar 7 08:08 slots
-rw-r--r-- 1 root root 424 Mar 5 15:43 statistics_boinc.bio.wzw.tum.de_boincsimap.xml
-rw-r--r-- 1 root root 1064 Mar 7 08:12 statistics_setiathome.berkeley.edu.xml
-rw-r--r-- 1 root root 4283 Mar 8 14:46 statistics_spin.fh-bielefeld.de.xml
-rw-r--r-- 1 root root 0 Oct 11 07:58 stderrdae.txt
-rw-r--r-- 1 root root 741065 Mar 3 20:32 stdoutdae.txt
-rw-r--r-- 1 root root 8460 Mar 8 07:03 time_stats_log


ID: 23545 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15487
Netherlands
Message 23549 - Posted: 8 Mar 2009, 23:34:04 UTC - in response to Message 23545.  

How many actual hosts are identified on the projects that you run? I have the feeling that since it is essentially all one host (isn't it?) that you get detached on all of them as they all have the same hostID... which it can't have under BOINC.
ID: 23549 · Report as offensive
Tim Porter

Send message
Joined: 8 Mar 09
Posts: 17
Message 23551 - Posted: 9 Mar 2009, 1:01:05 UTC

No it should be twenty different hosts except that instead of having a disk each one has its own NFS share on a file server. I can't see how that could be the issue.

SpinHenge works just fine and shows all the different hosts, seti only shows a couple of them despite the fact that most of them have at some stage been attached to seti (I think the projects only list hosts that have credit)

The host ID gets stored in a file in the BOINC folder right? Because if its somewhere else then that might be the issue.
ID: 23551 · Report as offensive
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 23553 - Posted: 9 Mar 2009, 3:59:03 UTC

hostid is stored in client_state.xml, under <project> (Not far from the top.)
The first line in this section will be the name of the project.

Each 'computer' will have one of these client_state.xml files, and the host ID will be different for each computer (xml file) on a given project.

If 20 computers, then:
20 client_state.xml files, each with
1 hostID per attached project
If all 20 computers are attached to SETI, then you'll be able to find 20 different
host IDS for SETI.

So, how many of the client_state.xml files has a hostid for SETI?

ID: 23553 · Report as offensive
Tim Porter

Send message
Joined: 8 Mar 09
Posts: 17
Message 23556 - Posted: 9 Mar 2009, 9:19:13 UTC

Hi Les,

I detached or suspended the projects that seemed to be failing in order to stop wasting time on them. I guess that would alter the results? Should I get SETI going on them for 24 hours first? Here are the results of a bit of grepping anyway (I did this from the fileserver)

[root@small cluster-homes]# grep -ir "host_cpid" * | grep -i "seti"

door1/BOINC/sched_request_setiathome.berkeley.edu.xml: <host_cpid>5ca17c3b912611aee8ca38f27c2cc9fc</host_cpid>
door2/BOINC/sched_request_setiathome.berkeley.edu.xml: <host_cpid>b0156bc7a12805b4dc3fe6193aacd135</host_cpid>
door3/BOINC/sched_request_setiathome.berkeley.edu.xml: <host_cpid>28eb410c39c6e8abec77e1e6573e479e</host_cpid>
door4/BOINC/sched_request_setiathome.berkeley.edu.xml: <host_cpid>a907f9728cdae861dd9b2aabfe3b3d8b</host_cpid>
door5/BOINC/sched_request_setiathome.berkeley.edu.xml: <host_cpid>9d89577178b7ee2018d4fc090186c6a0</host_cpid>
Binary file door5/BOINC/projects/setiathome.berkeley.edu_beta/astropulse_5.01_i686-pc-linux-gnu matches
Binary file door5/BOINC/projects/setiathome.berkeley.edu_beta/ap_graphics_5.01_i686-pc-linux-gnu matches
door5/BOINC/sched_request_setiathome.berkeley.edu_beta.xml: <host_cpid>51622ed4ffc476c976ad8782ae89e757</host_cpid>
door6/BOINC/sched_request_setiathome.berkeley.edu.xml: <host_cpid>fd4b6f670f63d92bfc00795fef75b6c8</host_cpid>
window11/BOINC/sched_request_setiathome.berkeley.edu.xml: <host_cpid>d0ac91f3d29d77e55ae3780256076f45</host_cpid>
window12/BOINC/sched_request_setiweb.ssl.berkeley.edu_beta.xml: <host_cpid>a69b29f386f9046366202d5c02e600b8</host_cpid>
window2/BOINC/sched_request_setiathome.berkeley.edu.xml: <host_cpid>80e22a26570c4346e5c2941922fe79ee</host_cpid>
window3/BOINC/sched_request_setiathome.berkeley.edu.xml: <host_cpid>a2805b8672373881be62ddde3be965ef</host_cpid>
window5/BOINC/sched_request_setiathome.berkeley.edu.xml: <host_cpid>8918d0ac418d211fbc0b61e6daa6d99a</host_cpid>

[root@small cluster-homes]# grep -ir "host_cpid" * | grep -i "client_state.xml"

biglaptop/BOINC/client_state.xml: <host_cpid>b59d280ae98836f20b890cfe8d12610a</host_cpid>
door1/BOINC/client_state.xml: <host_cpid>5ca17c3b912611aee8ca38f27c2cc9fc</host_cpid>
door10/BOINC/client_state.xml: <host_cpid>0d832156fc04d565f9bcdc0f9a8d527b</host_cpid>
door11/BOINC/client_state.xml: <host_cpid>fc82bab4445d412b07ccbd41001083ca</host_cpid>
door12/BOINC/client_state.xml: <host_cpid>f42880bc1e7195267c686bbd069b4e14</host_cpid>
door2/BOINC/client_state.xml: <host_cpid>b0156bc7a12805b4dc3fe6193aacd135</host_cpid>
door3/BOINC/client_state.xml: <host_cpid>28eb410c39c6e8abec77e1e6573e479e</host_cpid>
door4/BOINC/client_state.xml: <host_cpid>a907f9728cdae861dd9b2aabfe3b3d8b</host_cpid>
door5/BOINC/client_state.xml: <host_cpid>51622ed4ffc476c976ad8782ae89e757</host_cpid>
door6/BOINC/client_state.xml: <host_cpid>fd4b6f670f63d92bfc00795fef75b6c8</host_cpid>
door7/BOINC/client_state.xml: <host_cpid>bf3a5b4eb94bb44d170da6b0674b1c8e</host_cpid>
door8/BOINC/client_state.xml: <host_cpid>4ad3c8960f88bc3525f1eee1bff26d67</host_cpid>
door9/BOINC/client_state.xml: <host_cpid>601cebc9e4db1b9879d910939a0599bb</host_cpid>
window1/BOINC/client_state.xml: <host_cpid>58acc2d73ad43627375b087aa31a9955</host_cpid>
window11/BOINC/client_state.xml: <host_cpid>d0ac91f3d29d77e55ae3780256076f45</host_cpid>
window12/BOINC/client_state.xml: <host_cpid>68cd08ceb1bff0607ad7dc81c332b22a</host_cpid>
window2/BOINC/client_state.xml: <host_cpid>80e22a26570c4346e5c2941922fe79ee</host_cpid>
window3/BOINC/client_state.xml: <host_cpid>15d6b6ffe80351114c975febcd85fc00</host_cpid>
window4/BOINC/client_state.xml: <host_cpid>e2f85a635d4a85b36ae395031584f179</host_cpid>
window5/BOINC/client_state.xml: <host_cpid>7d0af5b223b0afbb3c96c4367abedc84</host_cpid>
window6/BOINC/client_state.xml: <host_cpid>d5ab342a473a59d69142d9562f48e5f6</host_cpid>
window8/BOINC/client_state.xml: <host_cpid>393b955150dce098000a821447006ee6</host_cpid>
window9/BOINC/client_state.xml: <host_cpid>28f9f0301f5a032a5885da32190897ca</host_cpid>

There may appear to be 24 different hosts, there are really only 20, four of the boxes are completely dead, failed PSU, dead RAM etc.
ID: 23556 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15487
Netherlands
Message 23557 - Posted: 9 Mar 2009, 9:57:50 UTC - in response to Message 23556.  

All BOINC installations need their own Data directory. Is that true, or do they all over their own client directory and are they pointing to the same data directory?

Since Knoppix is based on Debian, I can point you to these installation steps. Did you follow something similar to install these 20 BOINC installations? For if you did, you may well have only one Data directory, which is constantly overwritten.
ID: 23557 · Report as offensive
Tim Porter

Send message
Joined: 8 Mar 09
Posts: 17
Message 23558 - Posted: 9 Mar 2009, 10:30:31 UTC - in response to Message 23557.  

Is the data directory actually a folder called "data" (I dont seem to have that anywhere, where should it be?). Or do you just mean that each install has to be in its own folder (that is the situation).

The installation process was no more complex than this.

For each host...

Make a new folder on the fileserver with a name that identifies that host.
Mount that folder on the host under /mnt/small
CD to it.
wget <the URL for boinc_6.2.15_i686-pc-linux-gnu.sh>
./boinc_6.2.15_i686-pc-linux-gnu.sh

That created a folder called BOINC in the mounted folder.

cd BOINC/

Do whatever with ./run_manager, generally use VNC to get that going so I can do it graphically and then now if i'm just going to be running projects instead of attaching/detaching etc I just use ./boinc

Whenever the machines are turned on I have to manually mount the right folder from the fileserver on each host again, cluster-ssh is my friend. I'm looking at automating this but thats a whole different issue.

Does the installer try to do anything outside of the directory that the installer is located in? Do ./boinc or ./run_manager try to do anything outside of their directory? If so it will be lost on reboot as those changes are only held in RAM.

Ageless, I didn't follow those instructions, I can't say if what I did was 'like' that because I dont know what the important parts of it are. :S . I noticed that it referred to directories that are probably hard-coded like /var/lib /etc/ and the like, if the boinc_6.2.15_i686-pc-linux-gnu.sh script does the same then there might be issues, see previous paragraph.

Everyone, I really appreciate all your help on this so far! This has been bugging me for ages.
ID: 23558 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15487
Netherlands
Message 23559 - Posted: 9 Mar 2009, 10:41:52 UTC - in response to Message 23558.  

BOINC uses sandboxing, as far as I know also on Linux. This separates (at least) the Boinc binary directory from the data directory. The Boinc binary directory only holds the Boinc binaries, such as boinc_client, boinc_manager, boinccmd etc.

The Data directory holds the client_state.xml file, the projects binaries and their data files (tasks, work units, whatever you call them), scheduler messages, log files, etc.

if you have 20 different Boinc binary directories, but only one Boinc data directory that all those 20 Boinc binary directories point to, then the last one standing will have constantly overwritten all data in that directory.

Each of your installations is recognized as one host.
Each host can be attached to multiple projects.
Each host needs its separate Boinc binary and Boinc Data directory, so you should have 40 directories in all.
ID: 23559 · Report as offensive
Tim Porter

Send message
Joined: 8 Mar 09
Posts: 17
Message 23560 - Posted: 9 Mar 2009, 11:05:17 UTC - in response to Message 23559.  

Whatever is going on there is no filespace in my setup that can be written to or read from more than one machine at once.

In http://boinc.berkeley.edu/dev/forum_thread.php?id=3708&nowrap=true#23545 there is an `ls -l` from one of my BOINC folders, it seems that for a specific host the client_state.xml files and the boinc binaries are all in the same folder.

If you like I can go into some more detail about the setup that I have, I thought I had covered the necessary bits but maybe I missed an important point somewhere along the line? (I could even do a video?)

ID: 23560 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15487
Netherlands
Message 23561 - Posted: 9 Mar 2009, 11:17:24 UTC - in response to Message 23560.  

I'm sure there are people around here who would like to actually see your system (video), but it isn't necessary. If you can give a complete description of it, it is always useful, especially for when the real Linux people wake up.
ID: 23561 · Report as offensive
Tim Porter

Send message
Joined: 8 Mar 09
Posts: 17
Message 23562 - Posted: 9 Mar 2009, 11:27:10 UTC - in response to Message 23561.  

Ok, I'll write up properly about what I've got this evening. Its 11:20am here so give it ten hours or so and I'll have that up.
ID: 23562 · Report as offensive
Tim Porter

Send message
Joined: 8 Mar 09
Posts: 17
Message 23564 - Posted: 9 Mar 2009, 16:27:47 UTC - in response to Message 23563.  

Sekerob, it feels like thats exactly whats going on but I can't see how as there is nowhere on the file server that all of the machines can access at once (well, there is, but I'd have to mount it each time they boot and I don't do that).

Which log file should I be looking at? I'll post it up here. In the meantime I'll assume you mean stdoutdae.txt.

On one machine...

root@Knoppix:/mnt/small/BOINC# head -n 5 stdoutdae.txt

11-Oct-2008 12:57:34 [---] Starting BOINC client version 6.2.15 for i686-pc-linux-gnu
11-Oct-2008 12:57:34 [---] log flags: task, file_xfer, sched_ops
11-Oct-2008 12:57:34 [---] Libraries: libcurl/7.18.0 OpenSSL/0.9.8g zlib/1.2.3.3 c-ares/1.5.1
11-Oct-2008 12:57:34 [---] Data directory: /mnt/small/BOINC
11-Oct-2008 12:57:34 [---] Processor: 2 GenuineIntel Intel(R) Pentium(R) III CPU family 1266MHz [Family 6 Model 11 Stepping 1]

root@Knoppix:/mnt/small/BOINC# df -h

Filesystem Size Used Avail Use% Mounted on
/dev/root 8.3M 1.7M 6.7M 20% /
/dev/cloop 1.8G 1.8G 0 100% /KNOPPIX
192.168.1.20:/media/knoppix-cd
660M 660M 0 100% /cdrom
/ramdisk 809M 4.9M 804M 1% /ramdisk
/UNIONFS 809M 4.9M 804M 1% /UNIONFS
192.168.1.20:/media/knoppix-cd
660M 660M 0 100% /cdrom
/dev/cloop 1.8G 1.8G 0 100% /KNOPPIX
192.168.1.20:/tera2/cluster-homes/door2
917G 384G 524G 43% /mnt/small


And on a different machine...



root@Knoppix:/mnt/small/BOINC# head -n 5 stdoutdae.txt

28-Sep-2008 18:44:26 [---] Starting BOINC client version 6.2.15 for i686-pc-linux-gnu
28-Sep-2008 18:44:26 [---] log flags: task, file_xfer, sched_ops
28-Sep-2008 18:44:26 [---] Libraries: libcurl/7.18.0 OpenSSL/0.9.8g zlib/1.2.3.3 c-ares/1.5.1
28-Sep-2008 18:44:26 [---] Data directory: /mnt/small/BOINC
28-Sep-2008 18:44:26 [---] Processor: 2 GenuineIntel Intel(R) Pentium(R) III CPU family 1266MHz [Family 6 Model 11 Stepping 1]

root@Knoppix:/mnt/small/BOINC# df -h

Filesystem Size Used Avail Use% Mounted on
/dev/root 8.3M 1.7M 6.7M 20% /
/dev/cloop 1.8G 1.8G 0 100% /KNOPPIX
192.168.1.20:/media/knoppix-cd
660M 660M 0 100% /cdrom
/ramdisk 809M 4.8M 804M 1% /ramdisk
/UNIONFS 809M 4.8M 804M 1% /UNIONFS
192.168.1.20:/media/knoppix-cd
660M 660M 0 100% /cdrom
/dev/cloop 1.8G 1.8G 0 100% /KNOPPIX
192.168.1.20:/tera2/cluster-homes/door3
917G 384G 524G 43% /mnt/small
ID: 23564 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 20 Dec 07
Posts: 1069
Germany
Message 23565 - Posted: 9 Mar 2009, 18:57:34 UTC - in response to Message 23556.  

...
[root@small cluster-homes]# grep -ir "host_cpid" * | grep -i "client_state.xml"
...

You should have done a
grep -ir "hostid" client_state.xml
because that's the number under which you can see the host at the server (if it's still attached):
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7927
(that's one of mine :-)

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
ID: 23565 · Report as offensive
Tim Porter

Send message
Joined: 8 Mar 09
Posts: 17
Message 23567 - Posted: 9 Mar 2009, 22:59:30 UTC

Ok here is more info about my setup.

I do have some other machines as well but for the sake of simplicity I shall pretend they dont exist for a moment.

There are twenty diskless machines and one file server, I had some difficulty getting the machines to net-boot in the first place but eventually I got there. The distribution that I have chosen for the diskless machines is a remastered version of Knoppix, I dont know if I would make the same choice again but it does the job.

I remastered a knoppix CD (I didnt add very much, mainly removed a lot of stuff that wasnt needed, the games, open office, many KDE related things, etc) stuff added includes PHP-cli and MySQL clients.

Instead of burning twenty copies of the CD and having it run hellishly slow from the CD rom drive on each box I got it to netboot by just making the ISO file of the CD and mounting it on the fileserver. To do that I use a command like this...

mount -o loop -t iso9660 /tera2/knoppix.iso /media/knoppix-cd/

This means that I can now change dir to /media/knoppix-cd/ and access as though it were a cd-rom. It is of course read only. Next, I opened an NFS share to /media/knoppix-cd/ so that anyone or thing can read it over the network.

To get the netbooting to work I enabled PXE on the network card in each diskless machine and set up a DHCP server on the fileserver. Any one diskless box will, when it boots up, be told by the DHCP server its IP details and the location of PXE linux, this is a tiny cut down Linux (I'm not sure on the ins and outs of it) that essentially is served from the fileserver and contains enough cleverness (presumably an NFS client and a bootloader of sorts) that can then get the machine to fully boot from the /media/knoppix-cd/ directory on the fileserver.

Thats really about all there is to it. This gets all of the diskless boxes up and running, I can SSH to them and run my own tasks on them. (I dont have twenty boxes lying around JUST to run boinc :) )

Caveat: Knoppix is cool in that even though you run it from CD it still shows a full Linux file tree with all the usual folders and files, you can even write to them and read your changes back, it does this by just storing the changes in RAM, the same applies to all the diskless boxes, any changes made on a box will not be there the next time it is booted. If I want to make something permenant then I have to remaster the original ISO again.

Now for BOINC,

When I decided I wanted to install BOINC on these boxes I created some more folders on the fileserver, one folder for each machine. I created them in a folder called cluster-homes, this might be a bit misleading as I'm not sure if this setup technically is a cluster and I'm not using the directories as a /home/ directory. I shared each new folder via NFS. Because these are real folders as opposed to mount points that point to a read only ISO file any writes are real writes.

On each diskless box I logged into it and created a folder called /mnt/small/ (small is the name of the fileserver) and then mounted one of the twenty new NFS shares into it. CD into that folder, wget the boinc_6.2.15_i686-pc-linux-gnu.sh file and run it. This creates a folder called BOINC in the folder that belongs to that box which contains the BOINC binaries, it would appear these binaries use the folder that they reside in also as a data directory (ie, /mnt/small/BOINC/ which is an NFS to the fileserver at 192.168.1.20:/tera2/cluster-homes/<machinePosition><machineNumber>/BOINC )

When the machines power off and back on I have to create the /mnt/small/ folder again and mount the NFS share for that machine manually each time but once I've done that it will seem to run from where it left off.

I am tempted to wipe out all the boinc directories and for one of the projects that keeps giving me issues see if I can run seti on it for a few days and get some credits with little or no failures. Once I have confirmed that is working then bring up another box and check the logs to see what happens. I've set seti going on one box and detached from all the rest (didn't wipe any directories though). I'll see how it goes but I need to go to sleep now.

I was wondering, all these machines have different local IPs but the same hostname, is that possibly any kind of issue? I never would address any of them by hostname anyway. Any project server will only see my external IP and the hostname, might this confuse the project into thinking the machines were all the same? I'll have a go at setting something up to change all the hostnames at some point but thats likely to be in to next week.

Gundolf, I'll try that next time I have more than one attached to seti.

Thanks again everyone! I'm not going to be around much the next few days but I'll do what I can.
ID: 23567 · Report as offensive
Profile Leopoldo

Send message
Joined: 11 Feb 09
Posts: 8
Russia
Message 23576 - Posted: 10 Mar 2009, 8:20:58 UTC - in response to Message 23567.  

I was wondering, all these machines have different local IPs but the same hostname, is that possibly any kind of issue? I never would address any of them by hostname anyway. Any project server will only see my external IP and the hostname, might this confuse the project into thinking the machines were all the same?


IMHO, You are right about hostname's issue.

BOINC sends to SETI@Home server file "sched_request_setiathome.berkeley.edu.xml" which contains section "<host_info>" and inside we can see "<domain_name>" (in my case of Win2003 it is name of my computer, not workgroup name) and "<ip_addr>" (in my case IP is from my local LAN).

At least, website page about my computer shows both my "domain name" and my local IP - so both of them used somehow by project, maybe server-database identification process uses both, IMHO...
ID: 23576 · Report as offensive
Tim Porter

Send message
Joined: 8 Mar 09
Posts: 17
Message 23580 - Posted: 10 Mar 2009, 8:42:18 UTC - in response to Message 23576.  

Making them all boot up with different hostnames is no small task but I've been meaning to do it for a while so it can't hurt (I never got round to it because I dont really have a need for it). I'll give it a go once I've figured how on earth I'm going to do it. I'll let you know how it goes.
ID: 23580 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 20 Dec 07
Posts: 1069
Germany
Message 23581 - Posted: 10 Mar 2009, 9:23:10 UTC - in response to Message 23580.  

Making them all boot up with different hostnames is no small task but I've been meaning to do it for a while so it can't hurt (I never got round to it because I dont really have a need for it). I'll give it a go once I've figured how on earth I'm going to do it. I'll let you know how it goes.

If there'd be an etc/hosts file for each node, it would be simple :-)

Gruß,
Gundolf
ID: 23581 · Report as offensive
1 · 2 · Next

Message boards : Questions and problems : Client detached errors and no credit

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.