Error on file upload: Socket Read incomplete

Message boards : BOINC client : Error on file upload: Socket Read incomplete
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15483
Netherlands
Message 8517 - Posted: 4 Mar 2007, 15:00:43 UTC

ERR_UPLOAD_TRANSIENT -127

First an explanation what transient means. Transient refers to a module that, once loaded into main memory, is expected to remain in memory for a short time.

This is a server error.
The file you are trying to upload is locked on the server. The file_upload_handler put an advisory lock on the file, to prevent other file upload handlers to write to the file.

This can only be fixed by the project.

(This explanation has been in the BOINC FAQs for a long time.)

Resetting BOINC won't fix it.
Aborting the result before the deadline won't fix it.

It has to be fixed on the server, by the admins.
ID: 8517 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 8518 - Posted: 4 Mar 2007, 15:52:32 UTC

Ageless,

Thanks for dropping by to visit this thread, and for your helpful, as always, reminder about the FAQs.

But I don't think it can be the explanation for this class of events.

1) The general sequence of events that I'm seeing is:

WU finishes computation.
Same WU starts computation again.
BOINC attempts upload of results file.

- so the initial cause of the problem happens on the local PC, before any communication takes place with the project servers. Of course, all subsequent attempts at uploading fail, but that seems to be the result of the initial problem, not its cause.

2) Whenever I've had this problem, I've found that aborting the upload transfer clears the problem - the WU is put into a 'ready to report' state, and is cleared at the next scheduler contact. Of course, there's no credit for the WU, but the local BOINC can continue crunching error-free. No reset, not even a restart, and certainly no need of attention by project admins.

I think we must be talking about different problems.

FWIW, I have not yet experienced this problem since upgrading to BOINC 5.8.15, nor have I picked up on any other reports on BOINC later than 5.8.11. We did have another one reported by Ulrich Metzner on SETI Main this morning, but he's come back to say he's using 5.7.5, so it doesn't add much to the diagnosis.
ID: 8518 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15483
Netherlands
Message 8519 - Posted: 4 Mar 2007, 16:15:16 UTC - in response to Message 8518.  
Last modified: 4 Mar 2007, 16:22:11 UTC

But I don't think it can be the explanation for this class of events.

The explanation I put in the FAQs comes directly from the BOINC source code.
It is a server sided error:

[error] Error on file upload: socket read incomplete: asked for 12226, got 8963: No such file or directory
[file_xfer] Temporarily failed upload of 11no03aa.4732.8626.304816.3.23_0_0: transient upload error

This is an error you are getting back from the server. It's only BOINC telling you what it got back from communicating with the project server.

From the source code:
// read from socket, write to file
// ALWAYS returns an HTML reply

- so the initial cause of the problem happens on the local PC, before any communication takes place with the project servers.

Where do you see this in your messages?

2) Whenever I've had this problem, I've found that aborting the upload transfer clears the problem

Sure, aborting "fixes" all "problems" on your side.
Example given, I have had a protein result trying to upload for 3 days now. If I abort it, I "fix" that it is trying to upload to a server that is down. But does it really fix the problem? Or does it lull me into feeling I have solved it?

I think we must be talking about different problems.

Not according to the BOINC source code I used at the time. Although I see that the 5.8 branch has a lot of extra possibilities now. I'll dig into it and add to that FAQ over the next couple of days.
ID: 8519 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 8537 - Posted: 5 Mar 2007, 12:32:03 UTC

Jord, I think the difference between us is that you're in 'explanation' mode, while I'm in 'debug' mode, trying to use the explanations to get back to the orginial cause of the problem.

[file_xfer] Temporarily failed upload of 11no03aa.4732.8626.304816.3.23_0_0: transient upload error

Obviously, I accept that the words 'transient upload error' are in the source code, and will be output to the scheduler response and hence the local message log under failure conditions.

So what was the failure?

[error] Error on file upload: socket read incomplete: asked for 12226, got 8963


Ah, the server asked for 12226 bytes, but the upload stopped uploading at 8963 bytes - some sort of premature end-of-file signal.

Where did those figures come from? How did the server know to expect 12226 byres?

I'm assuming that the initial contact comes from the client PC: something along the lines of "Hey, I've finished some work and I've got something for you - it'll be xxx bytes, and called yyy - let me know when you're ready for it." (May be a single message, or may be broken into sub-messages - doesn't really matter at this level). The server then sets up the storage space, and requests the upload when it's ready.

BOINC then uploads the file, but it's a different size to the one it originally signalled. Why?

[The next clue comes from Josef Segur in a BOINC Alpha email. Thanks to Byron for quoting it on SETI Beta at message 14952] I've seen the problem most often with SETI WUs. The SETI upload file has the structure {Copy of WU header}{one or more, or no, data of interest}{final summary with 'best of' information}. The undersize uploads (I've preserved some) have the first two sections, but the final summary is missing: this is the bit that Joe couldn't understand.

My logs earlier in this thread - I won't bother quoting them again - provide the answer to Joe's question. The WUs finish computation (and presumably BOINC stores the result file size at this point): then the same WU starts computation all over again, overwriting the result file with a new copy of the header and, possibly, some of the data - but crucially, no final summary, because the second computation hasn't finished yet. Meanwhile, BOINC is continuing to process the first 'exit - finished' event, and offers the file for upload.

So my question is "Why - and under what circumstances - would BOINC re-start computation on a WU which has just exited with a normal 'finished' status?". My observation is that two results from the same project finishing in quick succession (no more than 1 second apart) is a necessary but not sufficient event trigger.

Nicolas's reply to Pepo and myself earlier in this thread was 'David Anderson says it's fixed in BOINC 5.8.11', but he couldn't or wouldn't point to a specific changelog entry which addressed these symptoms: the two entries cited in this thread refer to different abend situations (though the 2 Feb entry might cover this one too).

In the meantime, I've posted two counter-examples which show that that the problem was not completely fixed in BOINC 5.8.11 - which is why I keep banging on about it! (sorry!)

I've not seen any problems in .15 (or 16, or whatever we're up to now), but I'll keep monitoring and post anything I see. In the meantime, if you come across anything while wou're reading the new code for documentation purposes, please let us know. Maybe we can meet in the middle and nail this thing once and for all.
ID: 8537 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15483
Netherlands
Message 8544 - Posted: 5 Mar 2007, 13:48:51 UTC
Last modified: 5 Mar 2007, 13:53:42 UTC

[error] Error on file upload: socket read incomplete: asked for 12226, got 8963


Ah, the server asked for 12226 bytes, but the upload stopped uploading at 8963 bytes - some sort of premature end-of-file signal.

Where did those figures come from? How did the server know to expect 12226 bytes?

When your BOINC is communicating with the CGI scheduler, it sends back a reply of what the file size is that it wants to upload.

The key error here is "socket read incomplete". Now, you can argue that it's a socket on the host PC, but reading the file_upload_handler.c file, I read it as the Server socket API.

From /docs/upload.php:
""Transient error. The client should try another data server, or try this one later. "); echo " In the error cases, the <file_size> element is omitted and the <message> element gives an explanation."

(best take the code out... read it yourself on file_upload_scheduler.c
ID: 8544 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 8553 - Posted: 5 Mar 2007, 15:52:47 UTC - in response to Message 8544.  

(best take the code out... read it yourself on file_upload_scheduler.c

Yep, that's where it happens - I think we hit the return error at around line 243.

But we have to debug on from that point: why does the user's PC upload a smaller number of bytes than it originally declared that it wanted to upload?

Answer: because the file itself (on the user's PC) has been modified between the CGI request, and the actual upload: because something - a still unknown something, in my book - has triggered the science application to re-start computation on a WU that it should know it's finished with.
ID: 8553 · Report as offensive
Pepo
Avatar

Send message
Joined: 3 Apr 06
Posts: 547
Slovakia
Message 8557 - Posted: 6 Mar 2007, 0:37:53 UTC - in response to Message 8537.  

Nicolas's reply [...] was 'David Anderson says it's fixed in BOINC 5.8.11', but he couldn't or wouldn't point to a specific changelog entry which addressed these symptoms: the two entries cited in this thread refer to different abend situations (though the 2 Feb entry might cover this one too).

I've not seen any problems in .15 (or 16, or whatever we're up to now), but I'll keep monitoring and post anything I see...

Richard, Jord, I still believe the issue was (possibly completely) fixed by the few (2-4) changes in source code, although they may seem to be unrelated one to another (but they are all around the class ACTIVE_TASK and scheduling) and would not take the named version number 5.8.11 or any date that strict, because the changes appeared on various days (and initially in various branches, if I understood it correctly, and later put together).

(The later transient upload error messages about missing bytes and not matching file sizes are obvious for me (from the log context). I'd recomment not loosing more time with them ;-)

Peter
ID: 8557 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 8602 - Posted: 7 Mar 2007, 18:48:52 UTC - in response to Message 8430.  
Last modified: 7 Mar 2007, 18:55:26 UTC


Nicolas, could you give us a little bit more on the context in which David said it was fixed in 5.8.11? Could the fix possibly have been delayed even beyond that release?

Chat via Skype:

[22/02/2007 18:07:25] Nicolas: somebody on BOINC dev forums is reporting a quite weird problem: http://boinc.berkeley.edu/dev/forum_thread.php?id=1575
[22/02/2007 18:07:59] David Anderson: this is fixed in 5.8.11
[22/02/2007 18:08:24] Nicolas: great - thanks

And that was all. I may point him here again...
ID: 8602 · Report as offensive
Batman

Send message
Joined: 21 Mar 07
Posts: 1
Message 8970 - Posted: 21 Mar 2007, 21:12:00 UTC - in response to Message 8553.  

(best take the code out... read it yourself on file_upload_scheduler.c

Yep, that's where it happens - I think we hit the return error at around line 243.

But we have to debug on from that point: why does the user's PC upload a smaller number of bytes than it originally declared that it wanted to upload?

Answer: because the file itself (on the user's PC) has been modified between the CGI request, and the actual upload: because something - a still unknown something, in my book - has triggered the science application to re-start computation on a WU that it should know it's finished with.


Ok I have an account here, And I've posted a message in the other thread about this problem with 5.8.15 and C2Q cpus and WU's not wanting to upload.
ID: 8970 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 8985 - Posted: 22 Mar 2007, 9:34:07 UTC

Batman's report of the error turns out to be in a Crunch3r optimisation (BOINC v5.9.0.32 dated 08-Jan-2007 15:08), so it rather pre-dates this discussion.

Anyone seen any signs of it happening in BOINC v5.8.15 (19-Feb-2007 18:19) or
BOINC v5.8.16 (01-Mar-2007 08:30)?

(or on any platform other than Windows?)


ID: 8985 · Report as offensive
Previous · 1 · 2

Message boards : BOINC client : Error on file upload: Socket Read incomplete

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.