View previous topic :: View next topic |
Author |
Message |
JerWA Prince


Joined: 01 Jan 2007 Posts: 1497 Location: WA, USA
|
Posted: Sun May 13, 2007 8:16 am Post subject: SETI = borked again |
|
|
Well, you may or may not have noticed that SETI came back online last night with new work generated and sent for the first time in awhile. Yay! Unfortunately, around midnight, something broke and everything came to a screeching halt (see network activity link for Berkley below). No word yet what's up, but hopefully it's just a small hiccup (i.e. a machine that needs to be smacked) and we'll be rolling again soon.
http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;view=Octets
In the mean-time, might be wise to set your clients not to download new work and/or disable network activity if SETI is your only project. Past experience has shown us that pestering the upload server with results tends to invalidate them. _________________
Stats: [BOINC Synergy] - [Free-DC] - [MundayWeb] - [Netsoft] - [All Project Stats] |
|
Back to top |
|
 |
mohrorless Mail Order Goat Bride


Joined: 09 Oct 2006 Posts: 11206 Location: NYC
|
Posted: Sun May 13, 2007 12:05 pm Post subject: |
|
|
Maybe the new server got overworked too soon and quit.  _________________ Fetch me the Holy Hand Grenade!
Keeper of the Unending keg of PGGBs
Taunter in Training
Campaign Manager for Sir Shrubbery
Plus
 |
|
Back to top |
|
 |
JerWA Prince


Joined: 01 Jan 2007 Posts: 1497 Location: WA, USA
|
Posted: Sun May 13, 2007 1:56 pm Post subject: |
|
|
Dunno. It does this on a pretty regular basis (enough so that all the users there have these network activity links hehe), I don't think it's anything special caused by the new server. _________________
Stats: [BOINC Synergy] - [Free-DC] - [MundayWeb] - [Netsoft] - [All Project Stats] |
|
Back to top |
|
 |
Quixote Duke


Joined: 06 Nov 2006 Posts: 355 Location: Aaaargh!
|
Posted: Sun May 13, 2007 4:42 pm Post subject: |
|
|
Actually, they seem to have been working on it quite heroically through the weekend - let's see what the "Monday blues" bring. I'm leaving the settings just as they are, this thing um here is cranking out "climate" and "little green men" only, for now.
By the way - does "cricket" work with Winders XP? _________________ tilting windmills, rescuing damsels,etc |
|
Back to top |
|
 |
mohrorless Mail Order Goat Bride


Joined: 09 Oct 2006 Posts: 11206 Location: NYC
|
Posted: Sun May 13, 2007 6:23 pm Post subject: |
|
|
I seem to have several WUs with a status of "Downloading" and 1 "Uploading"...  _________________ Fetch me the Holy Hand Grenade!
Keeper of the Unending keg of PGGBs
Taunter in Training
Campaign Manager for Sir Shrubbery
Plus
 |
|
Back to top |
|
 |
Tenebra Prince


Joined: 16 Nov 2006 Posts: 2053 Location: Somewhere in the Outer Rim of a Galaxy far far away
|
|
Back to top |
|
 |
JerWA Prince


Joined: 01 Jan 2007 Posts: 1497 Location: WA, USA
|
Posted: Mon May 14, 2007 10:07 am Post subject: |
|
|
If you have work showing as queued for download, you can kind've help that along (to keep work going if you need it) by following the instructions in this post:
http://setiathome.berkeley.edu/forum_thread.php?id=39438
Essentially, change your HTTP proxy in BOINC Manager to 128.32.18.173. Then go to your tasks list, find the ones that are downloading and make a note (usually the last 3 #'s are enough) of them, then go to the transfers window and "retry now" for those specific files. They will download immediately. Once finished, remove the HTTP proxy setting as it will prevent uploads from working whenever the server is finally back online.
I did this and cleared all the pending download work for SETI from my clients which should keep them busy for a few more hours. Some have also reported that you may actually get MORE pending downloads once you clear them out, so you can keep repeating this cycle to get work if you need it. _________________
Stats: [BOINC Synergy] - [Free-DC] - [MundayWeb] - [Netsoft] - [All Project Stats] |
|
Back to top |
|
 |
mohrorless Mail Order Goat Bride


Joined: 09 Oct 2006 Posts: 11206 Location: NYC
|
Posted: Mon May 14, 2007 10:53 am Post subject: |
|
|
Thanks JerWa,
It worked great! _________________ Fetch me the Holy Hand Grenade!
Keeper of the Unending keg of PGGBs
Taunter in Training
Campaign Manager for Sir Shrubbery
Plus
 |
|
Back to top |
|
 |
Tenebra Prince


Joined: 16 Nov 2006 Posts: 2053 Location: Somewhere in the Outer Rim of a Galaxy far far away
|
|
Back to top |
|
 |
JerWA Prince


Joined: 01 Jan 2007 Posts: 1497 Location: WA, USA
|
Posted: Tue May 15, 2007 6:15 pm Post subject: |
|
|
They're still working on it, just posted another update:
Matt Lebofsky wrote: | We had the usual outage today which was mostly a success. The database compressed and was backed up in just over an hour. Normally this takes almost twice as long but the result table has significantly shrunk over the past two weeks (wonder why?). After that we put the new thumper in the closet (we being me, Eric, Jeff, and Kevin - it's a heavy machine). We also rebooted bruno to cleanly pick up a new disk (replacing a failed disk from yesterday). And I rebooted penguin to attach koloth's old tape drive to it (so it could read the classic data tapes for splitting).
That all went well. We also updated all the BOINC-side code to bring the SETI@home project in line with the current BOINC source tree and a few things broke, namely our validators and assimilators. These aren't project critical for the time being, so we're postponing dealing with these until we deal with the real problem at hand: getting people to connect to our data servers.
I think this is the longest outage we've ever had (even though it wasn't a "complete" outage - just no work was available) and we're in a whole new network configuration since the last major outage (new OS, new servers, new ISP, new switches, new router). In short, we're being clobbered by the returning flood of work requests. The major bottleneck is somewhere in the direction of our Hurricane router or bruno. Or at least that's the way it seems right now and there's no guarantee that when we break that dam a new bottleneck won't arise. I don't have the time to spell out what is broken and what we tried and what failed and what yielded unexpected results. Just know we're working on it and we understand most connections are being dropped.
- Matt |
_________________
Stats: [BOINC Synergy] - [Free-DC] - [MundayWeb] - [Netsoft] - [All Project Stats] |
|
Back to top |
|
 |
cozycat Squire

Joined: 14 Nov 2006 Posts: 3
|
Posted: Tue May 15, 2007 7:40 pm Post subject: |
|
|
I'm just going to continue letting my machine do its thing. While I keep pestering my Wife's Primary Professor to run S@H on the lab machines. Any good ideas on how to get a positive response? So far I have gotten threats about bad things being done to my person if I keep asking, lol. |
|
Back to top |
|
 |
JerWA Prince


Joined: 01 Jan 2007 Posts: 1497 Location: WA, USA
|
Posted: Wed May 16, 2007 9:07 pm Post subject: Fast One (May 16 2007) |
|
|
Matt Lebofsky wrote: | Quick note as I gotta catch a bus..
Wow - what a mess. I think we're in the middle of our biggest outage recovery to date, and it's breaking everything. The good news is we're coming into some newer hardware which we'll get on line to help somehow.
See Eric's thread in the Staff Blog. He's been working overtime getting a new frankenstein machine together to act as another upload/download server and reduce the load on bruno. The scheduling server (galileo) has been choking - I just now moved all that over to bruno as well. So we may retire galileo soon, too. Jeff has been going nuts trying to track down errors in validator/assimilator code so we can get those on line as well. And our old friend "slow feeder query" is back, probably just being aggravated by the heavy load.
Gotta go..
- Matt |
And the referenced post...
Eric Korpela wrote: | This one could probably go in the techincal news, but since I haven't blogged in a while, I decided to jot it down here.
Following the large outage, bruno's been having some problems keeping up. Lots of dropped connections. I guess most of you noticed that. It's not a lack of hardware this time, just an over-abundance of connection attempts.
Some of the dropped connections were local file-server connections, which causes some of the http processes to wait around which causes more dropped connections. Changing some of the TCP tuning parameters helped, but didn't solve the problem.
We did some brain storming before the outage and have come up with some tactics to combat these issues.
We're setting up our router to proxy the SYN/ACK handshakes. That way if we are flooded, the connections will be dropped before they get to bruno. That'll in turn prevent the NFS connections from getting dropped.
We're also getting rid of some configuration remnants from earlier BOINC server code. Currently bruno handles all of the incoming connections and forwards them to other machines when appropriate for uploads and downloads. We can designate other machines as upload or download handlers so that bruno won't have to touch those connections at all.
If that's not enough, we'll set up web servers on some of the other machines and get back to round robin DNS for the upload and download servers.
Well, that's enough typing for now. This weekend, one of my fingers had an unfortunate meeting with the leading edge of a 120mm fan blade inside a server case. Fortunately the fan blade broke and it doesn't look like I'll lose the fingernail. I've learned my lesson, always approach case fans from the trailing edge.
--
Eric |
_________________
Stats: [BOINC Synergy] - [Free-DC] - [MundayWeb] - [Netsoft] - [All Project Stats] |
|
Back to top |
|
 |
|