KWSN Orbiting Fortress Forum Index KWSN Orbiting Fortress
KWSN Distributed Computing Teams forum
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Einstein@Home Project News February 25, 2007

 
Post new topic   Reply to topic    KWSN Orbiting Fortress Forum Index -> DC News
View previous topic :: View next topic  
Author Message
KWSN - News
Prince
Prince


Joined: 01 Aug 2006
Posts: 3853

PostPosted: Sun Feb 25, 2007 12:15 am    Post subject: Einstein@Home Project News February 25, 2007 Reply with quote



The project performance problems and database load problems have been resolved. Details can be found in this message board post. Thank you for your patience while we sorted this out.

Read more...

Source: Einstein@Home
BOINC project Einstein@Home: Main page News
_________________
KWSN DC News reporter reporting from inside the server.
Back to top
View user's profile Send private message
Al Dente
Prince
Prince


Joined: 23 Feb 2006
Posts: 3228
Location: Leodis, the jewel at the end of the yellow brick road (or M1)

PostPosted: Sun Feb 25, 2007 3:20 am    Post subject: Reply with quote

This is Bruce Allen's message board post:

Message 64707 - Posted 24 Feb 2007 20:10:34 UTC
Last modified: 24 Feb 2007 20:13:01 UTC

Dear Einstein@Home volunteers and contributors,

I thought I would post a description of what went wrong and how it was fixed.

(1) Project performance problems. These were due to our database getting overloaded. It was processing an average of 950 queries per second, with peaks of up to about 3000 queries per second. Ultimately, these were due to the way that the BOINC locality scheduler works and the fact that our new analysis run did not have many low-frequency workunits. Einstein@Home is the only project that uses the locality scheduler, which is designed to send many workunits for the same data file, only sending a new data file when there is no work left for the previous data file. What happened was that many hosts that had low frequency files (because they were slower than the majority of hosts) requested work for these files, or NEW workunits also for low frequency files. When the project ran out of work for these files, the locality scheduler would then perform an extremely database intensive 'crawl' through the database looking for more work. So the slowest 20% of hosts were generating very large numbers of database queries looking for non-existent low frequency workunits. I fixed this by modifying the algorithm that searches for new work. Anyone interested in the details can look at BOINC CVS next week when I check in the modified code.

The database is now averaging about 60 to 80 queries per second, and the database server and project servers are once again snappy and responsive.

(2) File server problems. Our project uses three file servers, each of which has about 8TB of RAID-6 disk space. The file servers use Areca 24-port SATA controller cards, and Western Digital WD4000YR disks. For a number of months we have been experiencing problems in which a disk would apparently drop from the array and then reappear a few seconds later, prompting a RAID array rebuild. In the end we sent one of our server boxes (approximately 80 kg, worth about 10kUSD) by express mail to Taiwan, and the Areca engineers looked at it more closely. (Many thanks to these engineers, who have given us first-rate support!) It turned out that our problems were due to a hardware problem with the WD4000YR drives. They have a SATA interface chip which (in some revisions of the WD4000YR) is incompatible with an interface chip used on the Areca RAID controller. This incompatibility is only triggered by issuing NCQ commands. So by disabling NCQ on the RAID controller, the problem was fixed. Our two remaining file servers have now been working without issues for more than two weeks.

These things were further exacerbated by my move to Germany with my family (our kids are 2.5 and 6 years old) which meant that I couldn't give these issues enough attention until now.

Hopefully these problems are behind us! I am grateful to everyone for their patience, and apologize for how long it took to track these things down and deal with them.

Cheers,
Bruce Allen
_________________
Creationists believe they never evolved; I agree with them.

. . My Milestones . . My Full BOINC list
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    KWSN Orbiting Fortress Forum Index -> DC News All times are GMT - 5 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group
Optimized Seti@Home App | BOINC Stats