KWSN Orbiting Fortress Forum Index KWSN Orbiting Fortress
KWSN Distributed Computing Teams forum
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

NAN and EUE problems? This might help...

 
Post new topic   Reply to topic    KWSN Orbiting Fortress Forum Index -> Ye Olde Help Scrolls
View previous topic :: View next topic  
Author Message
Putting_things_on_top
Duke
Duke


Joined: 14 Oct 2009
Posts: 435
Location: Frostbite Falls, Minnesota, USA

PostPosted: Wed Jan 20, 2010 10:32 pm    Post subject: NAN and EUE problems? This might help... Reply with quote

I'm not sure if this will work for everyone, but it really has helped me maintain a more consistent daily output.
I have been plagued by NAN/EUE errors (F@H) on GPU folding...and after 5 or 6 successive EUE's, the client suspends itself for 24 hours.
This really cranked me off for the past few months!!! #Mad

But NOW, I have found a solution that works (at least for me!)
  • I added the -oneunit startup switch/flag (amongst others) to my GPU client configs
  • I set-up an infinite schedule to start each GPU client every 5 minutes (unless already running) thru Windows Task Scheduler
    NOTE: You must start GPU clients thru Task Scheduler only in order for it to manage successive restarts correctly

This set-up has some advantages:
  1. Every WU completion -or- error causes the client to stop cleanly (no EUE counter)
  2. Each GPU can get up to 5 minutes to cool down between WU's
  3. I get to sleep better at night without feeling compelled to restart failed clients in the middle of the night

Since I've set this up (almost two weeks ago), I have been getting around 20,000+ points per day (average).
Before, I think I was averaging 13,000 points (and with a lot of manual intervention on my part).
Yes, I still get NAN errors, but they only stop a GPU for - at most - 5 minutes!!!
Hope this helps others #ni-1
_________________
Click here for...KWSN F@H team summary at EOC

Or here for...KWSN F@H team overtake at EOC




Last edited by Putting_things_on_top on Tue Jan 26, 2010 9:53 pm; edited 6 times in total
Back to top
View user's profile Send private message Visit poster's website
Idan
UN-Smitten
Prince


Joined: 07 Dec 2005
Posts: 2993
Location: Tel-Aviv, Israel

PostPosted: Thu Jan 21, 2010 6:13 am    Post subject: Reply with quote

That's a really good tip!

Thanks a bunch! #ni-1
_________________
Anyone for Crunch?


Back to top
View user's profile Send private message
Putting_things_on_top
Duke
Duke


Joined: 14 Oct 2009
Posts: 435
Location: Frostbite Falls, Minnesota, USA

PostPosted: Thu Jan 21, 2010 8:43 am    Post subject: What IS a NAN (you might ask)... Reply with quote

What IS a NAN (you might ask)...

It really means "Not A Number", and (at least in terms of GPU folding) is the result of a double-precision floating-point calculation whose results are either too large or too small to be handled with any reasonable accuracy by the hardware.

This is why we almost always see a "UNSTABLE_MACHINE" return code in the logfile. And when F@H refers to a "machine", it doesn't necessarily mean that the hardware is malfunctioning. Some of these folding applications are using some kind of quasi finite-state-machine-theory style of programming (too complicated to explain here)...so they are really talking about the virtual machine that was created within the app.

Almost all Nvidia GPUs are incapable of handling double-precision operations to the requisite degree of accuracy; a few of the ATI cards are designed to natively process double-precision, but these are a limited set of models (not necessarily the high-end ones, either). As for Nvidia, they claim to have DPFPO architected into their (soon-to-be-released) Fermi series.

Additionally, some of the F@H projects are notoriously dependent on DPFPO (the 576x project series, for example), and it gets to be a crap-shoot as to whether or not a WU will succeed on a GPU. I have had WUs go to 85% and then crap-out...and I don't even get partial credit!

We generally do not see NAN errors returned from CPU clients because all 'modern' CPUs already have DPFPO built-in (much older CPUs relied on a separate "math-coprocessor" chip to achieve this functionality).

Oh, DPFPO = double precision floating point operations
_________________
Click here for...KWSN F@H team summary at EOC

Or here for...KWSN F@H team overtake at EOC


Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    KWSN Orbiting Fortress Forum Index -> Ye Olde Help Scrolls All times are GMT - 5 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group
Optimized Seti@Home App | BOINC Stats