 |
KWSN Orbiting Fortress KWSN Distributed Computing Teams forum
|
View previous topic :: View next topic |
Author |
Message |
Putting_things_on_top Duke


Joined: 14 Oct 2009 Posts: 435 Location: Frostbite Falls, Minnesota, USA
|
Posted: Wed Jan 20, 2010 10:32 pm Post subject: NAN and EUE problems? This might help... |
|
|
I'm not sure if this will work for everyone, but it really has helped me maintain a more consistent daily output.
I have been plagued by NAN/EUE errors (F@H) on GPU folding...and after 5 or 6 successive EUE's, the client suspends itself for 24 hours.
This really cranked me off for the past few months!!!
But NOW, I have found a solution that works (at least for me!)
- I added the -oneunit startup switch/flag (amongst others) to my GPU client configs
- I set-up an infinite schedule to start each GPU client every 5 minutes (unless already running) thru Windows Task Scheduler
NOTE: You must start GPU clients thru Task Scheduler only in order for it to manage successive restarts correctly
This set-up has some advantages:- Every WU completion -or- error causes the client to stop cleanly (no EUE counter)
- Each GPU can get up to 5 minutes to cool down between WU's
- I get to sleep better at night without feeling compelled to restart failed clients in the middle of the night
Since I've set this up (almost two weeks ago), I have been getting around 20,000+ points per day (average).
Before, I think I was averaging 13,000 points (and with a lot of manual intervention on my part).
Yes, I still get NAN errors, but they only stop a GPU for - at most - 5 minutes!!!
Hope this helps others  _________________ Click here for...KWSN F@H team summary at EOC
Or here for...KWSN F@H team overtake at EOC

Last edited by Putting_things_on_top on Tue Jan 26, 2010 9:53 pm; edited 6 times in total |
|
Back to top |
|
 |
Idan UN-Smitten


Joined: 07 Dec 2005 Posts: 2993 Location: Tel-Aviv, Israel
|
Posted: Thu Jan 21, 2010 6:13 am Post subject: |
|
|
That's a really good tip!
Thanks a bunch!  _________________ Anyone for Crunch?
 |
|
Back to top |
|
 |
Putting_things_on_top Duke


Joined: 14 Oct 2009 Posts: 435 Location: Frostbite Falls, Minnesota, USA
|
Posted: Thu Jan 21, 2010 8:43 am Post subject: What IS a NAN (you might ask)... |
|
|
What IS a NAN (you might ask)...
It really means "Not A Number", and (at least in terms of GPU folding) is the result of a double-precision floating-point calculation whose results are either too large or too small to be handled with any reasonable accuracy by the hardware.
This is why we almost always see a "UNSTABLE_MACHINE" return code in the logfile. And when F@H refers to a "machine", it doesn't necessarily mean that the hardware is malfunctioning. Some of these folding applications are using some kind of quasi finite-state-machine-theory style of programming (too complicated to explain here)...so they are really talking about the virtual machine that was created within the app.
Almost all Nvidia GPUs are incapable of handling double-precision operations to the requisite degree of accuracy; a few of the ATI cards are designed to natively process double-precision, but these are a limited set of models (not necessarily the high-end ones, either). As for Nvidia, they claim to have DPFPO architected into their (soon-to-be-released) Fermi series.
Additionally, some of the F@H projects are notoriously dependent on DPFPO (the 576x project series, for example), and it gets to be a crap-shoot as to whether or not a WU will succeed on a GPU. I have had WUs go to 85% and then crap-out...and I don't even get partial credit!
We generally do not see NAN errors returned from CPU clients because all 'modern' CPUs already have DPFPO built-in (much older CPUs relied on a separate "math-coprocessor" chip to achieve this functionality).
Oh, DPFPO = double precision floating point operations _________________ Click here for...KWSN F@H team summary at EOC
Or here for...KWSN F@H team overtake at EOC
 |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|