KWSN Orbiting Fortress Forum Index KWSN Orbiting Fortress
KWSN Distributed Computing Teams forum
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Plagued by Computational Errors

 
Post new topic   Reply to topic    KWSN Orbiting Fortress Forum Index -> KWSN BOINC'ers
View previous topic :: View next topic  
Author Message
lvanst
Baron
Baron


Joined: 31 Jan 2013
Posts: 152
Location: Phoenix, AZ (yes, it's hot here)

PostPosted: Wed Mar 13, 2013 12:21 am    Post subject: Plagued by Computational Errors Reply with quote

My hotrod is unstable...

Several helped me get it dialed in, and then I pushed too far? Embarassed

At first it was just Einstein and Seti GPU fails, but then it expanded to non-GPU work... (Milkyway, Rosetta, MindModeling)

So, I've throttled it back to stock:

1) Use 87% of cpus (leave 1 available)
2) Use 100% of cpu time
3) No overdrive on the Radeon 7750 GPU (880Mhz, 1125Mhz)
4) No overdrive on the FX-8350 (4 Ghz)
5) Unstalled/Reinstalled Video drivers (not sure if I was current, and AMD Vision Center kept detonating, so rebuilt)
6) Locked BitDefender out of the project work areas and BOINC

Any other ideas?
_________________
Back to top
View user's profile Send private message
Putting_things_on_top
Duke
Duke


Joined: 14 Oct 2009
Posts: 435
Location: Frostbite Falls, Minnesota, USA

PostPosted: Wed Mar 13, 2013 2:47 am    Post subject: Reply with quote

Without knowing more details, I would be inclined to suspect heat problems.

Running GPUs over a nominal 80c threshold for extended periods of time will cause performance degradation (increasing errors).
Running consistently above 88c will rapidly begin to cause permanent damage to 'cores' or on-board memory.

If you're comfortable with the potenrial risk of voiding your warranty, you may want to consider re-thermaling your GPUs.
I make a habit of cleaning & re-thermaling my GPUs about every 6-9 months (after I've had them for at least 6 months).
Even though I have NVidia cards, the principle is the same.
Video-card manufacturers seem to employ blind stone masons who trowel on [what really doesn't pass for] thermal compounds.
They invariably use the wrong kind of compound, and they apply it obscenely thick!
And don't use Arctic Silver - it is a 'curing' formulation...it dries-out over time. (I fried 2 cards using that garbage!)
Instead, use a real thermal compound (aka grease) from either Antec or SiiG.
Once you've carefully disassembled your card, scraped out that horrid gunk, and have thoroughly cleaned the major surfaces & components...
...apply the thermal compound - as thinly as humanly possible, to one surface only! (either the chip or the heatsink, not both)...
...then carefully re-assemble your card.
Done, and good to go!

For cleaning your components (either GPU or CPU), it's best [and cheaper] to get these 3 items at your local drug-store:
  • a bottle of 100% acetone (for fine detail cleaning)
  • a bottle of 97% isopropyl alcohol (for rough cleaning) [never use this on acrylic surfaces!!!!]
  • a package of 100% cotton, lint-free make-up removal pads (usually round or oval)
These items are absolutely indispensable for properly cleaning your electronic components.

I have an FX-8350 myself, and I have it running near full-tilt (6/8-cores @100%) with a VERY nice aftermarket cooler...it has yet to exceed 50c!!!
That cooler is the CoolerMaster X6 Elite.
But I did a little customized touch of replacing its stock-fan with a Scythe Ultra-Kaze (120x120x38), and reverse positioned it to "pull" the air thru (instead of the normal "push").
My cooler-fan now blows directly into the rear chasis fan (also an Ultra-Kaze).
Amped-up, that Scythe fan will haul 133 CFM!!!!
That combination gives astonishing cooling!

Oh yeah - BTW - if you're running Windoze (like me), there is a minor, known compatibility issue for most AMD processors.
Running it O/C can cause [what is called] race conditions to occur *FAR* too frequently.
You're best option is to run it at the normal stock-speed 4.013GHz.
Race conditions can still happen; but at stock-speed, the likelihood is greatly reduced.

Other tech-sites that I peruse claim that the FX-8350 will run O/C quite happily (and more efficiently) with no known problems under a stable Linux distribution (like Ubuntu, RedHat, or Mint).
The downside to Linux, however, is support for video-card drivers & GPUs is not as good as it should be.
Food for thought...


_________________
Click here for...KWSN F@H team summary at EOC

Or here for...KWSN F@H team overtake at EOC


Back to top
View user's profile Send private message Visit poster's website
lvanst
Baron
Baron


Joined: 31 Jan 2013
Posts: 152
Location: Phoenix, AZ (yes, it's hot here)

PostPosted: Wed Mar 13, 2013 3:17 pm    Post subject: Reply with quote

Thank you for the guidance PTOT...

I run a Corsair H100i which kept the CPU between 45C and 55C at full load while overclocking at 4.8Ghz. I've seen rare peaks of 62C...

The 7750 GPU is always between 37C and 39C, and rarely shows utilization above 60%...

I'll definitely redo the thermal on the GPU. I'm planning on reving it up further, once the platform is stable.

It ran stable last night, so that race condition is a strong possibility!

My previous platform was Ubuntu, and the OS alone gave me a significant performance lift over the initial Win7 build. I will flip the new platform to Ubuntu, once I'm confident in the hardware burn-in/config...


_________________
Back to top
View user's profile Send private message
Putting_things_on_top
Duke
Duke


Joined: 14 Oct 2009
Posts: 435
Location: Frostbite Falls, Minnesota, USA

PostPosted: Wed Mar 13, 2013 10:22 pm    Post subject: Reply with quote

Seems that you've already got the concept of "thermal management" well in-hand.

And considering that your GPU utilization is around 60% (and very impressive temps), I would postpone the re-thermaling effort until the temps start to drift upward.

So, in the long run, it seems to come down to the O/S...


_________________
Click here for...KWSN F@H team summary at EOC

Or here for...KWSN F@H team overtake at EOC


Back to top
View user's profile Send private message Visit poster's website
lvanst
Baron
Baron


Joined: 31 Jan 2013
Posts: 152
Location: Phoenix, AZ (yes, it's hot here)

PostPosted: Thu Mar 14, 2013 12:51 am    Post subject: Reply with quote

LOL! Amen to that...

I did have six more comp errs on Einstein, so I removed the GPU app_config settings for Einstein and Seti. Now it only runs one GPU job at a time. Let's see how that goes...

#Microwave
_________________
Back to top
View user's profile Send private message
lvanst
Baron
Baron


Joined: 31 Jan 2013
Posts: 152
Location: Phoenix, AZ (yes, it's hot here)

PostPosted: Fri Mar 15, 2013 10:44 pm    Post subject: Reply with quote

...and it ran error free, until I enabled overclocking... Sad

The errors returned shortly after using the "AMD Vision Engine Control Center" to auto-tune and then overdrive the CPUs and GPU.

I've disabled overclocking, again. Let's see if it cleans up over the weekend.
_________________
Back to top
View user's profile Send private message
branjo
Prince
Prince


Joined: 05 Jan 2006
Posts: 746
Location: Slovakia

PostPosted: Sat Mar 16, 2013 9:05 am    Post subject: Reply with quote

Still using app_config?
_________________



Back to top
View user's profile Send private message Visit poster's website
lvanst
Baron
Baron


Joined: 31 Jan 2013
Posts: 152
Location: Phoenix, AZ (yes, it's hot here)

PostPosted: Sun Mar 17, 2013 10:33 am    Post subject: Reply with quote

Branjo,

After it ran stable for a few days I reintroduced the app_config, allowing only 4 concurrent GPU WUs (Einstein & Seti).

No failures yet Smile

It's looking like the AMD overclocking tool was causing the instabilities.

For now I have the "Optimized defaults" loaded in BIOS, with XMP enabled. This does OC slightly using a feature AMD calls turbo-mode (4.0 - 4.2 Ghz)...

The AMD OC tool was pushing it to 4.8 Ghz, though I used that very same tool to lower the clocking to stock (4.0 Ghz) and still saw failures.

It seems the AMD OC tool is interacting directly with the CPUs in a way that causes the computation errors. Which to me just seems highly unlikely... Rolling Eyes
_________________
Back to top
View user's profile Send private message
branjo
Prince
Prince


Joined: 05 Jan 2006
Posts: 746
Location: Slovakia

PostPosted: Sun Mar 17, 2013 2:17 pm    Post subject: Reply with quote

I am using AMD "Catalyst Control Center" to OC my 7750 (running my non-OC'able CPU on factory settings Smile) and such OC'ed GPU shrub WCG/HCC1 w/o errors. But every project is more or less sensitive to overclocking, so it is good you found out the way to get rid of errors on EAH and SAH.

Cheers and good luck #ni-1
_________________



Back to top
View user's profile Send private message Visit poster's website
Putting_things_on_top
Duke
Duke


Joined: 14 Oct 2009
Posts: 435
Location: Frostbite Falls, Minnesota, USA

PostPosted: Sun Mar 17, 2013 10:00 pm    Post subject: Reply with quote

What it seems to boil down to (IMHO) is the sprint-vs-marathon or tortoise-vs-hare comparison.
I tend to go with the marathon approach - a slower, measured pace yields consistent long-term endurance.


_________________
Click here for...KWSN F@H team summary at EOC

Or here for...KWSN F@H team overtake at EOC


Back to top
View user's profile Send private message Visit poster's website
lvanst
Baron
Baron


Joined: 31 Jan 2013
Posts: 152
Location: Phoenix, AZ (yes, it's hot here)

PostPosted: Sun Mar 17, 2013 10:34 pm    Post subject: Reply with quote

Yes PTOT, I'm starting to learn that, but the mad scientist in my head keeps screaming "More Power!!!"

Branjo, it's clocking at 880Mhz GPU/1125Mhz Memory with the AMD Overdrive turned off. It doesn't seem to clock any faster than that with it turned on... Do you have any idea what the differences are between the two modes? I see no visible difference in behavior, other than I can manually set the GPU and memory to higher levels with AMD Overdrive turned on...



I think I'll rethermal that GPU to take it down a few degrees though...
_________________
Back to top
View user's profile Send private message
Putting_things_on_top
Duke
Duke


Joined: 14 Oct 2009
Posts: 435
Location: Frostbite Falls, Minnesota, USA

PostPosted: Sun Mar 17, 2013 11:27 pm    Post subject: Reply with quote

lvanst wrote:
Yes PTOT, I'm starting to learn that, but the mad scientist in my head keeps screaming "More Power!!!"


#ni-1
_________________
Click here for...KWSN F@H team summary at EOC

Or here for...KWSN F@H team overtake at EOC


Back to top
View user's profile Send private message Visit poster's website
branjo
Prince
Prince


Joined: 05 Jan 2006
Posts: 746
Location: Slovakia

PostPosted: Sat Mar 23, 2013 2:46 pm    Post subject: Reply with quote

lvanst wrote:
Yes PTOT, I'm starting to learn that, but the mad scientist in my head keeps screaming "More Power!!!"

Branjo, it's clocking at 880Mhz GPU/1125Mhz Memory with the AMD Overdrive turned off. It doesn't seem to clock any faster than that with it turned on... Do you have any idea what the differences are between the two modes? I see no visible difference in behavior, other than I can manually set the GPU and memory to higher levels with AMD Overdrive turned on...



I think I'll rethermal that GPU to take it down a few degrees though...


I am shrubbing with max OverDriven GPU Clock 900 MHz (this is important AFAIK), Memory clock 1,300 MHz (this is not important AFAIK), disabled "Manual fan control" (running at 10% speed), 10% Power control, w/o any problem. Temperature is 46 - 47 degrees with 98 - 99% GPU utilization for 12 concurrent WCG/HCC1 GPU tasks on 3 CPU threads (but the same temperature has been when I have run 32 concurrent WCG/HCC1 GPU tasks on all 8 threads).

#ni-1
_________________



Back to top
View user's profile Send private message Visit poster's website
Gemjunkie
Prince
Prince


Joined: 03 Jul 2010
Posts: 3519
Location: Earth, lately

PostPosted: Sat Mar 23, 2013 5:12 pm    Post subject: Reply with quote

You can drop the memory clock for MilkyWay to reduce power/heat a bit, it's not memory intensive.
_________________




(older, before split CPID)
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    KWSN Orbiting Fortress Forum Index -> KWSN BOINC'ers All times are GMT - 5 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group
Optimized Seti@Home App | BOINC Stats