View previous topic :: View next topic |
Author |
Message |
lvanst Baron


Joined: 31 Jan 2013 Posts: 152 Location: Phoenix, AZ (yes, it's hot here)
|
Posted: Wed Mar 13, 2013 12:21 am Post subject: Plagued by Computational Errors |
|
|
My hotrod is unstable...
Several helped me get it dialed in, and then I pushed too far?
At first it was just Einstein and Seti GPU fails, but then it expanded to non-GPU work... (Milkyway, Rosetta, MindModeling)
So, I've throttled it back to stock:
1) Use 87% of cpus (leave 1 available)
2) Use 100% of cpu time
3) No overdrive on the Radeon 7750 GPU (880Mhz, 1125Mhz)
4) No overdrive on the FX-8350 (4 Ghz)
5) Unstalled/Reinstalled Video drivers (not sure if I was current, and AMD Vision Center kept detonating, so rebuilt)
6) Locked BitDefender out of the project work areas and BOINC
Any other ideas? _________________
 |
|
Back to top |
|
 |
Putting_things_on_top Duke


Joined: 14 Oct 2009 Posts: 435 Location: Frostbite Falls, Minnesota, USA
|
Posted: Wed Mar 13, 2013 2:47 am Post subject: |
|
|
Without knowing more details, I would be inclined to suspect heat problems.
Running GPUs over a nominal 80c threshold for extended periods of time will cause performance degradation (increasing errors).
Running consistently above 88c will rapidly begin to cause permanent damage to 'cores' or on-board memory.
If you're comfortable with the potenrial risk of voiding your warranty, you may want to consider re-thermaling your GPUs.
I make a habit of cleaning & re-thermaling my GPUs about every 6-9 months (after I've had them for at least 6 months).
Even though I have NVidia cards, the principle is the same.
Video-card manufacturers seem to employ blind stone masons who trowel on [what really doesn't pass for] thermal compounds.
They invariably use the wrong kind of compound, and they apply it obscenely thick!
And don't use Arctic Silver - it is a 'curing' formulation...it dries-out over time. (I fried 2 cards using that garbage!)
Instead, use a real thermal compound (aka grease) from either Antec or SiiG.
Once you've carefully disassembled your card, scraped out that horrid gunk, and have thoroughly cleaned the major surfaces & components...
...apply the thermal compound - as thinly as humanly possible, to one surface only! (either the chip or the heatsink, not both)...
...then carefully re-assemble your card.
Done, and good to go!
For cleaning your components (either GPU or CPU), it's best [and cheaper] to get these 3 items at your local drug-store:- a bottle of 100% acetone (for fine detail cleaning)
- a bottle of 97% isopropyl alcohol (for rough cleaning) [never use this on acrylic surfaces!!!!]
- a package of 100% cotton, lint-free make-up removal pads (usually round or oval)
These items are absolutely indispensable for properly cleaning your electronic components.
I have an FX-8350 myself, and I have it running near full-tilt (6/8-cores @100%) with a VERY nice aftermarket cooler...it has yet to exceed 50c!!!
That cooler is the CoolerMaster X6 Elite.
But I did a little customized touch of replacing its stock-fan with a Scythe Ultra-Kaze (120x120x38), and reverse positioned it to "pull" the air thru (instead of the normal "push").
My cooler-fan now blows directly into the rear chasis fan (also an Ultra-Kaze).
Amped-up, that Scythe fan will haul 133 CFM!!!!
That combination gives astonishing cooling!
Oh yeah - BTW - if you're running Windoze (like me), there is a minor, known compatibility issue for most AMD processors.
Running it O/C can cause [what is called] race conditions to occur *FAR* too frequently.
You're best option is to run it at the normal stock-speed 4.013GHz.
Race conditions can still happen; but at stock-speed, the likelihood is greatly reduced.
Other tech-sites that I peruse claim that the FX-8350 will run O/C quite happily (and more efficiently) with no known problems under a stable Linux distribution (like Ubuntu, RedHat, or Mint).
The downside to Linux, however, is support for video-card drivers & GPUs is not as good as it should be.
Food for thought...
 _________________ Click here for...KWSN F@H team summary at EOC
Or here for...KWSN F@H team overtake at EOC
 |
|
Back to top |
|
 |
lvanst Baron


Joined: 31 Jan 2013 Posts: 152 Location: Phoenix, AZ (yes, it's hot here)
|
Posted: Wed Mar 13, 2013 3:17 pm Post subject: |
|
|
Thank you for the guidance PTOT...
I run a Corsair H100i which kept the CPU between 45C and 55C at full load while overclocking at 4.8Ghz. I've seen rare peaks of 62C...
The 7750 GPU is always between 37C and 39C, and rarely shows utilization above 60%...
I'll definitely redo the thermal on the GPU. I'm planning on reving it up further, once the platform is stable.
It ran stable last night, so that race condition is a strong possibility!
My previous platform was Ubuntu, and the OS alone gave me a significant performance lift over the initial Win7 build. I will flip the new platform to Ubuntu, once I'm confident in the hardware burn-in/config...
 _________________
 |
|
Back to top |
|
 |
Putting_things_on_top Duke


Joined: 14 Oct 2009 Posts: 435 Location: Frostbite Falls, Minnesota, USA
|
Posted: Wed Mar 13, 2013 10:22 pm Post subject: |
|
|
Seems that you've already got the concept of "thermal management" well in-hand.
And considering that your GPU utilization is around 60% (and very impressive temps), I would postpone the re-thermaling effort until the temps start to drift upward.
So, in the long run, it seems to come down to the O/S...
 _________________ Click here for...KWSN F@H team summary at EOC
Or here for...KWSN F@H team overtake at EOC
 |
|
Back to top |
|
 |
lvanst Baron


Joined: 31 Jan 2013 Posts: 152 Location: Phoenix, AZ (yes, it's hot here)
|
Posted: Thu Mar 14, 2013 12:51 am Post subject: |
|
|
LOL! Amen to that...
I did have six more comp errs on Einstein, so I removed the GPU app_config settings for Einstein and Seti. Now it only runs one GPU job at a time. Let's see how that goes...
 _________________
 |
|
Back to top |
|
 |
lvanst Baron


Joined: 31 Jan 2013 Posts: 152 Location: Phoenix, AZ (yes, it's hot here)
|
Posted: Fri Mar 15, 2013 10:44 pm Post subject: |
|
|
...and it ran error free, until I enabled overclocking...
The errors returned shortly after using the "AMD Vision Engine Control Center" to auto-tune and then overdrive the CPUs and GPU.
I've disabled overclocking, again. Let's see if it cleans up over the weekend. _________________
 |
|
Back to top |
|
 |
branjo Prince


Joined: 05 Jan 2006 Posts: 746 Location: Slovakia
|
Posted: Sat Mar 16, 2013 9:05 am Post subject: |
|
|
Still using app_config? _________________
  
 |
|
Back to top |
|
 |
lvanst Baron


Joined: 31 Jan 2013 Posts: 152 Location: Phoenix, AZ (yes, it's hot here)
|
Posted: Sun Mar 17, 2013 10:33 am Post subject: |
|
|
Branjo,
After it ran stable for a few days I reintroduced the app_config, allowing only 4 concurrent GPU WUs (Einstein & Seti).
No failures yet
It's looking like the AMD overclocking tool was causing the instabilities.
For now I have the "Optimized defaults" loaded in BIOS, with XMP enabled. This does OC slightly using a feature AMD calls turbo-mode (4.0 - 4.2 Ghz)...
The AMD OC tool was pushing it to 4.8 Ghz, though I used that very same tool to lower the clocking to stock (4.0 Ghz) and still saw failures.
It seems the AMD OC tool is interacting directly with the CPUs in a way that causes the computation errors. Which to me just seems highly unlikely...  _________________
 |
|
Back to top |
|
 |
branjo Prince


Joined: 05 Jan 2006 Posts: 746 Location: Slovakia
|
Posted: Sun Mar 17, 2013 2:17 pm Post subject: |
|
|
I am using AMD "Catalyst Control Center" to OC my 7750 (running my non-OC'able CPU on factory settings ) and such OC'ed GPU shrub WCG/HCC1 w/o errors. But every project is more or less sensitive to overclocking, so it is good you found out the way to get rid of errors on EAH and SAH.
Cheers and good luck  _________________
  
 |
|
Back to top |
|
 |
Putting_things_on_top Duke


Joined: 14 Oct 2009 Posts: 435 Location: Frostbite Falls, Minnesota, USA
|
Posted: Sun Mar 17, 2013 10:00 pm Post subject: |
|
|
What it seems to boil down to (IMHO) is the sprint-vs-marathon or tortoise-vs-hare comparison.
I tend to go with the marathon approach - a slower, measured pace yields consistent long-term endurance.
 _________________ Click here for...KWSN F@H team summary at EOC
Or here for...KWSN F@H team overtake at EOC
 |
|
Back to top |
|
 |
lvanst Baron


Joined: 31 Jan 2013 Posts: 152 Location: Phoenix, AZ (yes, it's hot here)
|
Posted: Sun Mar 17, 2013 10:34 pm Post subject: |
|
|
Yes PTOT, I'm starting to learn that, but the mad scientist in my head keeps screaming "More Power!!!"
Branjo, it's clocking at 880Mhz GPU/1125Mhz Memory with the AMD Overdrive turned off. It doesn't seem to clock any faster than that with it turned on... Do you have any idea what the differences are between the two modes? I see no visible difference in behavior, other than I can manually set the GPU and memory to higher levels with AMD Overdrive turned on...
I think I'll rethermal that GPU to take it down a few degrees though... _________________
 |
|
Back to top |
|
 |
Putting_things_on_top Duke


Joined: 14 Oct 2009 Posts: 435 Location: Frostbite Falls, Minnesota, USA
|
|
Back to top |
|
 |
branjo Prince


Joined: 05 Jan 2006 Posts: 746 Location: Slovakia
|
Posted: Sat Mar 23, 2013 2:46 pm Post subject: |
|
|
lvanst wrote: | Yes PTOT, I'm starting to learn that, but the mad scientist in my head keeps screaming "More Power!!!"
Branjo, it's clocking at 880Mhz GPU/1125Mhz Memory with the AMD Overdrive turned off. It doesn't seem to clock any faster than that with it turned on... Do you have any idea what the differences are between the two modes? I see no visible difference in behavior, other than I can manually set the GPU and memory to higher levels with AMD Overdrive turned on...
I think I'll rethermal that GPU to take it down a few degrees though... |
I am shrubbing with max OverDriven GPU Clock 900 MHz (this is important AFAIK), Memory clock 1,300 MHz (this is not important AFAIK), disabled "Manual fan control" (running at 10% speed), 10% Power control, w/o any problem. Temperature is 46 - 47 degrees with 98 - 99% GPU utilization for 12 concurrent WCG/HCC1 GPU tasks on 3 CPU threads (but the same temperature has been when I have run 32 concurrent WCG/HCC1 GPU tasks on all 8 threads).
 _________________
  
 |
|
Back to top |
|
 |
Gemjunkie Prince


Joined: 03 Jul 2010 Posts: 3519 Location: Earth, lately
|
Posted: Sat Mar 23, 2013 5:12 pm Post subject: |
|
|
You can drop the memory clock for MilkyWay to reduce power/heat a bit, it's not memory intensive. _________________
(older, before split CPID)
 |
|
Back to top |
|
 |
|