Message boards :
Number crunching :
Tasks hanging -
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Send message Joined: 12 Feb 23 Posts: 1 Credit: 4,850,676 RAC: 6,724 |
Have several systems running sidock, all long tasks 0.2.0 Unaccountably 7 seem to be hung on a dual xeon 24 thread. running Linux Time to complete is several weeks past the deadline. The other 11 tasks do not show any problem. None of the other system have this problem but they run windows. jstateson@dual-linux:~$ free -l total used free shared buff/cache available Mem: 12271956 5256616 5031304 27584 1984036 6622240 Low: 12271956 7240652 5031304 High: 0 0 0 Swap: 2097148 0 2097148 HTOP shows %100 usage but temperatures are strange on one CPU. Both CPUs are liquid cooled with sparate closed loop systems so there could be some differences. jstateson@dual-linux:~$ sensors coretemp-isa-0000 Adapter: ISA adapter Core 0: +31.0°C (high = +80.0°C, crit = +96.0°C) Core 1: +35.0°C (high = +80.0°C, crit = +96.0°C) Core 2: +34.0°C (high = +80.0°C, crit = +96.0°C) Core 8: +36.0°C (high = +80.0°C, crit = +96.0°C) Core 9: +32.0°C (high = +80.0°C, crit = +96.0°C) Core 10: +34.0°C (high = +80.0°C, crit = +96.0°C) amdgpu-pci-0400 Adapter: PCI adapter vddgfx: 1000.00 mV fan1: 2991 RPM (min = 0 RPM, max = 3700 RPM) edge: +45.0°C (crit = +94.0°C, hyst = -273.1°C) power1: 89.03 W (cap = 90.00 W) coretemp-isa-0001 Adapter: ISA adapter Core 0: +27.0°C (high = +80.0°C, crit = +96.0°C) Core 1: +26.0°C (high = +80.0°C, crit = +96.0°C) Core 2: +29.0°C (high = +80.0°C, crit = +96.0°C) Core 8: +28.0°C (high = +80.0°C, crit = +96.0°C) Core 9: +22.0°C (high = +80.0°C, crit = +96.0°C) Core 10: +24.0°C (high = +80.0°C, crit = +96.0°C) |
Send message Joined: 24 Oct 20 Posts: 21 Credit: 10,159,102 RAC: 0 |
Can we multi-thread these task or are they single core only tasks? |
Send message Joined: 24 Oct 20 Posts: 21 Credit: 10,159,102 RAC: 0 |
The newer 2.02 units seem to be completing ok, one puzzle that remains is inconsistent granted credit, in some cases a variation of up to 25% for the same computer. That's what happens when you use an algorithm to figure out credits instead of a fixed amount. |
Send message Joined: 11 Oct 20 Posts: 337 Credit: 25,678,308 RAC: 9,173 |
Have several systems running sidock, all long tasks 0.2.0 Hello! Would you post tasks names and current run time? Thank you! |
Send message Joined: 28 Dec 20 Posts: 13 Credit: 8,957,185 RAC: 0 |
Hung task on intel atom N270, 32 bit. Manually compiled. With this off "leave non-GPU task in memory while suspended", pause and resume can get task unstuck. corona_RdRp_v2_sidock-s_98_00014708_r1_s-20_0 https://www.sidock.si/sidock/result.php?resultid=79096387 with gdb, got some information. RbtChromDihedralElement::StandardisedValue (dihedralAngle=2.8241407014048006e+17) At: src/lib/RbtChromDihedralElement.cxx:152 Looks like a very huge angle, 282414070140480060. The function, StandardisedValue, repeatedly subtracts 360, but due to rounding, cannot subtract huge 64 bit float by a tiny number, called Double in programming language. This results in endless loop, endless task. A possible source code fix may be to use remainder or fmod in StandardisedValue function. The remainder function is probably best to use. This can avoid the need to do loops. dihedralAngle = remainder(dihedralAngle, 360); // OR // dihedralAngle = fmod(dihedralAngle, 360); (gdb) bt #0 0x083144e8 in RbtChromDihedralElement::StandardisedValue (dihedralAngle=2.8241407014048006e+17) at ../src/lib/RbtChromDihedralElement.cxx:152 #1 0x08314890 in RbtChromDihedralElement::Mutate (this=0xd537120, relStepSize=16331239353195370) at ../src/lib/RbtChromDihedralElement.cxx:73 #2 0x08311c58 in RbtChrom::Mutate (relStepSize=16331239353195370, this=<optimized out>) at ../src/lib/RbtChrom.cxx:56 #3 RbtChrom::Mutate (this=<optimized out>, relStepSize=16331239353195370) at ../src/lib/RbtChrom.cxx:56 #4 0x082e4140 in RbtPopulation::GAstep (this=0xb514750, nReplicates=nReplicates@entry=1100, relStepSize=relStepSize@entry=1, equalityThreshold=equalityThreshold@entry=0.10000000000000001, pcross=pcross@entry=0.40000000000000002, xovermut=xovermut@entry=true, cmutate=cmutate@entry=false) at ../src/lib/RbtPopulation.cxx:105 #5 0x082806f8 in RbtGATransform::Execute (this=<optimized out>) at ../include/RbtSmartPointer.h:131 #6 0x0830e73d in RbtTransformAgg::Execute (this=0x85c3960) at ../src/lib/RbtTransformAgg.cxx:165 #7 0x081dfcb3 in RbtWorkSpace::Run (this=0x85b02b0) at ../src/lib/RbtWorkSpace.cxx:170 #8 0x080955d0 in main (argc=<optimized out>, argv=<optimized out>) at ../src/exe/cmdock.cxx:775 |
Send message Joined: 28 Dec 20 Posts: 13 Credit: 8,957,185 RAC: 0 |
Update: I am guessing this may be my possible faulty hardware that may make random errors. The more I look at where the numbers (relStepSize=16331239353195370) may have possibly come from, the more I think this could be my faulty hardware making random wrong calculation. I started to believe it could be a possible faulty RAM or CPU hardware on N270 HP Mini 110-1000, very old computer. I had some difficulty just getting this computer to start. Often, display just stay blank, no boot, having to power cycle a few times. Once this computer randomly lost power, but all other devices stayed on. I guess this computer may possibly fail soon. Oh well, I have some other computers I can use. Software may be written to have some check for some faulty numbers to try to reduce the chance of getting stuck in endless loop, with Primegrid being an example of having lots of checks for possible errors. |
Send message Joined: 11 Oct 20 Posts: 337 Credit: 25,678,308 RAC: 9,173 |
sam6861, thank you for interesting notice! |
Send message Joined: 28 Dec 20 Posts: 13 Credit: 8,957,185 RAC: 0 |
More update: Looked at source code in random number code... I may be wrong about hardware errors this time and are more likely a software bug. The randomizer just makes this problem happen by random chance. I believe I found a bug with a function in source code, RbtRand::generate_cauchy. https://gitlab.com/Jukic/cmdock/-/blob/v0.2.0/src/lib/RbtRand.cxx Line 216 val = a random decimal number between -0.5 to 0.5. The problem is the use of tan(pi * val) in radians trigonometry mode. tan(pi*0.4999999999) in linux Qalculate is 3183098861837907, a huge number. RbtRand::generate_cauchy in src/lib/RbtRand.cxx line 220 RbtRand::GetCauchyRandom in src/lib/RbtRand.cxx line 69 RbtChromElement::CauchyMutate in src/lib/RbtChromElement.cxx line 86 --- The next function CauchyMutate calls is RbtChrom::Mutate... which is where relStepSize=16331239353195370 went to get stuck in RbtChromDihedralElement::StandardisedValue. I am not sure of hwo to solve this problem in RbtRand::generate_cauchy, this is mostly up to other people to find a good fix for this random huge number I guess. |
©2024 SiDock@home Team