Message boards :
Number crunching :
lots of tasks error out suddenly
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Dec 20 Posts: 3 Credit: 2,816,167 RAC: 0 |
I have been running non stop since 11/22 with no issues and suddenly 40 tasks fail all at once. Bad batch? Not much to see in the error log. <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 17:58:05 (36465): wrapper (7.17.26016): starting 17:58:06 (36465): wrapper (7.17.26016): starting 17:58:06 (36465): wrapper: running cmdock (-c -j 1 -r target.prm -p "/var/lib/boinc-client/slots/14/data/scripts/dock.prm" -f htvs.ptc -i ligands.sdf -o docking_out) 18:01:16 (36465): cmdock exited; CPU time 189.741309 18:01:16 (36465): app exit status: 0x8b 18:01:16 (36465): called boinc_finish(195) |
Send message Joined: 9 Oct 20 Posts: 185 Credit: 2,782,517 RAC: 50 |
Could you specify the computer please? I see several computers belonging to you in the database, there are some error results, but some of them seem to have been resolved. |
Send message Joined: 24 Oct 20 Posts: 23 Credit: 9,020 RAC: 0 |
Looks like it's this one - https://www.sidock.si/sidock/show_host_detail.php?hostid=23025 Enough RAM available? You could try a project reset and see if that helps. |
Send message Joined: 31 Dec 20 Posts: 3 Credit: 2,816,167 RAC: 0 |
Yes computer 23025 is one I am questioning. It is a 12c/24t Xeon with 32G of memory, and it was not running anything else demanding at the time, so I doubt memory was an issue, although I did add memory recently. It continued to run some tasks successfully even after the 40 that failed. These all failed within a very short time period. I moved it to another project and it is running fine there, which is why I as questioning if there was something up with the tasks that failed. If it is some system issue I am not sure what to look for, I have not restarted the OS or BOINC since this occurred. I will try moving work back to it and see what happens. |
Send message Joined: 30 Nov 21 Posts: 2 Credit: 1,245,009 RAC: 0 |
If you're running Linux, you should check the kernel logs (`journalctl -k` / `dmesg`) for any errors. |
Send message Joined: 31 Dec 20 Posts: 3 Credit: 2,816,167 RAC: 0 |
Don't know if it helps but I found a number of these in my system log when these all failed. cmdock[36125]: segfault at 5634063e8400 ip 00007f1fa00126d6 sp 00007ffe7b743ba0 error 4 in libcmdock.so.0[7f1f9fd52000+45b000] cmdock[36541]: segfault at 55adf716bb30 ip 00007fb85fe95539 sp 00007fffc47c7d00 error 4 in libcmdock.so.0[7fb85fb46000+45b000] cmdock[36441]: segfault at 557dcd3dace0 ip 00007f99ecbb66d9 sp 00007ffc3a8657a0 error 4 in libcmdock.so.0[7f99ec8f6000+45b000] I won't paste all 40 of them here but they all end the same "error 4 in libcmdock.so.0[************+45b000]' |
Send message Joined: 30 Nov 21 Posts: 2 Credit: 1,245,009 RAC: 0 |
If it's not a typical use-after-free C/C++ programming error, then it's caused by memory corruption from unstable memory or CPU. It could also be caused by a cosmic ray, however unlikely. |
©2024 SiDock@home Team