Message boards :
Number crunching :
Tasks hanging -
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Aug 21 Posts: 40 Credit: 17,532,587 RAC: 28,959 |
Has anyone else had a problem with the 0.2.0 long tasks hanging? I aborted https://www.sidock.si/sidock/result.php?resultid=76865427 when it was in high priority mode with 4 days remaining and a 3 day deadline as the progress was not moving and the outstanding time was still rising. I now have 2 more in the same state, for example corona_Sprot_delta_v1_RM_sidock_00411620_r1_s1000.0_0 at 51.004% after nearly 15 hours. |
Send message Joined: 23 Oct 20 Posts: 5 Credit: 3,559,933 RAC: 3,017 |
i have 2 of these hanging tasks ongoing - should i abort them? both are using a full core of cpu time. in both cases, the only files in the slot directory that is being updated is boinc_mmap_file and init_data.xml |
Send message Joined: 16 Aug 21 Posts: 40 Credit: 17,532,587 RAC: 28,959 |
i have 2 of these hanging tasks ongoing - should i abort them? I've suspended the 3 I have in progress until admin have a chance to review them. |
Send message Joined: 11 Oct 20 Posts: 337 Credit: 25,677,525 RAC: 9,176 |
Hello! Yes, If progress counter become frozen during unusually long time, you can try suspend and resume task or stop BOINC and start again. The second option is more reliable. |
Send message Joined: 23 Oct 20 Posts: 5 Credit: 3,559,933 RAC: 3,017 |
task suspend (without leave in memory defined) looks like it did the trick. what sort of time threshold are we looking at to suggest hung task and dig into the slots dir? 4 hours? more? |
Send message Joined: 16 Aug 21 Posts: 40 Credit: 17,532,587 RAC: 28,959 |
Percentage static and time remaining increasing over a 5 minute period. |
Send message Joined: 11 Oct 20 Posts: 337 Credit: 25,677,525 RAC: 9,176 |
I agree. 5 .. 10 minutes. For RPi - 5 ... 30 minutes. |
Send message Joined: 16 Aug 21 Posts: 40 Credit: 17,532,587 RAC: 28,959 |
Confirmed, this issue was caused by low memory - running 3 CPDN OpenIFS apps and 21 mixed SiDock, TN-Grid and WCG tasks in 16GB was always going to be tight but having upgraded to 64GB the hanging tasks have disappeared. |
Send message Joined: 27 Dec 21 Posts: 19 Credit: 16,650,286 RAC: 16,136 |
Also see some of such "stuck" tasks with latest app (never seen such behavior before with previous version). CPU core is still fully used, but actual progress stops. To make it worse, it seems in the application there is no "watchdog" timer (or inadequate settings are set in it). Normal tasks are successfully completed in 1.5-3 hours each on single core on my computers(depends on CPU - i have few different) , but the bad one can occupy a processor for a day or two and never end until I cancel or restart it. During this time, if there was no such failure, 10-20 other tasks on the same core could be successfully completed. If you do not manage to find out the root cause of the failures and eliminate it, I would recommend adding a guard timer. And better not for the entire task(WU - BOINC work unit), there are actually a lot of separate micro-tasks in it (modeling attempts, judging by the logs of about 500 pieces packed into each “long” task by default). If such an individual micro task does not end for more than 10-15 minutes(normal run times on relative modern CPUs <1 min), it will never end and it should be restarted or canceled. |
Send message Joined: 11 Oct 20 Posts: 337 Credit: 25,677,525 RAC: 9,176 |
Hello Max! Thank you for report! Good idea. |
Send message Joined: 23 Dec 20 Posts: 20 Credit: 1,360,768 RAC: 0 |
Below task went to end after re-start (earlier re-start due to another stuck task). https://www.sidock.si/sidock/result.php?resultid=77388891 Paul. |
Send message Joined: 27 Nov 22 Posts: 20 Credit: 4,452,454 RAC: 23,175 |
I have had 12 tasks hang among 4 different computers over the past few days. The lost time represents roughly 116 tasks which could of been processed. |
Send message Joined: 11 Oct 20 Posts: 337 Credit: 25,677,525 RAC: 9,176 |
Hi folks! I caught one of them and now try to reproduce the problem. If it succeeds, it will greatly help. Thank you for reports! |
Send message Joined: 16 Aug 21 Posts: 40 Credit: 17,532,587 RAC: 28,959 |
Hi folks! I caught one of them and now try to reproduce the problem. If it succeeds, it will greatly help. I note that all the tasks giving me problems are RdRp_v2_sample whereas all of the successful tasks are Sprot_delta_v1_RM_sidock Are the sample jobs a faulty batch and should I abort them on sight? |
Send message Joined: 23 Dec 20 Posts: 20 Credit: 1,360,768 RAC: 0 |
Hi folks! I caught one of them and now try to reproduce the problem. If it succeeds, it will greatly help. Not so for me, I have not got to the RdRp yet. The ones I have hangs on are Sprot. Current is https://www.sidock.si/sidock/workunit.php?wuid=49502557 Paul. |
Send message Joined: 16 Aug 21 Posts: 40 Credit: 17,532,587 RAC: 28,959 |
Hi folks! I caught one of them and now try to reproduce the problem. If it succeeds, it will greatly help. Having reset the 3 that appeared to be hanging they’re running ok for now with an apparent 17 hour expected run time. |
Send message Joined: 27 Nov 22 Posts: 20 Credit: 4,452,454 RAC: 23,175 |
Project reset with no new work units instructed based on the number of them hanging. On to another project until this issue is resolved. |
Send message Joined: 24 Oct 20 Posts: 21 Credit: 10,159,102 RAC: 0 |
Project reset with no new work units instructed based on the number of them hanging. Early in the past week I've aborted more than about a dozen or so tasks across several pc's but over the last few days I haven't had to abort any tasks, don't know if I've been lucky or they are being worked thru. |
Send message Joined: 18 Oct 22 Posts: 9 Credit: 16,528,549 RAC: 255,235 |
Teams, Crunchers, There are still lots of buggy workunits around. I consider those as being a waste of energy and computing power. I am fully aware that this is a voluntary efford everyone is contributing here. However, for the sake of the scientits as well as the volunteers issues should be fixed. I feel there is either no attention or progress on fixing anything. I will now start aborting all suspicious tasks, hoping an increase in aborted items will catch someones attention. Finally, if there should be no improvement visible, I will dedicate computing capacity to other projects. Very sad to talk like this but I do not see any other option. Cheers |
Send message Joined: 11 Oct 20 Posts: 337 Credit: 25,677,525 RAC: 9,176 |
Tasks like "corona_RdRp_v2_*" is much longer than "corona_Sprot_*". Estimated time for Ryzens - 16 hours and more. And 30 seconds ... 600 seconds per one step (depends on the "luck" of ligand - modeling for lucky ones takes longer). |
©2024 SiDock@home Team