Message boards :
Number crunching :
Tasks hanging -
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 22 Nov 20 Posts: 10 Credit: 13,167,232 RAC: 127 |
I'm really not liking these super long tasks. Time to consider moving to another project. |
Send message Joined: 23 Sep 22 Posts: 6 Credit: 555,953 RAC: 0 |
I'm having the same issue right around now as well. Over half my threads seem to be bogged down. Previously I've only had to kill a rare hanging tasks after a system reboot (which points to a possible inability for a task to resume from a checkpoint). This time around I have a batch full of tasks that are going either really slow or are apparently hanging. They do peg the CPU about 100%, so they're definitely using the CPU but the code may be stuck in a loop. Circumstances for this might be heavy system load. I have only 8 threads (8c/8t AM3 processor) and many times my system load is quite a bit above 10 . However, I have boinc set to keep computing regardless of system load. This, afaik should prevent the possiblity in case the programming is unable to properly suspend resume a compute thread (i.e. if errors arise from this step). Other possible factor is that I've had steam running, which is chromium/google based. Google regularly has questionable code so chromium derivatives often end up being vulnerable to exploits from memory safety issues and other matters of open sores, when are then exploited by malware dished out by real time bidding ads); there might be some possibility this chrome derivative is basically borking the system to the point of corrupting other processes. Given the other reports though, it seems like the likeliest bet is some bugs or suboptimal programming of the work units. One last possibility, is there any chance these work units depend on results of other work units? I did manually change the order the compute units would normally start (reason being I like to avoid high system loads during gaming sessions, so I have all unstarted tasks on suspend, and manually start them off when system load is low). |
Send message Joined: 23 Sep 22 Posts: 6 Credit: 555,953 RAC: 0 |
I don't mind long tasks, as long as you can be sure they complete in a reasonable or known time limit. |
Send message Joined: 11 Oct 20 Posts: 337 Credit: 25,679,261 RAC: 9,148 |
Hi! One last possibility, is there any chance these work units depend on results of other work units? I did manually change the order the compute units would normally start (reason being I like to avoid high system loads during gaming sessions, so I have all unstarted tasks on suspend, and manually start them off when system load is low). Tasks are completely independent from each other. You can freely stop and start them. If you are interested, then you can see how calculations is performed - go into appropriate task slot directory and list docking_out.log file. Inside it you can see how long did it take to simulate the interaction of each processed ligand and the target. You can see how this time varies and this can clear situation. |
Send message Joined: 23 Sep 22 Posts: 6 Credit: 555,953 RAC: 0 |
Thank you for the clarifications and helpful tip regarding the logs. I checked on one task which was about 87% complete. It seems to have hung on one of the records. On record 432 there is the last estimate made for time of completion, with 69 records remaining. RECORD #432 NAME: ZINC001026963444 RNG seed:std::random_device Numer of docking runs done: 20 (0 errors) Ligand docking duration: 22.0796 second(s) Average duration per ligand: 24.7868 second(s) Approximately 69 record(s) remaining, will finish Sat Jan 14 07:49:39 2023 ************************************************** RECORD #433 NAME: ZINC001026963481 RNG seed:std::random_device Numer of docking runs done: 20 (0 errors) Ligand docking duration: 23.5675 second(s) ************************************************** RECORD #434 NAME: ZINC001026963482 [blah.... truncated for brevity] ************************************************** RECORD #435 NAME: ZINC001026963521 [blah.... truncated for brevity] ************************************************** RECORD #436 [blah.... truncated for brevity] ************************************************** RECORD #437 [blah.... truncated for brevity] ************************************************** RECORD #438 NAME: ZINC001026963524 RNG seed:std::random_device Numer of docking runs done: 20 (0 errors) Ligand docking duration: 23.1778 second(s) ************************************************** RECORD #439 NAME: ZINC001026963525 RNG seed:std::random_device The log file ends with Record #439. The last complete record logged without missing information was the previous record #438. For what reason this ZINC001026963525 simulation seems to have hung I don't know. However, I think I will kill the task, as well as other tasks that seem to be hanging. I'll make a zip file of the logs in slot 4 in case there's further info in these to shed some light on the cause. The directory listing indicates that the only file being updated in the slot 4 directory is some sort of mmap file. All the other files have stopped changing for hours: root@mars2:/var/lib/boinc-client/slots/4# date; ls -alt Mon 16 Jan 2023 09:29:08 PM EST total 12012 -rw-r--r-- 1 boinc boinc 8192 Jan 16 21:28 boinc_mmap_file drwxrwx--x 4 boinc boinc 4096 Jan 15 06:34 . -rw-r--r-- 1 boinc boinc 6358 Jan 15 06:34 init_data.xml -rw-r--r-- 1 boinc boinc 517 Jan 14 07:23 boinc_task_state.xml -rw-r--r-- 1 boinc boinc 28 Jan 14 07:23 wrapper_checkpoint.txt -rw-r--r-- 1 boinc boinc 107863 Jan 14 07:23 docking_out.log -rw-r--r-- 1 boinc boinc 47151 Jan 14 07:23 docking_log -rw-r--r-- 1 boinc boinc 1617067 Jan 14 07:23 docking_out -rw-r--r-- 1 boinc boinc 255 Jan 14 07:23 docking_out.chk -rw-r--r-- 1 boinc boinc 8 Jan 14 07:23 docking_out.progress -rw-r--r-- 1 boinc boinc 0 Jan 14 04:22 boinc_lockfile -rw-r--r-- 1 boinc boinc 274 Jan 14 04:22 stderr.txt -rw-r--r-- 1 boinc boinc 56 Jan 14 04:22 htvs.ptc -rw-r--r-- 1 boinc boinc 180365 Jan 14 04:22 target.mol2 -rw-r--r-- 1 boinc boinc 7856840 Jan 14 04:22 target.as -rwxr-xr-x 1 boinc boinc 408352 Jan 14 04:22 cmdock -rw-r--r-- 1 boinc boinc 100 Jan 14 04:22 cmdock-l_wrapper_2.0_x86_64-pc-linux-gnu -rw-r--r-- 1 boinc boinc 721 Jan 14 04:22 job.xml -rw-r--r-- 1 boinc boinc 1033 Jan 14 04:22 target.prm drwxr-xr-x 13 boinc boinc 4096 Jan 12 02:27 .. drwxr-xr-x 6 boinc boinc 4096 Dec 21 03:53 data drwxr-xr-x 2 boinc boinc 4096 Dec 21 03:53 lib -rw-rw-r-- 1 boinc boinc 1983385 Jan 25 2022 ligands.sdf Boincmgr under the task properties displays the following for the properties: Application CurieMarieDock 0.2.0 long tasks 2.00 Name corona_Sprot_delta_v1_RM_sidock_00518854_r4_s1000.0 State Running Received Sat 14 Jan 2023 12:27:51 AM EST Report deadline Thu 19 Jan 2023 12:27:51 AM EST Estimated computation size 30,000 GFLOPs CPU time 2d 15:20:50 CPU time since checkpoint 2d 12:21:11 Elapsed time 2d 16:10:37 Estimated time remaining 09:05:03 Fraction done 87.600% Virtual memory size 141.95 MB Working set size 2.73 MB Directory slots/4 Process ID 275748 Progress rate 1.440% per hour Executable cmdock-l_wrapper_2.0_x86_64-pc-linux-gnu Maybe putting a timeout break point in the code and adding some debugging code will shed light on why some simulations seem to hang. |
Send message Joined: 11 Oct 20 Posts: 337 Credit: 25,679,261 RAC: 9,148 |
Yes, task is hung: ... Name corona_Sprot_delta_v1_RM_sidock_00518854_r4_s1000.0 ... [b]CPU time 2d 15:20:50 CPU time since checkpoint 2d 12:21:11 Elapsed time 2d 16:10:37[/b] ... In cases like this, I stop BOINC and start it again. As I understand, computer simulation reproduce a chaotic process with usage of random numbers, maybe, in some cases simulation goes into infinite loop, but if we restart it from last checkpoint (last processed ligand), new random numbers will be used and computation will continue successfully. At this moment it is only my hypothesis and now I try to reproduce problem on my computer for one of hung tasks. Thank you for report! |
Send message Joined: 23 Sep 22 Posts: 6 Credit: 555,953 RAC: 0 |
Thanks very much. I'm still getting quite a few slow tasks. I'll do a system reboot soon, and that should restart boinc. If that doen't improve things then one more thing to look into is whether the code for the simulation has changed over the past week (as opposed to just new data).In my experience things were going super smooth just a week ago before these symptoms. |
Send message Joined: 10 Dec 20 Posts: 24 Credit: 10,767,590 RAC: 0 |
I haven't looked at the logs but most all my WU's are showing 2d+ left till completion and the returned credit at Free-DC for this project has taken a sharp nosedive today implying it's a systemic problem in the WU's. |
Send message Joined: 16 Aug 21 Posts: 40 Credit: 17,538,731 RAC: 29,008 |
I haven't looked at the logs but most all my WU's are showing 2d+ left till completion and the returned credit at Free-DC for this project has taken a sharp nosedive today implying it's a systemic problem in the WU's. No, they’ve moved from tasks taking an hour or so to tasks taking a day or so. Credits will pick up when the long tasks finish and the average 1,500 credits per task kick in. |
Send message Joined: 10 Dec 20 Posts: 24 Credit: 10,767,590 RAC: 0 |
I haven't looked at the logs but most all my WU's are showing 2d+ left till completion and the returned credit at Free-DC for this project has taken a sharp nosedive today implying it's a systemic problem in the WU's. Except for two fake results like this one from my machines: 77522270 49580149 15 Jan 2023, 3:32:33 UTC 17 Jan 2023, 22:39:26 UTC Completed and validated 241,613.00 3,915,919.00 1,647.69 CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 This reported runtime is IMPOSSIBLE because it was running on a single thread and the machine returned 48 other WU in the same time period.. Something is wrong with these work units. There were 15 that ended in error states yesterday. 7 that couldn't validate. Given that 91 completed, that's a 19.5% failure rate. Also, the deadline is too close. Our local electric company forced new rate programs and meters on us. 6-8am and 6-8pm are 31 cents per kwh the rest of the day is 4 cents. BOINC doesn't support 2 pause periods so moved to dual installs. One runs 10 hours in the day the other 10 hours at night. These new peak/off-peak programs are a paradigm shift in USA electric power companies; so others crunching BOINC will have to face this soon 8th gen laptop should be able to complete one of these before a deadline but with 4 hours lost per day to the rate plan; and it showing 3 days 4 hours till a Jan 21 deadline, it looks unlikely to finish.. We'll need longer deadlines or a switch to multi-thread these WU's. |
Send message Joined: 27 Dec 21 Posts: 19 Credit: 16,652,694 RAC: 15,965 |
No. I only saws "hang" tasks in Sprot_delta. RdRp_v2_sample runs OK (At least I have never come across a hung task from this series). It's just that these tasks are considered MUCH (about 10-20 times) longer than the previous ones from a Sprot_delta series . And the calculation times exceeding a day (and on weak computers, more than 2 days of non stop computing) is NORMAL situation for these tasks and is not a failure! Although such long tasks can be a problem in themselves - admins need to at least increase the BOINC deadline setting for them, because weaker computers (or modern but not working 24/7, but only a few hours a day) simply will not have enough time to finish all calculations before the deadline. |
Send message Joined: 27 Dec 21 Posts: 19 Credit: 16,652,694 RAC: 15,965 |
It loose(resets to zero) CPU time stats after each restart (full restart without leaving in RAM). So only CPU/elapsed time since last app restart counted. Looks like another bug... I post about it in detail already in the another thread before saw your message: https://www.sidock.si/sidock/forum_thread.php?id=225&postid=1866#1866 |
Send message Joined: 10 Dec 20 Posts: 24 Credit: 10,767,590 RAC: 0 |
Did you see any results reporting 3,915,919 seconds? (Oh no! All my valid results have been purged! There was another and I was trying to check if it was an identical 3,915,919. It was over 3.9 million seconds). So, I have to go and edit all my machines BOINC settings to retain apps in RAM.... :sigh:
Agreed and I made that point several times on several messages |
Send message Joined: 3 Mar 22 Posts: 4 Credit: 8,334,432 RAC: 0 |
Lately, I've been interrupting these tasks due to the fact that their execution time was beyond reasonable limits. Perhaps this will help in finding the hang problem. https://www.sidock.si/sidock/result.php?resultid=77562352 https://www.sidock.si/sidock/result.php?resultid=77561851 https://www.sidock.si/sidock/result.php?resultid=77559687 https://www.sidock.si/sidock/result.php?resultid=77541537 https://www.sidock.si/sidock/result.php?resultid=77527163 https://www.sidock.si/sidock/result.php?resultid=77442808 https://www.sidock.si/sidock/result.php?resultid=77431479 https://www.sidock.si/sidock/result.php?resultid=77420159 https://www.sidock.si/sidock/result.php?resultid=77319386 |
Send message Joined: 11 Oct 20 Posts: 337 Credit: 25,679,261 RAC: 9,148 |
77522270 49580149 15 Jan 2023, 3:32:33 UTC 17 Jan 2023, 22:39:26 UTC Completed and validated 241,613.00 3,915,919.00 1,647.69 CurieMarieDock 0.2.0 long tasks v2.00 I found this result into archive and found files from it. It was sent on machine at 2023-01-15 03:32:33 and received at 2023-01-17 22:39:26 (UTC). Returned files is correct at first glance. Actual computing time is "5 hours, 25 minutes, 13.132 seconds". I don't know why Xeon 2660 + Windows 8.1 + BOINC report this incorrect time (1.5 month) but obviously, that is a mistake in task description data. Mistake can occurs on different stages - machine, sending, processing on server. Usually known anomalies relates to Windows hosts. Maybe a some influence of antivirus | defenders | e.t.c takes a place. In any case, "problem" of this result dosn't related to hungs. Looks that it is tasks metadata mistake only. |
Send message Joined: 11 Oct 20 Posts: 337 Credit: 25,679,261 RAC: 9,148 |
It's just that these tasks are considered MUCH (about 10-20 times) longer than the previous ones from a Sprot_delta series . And the calculation times exceeding a day (and on weak computers, more than 2 days of non stop computing) is NORMAL situation for these tasks and is not a failure! Yes, absolutely. Although such long tasks can be a problem in themselves - admins need to at least increase the BOINC deadline setting for them, because weaker computers (or modern but not working 24/7, but only a few hours a day) simply will not have enough time to finish all calculations before the deadline. Now we process a sample butch of workunits that need for gather of some statistics before switching to main dataset. Increase a deadline is necessary, I agree. Now is 6 days. 8 or 10 days is more appropriate, i think. |
Send message Joined: 10 Dec 20 Posts: 24 Credit: 10,767,590 RAC: 0 |
77522270 49580149 15 Jan 2023, 3:32:33 UTC 17 Jan 2023, 22:39:26 UTC Completed and validated 241,613.00 3,915,919.00 1,647.69 CurieMarieDock 0.2.0 long tasks v2.00Mistake can occurs on different stages - machine, sending, processing on server. Usually known anomalies relates to Windows hosts. Maybe a some influence of antivirus | defenders | e.t.c takes a place. This machine is dedicated to BOINC 20/7 and any service or 3rd party app that can drain resources is disabled. No anti-virus, no task schedules, no workstation, no local DNS server, only basic IP 4 packeting. The 3rd party task scheduler is just been added yesterday and couldn't be the cause. I am very careful to examine all new projects WU's and have never seen this kind of run time reporting. If the machines local clock was off 5 minutes on a WU that ran only 5 minutes, maybe the negative time reporting would be interpreted as 3,915,919.00 sec but the local clock isn't off by 235k seconds... |
Send message Joined: 9 Feb 21 Posts: 7 Credit: 7,473,030 RAC: 2,438 |
very annoying, to pause the hanging tasks almost every 10 minutes (sometimes less!) I think, I have to suspend sidock (hopefully for a short period) |
Send message Joined: 22 Nov 20 Posts: 10 Credit: 13,167,232 RAC: 127 |
Yes, come and join the PrimeGrid challenge. |
Send message Joined: 27 Nov 22 Posts: 20 Credit: 4,472,616 RAC: 24,658 |
The newer 2.02 units seem to be completing ok, one puzzle that remains is inconsistent granted credit, in some cases a variation of up to 25% for the same computer. https://www.sidock.si/sidock/workunit.php?wuid=49667156 https://www.sidock.si/sidock/workunit.php?wuid=49667239 |
©2024 SiDock@home Team