Message boards :
News :
СmDock "long" and "short" tasks applications
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 3 Mar 22 Posts: 4 Credit: 8,334,432 RAC: 0 |
After I started receiving long assignments, I began to encounter assignments that cross the deadline by several days, as their processing time has already taken several days, although usually it is about 20 hours. Do I need to do something about these issues? |
Send message Joined: 11 Oct 20 Posts: 333 Credit: 25,500,407 RAC: 6,598 |
After I started receiving long assignments, I began to encounter assignments that cross the deadline by several days, as their processing time has already taken several days, although usually it is about 20 hours. Some task can be hung. You can check this. Now I try to reproduce this with one of workunits. |
Send message Joined: 19 Dec 22 Posts: 10 Credit: 1,781,350 RAC: 0 |
New version of BOINC did not help -- likely because I was already using the latest? Where would I find the CA-Bundle.crt file? |
Send message Joined: 19 Dec 22 Posts: 10 Credit: 1,781,350 RAC: 0 |
Oooops... The CA-Bundle.crt file was right in front of me... Now to replace it as suggested. |
Send message Joined: 19 Dec 22 Posts: 10 Credit: 1,781,350 RAC: 0 |
THAT fixed it (replacing CA-Bundle.crt). Thanks to all for your assistance! |
Send message Joined: 27 Dec 21 Posts: 19 Credit: 16,331,360 RAC: 9,917 |
This new app looses CPU/elapsed time stats if restarted (full restarts,without leaving in memory). And so loose points/credits as well. At the same time, actual progress is NOT lost. That is, checkpoints are working. After restarting the app (BOINC restart or BOINC manager just switch to another project without active option "leave in memory while suspended" ), calculations continue from the last checkpoint as intended, but all the stats counters reported to BOINC of elapsed time, CPU time and time elapsed from the last checkpoint are resets to zero. Examples of such tasks: https://www.sidock.si/sidock/result.php?resultid=77568221 13,613.24/13,521.96 sec of elapsed/CPU time https://www.sidock.si/sidock/result.php?resultid=77568222 13,731.34/13,642.55 sec of elapsed/CPU time and 543.41 credits https://www.sidock.si/sidock/result.php?resultid=77568208 - 1,302/1,280 sec of elapsed/CPU time and 50.30 credits https://www.sidock.si/sidock/result.php?resultid=77568215 - 11,880/11,686 sec of elapsed/CPU time and 431.07 credits While actual run times was about 40 000 - 60 000 sec for all of these tasks (~100 sec per 1 ligand on average in docking_out.log and there are 500 of them in each tasks) I just restarted computer few times during it computation. And after each restarts tasks continues from checkpoint successfully but all time counters resets to zero each time. Probably the problem that users have recently complained about in other topics (about a very small amount of credits granted for some of the tasks ) is related to this as well - if the task was restarted often during the calculation process, then only the calculation time since the last restart will be taken into account and evaluated. As it look like credits calculations are based on CPU time used by task and reported by BOINC. P.S. BOINC progress bar (% of task completed) also resets to zero after each restart. But it restore to correct values after some time (usually few mins). But time counters does not restore. |
Send message Joined: 10 Dec 20 Posts: 24 Credit: 10,767,590 RAC: 0 |
This new app looses CPU/elapsed time stats if restarted (full restarts,without leaving in memory). And so loose points/credits as well. So we ae in a catch 22. Hoar Frost says we need to restart the stuck tasks to get them to work but if we do we lose all the earned credit so far. I have to pause my BOINC from 6-8am and 6-8pm every weekday because the electric company charges 9x normal rates during those periods. The SIDock WU have been removed from RAM 2x per day. Have had 15 WU fail and 7 not validate of 113 total: 19.5% failure rate. This WU are not ready for prime time... Too many unresolved issues. Use the Sidock test server for this and let the issues get worked out by people who know it's a beta test. |
Send message Joined: 10 Dec 20 Posts: 24 Credit: 10,767,590 RAC: 0 |
Look at this, this machine ran the hottest temperature LLR SRBase WU's for weeks without a single failure. 77561448 49614522 44886 15 Jan 2023, 18:16:18 UTC 18 Jan 2023, 2:09:25 UTC Error while computing 95,553.32 85,968.58 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77558574 49611655 44886 15 Jan 2023, 16:39:03 UTC 17 Jan 2023, 14:04:34 UTC Error while computing 91,007.37 80,447.98 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 That's an entire wasted day for each of those cores. Some of these WU are going to run 2-3 days ending in errors without any hardware cause? This is not acceptable. And the results are being purged way too quickly so we can not evaluate the run results and find the issues. 6 tasks all report less than 5% complete, been running for 12-15 hours and somehow they are going to complete in under 3 days (according to BOINC which is using past WU run time data)? 15 hours for 5% calculates to 12.5 more days to completion and the one is at 1.67% after 13 hours time to complete would be 33 days. And the failure rate was 19% per day on my 5 machines. Maybe this is the results of the percentages not being accurate from a BOINC restart as Mad Max found; but Mad Max reported the percentages displayed corrected themselves from checkpointing. I'm concerned. |
Send message Joined: 11 Oct 20 Posts: 333 Credit: 25,500,407 RAC: 6,598 |
After restart time counts from last successful checkpoint (as for any other project). Peoples that faced problem with credit says that problem was inconsistency of time and credit. For example: long time and low credit. Time is saved, but credit is low. This problem has other roots. If computing in separate slot hung and checkpoint doesn't performs - time lost can be big. Can time not be recorded in some cases? I think yes, it can be, need investigation. Errors 0xc2, 0xc3 - looks intresting, thank you for report. Just for fun. I made a simple chain of bash commands for viewing last modification time of docking_out.log in each slot directory. BOINC_SLOTS=~/Computing/BOINC/slots ; for directory in $(ls -1 $BOINC_SLOTS) ; do ls -g --no-group $BOINC_SLOTS/$directory | grep docking_out.log; done That save much time. :D For Windows, i think, can be make similar. |
Send message Joined: 27 Dec 21 Posts: 19 Credit: 16,331,360 RAC: 9,917 |
After restart time counts from last successful checkpoint (as for any other project). Yes, it works this way for any OTHER projects, but not for SiDock - for sidoc it resets to zero afters restart! May be it due to use of 2 level wrapped app (app launched by BOINC is not an actual app but just wrapper app which launch actual app which do all actual work/computations) - i do not run any other projects with wrapped apps used to compare. Just reproduced it again. There were 4 SiDock WUs running, and few WUs from other project (from WCG this time, but also work OK with Rosetta@Home and Einstein@Home and MilkyWay@Home). I restarted BOINC For WCG WUs CPU/elapsed time counters and progress bar were immediately restored to values close to prior restart (from latest checkpoint i guess). But for all SiDock WUs all time counters and progress bars were reset to zero. After 5-7 minutes of computation progress bar recovered to near pre-restart values. But CPU/elapsed times still counting from moment of restart. Suspending/resume WUs (with "leave in RAM" option turned off) also kills time counters but save progress bar % of BOINC manager is not restarted. .... Oh, look like i just have found problem (or at least part of it) - BOINC do not see a checkpoints from SiDock at all: CPU time 01:05:37 CPU time since checkpoint 01:05:34 Elapsed time 01:05:46 SiDock use own implementation of checkpoints ? Or do not report to BOINC properly after checkpoint saved? I know they actually works fine. But BOINC does not see/know about it. Any way - BOINC thinks there are no any checkpoint made for WU and that's why it reset CPU/elapsed time counters. Also SiDock does not report progress % to BOINC properly. In working slot directory in boinc_task_state.xml files of all running SiDock WUs i see <fraction_done>0.000000</fraction_done> While in BOINC GUI i see correct values. Probable it report it via API (app-to-app communication on the fly) but does not write same info to the state file as it should? It could explain strange progress bar behavior after restart: BOINC always reads files fist and see fraction_done = 0 and so revert progress bar in GUI to zero too. But later gets actual progress % some other way and corrects progress bar. P.S. I use latest BOINC(v7.20.2) on x64 windows. May be on *nix it works differently... |
Send message Joined: 27 Dec 21 Posts: 19 Credit: 16,331,360 RAC: 9,917 |
This additional errors (all of them only on a computer with hostid 25851 or Cruiser-2 as name) you can ignore. I know exact reason and it is not SiDock or BOINC related. This was 3rd party(non BOINC) buggy app running on same computer with nasty memory leak. It just ate up all the memory (including the virtual/swap file - about 24 GB total) yesterday and other programs started crashing due to out of RAM. After I noticed it and restarted it to free trashed RAM, all these errors stopped immediately. But this has nothing to do with the problem of resets of time counters, progress bar and credit calculations which i see on all of my computers. |
Send message Joined: 11 Oct 20 Posts: 333 Credit: 25,500,407 RAC: 6,598 |
At Linux and BOINC 7.4.22 time before checkpoint saved after BOINC restart. With more recent version of BOINC, should be working also, I think. I waited until there were a couple of hours left before end of tasks for workunits 49617538 and 49617309 and restarted BOINC. After ~ 2 and 4 hours computations for this tasks completed and time saved: 77564558 4 15 Jan 2023, 20:42:57 UTC 19 Jan 2023, 0:55:57 UTC Completed and validated 86,440.24 86,196.93 1,418.98 CurieMarieDock 0.2.0 long tasks v2.00 x86_64-pc-linux-gnu 77564279 4 15 Jan 2023, 20:33:39 UTC 19 Jan 2023, 2:52:41 UTC Completed and validated 105,178.79 104,859.20 1,721.03 CurieMarieDock 0.2.0 long tasks v2.00 x86_64-pc-linux-gnu I should try it under Windows. :) |
Send message Joined: 19 Dec 22 Posts: 10 Credit: 1,781,350 RAC: 0 |
Now that I have SiDock working... Some 32 instances of CurieMarieDock 0.2.0 LONG failed after ONE SECOND recently. Mose show as corona_sprot... ONE task shows 15+ hours of elapsed time with 1d5h plus to go. This shows as corona_RdRp_v2_sample although three of these failed immediately as well. This project is definitely not ready for prime time... The log is simply showing them as having been downloaded and then finished. -------------- It appears that at least some (all?) of these are being blocked by PCMatic's Super Shield. (I forced an update to SiDock and watch PcMatic reject them.) I do not have this problem with any other project. (And I will NOT be bypassing PcMatic -- which works off a "white list" rather than a "black list" like most AV programs.) For the moment I plan to block new updates and try again in a few days. THOUGHTS? |
Send message Joined: 19 Dec 22 Posts: 10 Credit: 1,781,350 RAC: 0 |
An update to my post above... While PC Matic is definitely preventing some tasks from running, others run fine. I am only seeing one running on my laptop, but my other computers are definitely running one or more SiDock tasks. They, too, are also rejecting some (many?). This is also not unique to SiDock. The LHC project is also seeing some tasks blocked on a regular basis. On RARE occasions some WCG (World Community Grid) projects are being blocked as well, generally when something new runs. |
Send message Joined: 3 Mar 22 Posts: 4 Credit: 8,334,432 RAC: 0 |
At the moment, I have stopped processing the tasks of this project so as not to waste resources in vain. I will continue after there is information that these problems have been fixed. |
Send message Joined: 10 Dec 20 Posts: 24 Credit: 10,767,590 RAC: 0 |
This has been a bad SiDock day for me. 31 tasks completed successfully across 9 computers. 27 WU of 32 on the daytime BOINC installation on the one server that is running dual BOINC (10 hours morning/ 10 Hours night for avoiding peak electric rate hours) are increasing expectation time 3 seconds every second. I paused them but they are set to NOT leave RAM because that loses credit. They have 35+ hours or runtime already and looking at another 50+ hours (some said 5-9 days left) and can't beat their deadline so I aborted them. The nighttime BOINC install appears to be working but another 5 appear unable to be able to meet the deadline. *Why did 29 of 32 SiDock WU stop advancing on a machine that pauses 2x a day for 2 hours each period?* I am switching to a single BOINC install but it will still need to pause 2x a day with a cron job boinccmd --set_run_mode never 7320 So I need assurances that these WU won't keep stalling out because they paused. Also, still getting a few ending in error after very long runs. So today I'm at 31 success and 23 failures. 42% failure rate is abominable! 77279883 49343723 44965 11 Jan 2023, 23:44:08 UTC 18 Jan 2023, 16:16:43 UTC Aborted 493391.9 486944.8 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77337873 49399835 44888 12 Jan 2023, 19:26:44 UTC 16 Jan 2023, 17:47:08 UTC Error while computing 13412.96 13299.98 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77337845 49399809 44888 12 Jan 2023, 19:26:44 UTC 16 Jan 2023, 17:16:13 UTC Error while computing 11594.1 11437.84 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77398977 49459298 44888 13 Jan 2023, 15:36:16 UTC 16 Jan 2023, 16:29:53 UTC Error while computing 8851.1 8775.33 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77398985 49459314 44888 13 Jan 2023, 15:36:16 UTC 16 Jan 2023, 16:53:02 UTC Error while computing 10180.91 10069.2 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77399065 49459325 44888 13 Jan 2023, 15:38:10 UTC 16 Jan 2023, 17:00:55 UTC Error while computing 10682.11 10552.55 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77440386 49499316 44898 14 Jan 2023, 3:44:08 UTC 19 Jan 2023, 2:10:12 UTC Aborted 180865.64 161821.8 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77440796 49499732 44900 14 Jan 2023, 3:50:29 UTC 15 Jan 2023, 12:52:36 UTC Error while computing 238.91 209.97 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77446164 49505088 44965 14 Jan 2023, 5:27:09 UTC 18 Jan 2023, 16:16:43 UTC Aborted 331879.66 327318.9 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77557978 49611032 44887 15 Jan 2023, 16:19:56 UTC 20 Jan 2023, 13:30:06 UTC Aborted 278329.64 268281.6 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77558574 49611655 44886 15 Jan 2023, 16:39:03 UTC 17 Jan 2023, 14:04:34 UTC Error while computing 91007.37 80447.98 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77561448 49614522 44886 15 Jan 2023, 18:16:18 UTC 18 Jan 2023, 2:09:25 UTC Error while computing 95553.32 85968.58 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77561513 49614582 44886 15 Jan 2023, 18:17:09 UTC 20 Jan 2023, 2:23:25 UTC Aborted 140384.21 122982.7 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77575916 49628894 44888 16 Jan 2023, 14:00:54 UTC 20 Jan 2023, 14:54:31 UTC Aborted 160172.45 70500.13 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77575917 49628899 44888 16 Jan 2023, 14:00:54 UTC 20 Jan 2023, 14:54:31 UTC Aborted 164651.93 74874.86 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77575927 49628909 44888 16 Jan 2023, 14:00:55 UTC 17 Jan 2023, 18:37:29 UTC Error while computing 4871.7 4784.16 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77575864 49628849 44888 16 Jan 2023, 14:01:56 UTC 20 Jan 2023, 14:46:51 UTC Aborted 140432.11 35749.41 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77576808 49629786 44888 16 Jan 2023, 16:29:53 UTC 20 Jan 2023, 14:54:31 UTC Aborted 142096.72 37286.89 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77576812 49629790 44888 16 Jan 2023, 16:29:53 UTC 20 Jan 2023, 14:54:31 UTC Aborted 152900.31 73801.61 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77576971 49629949 44888 16 Jan 2023, 16:53:02 UTC 20 Jan 2023, 14:54:31 UTC Aborted 151144.96 70712.69 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77576972 49629952 44888 16 Jan 2023, 16:53:02 UTC 20 Jan 2023, 14:46:51 UTC Aborted 152046.29 71703.56 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77576991 49629969 44888 16 Jan 2023, 17:00:55 UTC 20 Jan 2023, 14:54:31 UTC Aborted 148436.82 68058.72 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77576993 49629971 44888 16 Jan 2023, 17:00:55 UTC 20 Jan 2023, 14:46:51 UTC Aborted 149894.3 45036 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77577026 49630004 44888 16 Jan 2023, 17:16:13 UTC 20 Jan 2023, 14:54:31 UTC Aborted 146550.13 66754.64 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77577052 49630030 44888 16 Jan 2023, 17:16:13 UTC 20 Jan 2023, 14:54:31 UTC Aborted 144242.85 65149.5 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77577165 49630145 44888 16 Jan 2023, 17:39:47 UTC 20 Jan 2023, 14:54:31 UTC Aborted 146132.22 65806.09 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77577227 49630209 44888 16 Jan 2023, 17:47:08 UTC 20 Jan 2023, 14:54:31 UTC Aborted 145234.13 64319.44 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77577289 49630271 44888 16 Jan 2023, 17:55:21 UTC 20 Jan 2023, 14:55:11 UTC Aborted 144869.21 64534.8 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77577296 49630270 44888 16 Jan 2023, 17:55:21 UTC 20 Jan 2023, 14:55:11 UTC Aborted 144408.69 63322.3 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77577308 49630289 44888 16 Jan 2023, 17:58:26 UTC 20 Jan 2023, 14:54:31 UTC Aborted 144100.1 62643.67 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77577263 49630238 44888 16 Jan 2023, 17:59:20 UTC 20 Jan 2023, 14:54:31 UTC Aborted 144041.11 63361.22 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77577321 49630296 44888 16 Jan 2023, 18:00:14 UTC 20 Jan 2023, 14:54:31 UTC Aborted 143210.83 63137.41 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77585195 49638144 13277 17 Jan 2023, 17:47:15 UTC 19 Jan 2023, 2:09:39 UTC Aborted 10668.42 10534.66 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77585527 49638492 44888 17 Jan 2023, 18:37:29 UTC 20 Jan 2023, 14:54:31 UTC Aborted 142162.91 52501.22 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77580116 49633069 44893 17 Jan 2023, 3:07:02 UTC 20 Jan 2023, 14:29:05 UTC Aborted 192815.89 179682.8 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77580105 49633104 44895 17 Jan 2023, 3:09:24 UTC 20 Jan 2023, 14:29:18 UTC Aborted 185638.03 180320.3 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77581948 49634901 13277 17 Jan 2023, 8:53:04 UTC 19 Jan 2023, 2:09:39 UTC Aborted 28511.6 26879.02 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77588447 49641410 44886 18 Jan 2023, 4:06:39 UTC 20 Jan 2023, 4:28:35 UTC Error while computing 128506.93 112809.9 --- CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 |
Send message Joined: 10 Dec 20 Posts: 24 Credit: 10,767,590 RAC: 0 |
You all could make management of these huge jobs easier if you multi-threaded the problem and set run limits. Look at Amicable Numbers user settings. We get to choose number of threads and run time length. I've spent about 12 hours over the last 3 days baby sitting these WU's. And I foresee more management hours to come. |
Send message Joined: 27 Nov 22 Posts: 20 Credit: 4,172,323 RAC: 27,044 |
You have been very vocal in your disappointment as seen over a number of different threads. I suggest you move on to a different project (like I and others) or simply stop until the issues are resolved. It's not worth getting bent out of shape over. |
Send message Joined: 22 Nov 20 Posts: 10 Credit: 13,165,036 RAC: 0 |
Yes, I did that. The long tasks are too long. |
Send message Joined: 18 Oct 22 Posts: 9 Credit: 11,097,652 RAC: 58,978 |
Dear Admins / Software Developers, Please address the reported issues asap. I already watch top contributors leaving this project and I will shortly follow when there is no improvement. Take care. Chris |
©2024 SiDock@home Team