Tasks hanging -

Message boards : Number crunching : Tasks hanging -
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile vaughan

Send message
Joined: 22 Nov 20
Posts: 10
Credit: 13,167,232
RAC: 127
Message 1832 - Posted: 16 Jan 2023, 20:16:26 UTC

I'm really not liking these super long tasks. Time to consider moving to another project.
ID: 1832 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile threatripper

Send message
Joined: 23 Sep 22
Posts: 6
Credit: 555,953
RAC: 0
Message 1833 - Posted: 16 Jan 2023, 20:17:30 UTC - in response to Message 1829.  

I'm having the same issue right around now as well. Over half my threads seem to be bogged down.

Previously I've only had to kill a rare hanging tasks after a system reboot (which points to a possible inability for a task to resume from a checkpoint).

This time around I have a batch full of tasks that are going either really slow or are apparently hanging. They do peg the CPU about 100%, so they're definitely using the CPU but the code may be stuck in a loop.

Circumstances for this might be heavy system load. I have only 8 threads (8c/8t AM3 processor) and many times my system load is quite a bit above 10 . However, I have boinc set to keep computing regardless of system load. This, afaik should prevent the possiblity in case the programming is unable to properly suspend resume a compute thread (i.e. if errors arise from this step). Other possible factor is that I've had steam running, which is chromium/google based. Google regularly has questionable code so chromium derivatives often end up being vulnerable to exploits from memory safety issues and other matters of open sores, when are then exploited by malware dished out by real time bidding ads); there might be some possibility this chrome derivative is basically borking the system to the point of corrupting other processes.

Given the other reports though, it seems like the likeliest bet is some bugs or suboptimal programming of the work units.

One last possibility, is there any chance these work units depend on results of other work units? I did manually change the order the compute units would normally start (reason being I like to avoid high system loads during gaming sessions, so I have all unstarted tasks on suspend, and manually start them off when system load is low).
ID: 1833 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile threatripper

Send message
Joined: 23 Sep 22
Posts: 6
Credit: 555,953
RAC: 0
Message 1835 - Posted: 16 Jan 2023, 20:20:34 UTC - in response to Message 1832.  

I don't mind long tasks, as long as you can be sure they complete in a reasonable or known time limit.
ID: 1835 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 337
Credit: 25,678,308
RAC: 9,173
Message 1836 - Posted: 16 Jan 2023, 20:44:22 UTC - in response to Message 1833.  

Hi!
One last possibility, is there any chance these work units depend on results of other work units? I did manually change the order the compute units would normally start (reason being I like to avoid high system loads during gaming sessions, so I have all unstarted tasks on suspend, and manually start them off when system load is low).

Tasks are completely independent from each other. You can freely stop and start them.
If you are interested, then you can see how calculations is performed - go into appropriate task slot directory and list docking_out.log file. Inside it you can see how long did it take to simulate the interaction of each processed ligand and the target. You can see how this time varies and this can clear situation.
ID: 1836 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile threatripper

Send message
Joined: 23 Sep 22
Posts: 6
Credit: 555,953
RAC: 0
Message 1843 - Posted: 17 Jan 2023, 2:40:53 UTC - in response to Message 1836.  

Thank you for the clarifications and helpful tip regarding the logs.

I checked on one task which was about 87% complete. It seems to have hung on one of the records. On record 432 there is the last estimate made for time of completion, with 69 records remaining.

RECORD #432
NAME:   ZINC001026963444
                     RNG seed:std::random_device
Numer of docking runs done:   20 (0 errors)
Ligand docking duration:      22.0796 second(s)

Average duration per ligand:  24.7868 second(s)
Approximately 69 record(s) remaining, will finish Sat Jan 14 07:49:39 2023

**************************************************
RECORD #433
NAME:   ZINC001026963481
                     RNG seed:std::random_device
Numer of docking runs done:   20 (0 errors)
Ligand docking duration:      23.5675 second(s)

**************************************************
RECORD #434
NAME:   ZINC001026963482
 [blah.... truncated for brevity]
**************************************************
RECORD #435
NAME:   ZINC001026963521 
 [blah.... truncated for brevity]
**************************************************
RECORD #436
 [blah.... truncated for brevity]
**************************************************
RECORD #437   [blah.... truncated for brevity]
**************************************************
RECORD #438
NAME:   ZINC001026963524
                     RNG seed:std::random_device
Numer of docking runs done:   20 (0 errors)
Ligand docking duration:      23.1778 second(s)

**************************************************
RECORD #439
NAME:   ZINC001026963525
                     RNG seed:std::random_device



The log file ends with Record #439. The last complete record logged without missing information was the previous record #438.

For what reason this ZINC001026963525 simulation seems to have hung I don't know.

However, I think I will kill the task, as well as other tasks that seem to be hanging. I'll make a zip file of the logs in slot 4 in case there's further info in these to shed some light on the cause.

The directory listing indicates that the only file being updated in the slot 4 directory is some sort of mmap file. All the other files have stopped changing for hours:


root@mars2:/var/lib/boinc-client/slots/4# date; ls -alt
Mon 16 Jan 2023 09:29:08 PM EST
total 12012
-rw-r--r--  1 boinc boinc    8192 Jan 16 21:28 boinc_mmap_file
drwxrwx--x  4 boinc boinc    4096 Jan 15 06:34 .
-rw-r--r--  1 boinc boinc    6358 Jan 15 06:34 init_data.xml
-rw-r--r--  1 boinc boinc     517 Jan 14 07:23 boinc_task_state.xml
-rw-r--r--  1 boinc boinc      28 Jan 14 07:23 wrapper_checkpoint.txt
-rw-r--r--  1 boinc boinc  107863 Jan 14 07:23 docking_out.log
-rw-r--r--  1 boinc boinc   47151 Jan 14 07:23 docking_log
-rw-r--r--  1 boinc boinc 1617067 Jan 14 07:23 docking_out
-rw-r--r--  1 boinc boinc     255 Jan 14 07:23 docking_out.chk
-rw-r--r--  1 boinc boinc       8 Jan 14 07:23 docking_out.progress
-rw-r--r--  1 boinc boinc       0 Jan 14 04:22 boinc_lockfile
-rw-r--r--  1 boinc boinc     274 Jan 14 04:22 stderr.txt
-rw-r--r--  1 boinc boinc      56 Jan 14 04:22 htvs.ptc
-rw-r--r--  1 boinc boinc  180365 Jan 14 04:22 target.mol2
-rw-r--r--  1 boinc boinc 7856840 Jan 14 04:22 target.as
-rwxr-xr-x  1 boinc boinc  408352 Jan 14 04:22 cmdock
-rw-r--r--  1 boinc boinc     100 Jan 14 04:22 cmdock-l_wrapper_2.0_x86_64-pc-linux-gnu
-rw-r--r--  1 boinc boinc     721 Jan 14 04:22 job.xml
-rw-r--r--  1 boinc boinc    1033 Jan 14 04:22 target.prm
drwxr-xr-x 13 boinc boinc    4096 Jan 12 02:27 ..
drwxr-xr-x  6 boinc boinc    4096 Dec 21 03:53 data
drwxr-xr-x  2 boinc boinc    4096 Dec 21 03:53 lib
-rw-rw-r--  1 boinc boinc 1983385 Jan 25  2022 ligands.sdf


Boincmgr under the task properties displays the following for the properties:

Application CurieMarieDock 0.2.0 long tasks 2.00 
Name corona_Sprot_delta_v1_RM_sidock_00518854_r4_s1000.0
State Running
Received Sat 14 Jan 2023 12:27:51 AM EST
Report deadline Thu 19 Jan 2023 12:27:51 AM EST
Estimated computation size 30,000 GFLOPs
CPU time 2d 15:20:50
CPU time since checkpoint 2d 12:21:11
Elapsed time 2d 16:10:37
Estimated time remaining 09:05:03 
Fraction done 87.600%
Virtual memory size 141.95 MB
Working set size 2.73 MB
Directory slots/4
Process ID 275748
Progress rate 1.440% per hour
Executable cmdock-l_wrapper_2.0_x86_64-pc-linux-gnu


Maybe putting a timeout break point in the code and adding some debugging code will shed light on why some simulations seem to hang.
ID: 1843 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 337
Credit: 25,678,308
RAC: 9,173
Message 1844 - Posted: 17 Jan 2023, 8:30:29 UTC - in response to Message 1843.  

Yes, task is hung:
...
Name corona_Sprot_delta_v1_RM_sidock_00518854_r4_s1000.0
...
[b]CPU time 2d 15:20:50
CPU time since checkpoint 2d 12:21:11
Elapsed time 2d 16:10:37[/b]
...

In cases like this, I stop BOINC and start it again. As I understand, computer simulation reproduce a chaotic process with usage of random numbers, maybe, in some cases simulation goes into infinite loop, but if we restart it from last checkpoint (last processed ligand), new random numbers will be used and computation will continue successfully. At this moment it is only my hypothesis and now I try to reproduce problem on my computer for one of hung tasks.
Thank you for report!
ID: 1844 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile threatripper

Send message
Joined: 23 Sep 22
Posts: 6
Credit: 555,953
RAC: 0
Message 1852 - Posted: 17 Jan 2023, 17:50:02 UTC - in response to Message 1844.  

Thanks very much. I'm still getting quite a few slow tasks. I'll do a system reboot soon, and that should restart boinc.

If that doen't improve things then one more thing to look into is whether the code for the simulation has changed over the past week (as opposed to just new data).In my experience things were going super smooth just a week ago before these symptoms.
ID: 1852 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 10 Dec 20
Posts: 24
Credit: 10,767,590
RAC: 0
Message 1859 - Posted: 18 Jan 2023, 1:24:39 UTC

I haven't looked at the logs but most all my WU's are showing 2d+ left till completion and the returned credit at Free-DC for this project has taken a sharp nosedive today implying it's a systemic problem in the WU's.
ID: 1859 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 16 Aug 21
Posts: 40
Credit: 17,535,663
RAC: 28,966
Message 1860 - Posted: 18 Jan 2023, 1:31:29 UTC - in response to Message 1859.  

I haven't looked at the logs but most all my WU's are showing 2d+ left till completion and the returned credit at Free-DC for this project has taken a sharp nosedive today implying it's a systemic problem in the WU's.


No, they’ve moved from tasks taking an hour or so to tasks taking a day or so. Credits will pick up when the long tasks finish and the average 1,500 credits per task kick in.
ID: 1860 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 10 Dec 20
Posts: 24
Credit: 10,767,590
RAC: 0
Message 1862 - Posted: 18 Jan 2023, 1:55:45 UTC - in response to Message 1860.  
Last modified: 18 Jan 2023, 2:53:19 UTC

I haven't looked at the logs but most all my WU's are showing 2d+ left till completion and the returned credit at Free-DC for this project has taken a sharp nosedive today implying it's a systemic problem in the WU's.


No, they’ve moved from tasks taking an hour or so to tasks taking a day or so. Credits will pick up when the long tasks finish and the average 1,500 credits per task kick in.


Except for two fake results like this one from my machines:

77522270 49580149 15 Jan 2023, 3:32:33 UTC 17 Jan 2023, 22:39:26 UTC Completed and validated 241,613.00 3,915,919.00 1,647.69 CurieMarieDock 0.2.0 long tasks v2.00
windows_x86_64

This reported runtime is IMPOSSIBLE because it was running on a single thread and the machine returned 48 other WU in the same time period..

Something is wrong with these work units.
There were 15 that ended in error states yesterday.
7 that couldn't validate.
Given that 91 completed, that's a 19.5% failure rate.

Also, the deadline is too close. Our local electric company forced new rate programs and meters on us. 6-8am and 6-8pm are 31 cents per kwh the rest of the day is 4 cents.
BOINC doesn't support 2 pause periods so moved to dual installs.
One runs 10 hours in the day the other 10 hours at night.
These new peak/off-peak programs are a paradigm shift in USA electric power companies; so others crunching BOINC will have to face this soon

8th gen laptop should be able to complete one of these before a deadline but with 4 hours lost per day to the rate plan; and it showing 3 days 4 hours till a Jan 21 deadline, it looks unlikely to finish..
We'll need longer deadlines or a switch to multi-thread these WU's.
ID: 1862 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mad_Max

Send message
Joined: 27 Dec 21
Posts: 19
Credit: 16,651,665
RAC: 16,049
Message 1869 - Posted: 18 Jan 2023, 3:26:40 UTC - in response to Message 1820.  


I note that all the tasks giving me problems are RdRp_v2_sample whereas all of the successful tasks are Sprot_delta_v1_RM_sidock

Are the sample jobs a faulty batch and should I abort them on sight?

No. I only saws "hang" tasks in Sprot_delta.
RdRp_v2_sample runs OK (At least I have never come across a hung task from this series).

It's just that these tasks are considered MUCH (about 10-20 times) longer than the previous ones from a Sprot_delta series . And the calculation times exceeding a day (and on weak computers, more than 2 days of non stop computing) is NORMAL situation for these tasks and is not a failure!

Although such long tasks can be a problem in themselves - admins need to at least increase the BOINC deadline setting for them, because weaker computers (or modern but not working 24/7, but only a few hours a day) simply will not have enough time to finish all calculations before the deadline.
ID: 1869 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mad_Max

Send message
Joined: 27 Dec 21
Posts: 19
Credit: 16,651,665
RAC: 16,049
Message 1870 - Posted: 18 Jan 2023, 3:34:37 UTC - in response to Message 1862.  


This reported runtime is IMPOSSIBLE because it was running on a single thread and the machine returned 48 other WU in the same time period..

Something is wrong with these work units.

It loose(resets to zero) CPU time stats after each restart (full restart without leaving in RAM). So only CPU/elapsed time since last app restart counted. Looks like another bug...
I post about it in detail already in the another thread before saw your message: https://www.sidock.si/sidock/forum_thread.php?id=225&postid=1866#1866
ID: 1870 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 10 Dec 20
Posts: 24
Credit: 10,767,590
RAC: 0
Message 1871 - Posted: 18 Jan 2023, 3:39:44 UTC - in response to Message 1870.  
Last modified: 18 Jan 2023, 3:44:48 UTC


This reported runtime is IMPOSSIBLE because it was running on a single thread and the machine returned 48 other WU in the same time period..

Something is wrong with these work units.

It loose(resets to zero) CPU time stats after each restart (full restart without leaving in RAM). So only CPU/elapsed time since last app restart counted. Looks like another bug...
I post about it in detail already in the another thread before saw your message: https://www.sidock.si/sidock/forum_thread.php?id=225&postid=1866#1866


Did you see any results reporting 3,915,919 seconds?

(Oh no! All my valid results have been purged! There was another and I was trying to check if it was an identical 3,915,919. It was over 3.9 million seconds).

So, I have to go and edit all my machines BOINC settings to retain apps in RAM.... :sigh:


Although such long tasks can be a problem in themselves - admins need to at least increase the BOINC deadline setting for them, because weaker computers (or modern but not working 24/7, but only a few hours a day) simply will not have enough time to finish all calculations before the deadline.


Agreed and I made that point several times on several messages
ID: 1871 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
3man001

Send message
Joined: 3 Mar 22
Posts: 4
Credit: 8,334,432
RAC: 0
Message 1872 - Posted: 18 Jan 2023, 4:52:02 UTC

Lately, I've been interrupting these tasks due to the fact that their execution time was beyond reasonable limits.
Perhaps this will help in finding the hang problem.

https://www.sidock.si/sidock/result.php?resultid=77562352
https://www.sidock.si/sidock/result.php?resultid=77561851
https://www.sidock.si/sidock/result.php?resultid=77559687
https://www.sidock.si/sidock/result.php?resultid=77541537
https://www.sidock.si/sidock/result.php?resultid=77527163
https://www.sidock.si/sidock/result.php?resultid=77442808
https://www.sidock.si/sidock/result.php?resultid=77431479
https://www.sidock.si/sidock/result.php?resultid=77420159
https://www.sidock.si/sidock/result.php?resultid=77319386
ID: 1872 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 337
Credit: 25,678,308
RAC: 9,173
Message 1876 - Posted: 18 Jan 2023, 8:37:20 UTC - in response to Message 1862.  

77522270 49580149 15 Jan 2023, 3:32:33 UTC 17 Jan 2023, 22:39:26 UTC Completed and validated 241,613.00 3,915,919.00 1,647.69 CurieMarieDock 0.2.0 long tasks v2.00
windows_x86_64

This reported runtime is IMPOSSIBLE because it was running on a single thread and the machine returned 48 other WU in the same time period.

I found this result into archive and found files from it. It was sent on machine at 2023-01-15 03:32:33 and received at 2023-01-17 22:39:26 (UTC).
Returned files is correct at first glance. Actual computing time is "5 hours, 25 minutes, 13.132 seconds". I don't know why Xeon 2660 + Windows 8.1 + BOINC report this incorrect time (1.5 month) but obviously, that is a mistake in task description data. Mistake can occurs on different stages - machine, sending, processing on server. Usually known anomalies relates to Windows hosts. Maybe a some influence of antivirus | defenders | e.t.c takes a place.

In any case, "problem" of this result dosn't related to hungs. Looks that it is tasks metadata mistake only.
ID: 1876 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 337
Credit: 25,678,308
RAC: 9,173
Message 1877 - Posted: 18 Jan 2023, 8:46:47 UTC - in response to Message 1869.  

It's just that these tasks are considered MUCH (about 10-20 times) longer than the previous ones from a Sprot_delta series . And the calculation times exceeding a day (and on weak computers, more than 2 days of non stop computing) is NORMAL situation for these tasks and is not a failure!

Yes, absolutely.

Although such long tasks can be a problem in themselves - admins need to at least increase the BOINC deadline setting for them, because weaker computers (or modern but not working 24/7, but only a few hours a day) simply will not have enough time to finish all calculations before the deadline.

Now we process a sample butch of workunits that need for gather of some statistics before switching to main dataset. Increase a deadline is necessary, I agree. Now is 6 days. 8 or 10 days is more appropriate, i think.
ID: 1877 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 10 Dec 20
Posts: 24
Credit: 10,767,590
RAC: 0
Message 1901 - Posted: 20 Jan 2023, 15:39:14 UTC - in response to Message 1876.  
Last modified: 20 Jan 2023, 15:53:19 UTC

77522270 49580149 15 Jan 2023, 3:32:33 UTC 17 Jan 2023, 22:39:26 UTC Completed and validated 241,613.00 3,915,919.00 1,647.69 CurieMarieDock 0.2.0 long tasks v2.00
windows_x86_64

This reported runtime is IMPOSSIBLE because it was running on a single thread and the machine returned 48 other WU in the same time period.
Mistake can occurs on different stages - machine, sending, processing on server. Usually known anomalies relates to Windows hosts. Maybe a some influence of antivirus | defenders | e.t.c takes a place.

In any case, "problem" of this result dosn't related to hungs. Looks that it is tasks metadata mistake only.


This machine is dedicated to BOINC 20/7 and any service or 3rd party app that can drain resources is disabled.
No anti-virus, no task schedules, no workstation, no local DNS server, only basic IP 4 packeting.
The 3rd party task scheduler is just been added yesterday and couldn't be the cause.
I am very careful to examine all new projects WU's and have never seen this kind of run time reporting.

If the machines local clock was off 5 minutes on a WU that ran only 5 minutes, maybe the negative time reporting would be interpreted as 3,915,919.00 sec but the local clock isn't off by 235k seconds...
ID: 1901 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
der_Day

Send message
Joined: 9 Feb 21
Posts: 7
Credit: 7,473,030
RAC: 2,438
Message 1907 - Posted: 20 Jan 2023, 20:18:17 UTC

very annoying, to pause the hanging tasks almost every 10 minutes (sometimes less!)
I think, I have to suspend sidock (hopefully for a short period)
ID: 1907 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile vaughan

Send message
Joined: 22 Nov 20
Posts: 10
Credit: 13,167,232
RAC: 127
Message 1939 - Posted: 24 Jan 2023, 1:23:37 UTC - in response to Message 1907.  

Yes, come and join the PrimeGrid challenge.
ID: 1939 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
arcturus

Send message
Joined: 27 Nov 22
Posts: 20
Credit: 4,462,850
RAC: 23,939
Message 1945 - Posted: 24 Jan 2023, 18:13:55 UTC

The newer 2.02 units seem to be completing ok, one puzzle that remains is inconsistent granted credit, in some cases a variation of up to 25% for the same computer.

https://www.sidock.si/sidock/workunit.php?wuid=49667156
https://www.sidock.si/sidock/workunit.php?wuid=49667239
ID: 1945 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Tasks hanging -

©2024 SiDock@home Team