Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 280 · 281 · 282 · 283

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2024
Credit: 39,862,078
RAC: 19,183
Message 109374 - Posted: 12 Jun 2024, 19:05:03 UTC - in response to Message 109371.  

Tasks starting with RosettaVS run for 8 hours for me.

Great, but I don't say this for the ones that run as expected, but for all those that don't, of which there seem to be many.
Also, I don't recall seeing any RosettaVS tasks. I don't know how they behave.
ID: 109374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 380
Credit: 11,334,032
RAC: 8,037
Message 109375 - Posted: 13 Jun 2024, 6:51:24 UTC - in response to Message 109367.  

Now out of work new

This has been the best run we've had for a couple of years - bound to end at some point once everyone's offline cache runs down.
It's at this point my 12hr runtime setting ekes out my remaining work as far as possible.

What I'd re-emphasise is that the default runtime for tasks has fallen to 3hrs for some reason, which I believe to be a mistake and contradicts the forced Boinc setting of 8hrs,
As such, people should go into Boinc's Your Account option, select Rosetta@home preferences and change Target CPU run time to an explicit 8hrs rather than "not selected".
This will almost treble how long tasks run and extend the life of work batches so that we run out less, if at all, while almost trebling the credit we get for tasks too.

This should be considered a high priority for everyone imo.


I’ve always figured to leave it on default as the project scientists who set them up know their requirements better than I do.
ID: 109375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1551
Credit: 15,933,616
RAC: 17,840
Message 109376 - Posted: 13 Jun 2024, 7:51:00 UTC

New batch of work over at Ralph, with new errors.

RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_148_16902_5_1

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
Codice di accesso non valido.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
Traceback (most recent call last):
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 8, in <module>
    import torch
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorch__init__.py", line 124, in <module>
    raise err
OSError: [WinError 1455] Il file di paging &#232; troppo piccolo per essere completato. Error loading "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchlibcaffe2_detectron_ops_gpu.dll" or one of its dependencies.

</stderr_txt>
]]>




RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_e_pred_195_16901_6_1

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
Codice di accesso non valido.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
Traceback (most recent call last):
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 698, in <module>
    b.write(base64.b64decode(f.read()))
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libbase64.py", line 87, in b64decode
    return binascii.a2b_base64(s)
binascii.Error: Invalid base64-encoded string: number of data characters (65) cannot be 1 more than a multiple of 4

</stderr_txt>
]]>




RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_119_16902_6_1

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
Codice di accesso non valido.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
Traceback (most recent call last):
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 708, in <module>
    pred.predict(out_name+f'_{n}', 
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 551, in predict
    logit_s, logit_aa_s, logit_pae, logit_pde, p_bind, pred_crds, alpha, pred_allatom, pred_lddt_binned,                msa_prev, pair_prev, state_prev = self.model(
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaRoseTTAFoldModel.py", line 358, in forward
    msa, pair, xyz, alpha_s, xyz_allatom, state, symmsub = self.simulator(
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaTrack_module.py", line 1106, in forward
    msa, pair, xyz, state, alpha, symmsub = self.main_block[i_m](msa, pair,
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaTrack_module.py", line 929, in forward
    xyz, state, alpha = self.str2str(
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchcudaampautocast_mode.py", line 141, in decorate_autocast
    return func(*args, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaTrack_module.py", line 503, in forward
    shift = self.se3(G, node.reshape(B*L, -1, 1), l1_feats, edge_feats)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaSE3_network.py", line 96, in forward
    return self.se3(G, node_features, edge_features)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodeltransformer.py", line 185, in forward
    node_feats = self.graph_modules(node_feats, edge_feats, graph=graph, basis=basis)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodeltransformer.py", line 47, in forward
    input = module(input, *args, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersattention.py", line 162, in forward
    fused_key_value = self.to_key_value(node_features, edge_features, graph, basis)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersconvolution.py", line 347, in forward
    out += self.conv_in[str(degree_in)](feature, invariant_edge_feats,
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersconvolution.py", line 186, in forward
    radial_weights = self.radial_func(invariant_edge_feats[e_i:e_j]) 
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersconvolution.py", line 118, in forward
    return self.net(features)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulescontainer.py", line 139, in forward
    input = module(input)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmoduleslinear.py", line 96, in forward
    return F.linear(input, self.weight, self.bias)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnfunctional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: [enforce fail at ..c10coreCPUAllocator.cpp:79] data. DefaultCPUAllocator: not enough memory: you tried to allocate 536870912 bytes.

</stderr_txt>]]>

Grant
Darwin NT
ID: 109376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 243
Credit: 435,550
RAC: 1,317
Message 109377 - Posted: 13 Jun 2024, 12:29:17 UTC
Last modified: 13 Jun 2024, 13:06:40 UTC

Did they port rosetta python projects to native windows?
Try to increase pagefile size.
It helped with gpugrid python project.
It even uses gpu.
ID: 109377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2024
Credit: 39,862,078
RAC: 19,183
Message 109379 - Posted: 13 Jun 2024, 23:42:54 UTC - in response to Message 109375.  

What I'd re-emphasise is that the default runtime for tasks has fallen to 3hrs for some reason, which I believe to be a mistake and contradicts the forced Boinc setting of 8hrs,
As such, people should go into Boinc's Your Account option, select Rosetta@home preferences and change Target CPU run time to an explicit 8hrs rather than "not selected".
This will almost treble how long tasks run and extend the life of work batches so that we run out less, if at all, while almost trebling the credit we get for tasks too.

This should be considered a high priority for everyone imo.

I’ve always figured to leave it on default as the project scientists who set them up know their requirements better than I do.

While generally true, it's clear imo this 3hr target runtime is an error as it's inconsistent with what Rosetta tells Boinc.
It only ever slips through when a new version of the app comes out.
Istr it happened once before and was corrected in the days when the admins paid more attention to us.
If the 8hr default ever changes I think something would be said - and seeing as no-one's saying anything these days I doubt it ever will change without a very specific reason.
ID: 109379 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2024
Credit: 39,862,078
RAC: 19,183
Message 109380 - Posted: 14 Jun 2024, 3:20:41 UTC

Ooh, 360k tasks. We live to fight another day (or two)
ID: 109380 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1912
Credit: 8,818,406
RAC: 9,577
Message 109383 - Posted: 15 Jun 2024, 6:48:29 UTC
Last modified: 15 Jun 2024, 6:48:45 UTC

Today a lot of "classical" error

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442
BOINC:: Error reading and gzipping output datafile: default.out
08:16:19 (5164): called boinc_finish(1)

ID: 109383 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2024
Credit: 39,862,078
RAC: 19,183
Message 109385 - Posted: 15 Jun 2024, 9:32:32 UTC - in response to Message 109383.  

Today a lot of "classical" error

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442
BOINC:: Error reading and gzipping output datafile: default.out
08:16:19 (5164): called boinc_finish(1)

Yes, but very quickly, so I'm not too worried by them

More concerning are two Validate errors after running to completion
hal_8a_i_hal_8aa_2jp5597_d99_0001_SAVE_ALL_OUT_2978378_13_0
hal_8a_i_hal_8aa_2jp1316_d224_0001_SAVE_ALL_OUT_2978378_13_0
ID: 109385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2024
Credit: 39,862,078
RAC: 19,183
Message 109387 - Posted: 17 Jun 2024, 20:29:44 UTC - in response to Message 109380.  

Ooh, 360k tasks. We live to fight another day (or two)

Turned into 3+ days, but we're out again.
ID: 109387 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2024
Credit: 39,862,078
RAC: 19,183
Message 109389 - Posted: 19 Jun 2024, 21:01:17 UTC - in response to Message 109387.  

Ooh, 360k tasks. We live to fight another day (or two)

Turned into 3+ days, but we're out again.

While I know most people will have finished up their outstanding tasks already, I managed to sneak 4 extra returned tasks today and now discover that the validators running under boinc-process are down again.
Better now than at other times, I guess
ID: 109389 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1551
Credit: 15,933,616
RAC: 17,840
Message 109390 - Posted: 20 Jun 2024, 6:15:28 UTC
Last modified: 20 Jun 2024, 6:15:54 UTC

That boinc-process server has developed a habit of regularly falling over, it was well past due for another crash.
Grant
Darwin NT
ID: 109390 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2024
Credit: 39,862,078
RAC: 19,183
Message 109391 - Posted: 20 Jun 2024, 7:51:27 UTC - in response to Message 109389.  

Ooh, 360k tasks. We live to fight another day (or two)

Turned into 3+ days, but we're out again.

While I know most people will have finished up their outstanding tasks already, I managed to sneak 4 extra returned tasks today and now discover that the validators running under boinc-process are down again.
Better now than at other times, I guess

Or maybe not better now as 660k tasks newly available
ID: 109391 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1912
Credit: 8,818,406
RAC: 9,577
Message 109396 - Posted: 20 Jun 2024, 20:10:55 UTC - in response to Message 109391.  

Or maybe not better now as 660k tasks newly available


0 wus and a lot of daemons are down....
ID: 109396 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2024
Credit: 39,862,078
RAC: 19,183
Message 109397 - Posted: 20 Jun 2024, 23:19:40 UTC - in response to Message 109396.  
Last modified: 20 Jun 2024, 23:26:14 UTC

Or maybe not better now as 660k tasks newly available

0 wus and a lot of daemons are down....

Yup. I would've expected 660k to last at least 2 days, but I'm not sure it lasted much more than 15hrs, Unless tasks got pulled.
Front page figures borked on top of boinc-process server borked

Edit: Actually, I'm now thinking tasks did get pulled.

Unvalidated tasks were about 20k before the new batch arrived - now 160k
In progress tasks were about 30k, now 112k
That implies 222k tasks were grabbed

But the front page is locked at 7am with 660k queued, 440k have gone missing, presumed pulled
ID: 109397 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 280 · 281 · 282 · 283

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org