Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Comments

Close side panel

DPL: switch to ws:// as default client for self hosted#5535

Merged
ktf merged 1 commit intoAliceO2Group:devAliceO2Group/AliceO2:devfrom
ktf:feat-websocket-defaultktf/AliceO2:feat-websocket-defaultCopy head branch name to clipboard
Feb 25, 2021
Merged

DPL: switch to ws:// as default client for self hosted#5535
ktf merged 1 commit intoAliceO2Group:devAliceO2Group/AliceO2:devfrom
ktf:feat-websocket-defaultktf/AliceO2:feat-websocket-defaultCopy head branch name to clipboard

Conversation

@ktf
Copy link
Member

@ktf ktf commented Feb 23, 2021

No description provided.

@ktf ktf requested a review from a team as a code owner February 23, 2021 11:18
@ktf
Copy link
Member Author

ktf commented Feb 23, 2021

@davidrohr can you check this is now working correctly for you?

@ktf
Copy link
Member Author

ktf commented Feb 23, 2021

@teo, this should not have any effect for AliECS. If you see any, please let me know.

@teo
Copy link
Member

teo commented Feb 23, 2021

I don't even know what this is, are you piping stdout somewhere via websocket? Are CLI flags affected?

@ktf
Copy link
Member Author

ktf commented Feb 23, 2021

Yes, but only for the case in which the DPL driver is used. When running under a "Control" things should behave exactly the same.

@ktf
Copy link
Member Author

ktf commented Feb 23, 2021

@davidrohr do you understand what:

reco_NOGPU.log:[58926:CPVClusterizerSpec]: [20:10:47][ERROR] Exception caught: bitset::test: __position (which is 23040) >= _Nb (which is 23040) 
reco_NOGPU.log-[58926:CPVClusterizerSpec]: /mnt/mesos/sandbox/sandbox/o2-fullci/sw/slc8_x86-64/O2/5535-1/lib/libO2FrameworkFoundation.so(_ZN2o29framework13runtime_errorEPKc+0x74)[0x7fe9968a7a54]
reco_NOGPU.log-[58926:CPVClusterizerSpec]: /mnt/mesos/sandbox/sandbox/o2-fullci/sw/slc8_x86-64/O2/5535-1/lib/libO2Framework.so(+0x1106e5)[0x7fe9979c16e5]
killing child 56782
killing child 56785
killing child 56787
killing child 56788
killing child 56793
killing child 56794
killing child 56834
killing child 57752
killing child 58347
killing child 58779
killing child 60094
killing child 62747
/mnt/mesos/sandbox/sandbox/o2-fullci/sw/slc8_x86-64/O2/5535-1/share/scripts/jobutils.sh: line 48: 56782 Killed                  eval ${finalcommand} >> ${logfile} 2>&1
Running: TIME="#walltime %e" /mnt/mesos/sandbox/sandbox/o2-fullci/sw/slc8_x86-64/O2/5535-1/share/scripts/monitor-mem.sh /usr/bin/time --output=reco_NOGPU.log_time './reco_NOGPU.log_tmp.sh'
/usr/bin/time --output=reco_NOGPU.log_time ./reco_NOGPU.log_tmp.sh

means?

@shahor02
Copy link
Collaborator

@ktf: bitset overflow will be fixed by #5499

@ktf
Copy link
Member Author

ktf commented Feb 23, 2021

Ok. Looks like there is still some errors, though.

@shahor02
Copy link
Collaborator

If you mean this:

[ 93%] Building CXX object Modules/FT0/CMakeFiles/o2-qc-ft0-data-producer.dir/src/runDataProducer.cxx.o
/mnt/mesos/sandbox/sandbox/o2-fullci/sw/SOURCES/QualityControl/v1.10.0/v1.10.0/Modules/PHOS/src/runQCPHOSRaw.cxx:6:10: fatal error: PHOSWorkflow/PublisherSpec.h: No such file or directory
 #include <PHOSWorkflow/PublisherSpec.h>
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
````, 
it is in the QC. @peressounko has moved a class in O2 and will provide corresponding patch for QC.

@ktf
Copy link
Member Author

ktf commented Feb 24, 2021

@TimoWilken could you restart the fullCI on this PR?

@TimoWilken
Copy link
Contributor

@ktf done!

@ktf
Copy link
Member Author

ktf commented Feb 25, 2021

@jgrosseo @davidrohr @sawenzel I am merging this, since the tests seem to pass. If you have any regression you can revert back to the old behaviour with --driver-client-backend stdout://.

@ktf ktf merged commit 8514bb0 into AliceO2Group:dev Feb 25, 2021
@ktf ktf deleted the feat-websocket-default branch February 25, 2021 08:21
@shahor02
Copy link
Collaborator

@ktf this PR leads to a memory buildup: I am running gdb on the last device in the workflow which processes just 1 TF read from the disk (i.e. all upstream devices should not do any processing once the last one starts).
While gdb is paused on a breakpoint, all dpl devices show 100% cpu usage, the memory consumption keeps growing (~100MB/s), eventually it system starts swapping and everything freezes.

If I revert this (and #5537) or run with --driver-client-backend stdout://, everything is back to normal.

@ktf ktf restored the feat-websocket-default branch February 25, 2021 20:33
ktf added a commit that referenced this pull request Feb 25, 2021
ktf added a commit that referenced this pull request Feb 25, 2021
@ktf
Copy link
Member Author

ktf commented Feb 25, 2021

Ok, reverting back... Can you provide me more details on your workflow?

@shahor02
Copy link
Collaborator

@ktf to reproduce it you can download input data from https://cernbox.cern.ch/index.php/s/6Ij0dQbkhvCYGad and start

o2-primary-vertexing-workflow  --shm-segment-size 10000000000 --disable-vertex-track-matching -s

then start gdb on primary-vertexing device, set e.g. bre PVertexer.cxx:540 and run. Once it stops on the breakpoint, the memory keeps growing (with this small data slower than in my full test, still well noticeable).

ktf added a commit that referenced this pull request Feb 25, 2021
ktf added a commit that referenced this pull request Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.