Task #792
openCheck replay speed and optimize for online
Files
Updated by Ole Hansen 3 months ago
- File Bildschirmfoto 2023-09-04 um 15.47.12.png added
- Status changed from New to In Progress
- % Done changed from 0 to 10
Here's some profiling information for the current software. This was obtained with the first 1000 events from run 740 with only THcNPSApparatus
containing the THcNPSCalorimeter
configured (RelWithDebInfo build). For the interpretation of the graph, cf. https://www.jetbrains.com/help/clion/2023.2/cpu-profiler.html#InterpretingTheResults_FlameChart. THcNPSCalorimeter::Decode
consumes 46% of the time, much of that in THcHitList::DecodeToHitList
. I think there's some easy room for improvement. I'll investigate further.
Updated by Ole Hansen 2 months ago
- File Bildschirmfoto 2023-09-23 um 14.18.05.png Bildschirmfoto 2023-09-23 um 14.18.05.png added
- % Done changed from 10 to 80
The low-hanging fruit in this case turned out to the function THcHitList::DecodeToHitList
and THcRawHit::Compare
. I made the following main changes:
- Do not use dynamic_cast
in THcRawHit::Compare
. Since this function is called thousands of times per event in the call to TClonesArray::Sort
from DecodeToHitList
, dynamic_cast
becomes very expensive.
- Replace the O(N^2)-complexity search for hits already found with a given plane and counter number with a std::map lookup. As N >> 100, the much better complexity of the map lookup pays off. (The map also presents the hits in the desired plane/counter sorted order, and so it is a complete waste to call TClonesArray::Sort
later, but I can't see an easy way to apply the map order to the TClonesArray. ROOT data structures are fundamentally incompatible with STL containers.)
- Inlined all Get functions of THcRawAdcHit
. There is significant overhead associated with calling essentially trivial getters non-inline.
See the attached profile for the result. The setup is the same as in my earlier comment here. The percentage of time spent in THaApparatus::Decode drops from 46% to 23%. The overall run time of this test replay drops from 40.9s to 28.7s, a 30% improvement.
I can see opportunities for further improvements: THaOutput spends 20% of its time retrieving global variable data via Podd::Variable::GetValue
. Fadc250Module::LoadSlot
spends 60% of its time in THaSlotData::loadData
.
These changes will be part of the upcoming hcana 1.0.
Updated by Ole Hansen 2 months ago
- File deleted (
Bildschirmfoto 2023-09-04 um 15.47.12.png)