Check replay speed and optimize for online
Updated by Ole Hansen 3 months ago
- File Bildschirmfoto 2023-09-04 um 15.47.12.png added
- Status changed from New to In Progress
- % Done changed from 0 to 10
Here's some profiling information for the current software. This was obtained with the first 1000 events from run 740 with only
THcNPSApparatus containing the
THcNPSCalorimeter configured (RelWithDebInfo build). For the interpretation of the graph, cf. https://www.jetbrains.com/help/clion/2023.2/cpu-profiler.html#InterpretingTheResults_FlameChart.
THcNPSCalorimeter::Decode consumes 46% of the time, much of that in
THcHitList::DecodeToHitList. I think there's some easy room for improvement. I'll investigate further.
Updated by Ole Hansen 2 months ago
- File Bildschirmfoto 2023-09-23 um 14.18.05.png Bildschirmfoto 2023-09-23 um 14.18.05.png added
- % Done changed from 10 to 80
The low-hanging fruit in this case turned out to the function
THcRawHit::Compare. I made the following main changes:
- Do not use
THcRawHit::Compare. Since this function is called thousands of times per event in the call to
dynamic_cast becomes very expensive.
- Replace the O(N^2)-complexity search for hits already found with a given plane and counter number with a std::map lookup. As N >> 100, the much better complexity of the map lookup pays off. (The map also presents the hits in the desired plane/counter sorted order, and so it is a complete waste to call
TClonesArray::Sort later, but I can't see an easy way to apply the map order to the TClonesArray. ROOT data structures are fundamentally incompatible with STL containers.)
- Inlined all Get functions of
THcRawAdcHit. There is significant overhead associated with calling essentially trivial getters non-inline.
See the attached profile for the result. The setup is the same as in my earlier comment here. The percentage of time spent in THaApparatus::Decode drops from 46% to 23%. The overall run time of this test replay drops from 40.9s to 28.7s, a 30% improvement.
I can see opportunities for further improvements: THaOutput spends 20% of its time retrieving global variable data via
Fadc250Module::LoadSlot spends 60% of its time in
These changes will be part of the upcoming hcana 1.0.