Project

General

Profile

Actions

Task #792

open

Check replay speed and optimize for online

Added by Alexandre Camsonne 10 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Start date:
07/12/2023
Due date:
% Done:

80%

Estimated time:
Spent time:

Files

Bildschirmfoto 2023-09-23 um 14.18.05.png (271 KB) Bildschirmfoto 2023-09-23 um 14.18.05.png Profile results after DecodeToHitList optimization Ole Hansen, 09/23/2023 02:53 PM
Bildschirmfoto 2023-09-23 um 14.24.31.png (257 KB) Bildschirmfoto 2023-09-23 um 14.24.31.png Profile results before optimization Ole Hansen, 09/23/2023 03:02 PM
Actions #1

Updated by Alexandre Camsonne 9 months ago

40 Hz with

Actions #2

Updated by Alexandre Camsonne 8 months ago

  • Assignee set to Ole Hansen
Actions #3

Updated by Ole Hansen 8 months ago

  • File Bildschirmfoto 2023-09-04 um 15.47.12.png added
  • Status changed from New to In Progress
  • % Done changed from 0 to 10

Here's some profiling information for the current software. This was obtained with the first 1000 events from run 740 with only THcNPSApparatus containing the THcNPSCalorimeter configured (RelWithDebInfo build). For the interpretation of the graph, cf. https://www.jetbrains.com/help/clion/2023.2/cpu-profiler.html#InterpretingTheResults_FlameChart. THcNPSCalorimeter::Decode consumes 46% of the time, much of that in THcHitList::DecodeToHitList. I think there's some easy room for improvement. I'll investigate further.

Bildschirmfoto 2023-09-23 um 14.24.31.png

Actions #4

Updated by Ole Hansen 7 months ago

The low-hanging fruit in this case turned out to the function THcHitList::DecodeToHitList and THcRawHit::Compare. I made the following main changes:

- Do not use dynamic_cast in THcRawHit::Compare. Since this function is called thousands of times per event in the call to TClonesArray::Sort from DecodeToHitList, dynamic_cast becomes very expensive.
- Replace the O(N^2)-complexity search for hits already found with a given plane and counter number with a std::map lookup. As N >> 100, the much better complexity of the map lookup pays off. (The map also presents the hits in the desired plane/counter sorted order, and so it is a complete waste to call TClonesArray::Sort later, but I can't see an easy way to apply the map order to the TClonesArray. ROOT data structures are fundamentally incompatible with STL containers.)
- Inlined all Get functions of THcRawAdcHit. There is significant overhead associated with calling essentially trivial getters non-inline.

See the attached profile for the result. The setup is the same as in my earlier comment here. The percentage of time spent in THaApparatus::Decode drops from 46% to 23%. The overall run time of this test replay drops from 40.9s to 28.7s, a 30% improvement.

I can see opportunities for further improvements: THaOutput spends 20% of its time retrieving global variable data via Podd::Variable::GetValue. Fadc250Module::LoadSlot spends 60% of its time in THaSlotData::loadData.

These changes will be part of the upcoming hcana 1.0.

Actions #5

Updated by Ole Hansen 7 months ago

  • File deleted (Bildschirmfoto 2023-09-04 um 15.47.12.png)
Actions #7

Updated by Casey Morean 5 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF