Memory contention model (example)

Nacho_S · Post by **Nacho_S** » Wed Mar 06, 2019

In the challenge we ask to derive a memory contention model when more than one CPU core and/or the GPU is accessing memory
at the same time. Taking into account that:

A task mapped to run on the GPU, needs offloading data that is acquired through the copy engine (GPU CE). Also, a GPU kernel can output data to be copied back to the host.

CPU tasks are modeled as with a Read/Compute/Write semantic, with "Read" and "Write" being 100% memory bounded operations, and

GPU CE, A57 and Denver have a significantly different memory bandwidth, latencies and sensitivity to memory interference.

According to the following references that measure memory impact on interference:

https://ieeexplore.ieee.org/stamp/stamp ... er=8247615
http://hercules2020.eu/wp-content/uploa ... tforms.pdf

The idea is that the length of memory phases (read and write) depends on how many other memory controller clients
are accessing main memory at the same time.

Let us suppose the following ideas for modeling interference:
We are given a taskset of CPU and GPU tasks.
Size of buffers to read and to write is known in advance and it is fixed for every instance of the periodic job.
From the GPU side we only consider interference from the Copy Engine data movements (modeled in amalthea as "runnables").
A GPU CE data movement is a 100% memory bound runnable.
Every memory access is modeled as sequential access patterns.

Model:
The model we derive from the previously cited literature is the following and it is used to describe what happens to read/write latencies
when more than one CPU core is accessing main memory at the same time.
Also, the following model accounts for increasing latencies for GPU CE activity during the observed time window.

For CPUs:
Lat(CPUtype,cacheLine)[ns] = baseline(CPUtype) + K(CPUtype)*#C + sGPU(CPUtype)*bGPU

with:
Lat(x,y) = time needed to read or write a cacheLine (64B) from main memory to CPU registers.
CPUtype = Observed CPU core; A57 or Denver
baseline = time taken to read or write a cacheLine (64B) from main memory to CPU registers in isolation (no interference).
baseline(A57) = 20 ns
baseline(Denver) = 8 ns

K(CPUtype) = increase in latency operated by a single interfering core. Do note: it does not matter if the interfering core is denver or A57: this number only depends on the observed CPU core (CPUtype)
K(A57) = 20 ns
K(Denver) = 2 ns
#C = number of interfering cores. Range from 0 to 5 (as one core is the observed CPU core. 0 means no interference from other CPUs)

sGPU = sensibility to GPU CE activity. This represents an increase in latencies if the GPU is performing operations on the copy engine.
sGPU(A57) = 100 ns
sGPU(Denver) = 20 ns

bGPU = boolean. 1: if the GPU is operating the copy engine, 0 otherwise.

for GPU Copy engine:
Lat(memcpy,64B) = GPUbaseline + 0.5*#C

Lat(memcpy,64B) = Time taken to transfer 64B using the copy engine (cudaMemcpy)
GPUbaseline = 3 ns. Time taken to transfer 64B using the copy engine with no interfering CPUs
Each core active in the same time window of the CE operation increases the baseline by half a nanosecond.

Numerical Example:
A task mapped on a A57 core has a memory footprint (read) of 128B.
If no other CPU is accessing memory (#C=0)
and if the GPU CE is idle (bGPU=0),
then the time necessary to perform the read operation of the working set size is:
( 128/cacheLine*Lat(A57,cacheLine) = 2*20 = 40 ns ).
If one core is active (does not matter if denver or A57):
(128/cacheLine*Lat(A57,cacheLine) = 2*(20 + 20*1 + 0) = 80 ns)
And this would increase to 280 ns if the GPU CE was active for the whole time of this memory phase.

Please let us know if you have any further question.

Nacho & Nicola

arne.hamann · Post by **arne.hamann** » Tue Mar 26, 2019

The description of the memory contention model is now also included in the appendix of the challenge description that can be found here.

zero212 · Post by **zero212** » Wed Apr 10, 2019

Hi
I guess the baseline can be derived from the read/write latencies and the PU's frequency, but what about K and sGPU parameters? Are there any model entities these values can be derived from? I am working on an implementation and would like to make it as flexible as possible.

Nacho_S · Post by **Nacho_S** » Fri Apr 12, 2019

Hi,

zero212 wrote: ↑
Wed Apr 10, 2019
Hi
I guess the baseline can be derived from the read/write latencies and the PU's frequency, but what about K and sGPU parameters? Are there any model entities these values can be derived from? I am working on an implementation and would like to make it as flexible as possible.

You can find those values here: http://hercules2020.eu/wp-content/uploa ... tforms.pdf
Please check figure 17 (page 22) for the A57 cores and figure 19 (page 24) for Denver cores.

To derive those results we used a memory latency benchmark called "LAT_MEM_RD" (http://www.bitmover.com/lmbench/lat_mem_rd.8.html)

Best,

Nacho

lkrupp · Post by **lkrupp** » Thu Apr 25, 2019

Hi,
the document describing the memory latency measurements states that LMBench (LAT_MEM_RD as specified in detail above) in conjuction with a custom-made program is used. I am interested in how exactly the latency is measured. Therefore, my questions are:

1) Are the measurements on the Tegra platforms conducted using standard Linux for Tegra or was the Linux kernel adapted?
2) In case of standard L4T: Are timing primitives of the OS (like gettime() or clocks()) used and how do you account for the overhead of the OS?

Thank you in advance!

Best regards,

Lukas Krupp

Nacho_S · Post by **Nacho_S** » Mon Apr 29, 2019

lkrupp wrote: ↑
Thu Apr 25, 2019
Hi,
the document describing the memory latency measurements states that LMBench (LAT_MEM_RD as specified in detail above) in conjuction with a custom-made program is used. I am interested in how exactly the latency is measured. Therefore, my questions are:

1) Are the measurements on the Tegra platforms conducted using standard Linux for Tegra or was the Linux kernel adapted?
2) In case of standard L4T: Are timing primitives of the OS (like gettime() or clocks()) used and how do you account for the overhead of the OS?

Thank you in advance!

Best regards,

Lukas Krupp

Hi,
>> 1) Are the measurements on the Tegra platforms conducted using standard Linux for Tegra or was the Linux kernel adapted?
The experiments have been done by using the standard Linux for Tegra.

>> 2) In case of standard L4T: Are timing primitives of the OS (like gettime() or clocks()) used and how do you account for the overhead of the OS?
We used OS timing primitives, unfortunately we do not take into account the OS overhead, however, we tried to minimize this overhead as much as possible by launching the tasks in isolation (with synthetic parameters as input), maximum OS priority, etc...
Nacho.

lkrupp · Post by **lkrupp** » Tue May 07, 2019

Hello,
thank you very much for your answer!

I am currently analyzing the Jetson TX2 platform and have another question:
Does anyone eventually know if the GPU of Jetson TX2 has accessible hardware performance counters and if yes, how to access them?

Many thanks in advance.

Best regards

Nacho_S · Post by **Nacho_S** » Tue May 14, 2019

lkrupp wrote: ↑
Tue May 07, 2019
Hello,
thank you very much for your answer!

I am currently analyzing the Jetson TX2 platform and have another question:
Does anyone eventually know if the GPU of Jetson TX2 has accessible hardware performance counters and if yes, how to access them?

Many thanks in advance.

Best regards

Hi,

We have never profiled an application at that level but as far as I know you can access it by using this API https://docs.nvidia.com/cuda/cupti/inde ... r_overview

Nacho

Tools and Benchmarks for Real-Time Systems

Memory contention model (example)

Memory contention model (example)

Re: Memory contention model (example)

Re: Memory contention model (example)

Re: Memory contention model (example)

Re: Memory contention model (example)

Re: Memory contention model (example)

Re: Memory contention model (example)

Re: Memory contention model (example)