papers
/
2024d_execution_model


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152
							\section{Design Guidelines}
\label{sec:design_guidelines}

Based on the insights from our model, we propose design guidelines to implement efficient intermittent systems.
The effectiveness of these guidelines is evaluated using seven benchmarks on the reference system used in Sec.~\ref{sec:detailed_execution_model}. 
We ported five benchmarks from miBench~\cite{guthausMiBench2001} benchmark suite and implemented two computation kernels (\emph{matmul} and \emph{conv2d}) commonly used in the evaluation of intermittent systems in the literature~\cite{kimLACT2024,maengSupporting2019,bhattacharyyaNvMR2022,ganesanWhat2019,akhunovEnabling2023}.

We evaluate two popular existing checkpointing schemes: \emph{static} and \emph{dynamic}.
In \emph{static}, checkpoint triggers are inserted at every loop latch in the program during compilation~\cite{ransfordMementos2011,kimLivenessAware2023,kimLACT2024,maengAdaptive2018}.
At runtime, checkpoint triggers examine $V_{ES}$ and execute checkpoint only when it is below a predefined threshold.
In contrast, \emph{dynamic}~\cite{jayakumarQUICKRECALL2014,maengSupporting2019,balsamoHibernus2016,balsamoHibernus2015,kortbeekTimesensitive2020} does not modify the original program code.
Instead, it executes checkpoints via interrupts from the power management system, generated when $V_{ES}$ reaches $V_l$.
These schemes are considered since most checkpoint techniques utilize $V_{ES}$ by either actively polling it (as in \emph{static}) or by receiving a signal (as in \emph{dynamic}).
All the evaluations are conducted with 470uF energy storage and 1mA of input current at 1.9V, unless otherwise stated.

\subsection{Delaying Checkpoint Executions}
\label{sec:delay_checkpoint_execution}

The first design practice we propose is to delay checkpoint executions until the last possible moment.
While this practice is generally regarded as desirable in existing works~\cite{ransfordMementos2011,bhattiHarvOS2017}, it has not been recognized as a critical property.
Under the traditional execution model, early checkpoint execution is often considered acceptable as it makes the system wake up sooner, incurring only minor costs for initialization and recovery.
For example, some approaches have explored proactive power-offs based on the program's worst-case execution time~\cite{choiCompilerDirected2022,reymondSCHEMATIC2024,raffeckWoCA2024}.
% For example, some approaches have explored proactive power-offs based on the program's worst-case execution time~\cite{choiCompilerDirected2022,reymondSCHEMATIC2024}, which can be overly pessimistic~\cite{raffeckWoCA2024}.
In contrast, our model reveals that significant energy is wasted each time the system powers off (Sec.~\ref{sec:power_efficiency}).%, highlighting the impact of delaying checkpoint executions.
% As a result, the importance of delaying checkpoint executions is greater than previously assumed.

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{figs/plot_expr_7_cropped.pdf}
    \caption{Execution times across various checkpoint voltages, normalized to the 3.4V configuration.}
    \label{fig:expr_checkpoint_voltages}
\end{figure}

We evaluate the impact of delaying checkpoint executions in \emph{dynamic}, by varying the interrupt voltage. 
A 1100uF capacitor is used for $C_{ES}$.
% Fig.~\ref{fig:expr_checkpoint_voltages} presents the benchmark execution times in dynamic checkpoint scheme, across various checkpoint execution voltages.
% A 1100uF capacitor is used for $C_{ES}$ and the execution times are normalized to the 3.4V configuration. 
Fig.~\ref{fig:expr_checkpoint_voltages} presents the average execution times of the benchmarks over 30 runs, normalized to the 3.4V configuration.
The results show that executing checkpoints earlier is significantly inefficient as opposed to existing expectations: by 1.38x in 3.7V, and 2.45x in 4.0V setups, on average.
Moreover, the overhead is consistent across all benchmarks since early checkpoint executions directly reduce the energy available for the computing system.
% Consequently, to design efficient checkpoint techniques, it important to minimize the margin between checkpoint execution and the power-off.
Consequently, for maximum power efficiency, checkpoint techniques should be able to minimize the margin between the checkpoint execution and the power-off.
% Consequently, delaying checkpoint executions is crucial when designing state-retention techniques.
Achieving this fundamentally depends on accurately predicting imminent power failures, which is the focus of the next section.
% Consequently, it is important to execute as long as possible whenever the system wakes up.
% In the next section, we discuss how this can be implemented in the existing intermittent systems.

\subsection{Using $V_{dd}$ with a Reference Voltage for Checkpoint Signals}
\label{sec:use_vdd_for_checkpoint}

Sec.~\ref{sec:predicting_power_failures} demonstrates that $V_{ES}$ is not a reliable estimate for the system's remaining execution time and that low $V_{dd}$ is the direct cause of power-off.
Based on this insight, we propose using $V_{dd}$ to more accurately detect the imminent power failures, as in works without power management system (Sec.~\ref{sec:related_work}).
We present two efficient implementations, $S_{sta}$ and $S_{dyn}$, to accurately detect the imminent power-off events in approaches similar to \emph{static} and \emph{dynamic}, respectively.

% Sec.~\ref{sec:predicting_power_failures} demonstrates that $V_{ES}$ is not a good estimate for the system's remaining execution time.
% Instead, we propose using $V_{dd}$ to more accurately estimate the imminent power-off events, similar to approaches used in works without power management system (Sec.~\ref{sec:related_work}).
% We propose setups that operate correctly below the normal $V_{dd}$, by accounting for the operations of ADC in sub-normal voltage conditions (Sec.~\ref{sec:sub_normal_execution}).
% Additionally, when obtaining $V_{dd}$, it is important to account for the operations of ADC in sub-normal voltage conditions (Sec.~\ref{sec:sub_normal_execution}).

Meanwhile, when designing techniques using $V_{dd}$, designers should account for the behaviors of analog components at sub-normal voltages (Sec.~\ref{sec:sub_normal_execution}).
For consistent operation of ADCs, we adopt a voltage source with a known value of $V_{ref}$.
In STM32L5 and MSP430, an internal reference voltage source of 1.2V is available; alternatively, an external voltage reference (e.g., TI LVM431~\cite{texasinstrumentsLMV431}) can be used.
Note that $V_{ref}$ should be lower than the minimal operating voltage of MCU (e.g., 1.7V) as $V_{ref}$ is generated by regulating $V_{dd}$.

$S_{sta}$ is designed for techniques similar to \emph{static}, which query whether to execute a checkpoint at checkpoint triggers.
Since directly reading $V_{dd}$ is infeasible (i.e., $V_{dd}$ itself is a reference voltage), $S_{sta}$ reads $V_{ref}$ instead.
% Instead of reading $V_{ES}$ at checkpoint triggers, $S_{sta}$ reads $V_{ref}$. 
This results in the same value of $\lfloor V_{ref}/V_{dd} \cdot 2^n \rfloor$ when operating on normal voltage, where $n$ is the ADC resolution.
On the other hand, during sub-normal voltage executions, this value increases as $V_{dd}$ decreases, as discussed in Sec.~\ref{sec:sub_normal_execution}.
As a result, given that the target threshold voltage for checkpoint execution is $V_{th}$, software designers can compare the ADC value against $\lfloor V_{ref}/V_{th} \cdot 2^n \rfloor$ to determine whether to execute a checkpoint.

On the other hand, $S_{dyn}$ utilizes an on-chip comparator, which is available in most modern MCUs including STM32L5 and MSP430.
As $V_{ref}$ is always lower than $V_{dd}$, we use a voltage divider consisting of two resistors, $R1$ and $R2$, to scale $V_{dd}$ and compare it with $V_{ref}$.
Specifically, we configure $R1$ and $R2$ to satisfy $\frac{R2}{R1+R2} \cdot V_{th} = V_{ref}$, so the comparator generates an interrupt when $V_{dd}$ reaches the threshold $V_{th}$.

% T2 is setup for static checkpoint techniques, which poll the capacitor voltage to determine whether execute checkpoint or not.
% Instead of reading the capacitor voltage, it reads the reference voltage.
% As we discussed in Sec.~\ref{sec:sub_normal_execution}, the voltage remains same while the system executes at normal voltage but the value increases during sub-normal voltage execution.

% \begin{itemize}
%     \item T1 utilizes a on-chip comparator (available both in STM32L5 and MSP430) with a reference voltage.
%     \item T2.
% \end{itemize}

\begin{figure}
    \centering
    \begin{subfigure}{\linewidth}
        \includegraphics[width=\textwidth]{figs/plot_expr_11_cropped.pdf}
        \caption{Static checkpointing with $S_{sta}$.}
        \label{fig:expr_precise_checkpoint_timings_static}
        \vspace{3pt}
    \end{subfigure}
    \begin{subfigure}{\linewidth}
        \includegraphics[width=\textwidth]{figs/plot_expr_10_cropped.pdf}
        \caption{Dynamic checkpointing with $S_{dyn}$.}
        \label{fig:expr_precise_checkpoint_timings_dynamic}
    \end{subfigure}
    \caption{Impact of precise checkpoint timings to the end-to-end execution times.}
    \label{fig:expr_precise_checkpoint_timings}
\end{figure}

Fig.~\ref{fig:expr_precise_checkpoint_timings} compares the average execution times of the benchmarks over 30 iterations between traditional systems and the proposed setups.
Fig.~\ref{fig:expr_precise_checkpoint_timings_static} and Fig.~\ref{fig:expr_precise_checkpoint_timings_dynamic} illustrates the results of $S_{sta}$ and $S_{dyn}$, respectively.
% illustrates the performance of $S_{sta}$ and Fig.~\ref{fig:expr_precise_checkpoint_timings_dynamic} presents the result for $S_{dyn}$.
The whiskers indicate the minimum and maximum execution times for each benchmark.
The results show significant improvements in execution times for both systems, with average gain of 3.04x in $S_{sta}$ and 2.85x in $S_{dyn}$.
While the effectiveness of checkpoint schemes varies depending on application characteristics, our setups evenly enhance performance across all benchmarks.
This underscores the importance of accurately detecting power-off events for efficient intermittent system operation.
% It clearly demonstrates that the both setups can extend the operation at sub-normal voltages: 3.04x in $S_{sta}$ and 2.85x in $S_{dyn}$.
% Furthermore, these improvements are consistent across all benchmarks, regardless of the application characteristics, highlighting the general effectiveness of the proposed setups.

Another advantage of the proposed setups is their simplicity and practical applicability.
Since the both setups only modify the method to detect imminent power failures and leave the checkpoint algorithms unchanged, it is straightforward to apply them in existing techniques.
Furthermore, the proposed setups can reduce the system complexity, as they eliminate the need for communication between the energy storage system and the computing system (e.g., interrupt or access to $V_{ES}$).

% \subsection{Checkpoint Techniques and Evaluation Methods}
\subsection{On Selecting Hardware Components}

Our model also helps designers in selecting efficient hardware components across various parameters.
For example, it reveals that operating voltage of peripherals (e.g., external NVMs) is a critical design consideration (Sec.~\ref{sec:sub_normal_execution}), often more important than other factors such as latency.
% We evaluate this tradeoff by simulating an external FRAM having faster access latency but smaller operating voltage.

To evaluate this tradeoff, we simulate two FRAM configurations, F1 and F2, in our reference system.
F1 represents a slower setup capable of operating down to 2.5V. 
This is achieved by doubling the software-configurable wait time for FRAM accesses.
F2 is set to have the lowest access latency but requires the system stop operating at 2.8V.

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{figs/plot_expr_12_cropped.pdf}
    \caption{Impact of peripheral operating voltage.}
    \label{fig:expr_peripheral_voltage}
\end{figure}

Fig.~\ref{fig:expr_peripheral_voltage} presents the execution times of the benchmarks for the two configurations in $S_{dyn}$, averaged over 30 runs.
Despite its doubled latency, F1 completes the workloads 1.46x faster on average, with consistent improvements across all benchmarks.
These results suggest that using slower FRAM that operates until 1.8V (e.g.,~\cite{fujitsuMB85R4M2T}) could considerably improve the performance of our reference system.
This example clearly shows that operating voltage, often overlooked in the traditional model, should be considered a critical design parameter.

Finally, our model highlights advantages of using smaller decoupling capacitors.
Larger buffers not only increases the ratio of sub-normal voltage operations but also raise the amount of discharged energy during power-offs.
Indeed, in our reference system with $C_{ES}$ = 1100uF, we observe that completing benchmarks takes 1.18x and 1.36x longer on average, when 440uF and 660uF capacitors are used as C2, respectively, compared to our setup with a 220uF capacitor.
% As a result, it is a good design practice to use the smallest decoupling capacitors for efficiency of intermittent systems.

% \begin{figure}
%     \centering
%     \includegraphics[width=\linewidth]{figs/plot_expr_12_cropped.pdf}
%     \caption{Execution times with varying decoupling capacitors.}
%     % \label{fig:expr_checkpoint_voltages}
% \end{figure}

% Power failure injection (soft reset)~\cite{wuIntOS2024,yildizEfficient2023}.