DevHeads IoT Integration Server•11mo ago

How do I optimize memory usage for a neural network running on an ARM Cortex-M4 using CMSIS-NN?

@Middleware & OS How do I optimize memory usage for a neural network running on an ARM Cortex-M4 using CMSIS-NN? My current model runs out of memory. Here's my code:

#include "arm_nnfunctions.h"

void run_nn(const q7_t* input_data) {
  q7_t intermediate_buffer[INTERMEDIATE_SIZE];
  q7_t output_data[OUTPUT_SIZE];
  // Run the network layers
  arm_convolve_HWC_q7_basic(input_data, CONV1_WEIGHT, CONV1_BIAS, intermediate_buffer

#include "arm_nnfunctions.h"

void run_nn(const q7_t* input_data) {
  q7_t intermediate_buffer[INTERMEDIATE_SIZE];
  q7_t output_data[OUTPUT_SIZE];
  // Run the network layers
  arm_convolve_HWC_q7_basic(input_data, CONV1_WEIGHT, CONV1_BIAS, intermediate_buffer

Solution:

Did you consider reuse buffers for intermediate and output data

Jump to solution

4 Replies

Solution

wafa_ath•11mo ago

Did you consider reuse buffers for intermediate and output data

Enthernet Code•11mo ago

No but I think I'll try it out, they're temporary right?

wafa_ath•11mo ago

Yes, exactly. you can save memory since they are only needed temporarily during computation.

Enthernet Code•11mo ago

Thanks, this was helpful I tried it out 👇

#include "arm_nnfunctions.h"

#define INTERMEDIATE_SIZE 1024
#define OUTPUT_SIZE 512
#define MAX_BUFFER_SIZE ((INTERMEDIATE_SIZE > OUTPUT_SIZE) ? INTERMEDIATE_SIZE : OUTPUT_SIZE)

void run_nn(const q7_t* input_data) {
  // Use a single buffer for both intermediate and output data
  q7_t shared_buffer[MAX_BUFFER_SIZE];
  
  // Run the network layers using the shared buffer
  arm_convolve_HWC_q7_basic(input_data, CONV1_WEIGHT, CONV1_BIAS, shared_buffer);
  
  // Continue with other layers, reusing the shared buffer
  // For example:
  // arm_fully_connected_q7(shared_buffer, FC1_WEIGHT, FC1_BIAS, shared_buffer);
  
  // Copy final output to the output data buffer if needed
  q7_t output_data[OUTPUT_SIZE];
  memcpy(output_data, shared_buffer, OUTPUT_SIZE * sizeof(q7_t));
}

#include "arm_nnfunctions.h"

#define INTERMEDIATE_SIZE 1024
#define OUTPUT_SIZE 512
#define MAX_BUFFER_SIZE ((INTERMEDIATE_SIZE > OUTPUT_SIZE) ? INTERMEDIATE_SIZE : OUTPUT_SIZE)

void run_nn(const q7_t* input_data) {
  // Use a single buffer for both intermediate and output data
  q7_t shared_buffer[MAX_BUFFER_SIZE];
  
  // Run the network layers using the shared buffer
  arm_convolve_HWC_q7_basic(input_data, CONV1_WEIGHT, CONV1_BIAS, shared_buffer);
  
  // Continue with other layers, reusing the shared buffer
  // For example:
  // arm_fully_connected_q7(shared_buffer, FC1_WEIGHT, FC1_BIAS, shared_buffer);
  
  // Copy final output to the output data buffer if needed
  q7_t output_data[OUTPUT_SIZE];
  memcpy(output_data, shared_buffer, OUTPUT_SIZE * sizeof(q7_t));
}

Gaming

Programming

How do I optimize memory usage for a neural network running on an ARM Cortex-M4 using CMSIS-NN?

Did you find this page helpful?