How do I optimize memory usage for a neural network running on an ARM Cortex-M4 using CMSIS-NN?

@Middleware & OS How do I optimize memory usage for a neural network running on an ARM Cortex-M4 using CMSIS-NN? My current model runs out of memory. Here's my code:
#include "arm_nnfunctions.h"

void run_nn(const q7_t* input_data) {
q7_t intermediate_buffer[INTERMEDIATE_SIZE];
q7_t output_data[OUTPUT_SIZE];
// Run the network layers
arm_convolve_HWC_q7_basic(input_data, CONV1_WEIGHT, CONV1_BIAS, intermediate_buffer
#include "arm_nnfunctions.h"

void run_nn(const q7_t* input_data) {
q7_t intermediate_buffer[INTERMEDIATE_SIZE];
q7_t output_data[OUTPUT_SIZE];
// Run the network layers
arm_convolve_HWC_q7_basic(input_data, CONV1_WEIGHT, CONV1_BIAS, intermediate_buffer
Solution:
Did you consider reuse buffers for intermediate and output data
Jump to solution
4 Replies
Solution
wafa_ath
wafa_ath•5mo ago
Did you consider reuse buffers for intermediate and output data
Enthernet Code
Enthernet Code•5mo ago
No but I think I'll try it out, they're temporary right?
wafa_ath
wafa_ath•5mo ago
Yes, exactly. you can save memory since they are only needed temporarily during computation.
Enthernet Code
Enthernet Code•5mo ago
Thanks, this was helpful I tried it out 👇
#include "arm_nnfunctions.h"

#define INTERMEDIATE_SIZE 1024
#define OUTPUT_SIZE 512
#define MAX_BUFFER_SIZE ((INTERMEDIATE_SIZE > OUTPUT_SIZE) ? INTERMEDIATE_SIZE : OUTPUT_SIZE)

void run_nn(const q7_t* input_data) {
// Use a single buffer for both intermediate and output data
q7_t shared_buffer[MAX_BUFFER_SIZE];

// Run the network layers using the shared buffer
arm_convolve_HWC_q7_basic(input_data, CONV1_WEIGHT, CONV1_BIAS, shared_buffer);

// Continue with other layers, reusing the shared buffer
// For example:
// arm_fully_connected_q7(shared_buffer, FC1_WEIGHT, FC1_BIAS, shared_buffer);

// Copy final output to the output data buffer if needed
q7_t output_data[OUTPUT_SIZE];
memcpy(output_data, shared_buffer, OUTPUT_SIZE * sizeof(q7_t));
}
#include "arm_nnfunctions.h"

#define INTERMEDIATE_SIZE 1024
#define OUTPUT_SIZE 512
#define MAX_BUFFER_SIZE ((INTERMEDIATE_SIZE > OUTPUT_SIZE) ? INTERMEDIATE_SIZE : OUTPUT_SIZE)

void run_nn(const q7_t* input_data) {
// Use a single buffer for both intermediate and output data
q7_t shared_buffer[MAX_BUFFER_SIZE];

// Run the network layers using the shared buffer
arm_convolve_HWC_q7_basic(input_data, CONV1_WEIGHT, CONV1_BIAS, shared_buffer);

// Continue with other layers, reusing the shared buffer
// For example:
// arm_fully_connected_q7(shared_buffer, FC1_WEIGHT, FC1_BIAS, shared_buffer);

// Copy final output to the output data buffer if needed
q7_t output_data[OUTPUT_SIZE];
memcpy(output_data, shared_buffer, OUTPUT_SIZE * sizeof(q7_t));
}
Want results from more Discord servers?
Add your server