Espressif recently launched its latest-generation SoC, the ESP32-S31. Compared with ESP32-S3, the chip delivers a substantial gain in raw compute performance and integrates several key optimizations in hardware peripherals and communication interfaces. This article uses the main-controller upgrade of ESP-VoCat as a case study to explain ESP32-S31’s technical improvements and real-world performance.

Bluetooth Audio

ESP32-S31 adds full support for BLE Audio and Classic Bluetooth, giving end devices the ability to play audio directly like a standard Bluetooth speaker.

蓝牙音箱.gif

Leveraging LE Audio broadcast capabilities, ESP32-S31 can support multiple devices that synchronously receive a single broadcast source and independently decode left and right channels. The device itself can also act as an audio broadcast transmitter, suitable for multi-node synchronized audio playback.

LE Audio.gif

Higher Performance

The most direct improvement in ESP32-S31 is compute performance: the CPU frequency increases from 240 MHz to 320 MHz, and the underlying architecture moves to RISC-V. On CoreMark, overall performance is roughly 65% higher than ESP32-S3.

The higher clock speed makes many conventional vision algorithms run more smoothly. For example, color recognition with OpenCV responds noticeably faster.

opencv颜色识别.gif

ESP32-S31’s gains go beyond higher benchmark scores. Many hardware updates directly improve the experience in real projects.

Image Capture

On ESP32-S3, ESP-VoCat could not add a DVP camera because of limited I/O. ESP32-S31 expands available GPIO to 60 pins; with a DVP camera connected, 7 GPIOs remain free.

摄像头.gif

ESP32-S31 includes a hardware JPEG encoder/decoder. After the camera captures a frame, the SoC can compress it with hardware JPEG encoding. Combined with native Wi-Fi 6 support, the device can stream video with very low latency.

视频推流.gif

On-Device AI Vision

For edge AI inference, ESP32-S31 integrates AI instruction-set acceleration and raises PSRAM interface bandwidth from 80 MHz on ESP32-S3 to 250 MHz. Higher data throughput yields a clear improvement in on-device model inference speed.

手势识别.gif

When running YOLO11n locally for object detection, ESP32-S31 delivers a significant gain in inference speed.

Display Performance

ESP32-S31 adds hardware JPEG encode/decode for image processing. Decode is faster and CPU utilization drops noticeably.

s31喵伴刷屏.gif

ESP32-S31 supports the higher color depth of RGB888 for more accurate color reproduction. A direct comparison of color gradients against RGB565 shows a clear difference.

RGB颜色格式对比.gif

ESP32-S31 supports display output up to 720p (HD), suitable for high-resolution UI rendering.

720p屏幕 (1).gif

ESP32-S31 also integrates a Pixel Processing Accelerator (PPA) and 2D-DMA. Dedicated graphics hardware handles data movement; operations such as rotation, scaling, mirroring, and color format conversion are hardware-accelerated.

图形硬件加速.gif

Real-time scaling of the camera preview is fast because of PPA.

PPA缩放摄像头画面.gif

In display rotation scenarios, the CPU can saturate easily and frame rate drops sharply. With hardware acceleration, ESP32-S31 maintains relatively smooth performance even when rotation is enabled.

画面旋转性能对比.gif