Writing Python Extensions in C++ for Performance
PythonC++Performance
Python is slow. C++ is fast. Combine them for the best of both worlds.
When to Use C++ Extensions
- CPU-bound operations (image processing, simulations)
- Performance-critical code (hot loops)
- Existing C++ libraries you want to use from Python
Tool: pybind11
pip install pybind11
Simple Example
C++ Code
// example.cpp
#include <pybind11/pybind11.h>
int add(int a, int b) {
return a + b;
}
PYBIND11_MODULE(example, m) {
m.def("add", &add, "Add two numbers");
}
Build
# setup.py
from pybind11.setup_helpers import Pybind11Extension, build_ext
from setuptools import setup
ext_modules = [
Pybind11Extension("example", ["example.cpp"]),
]
setup(
name="example",
ext_modules=ext_modules,
cmdclass={"build_ext": build_ext},
)
pip install .
Use in Python
import example
print(example.add(2, 3)) # 5
Real-World Example: Image Processing
Python (Slow)
def blur_image(image):
height, width = image.shape
result = np.zeros_like(image)
for y in range(1, height-1):
for x in range(1, width-1):
result[y,x] = (
image[y-1,x] + image[y+1,x] +
image[y,x-1] + image[y,x+1]
) / 4
return result
# Time: 2.5 seconds for 1000x1000 image
C++ (Fast)
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
namespace py = pybind11;
py::array_t<uint8_t> blur_image(py::array_t<uint8_t> input) {
auto buf = input.request();
auto result = py::array_t<uint8_t>(buf.size);
auto res_buf = result.request();
uint8_t *ptr = (uint8_t *) buf.ptr;
uint8_t *res_ptr = (uint8_t *) res_buf.ptr;
int height = buf.shape[0];
int width = buf.shape[1];
for (int y = 1; y < height - 1; y++) {
for (int x = 1; x < width - 1; x++) {
int idx = y * width + x;
res_ptr[idx] = (
ptr[(y-1)*width + x] + ptr[(y+1)*width + x] +
ptr[y*width + (x-1)] + ptr[y*width + (x+1)]
) / 4;
}
}
return result;
}
PYBIND11_MODULE(image_ops, m) {
m.def("blur_image", &blur_image);
}
Speedup: 50x faster (50ms vs 2.5s)
Working with NumPy
#include <pybind11/numpy.h>
py::array_t<double> process_array(py::array_t<double> input) {
py::buffer_info buf = input.request();
double *ptr = (double *) buf.ptr;
size_t size = buf.size;
for (size_t i = 0; i < size; i++) {
ptr[i] *= 2; // Double each element
}
return input;
}
Classes and Objects
class Calculator {
public:
Calculator(int initial) : value(initial) {}
void add(int x) { value += x; }
int get() const { return value; }
private:
int value;
};
PYBIND11_MODULE(calc, m) {
py::class_<Calculator>(m, "Calculator")
.def(py::init<int>())
.def("add", &Calculator::add)
.def("get", &Calculator::get);
}
from calc import Calculator
c = Calculator(10)
c.add(5)
print(c.get()) # 15
Benchmarking
import time
import numpy as np
# Python version
start = time.time()
result_py = blur_image_python(image)
print(f"Python: {time.time() - start:.3f}s")
# C++ version
start = time.time()
result_cpp = blur_image_cpp(image)
print(f"C++: {time.time() - start:.3f}s")
Conclusion
C++ extensions give you:
- 50-100x speedups for CPU-bound code
- Access to C++ libraries from Python
- Best of both worlds (Python ease + C++ speed)
When to use:
- Profiling shows Python is the bottleneck
- You have existing C++ code
- NumPy/Numba aren’t fast enough
Have you written Python extensions? What speedups did you achieve?