Building a Custom Thread Pool in C++ from Scratch
A thread pool is one of the most fundamental concurrency patterns in systems programming. Instead of spawning a new thread for each task (expensive and wasteful), we maintain a pool of worker threads that process tasks from a shared queue. In this post, we’ll build a production-quality thread pool from scratch in C++17.
Why Thread Pools Matter
Creating and destroying threads is expensive. Each thread creation involves:
- Kernel syscalls for thread allocation
- Stack memory allocation (typically 1-8MB per thread)
- Context switching overhead
A thread pool amortizes these costs by reusing threads. This is critical for high-throughput servers, parallel computation engines, and any system that processes many short-lived tasks.
Architecture Overview
Our thread pool will have three main components:
- Task Queue: A thread-safe queue holding pending work
- Worker Threads: Fixed number of threads that consume tasks
- Synchronization Primitives: Mutexes and condition variables for coordination
#include <vector>
#include <queue>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <functional>
#include <future>
#include <memory>
class ThreadPool {
public:
explicit ThreadPool(size_t numThreads);
~ThreadPool();
template<typename F, typename... Args>
auto enqueue(F&& f, Args&&... args)
-> std::future<typename std::result_of<F(Args...)>::type>;
private:
std::vector<std::thread> workers;
std::queue<std::function<void()>> tasks;
std::mutex queueMutex;
std::condition_variable condition;
bool stop;
void workerThread();
};
Implementation: Worker Thread Logic
Each worker thread runs an infinite loop, waiting for tasks:
void ThreadPool::workerThread() {
while (true) {
std::function<void()> task;
{
std::unique_lock<std::mutex> lock(queueMutex);
// Wait until there's a task or we're stopping
condition.wait(lock, [this] {
return stop || !tasks.empty();
});
if (stop && tasks.empty()) {
return; // Exit thread
}
task = std::move(tasks.front());
tasks.pop();
}
task(); // Execute outside the lock
}
}
Key Design Decisions:
- Scoped Locking: The mutex is only held while accessing the queue, not during task execution
- Condition Variable: Efficiently puts threads to sleep instead of busy-waiting
- Move Semantics:
std::moveavoids copying the task function object
Constructor: Spawning Workers
ThreadPool::ThreadPool(size_t numThreads) : stop(false) {
for (size_t i = 0; i < numThreads; ++i) {
workers.emplace_back([this] { workerThread(); });
}
}
We use emplace_back to construct threads in-place, and capture this to access member variables.
Enqueue: Adding Tasks with Futures
The enqueue method is the most complex part. It needs to:
- Accept any callable with any arguments
- Return a
std::futurefor the result - Handle exceptions properly
template<typename F, typename... Args>
auto ThreadPool::enqueue(F&& f, Args&&... args)
-> std::future<typename std::result_of<F(Args...)>::type>
{
using ReturnType = typename std::result_of<F(Args...)>::type;
// Wrap the task in a packaged_task
auto task = std::make_shared<std::packaged_task<ReturnType()>>(
std::bind(std::forward<F>(f), std::forward<Args>(args)...)
);
std::future<ReturnType> result = task->get_future();
{
std::unique_lock<std::mutex> lock(queueMutex);
if (stop) {
throw std::runtime_error("enqueue on stopped ThreadPool");
}
tasks.emplace([task]() { (*task)(); });
}
condition.notify_one(); // Wake up one worker
return result;
}
Why std::packaged_task?
std::packaged_task connects a callable to a std::future, allowing us to retrieve the result asynchronously. We wrap it in shared_ptr because the lambda needs to be copyable.
Destructor: Graceful Shutdown
ThreadPool::~ThreadPool() {
{
std::unique_lock<std::mutex> lock(queueMutex);
stop = true;
}
condition.notify_all(); // Wake all threads
for (std::thread& worker : workers) {
worker.join(); // Wait for completion
}
}
This ensures all pending tasks complete before destruction.
Usage Example
int main() {
ThreadPool pool(4); // 4 worker threads
std::vector<std::future<int>> results;
for (int i = 0; i < 8; ++i) {
results.emplace_back(
pool.enqueue([i] {
std::this_thread::sleep_for(std::chrono::seconds(1));
return i * i;
})
);
}
for (auto& result : results) {
std::cout << result.get() << " ";
}
return 0;
}
Output: 0 1 4 9 16 25 36 49 (order may vary due to concurrency)
Performance Considerations
Thread Count Selection
Rule of thumb: std::thread::hardware_concurrency() for CPU-bound tasks, higher for I/O-bound tasks.
ThreadPool pool(std::thread::hardware_concurrency());
False Sharing
If tasks modify adjacent memory locations, cache line ping-ponging can occur. Solution: pad data structures to cache line boundaries (typically 64 bytes).
Lock Contention
Our implementation uses a single mutex. For extreme throughput, consider:
- Work Stealing: Each thread has its own queue, can steal from others
- Lock-Free Queues: Using atomic operations instead of mutexes
Advanced: Work Stealing
class WorkStealingThreadPool {
std::vector<std::deque<Task>> perThreadQueues;
std::vector<std::mutex> perThreadMutexes;
void workerThread(size_t threadId) {
while (true) {
Task task;
// Try own queue first
if (tryPopLocal(threadId, task)) {
task();
continue;
}
// Try stealing from others
if (tryStealFrom(threadId, task)) {
task();
continue;
}
// Sleep if no work
std::this_thread::yield();
}
}
};
Common Pitfalls
- Deadlock: Never enqueue a task that waits for another task in the same pool
- Exception Safety: Uncaught exceptions in tasks terminate the program. Always catch in tasks.
- Lifetime Issues: Ensure captured variables outlive the task execution
Benchmarking
// Sequential
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000; ++i) {
expensiveComputation(i);
}
auto end = std::chrono::high_resolution_clock::now();
// Time: ~10 seconds
// Thread Pool
ThreadPool pool(8);
std::vector<std::future<void>> futures;
for (int i = 0; i < 1000; ++i) {
futures.push_back(pool.enqueue(expensiveComputation, i));
}
for (auto& f : futures) f.wait();
// Time: ~1.5 seconds (6.6x speedup on 8 cores)
Conclusion
We’ve built a fully functional thread pool with:
- Type-safe task submission
- Future-based result retrieval
- Graceful shutdown
- Exception handling
This pattern is used in production systems like web servers (Nginx, Apache), databases (PostgreSQL), and game engines (Unreal Engine).
Next Steps:
- Implement priority queues for task scheduling
- Add thread pool resizing (dynamic worker count)
- Explore lock-free queue implementations
The complete code is available on GitHub.
Have you implemented a thread pool before? What challenges did you face? Let me know in the comments!