In many-core systems to achieve maximum performance, it is desirable to produce many tasks more than the cores and efficiently distribute those tasks among available resources. Software load balancers will provide enough performance as long as the number of jobs is big enough in comparison with the load balancing overhead. To mitigate this overhead, delegating load bal- ancing to an accelerator will improve the performance of such architectures. This paper presents a hardware dynamic load balancer module implemented on the FPGA Zynq Ultrascale+ and is based on the semi-work-stealing 2 scheduling. The load balancer is specifically designed for Data- Flow-Threads (DF-Threads) and can support multi-core and multi-node computing architectures. The performance of the design is initially examined through a simple “stress-test” that generates threads (the Recursive-Fibonacci program) on a two-nodes FPGA cluster.