The focus of our study is the support for fine/medium grained thread level parallelism (TLP) by using a hardware scheduling unit and relying on existing simple cores. Simple cores are grouped into clusters in order to provide a scalable solution. As a proof of concept, we use an implementation based on the cell broadband engine (CBE). Cell is a multiprocessor on a chip developed by Sony, Toshiba and IBM that contains one general purpose core and eight coprocessor elements that accelerate the multimedia and vector processing. The aim of this paper is to present a possible implementation of DTA (decoupled threaded architecture) that is based on the cell processor, while keeping the scalability of the original DTA.