In this paper, the efficient software implementation and side-channel resistance of the LS-Design construction are studied through a series of software implementations of the Fantomas block cipher, one of its most prominent instantiations. Target platforms include resource-constrained ARM devices like the Cortex-M3 and M4, and more powerful processors such as the ARM Cortex-A15 and modern Intel platforms. The implementations span a broad range of characteristics: 32-bit and 64-bit versions, unprotected and side-channel resistant, and vectorized code for NEON and SSE instruction sets. Our results improve the state of the art substantially, in terms of both efficiency and compactness, by making use of novel algorithmic techniques and features specific to the target platform. We finish by proposing and prototyping instruction set extensions to reduce by half the performance penalty of the introduced side-channel countermeasures.