The PRESENT block cipher was one of the first hardware-oriented proposals for implementation in extremely resource-constrained environments. Its design is based on 4-bit S-boxes and a 64-bit permutation, a far from optimal choice to achieve good performance in software. As a result, most software implementations require large lookup tables in order to meet efficiency goals. In this paper, we describe a new portable and efficient software implementation of PRESENT, fully protected against timing attacks. Our implementation uses a novel decomposition of the permutation layer, and bitsliced computation of the S-boxes using optimized Boolean formulas, not requiring lookup tables. The implementations are evaluated in embedded ARM CPUs ranging from microcontrollers to full-featured processors equipped with vector instructions. Timings for our software implementation show a significant performance improvement compared to the numbers from the FELICS benchmarking framework. In particular, encrypting 128 bits using CTR mode takes about 2100 cycles on a Cortex-M3, improving on the best Assembly implementation in FELICS by a factor of 8. Additionally, we present the fastest masked implementation of PRESENT for protection against timing and other side-channel attacks in the scenario we consider, improving on related work by 15%. Hence, we conclude that PRESENT can be remarkably efficient in software if implemented with our techniques, and even compete with a software implementation of AES in terms of latency while offering a much smaller code footprint.