If it can be compiled in Thumb mode, there MIGHT be a little faster of a binary on the RPi3 only (can't compile to THUMB mode on RPi1 or 2, but can use interworking).  But I would try using thumb-interworking for all three systems so that ARM and THUMB code can be used optimally when it's needed to create a more responsive binary.  It shouldn't make a huge difference, but a little less overhead will make for a few FLOPS difference.  But I'm really curious to compile it in ARCH64 and see what can be accomplished in the registers all at once.  If a few instructions can be combined into registers and computed at the same time, then it might be beneficial to optimizing MAME efficiency.  Mind you, a few barriers would need to be created to ensure that computed data is ready when computed together on RPi2 and 3, but it shouldn't be noticeable or even negligible to performance as there should be a few barriers already in place.  While some of you would say that 64-bit mode would only allow access to more than 4GB of memory, you probably haven't worked with 64-bit computing much in that case.  It does speed code up considerably when you have multiple items that need to be computed in the same way.  But I digress. There are a few optimizations here and there that can be done to speed things along, but it would require you to compile with those options to do so.  I would say toss them in and see how it comes out.