Hi Daniel,
it is very likely that 60 games is not enough to see whether your changes really made your engine play stronger. Evaluate your 60 games with BayesElo to get relative ratings and have a look at the resulting +/- error bars. If you repeat the same from scratch several times you might get substantially different results each time, that's what the error bars indicate. Then perhaps play 600 games, and you will see the error bars decreasing, although probably not to an acceptable amount.
I can't tell you exact numbers but some engine authors, like Bob Hyatt, tend to play thousands of games to reliably measure small ELO differences between two engine versions.
Bob also prefers not to use opening books but instead uses a huge number of different (balanced) starting positions which he extracted from high-quality games.
He also does not repeat playing the same position too often within one test run since it has been stated that doing so would have negative impact on the stability of test results (dependent measurements).
Finally, to get such a huge number of games finished within reasonable time you need
a) to play with ultra-fast time control (in the range of few seconds for each game),
b) Bob's cluster
Since you don't have b) you will have to live with slightly higher error bars compared to those Bob is getting now.
There have been huge threads in CCC (IIRC) about this topic within the past 12 months.
Sven