Last weekend we went live with a change where we split one of our central user databases off into a master-master replication pair, with HA and a virtual IP/hostname to point our apps to. It had been previously hosted on one of the shards, so it was time to give it a life of its own as well as some redundancy.
On Friday we did a test of the change in our staging environment.
After setting up the master-master pair, the first thing we verified was writes/table creates, etc. were flowing in both directions for the database we planned to replicate between the two. Lets just call it "db1".
To enable this I had used the option: replicate-do-db=db1 in the my.cnf on both DB machines.
After we verified this, I was told that we should probably also replicate "db2". I edited the config on both machines and changed the option to be replicate-db-db=db1,db2
We then proceeded to take a snapshot of our staging environment for these dbs so that we could test failover, app performance, etc.
We restored the snapshot, brought up the app and everything looked fine. We tested failovers. All looked good.
The change was pushed into production over the weekend and on Sunday night a network glitch caused the HA to failover to make the second DB in the pair active. Then late Monday evening someone noticed that changes didn't actually seem to be replicating between the two, although SHOW SLAVE STATUS reported IO and SQL threads running and everything was caught up.
Enter the culprit - my replicate-do-db setting
The correct way to tell MySQL to enable replication for select DBs is to issue the parameter multiple times:
The configuration as I had it was actually telling MySQL to filter and only replicate a database called "db1,db2" which of course did not exist.
Despite the fact we thought that we'd been good at testing, this managed to slip through.
I guess even someone who has been using MySQL for almost 10 years can make silly mistakes. Whoops!
Here's hoping that someone else out there can learn from my mistake!