Aurora, where’s my data? Extreme IOPS Optimization for MySQL

Today I decided to run another absurd test run on Amazon Aurora, this time playing with different MySQL engines.

A little absurd experimentation with MySQL and Aurora

Amazon RDS and Amazon Aurora fully support the InnoDB storage engine for MySQL DB instances. There are features like snapshot restore which are supported only for InnoDB storage engine. But InnoDB is not the only engine available on RDS for MySQL or Aurora MySQL

You can see all the engines able to run in a simple way:

MySQL> show engines;

+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
| Engine             | Support | Comment                                                        | Transactions | XA   | Savepoints |
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
| FEDERATED          | NO      | Federated MySQL storage engine                                 | NULL         | NULL | NULL       |
| MEMORY             | YES     | Hash based, stored in memory, useful for temporary tables      | NO           | NO   | NO         |
| InnoDB             | DEFAULT | Supports transactions, row-level locking, and foreign keys     | YES          | YES  | YES        |
| PERFORMANCE_SCHEMA | YES     | Performance Schema                                             | NO           | NO   | NO         |
| MyISAM             | YES     | MyISAM storage engine                                          | NO           | NO   | NO         |
| MRG_MYISAM         | YES     | Collection of identical MyISAM tables                          | NO           | NO   | NO         |
| BLACKHOLE          | YES     | /dev/null storage engine (anything you write to it disappears) | NO           | NO   | NO         |
| CSV                | YES     | CSV storage engine                                             | NO           | NO   | NO         |
| ARCHIVE            | YES     | Archive storage engine                                         | NO           | NO   | NO         |
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
enter fullscreen mode

exit fullscreen mode

InnoDB is omission engine and which you should be using for (almost) every single table you have. Unless you care about the data stored in the table.

but InnoDB is Not the fastest and lightest storage engine on RDS, No, I’m not talking about MyISAM, I’m talking about Garbage Bin, the blackhole storage engine:

The BLACKHOLE storage engine acts as a “black hole” that accepts data but throws it away and does not store it. Retrieval always returns an empty result.

What?

The best way to optimize IOPS and have less problems with data is by not storing the data in the first place. Ok, we will store less data but do we really benefit in CPU usage? How much does it really cost to store (useless) data?

any real difference?

We are going to run a small experiment running an Aurora Serverless instance on a fixed size of 4 ACUs (about 8GB).

Let’s create a simple table and a procedure to fill it with a few (million and useless) records. Nothing fancy here, just some random numbers and 5 million records for each call.


CREATE TABLE data
(id bigint(20) NOT NULL AUTO_INCREMENT,
datetime TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
value float DEFAULT NULL,
PRIMARY KEY (id)) ENGINE=InnoDB;

DELIMITER $$
CREATE PROCEDURE generate_data()
BEGIN
  DECLARE i INT DEFAULT 0;
  WHILE i < 5000000 DO
    INSERT INTO data (datetime,value) VALUES (
      FROM_UNIXTIME(UNIX_TIMESTAMP('2022-01-01 01:00:00')+FLOOR(RAND()*31536000)),
      ROUND(RAND()*100,2));
    SET i = i + 1;
  END WHILE;
END$$
DELIMITER ;
enter fullscreen mode

exit fullscreen mode

Now we can call our simple procedure (twice, to see that the results are not random and reproducible) and compare the metrics of the database with the InnoDB and BLACKHOLE engines. We sleep a little (200 seconds) to get a clear metric between executions:

CALL generate_data();
SELECT sleep(200);
CALL generate_data();

SELECT sleep(200);
ALTER TABLE data ENGINE=BLACKHOLE;

CALL generate_data();
SELECT sleep(200);
CALL generate_data();

enter fullscreen mode

exit fullscreen mode

And the winner is…

As expected, the difference in CPU load of the database is huge, storing (waste) data is not cheap.

The insert throughput increases, reducing the time required to run the process.

Insert Throughput: InnoDB vs BLACKHOLE

are you kidding, aren’t you?

No, there are some corner cases where running a simple:

ALTER TABLE myUselessTable ENGINE=BLACKHOLE;

Not completely crazy. Between scenarios:

  1. emergency button, Well, your database is actually running out of IOPS, you don’t have any logical sharding to split the traffic into different instances. And vertical scaling won’t help. If you have some data that you can lose, running a table on BLACKHOLE could be an out of jail card.

  2. hurry application change. Well, that cool feature in your product isn’t that great anymore, you’ve decided to stop collecting data and delete it, but you won’t have a fix/new version for days or weeks. You might be lucky enough and be able to drop the records in the meantime.

3 testing The impact of your new great feature on write IOPS on a complex workload. Try it with and without the blackhole, as in these dummy tests. You may find some useful and scary numbers.

sustainability?

As everyone talks about sustainability, let’s add one final note. Last month I attended a talk about sustainability in Serverless, given by Sheen Brissell at the Serverless Architecture conference in Berlin. One of Sheen’s key messages was to get rid of useless data: If you don’t need the data, don’t store it.

If you don't need the data, don't keep it

Gert Leanders recently shared a similar thought:

There is no value in a petabyte data lake without data access.

I totally agree, the best way to have less problems with the data is by not having the data. And the cost of useless data isn’t just storage.

Don’t store data you don’t need to store. You’re just not taking up storage you don’t need. You are burning CPU cycles. You are heating the planet. You are taking out your credit card. You are slowing down your application.

You may not need to replace your product to reap some of the benefits. Proceed with that simple ALTER TABLE.

Or drop the entire database and stop that RDS/Aurora instance. OK, maybe I’m kidding.

Leave a Comment