Generate sample data using built-in database functionality

There are many ways to generate dummy data for a database. You can create a data generator, use a mock data service, or obtain a subset of production data. Additionally, many databases provide built-in functionality for generating synthetic data.

In this article, you will learn how to use special database functions, hierarchical queries, and recursive common table expressions (CTE) to generate sample data sets in PostgreSQL, MySQL, Oracle, and SQL Server.

Sample table

We’ll use the following sample tables to demonstrate various built-in data generation techniques:

create table sample(</code><code> id int, </code><code> val text</code><code>);

The id column stores the identifier of the record, while the val column holds some text value.

Once you learn how to generate simulated data for a table with two columns, you can apply these techniques to tables with more columns of different data types.

Generate series functions in PostgreSQL

PostgreSQL provides a special generate_series function that can generate a series of numeric or date/time values for a specified range.

For example, to generate a range of IDs from 1 to 5, you would use the following function call:

select id from generate_series(1, 5) as id;</code><code>?</code><code> + -- + </code><code>|id|</code><code> + -- + </code><code>|5 |</code><code>|4 |</code><code>|3 |</code><code>|2 |</code><code>|1 |</code><code> + -- + 

Afterwards, you can insert the generated series into the table sample using the following simple statement:

insert into sample(id)</code><code>select id from generate_series(1, 5) as id;

This query adds 5 records to the table, each record has a unique identifier. However, the val column is still null:

select * from sample order by id;</code><code>?</code><code> + -- + ---- + </code><code>|id|val |</code><code> + -- + ---- + </code><code>|1 |null|</code><code>|2 |null|</code><code>|3 |null|</code><code>|4 |null|</code><code>|5 |null|</code><code> + -- + ---- + </code><code>?</pre >
<p>To populate a column with val non-null values, you simply modify the data generation query as follows:</p>
<pre>-- Delete previously generated records</code><code>delete from sample;</code><code>?</code><code>-- Generate 1000 records using the concat function to populate the `val` column</code><code>insert into sample(id,val)</code><code>select id, concat('val', id * 10) from generate_series(1, 1000) as id;

Finally, retrieve the first 5 records from the table to see the sample data:

select * from sample order by id limit 5;</code><code>?</code><code> + -- + ----- + </code><code>|id|val | </code><code> + -- + ----- + </code><code>|1 |val10|</code><code>|2 |val20|</code><code>|3 |val30|</code><code>|4 |val40|</code><code>|5 |val50|</code><code> + -- + ----- + </code><code>?

Quick and easy.

Generate series again…but in SQL Server

The latest version of SQL Server introduced support for the generate_series function. Therefore, the experience of generating sample data is consistent whether you use PostgreSQL or SQL Server.

To generate a range of numbers from 1 to 5, you can call the function as follows:

select value from generate_series(1,5);</code><code>?</code><code> + ----- + </code><code>|value|</code><code> + ----- + </code><code>|1 |</code><code>|2 |</code><code>|3 |</code><code>|4 |</code><code>|5 |</code><code> + ----- + </code><code>?

Subsequently, execute the following command to create a sample of 1,000 records in the table:

insert into sample(id,val)</code><code>select value,concat('val', value * 10) from generate_series(1,1000);

To examine the generated data, query the first 5 records:

select * from sample order by id</code><code>offset 0 rows</code><code>fetch next 5 rows only;</code><code>?</code><code> + - - + ----- + </code><code>|id|val |</code><code> + -- + ----- + </code><code>|1 |val10|</code><code>|2 |val20|</code><code>|3 |val30|</code><code>|4 |val40|</code><code>|5 |val50|</code><code> + -- + ----- + </code><code>?

As demonstrated, the generate_series function is a powerful tool for performing data generation tasks. However, this function is not yet part of the SQL standard and may not be available in all database systems.

Hierarchical queries in Oracle

Oracle is one of the databases that does not support the generate_series feature. However, the database community has devised many alternative methods to generate dummy data.

One popular approach involves using hierarchical queries. For example, the following hierarchical query can generate a series of records from 1 to 5:

select level from dual connect by level <= 5;</code><code>?</code><code> + ----- + </code><code>|LEVEL|</code> <code> + ----- + </code><code>|1 |</code><code>|2 |</code><code>|3 |</code><code>|4 | </code><code>|5 |</code><code> + ----- + </code><code>?

Internally, the query constructs a data tree structure, where the LEVEL pseudo-column indicates the depth of the tree, starting from the root.

By using a hierarchical query with an insert statement, you can generate 1000 records for the sample table:

-- Oracle doesn't support the `text` data type,</code><code>-- requiring you to create the table this way</code><code>create table sample (id int, val varchar (10));</code><code>?</code><code>-- Generate 1000 records</code><code>insert into sample(id,val)</code><code>select level, concat('val', level * 10) from dual </code><code>connect by level <= 1000;</code><code>?</code><code>?

The resulting data will be similar to what you see in PostgreSQL and SQL Server:

select * from sample order by id</code><code>offset 0 rows fetch next 5 rows only;</code><code>?</code><code> + -- + ----- + </code><code>|ID|VAL |</code><code> + -- + ----- + </code><code>|1 |val10|</code><code>| 2 |val20|</code><code>|3 |val30|</code><code>|4 |val40|</code><code>|5 |val50|</code><code> + -- + ----- + </code><code>?

Recursive common table expressions in MySQL

Like Oracle, MySQL does not support the generate_series functionality, so alternative methods need to be found to generate sample data.

One method is to use recursive common table expressions (CTE). To illustrate this, the following recursive query generates a series of numbers from 1 to 5:

with recursive seq as (</code><code> select 1 as id union all select id + 1 from seq where id < 5</code><code>)</code><code>select id from seq ;</code><code>?</code><code> + -- + </code><code>|id|</code><code> + -- + </code><code>|1 |</code><code>|2 |</code><code>|3 |</code><code>|4 |</code><code>|5 |</code><code> + - - + </code><code>?

Next, you can use this recursion with the following insert statement to generate 1000 records:

insert into sample(id,val)</code><code>with recursive seq as (</code><code> select 1 as id union all select id + 1 from seq where id < 1000</code> <code>)</code><code>select id, concat('val', id * 10) from seq;

Finally, to make sure the simulation data has been generated correctly, take a quick look at the first five records:

select * from sample order by id limit 5;</code><code>?</code><code> + -- + ----- + </code><code>|id|val | </code><code> + -- + ----- + </code><code>|1 |val10|</code><code>|2 |val20|</code><code>|3 |val30|</code><code>|4 |val40|</code><code>|5 |val50|</code><code> + -- + ----- + </code><code>?

Continue to master the database

As you can see, a relational database is more than just a storage for application data. They provide a wide range of functionality that allow you to perform various tasks directly on the database side. One of these tasks is sample data generation, which can sometimes be fulfilled through built-in database functionality.

Author: Denis Magda

For more technical information, please pay attention to the official account [Cloud Native Database]

squids.cn, cloud database RDS, migration tool DBMotion, cloud backup DBTwin and other database ecological tools.

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. PostgreSQL skill treeHomepageOverview 7092 people are learning the system