How to find and delete duplicate rows in MySQL?

How to find duplicate rows

The first step is to define what rows are duplicates. Most of the time it’s simple: some of their columns have the same values. This example uses this definition. Maybe your definition of “repeat” is very complicated and you need to make some modifications to the sql. Data samples to be used in this example:

create table test(id int not null primary key, day date not null);

insert into test(id, day) values(1, '2006-10-08'); insert into test(id, day) values(2, '2006-10-08'); insert into test(id, day) values(3, '2006-10-09');

select * from test;
 + + +
| id | day |
 + + +
| 1 | 2006-10-08 |
| 2 | 2006-10-08 |
| 3 | 2006-10-09 |
 + + +

The first two rows have the same value in the day field, so I treat them as duplicate rows. The query statement uses the GROUP BY clause to group columns with the same field value into a group, and then calculates the number of groups.

select day, count(*) from test GROUP BY day;
 + + +
| day | count(*) |
 + + +
| 2006-10-08 | 2 |
| 2006-10-09 | 1 |
 + + +

The group data of the duplicate row is greater than 1. If you want to display only duplicate rows, you must use the HAVING clause, for example

select day, count(*) from test group by day HAVING count(*) > 1;
 + + +
| day | count(*) |
 + + +
| 2006-10-08 | 2 |
 + + +

Here’s the basic trick: group based on fields with the same value, then only show groups with data greater than 1.

Why can’t I use the WHERE clause? Because the WHERE clause filters the rows before grouping, the HAVING clause filters the rows after grouping.

How to delete duplicate rows

A related question is how to remove duplicate rows. A common task is to keep only one row of duplicate rows and delete the others. You can then create appropriate indexes to prevent duplicate rows from being written to the database in the future.

Again, the first step is to figure out the definition of duplicate rows. Which row do you want to keep? The first row, or the row with the smallest value for a certain field? In this example, assume that you want to keep the first row – the row with the smallest value in the id field, which means you want to delete the other rows.

Perhaps the easiest way to do this is through a temporary table. Especially for MYSQL, there are some restrictions that cannot update the table while selecting in this query statement. For simplicity, only the temporary table method is used here.

Our task is: remove all duplicate rows except the row with the smallest value for the id field in the group. Therefore, you need to find the groups whose count field is greater than 1, and the rows you want to keep. You can use the MIN() function. The statements here are to create a temporary table and find the rows that need to be deleted by DELETE.

create temporary table to_delete (day date not null, min_id int not null);

insert into to_delete(day, min_id)
select day, MIN(id) from test group by day having count(*) > 1;

select * from to_delete;
 + + +
| day | min_id |
 + + +
| 2006-10-08 | 1 |
 + + +

With this data, you can start deleting “dirty data”. There can be several methods, each with its own advantages and disadvantages, but I will not make a detailed comparison here. I will just explain the standard methods used in relational databases that support query statements.

delete from test where exists(
select * from to_delete
where to_delete.day = test.day and to_delete.min_id <> test.id
)

How to find duplicate rows across multiple columns

Someone recently asked this question: There are two fields b and c in one of my tables, which are respectively related to fields b and c in two other tables. I want to find rows that have duplicate values on field b or field c.

It’s hard to understand at first glance, but after talking to me, I understood: He wants to create unique indexes for b and c respectively. As mentioned above, finding duplicate values in a certain field is as simple as grouping by group and then counting the number of groups. And finding rows with duplicate fields is also very simple, just put all fields in the group clause. But if it is to determine whether the b field is repeated or the c field is repeated, the problem is much more difficult. This is the sample data obtained by the questioner

create table a_b_c(
a int not null primary key auto_increment,
b int,
c int
);

insert into a_b_c(b,c) values (1, 1);
insert into a_b_c(b,c) values (1, 2);
insert into a_b_c(b,c) values (1, 3);
insert into a_b_c(b,c) values (2, 1);
insert into a_b_c(b,c) values (2, 2);
insert into a_b_c(b,c) values (2, 3);
insert into a_b_c(b,c) values (3, 1);
insert into a_b_c(b,c) values (3, 2);
insert into a_b_c(b,c) values (3, 3);

Now, you can easily see that there are some duplicate rows in the table, but you can’t find two rows with the same tuple {b, c}. That’s why the problem becomes difficult.

Wrong query statement

If you group the two columns together, you will get different results depending on how you group and calculate the statistics. This is where the questioner is stuck. Sometimes a query finds some duplicate rows but misses others. This is the query he got

select b, c, count(*) from a_b_c
group by b, c
having count(distinct b > 1)
or count(distinct c > 1);

The result returns all rows because CONT(*) is always 1. Why? Because >1 is written in COUNT(). This error is easily overlooked and is in fact equivalent to

select b, c, count(*) from a_b_c
group by b, c having count(1) or count(1);

Why? Because (b > 1) is a Boolean value, it is not the result you want at all. what you want is

select b, c, count(*) from a_b_c
group by b, c
having count(distinct b) > 1
or count(distinct c) > 1;

Returns an empty result. Obviously, because there are no duplicates {b,c}. Many other combinations of OR and AND have been tried here. Originally, two fields were grouped, and statistics were calculated using another field, like this

select b, count(*) from a_b_c group by b having count(distinct c) >1;
 + + +
| b | count(*) |
 + + +
| 1 | 3 |
| 2 | 3 |
| 3 | 3 |
 + + +

None can find all duplicate rows. And the most frustrating thing is that for some situations, this kind of statement is valid. If you mistakenly think that it is written like this, for other situations, you may get wrong results.

In fact, simply using GROUP BY is not advisable. Why? Because when you use group by on certain fields, the values of other fields will be dispersed into different groups. You can see these effects by sorting on these fields, just as grouping does. First, sort the b field to see how it is grouped

When you sort (group) the b field, c with the same value is grouped into different groups, so you cannot use COUNT(DISTINCT c) to count the quantity. Internal functions such as COUNT() only work on the same group and are powerless on values in different groups. Similarly, if the c field is sorted, b with the same value will also be divided into different groups, which cannot achieve our purpose anyway.

Several correct methods

Perhaps the simplest way is to find duplicate values in a field separately, and then use UNION to put them together, like this:

select b as value, count(*) as cnt, 'b' as what_col from a_b_c group by b having count(*) > 1 union
select c as value, count(*) as cnt, 'c' as what_col from a_b_c group by c having count(*) > 1;
 + + + +
| value | cnt | what_col |
 + + + +
| 1 | 3 | b |
| 2 | 3 | b |
| 3 | 3 | b |
| 1 | 3 | c |
| 2 | 3 | c |
| 3 | 3 | c |
 + + + +

Output the what_col field in order to extract which field is repeated. Another way is to use nested queries:

select a, b, c from a_b_c
where b in (select b from a_b_c group by b having count(*) > 1)
or c in (select c from a_b_c group by c having count(*) >1);

 + + + +
|a|b|c|
 + + + +

| 7 | 1 | 1 |
| 8 | 1 | 2 |
| 9 | 1 | 3 |
| 10 | 2 | 1 |
| 11 | 2 | 2 |
| 12 | 2 | 3 |
| 13 | 3 | 1 |
| 14 | 3 | 2 |
| 15 | 3 | 3 |
 + + + +

This method is much less efficient than using UNION, and displays each duplicate row, but not the duplicate field value. There is another way to query the nested query results of multiple tables and groups together. The writing method is relatively complicated, but it is necessary for complex data or situations with high efficiency requirements.

select a, a_b_c.b, a_b_c.c
from a_b_c
left outer join (
select b from a_b_c group by b having count(*) > 1
) as b on a_b_c.b = b.b
left outer join (
select c from a_b_c group by c having count(*) > 1
) as c on a_b_c.c = c.c
where b.b is not null or c.c is not null

The above methods work, I’m sure there are others. If UNION can work, I think it will be the simplest.