本文针对MySQL中在Repeatable Read的隔离级别下使用select for update可能引发的死锁问题进行分析。
1. 案例
业务中需要对各种类型的实体进行编号,例如对于x类实体的编号可能是x201712120001,x201712120002,x201712120003类似于这样。可以观察到这类编号有两个部分组成:x+日期作为前缀,以及流水号(这里是四位的流水号)。
如果用数据库表实现一个能够分配流水号的需求,无外乎就可以建立一个类似于下面的表
CREATE TABLE number (
prefix VARCHAR(20) NOT NULL DEFAULT ‘‘ COMMENT ‘前缀码‘,
value BIGINT NOT NULL DEFAULT 0 COMMENT ‘流水号‘,
UNIQUE KEY uk_prefix(prefix)
);那么在业务层,根据业务规则得到编号的前缀比如x20171212,接下去就可以在代码中起事务,用select for update进行如下的控制。
@Transactional
long acquire(String prefix) {
SerialNumber current = dao.selectAndLock(prefix);
if (current == null) {
dao.insert(new Record(prefix, 1));
return 1;
}
else {
current.number++;
dao.update(current);
return current.number;
}
}这段代码做的事情其实就是加锁筛选,有则更新,无则插入,然而在Repeatable Read的隔离级别下这段代码是有潜在死锁问题的。(另一处与事务相关的问题也会在下文提及)。
2. 死锁的原因
当可以通过select for update的where条件筛出记录时,上面的代码是不会有deadlock问题的。然而当select for update中的where条件无法筛选出记录时,这时在有多个线程执行上面的acquire方法时是可能会出现死锁的。
2.1 死锁的简单复现
下面通过一个比较简单的例子复现一下这个场景
首先给表里初始化3条数据。
insert into number select ‘bbb‘,2;
insert into number select ‘hhh‘,8;
insert into number select ‘yyy‘,25;
接着按照如下的时序进行操作:
| session 1 | session 2 |
|---|---|
| begin; | |
| begin; | |
| select * from number where prefix=‘ddd‘ for update; | |
| select * from number where prefix=‘fff‘ for update | |
| insert into number select ‘ddd‘,1 | |
| 阻塞中 | insert into number select ‘fff‘,1 |
| 插入成功 | 死锁,session 2的事务被回滚 |
2.2 死锁的分析
通过show engine innodb status,我们慢慢地观察每一步的情况:
2.2.1 session1做了select for update
------------
TRANSACTIONS
------------
Trx id counter 238435
Purge done for trx‘s n:o < 238430 undo n:o < 0 state: running but idle
History list length 13
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 281479459589696, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 281479459588792, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 238434, ACTIVE 3 sec
2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 160, OS thread handle 123145573965824, query id 69153 localhost root
TABLE LOCK tabletest.numbertrx id 238434 lock mode IX
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238434 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
可以看到这里,事务238434拿到了hhh前的gap锁。
2.2.2 session2做了select for update
------------
TRANSACTIONS
------------
Trx id counter 238436
Purge done for trx‘s n:o < 238430 undo n:o < 0 state: running but idle
History list length 13
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 281479459589696, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 238435, ACTIVE 3 sec
2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 161, OS thread handle 123145573408768, query id 69155 localhost root
TABLE LOCK tabletest.numbertrx id 238435 lock mode IX
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238435 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
---TRANSACTION 238434, ACTIVE 30 sec
2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 160, OS thread handle 123145573965824, query id 69153 localhost root
TABLE LOCK tabletest.numbertrx id 238434 lock mode IX
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238434 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
可以看到这里事务238435也拿到了hhh前的gap锁。
2.2.3 session1尝试insert
------------
TRANSACTIONS
------------
Trx id counter 238436
Purge done for trx‘s n:o < 238430 undo n:o < 0 state: running but idle
History list length 13
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 281479459589696, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 238435, ACTIVE 28 sec
2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 161, OS thread handle 123145573408768, query id 69155 localhost root
TABLE LOCK tabletest.numbertrx id 238435 lock mode IX
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238435 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
---TRANSACTION 238434, ACTIVE 55 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 160, OS thread handle 123145573965824, query id 69157 localhost root executing
insert into number select ‘ddd‘,1
------- TRX HAS BEEN WAITING 2 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238434 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
------------------
TABLE LOCK tabletest.numbertrx id 238434 lock mode IX
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238434 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238434 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
可以看到,这时候事务238434在尝试插入‘ddd‘,1时,由于发现其他事务(238435)已经有这个区间的gap锁,因此innodb给事务238434上了插入意向锁,锁的模式为LOCK_X | LOCK_GAP | LOCK_INSERT_INTENTION,等待事务238435释放掉gap锁。

截取自innodb源码的lock_rec_insert_check_and_lock方法实现
2.2.4 session2尝试insert
------------------------
LATEST DETECTED DEADLOCK
------------------------
2017-12-21 22:50:40 0x70001028a000
*** (1) TRANSACTION:
TRANSACTION 238434, ACTIVE 81 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 160, OS thread handle 123145573965824, query id 69157 localhost root executing
insert into number select ‘ddd‘,1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238434 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
*** (2) TRANSACTION:
TRANSACTION 238435, ACTIVE 54 sec inserting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 161, OS thread handle 123145573408768, query id 69159 localhost root executing
insert into number select ‘fff‘,1
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238435 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238435 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
*** WE ROLL BACK TRANSACTION (2)
------------
TRANSACTIONS
------------
Trx id counter 238436
Purge done for trx‘s n:o < 238430 undo n:o < 0 state: running but idle
History list length 13
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 281479459589696, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 281479459588792, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 238434, ACTIVE 84 sec
3 lock struct(s), heap size 1136, 3 row lock(s), undo log entries 1
MySQL thread id 160, OS thread handle 123145573965824, query id 69157 localhost root
TABLE LOCK tabletest.numbertrx id 238434 lock mode IX
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238434 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
Record lock, heap no 7 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 646464; asc ddd;;
1: len 6; hex 00000003a362; asc b;;
2: len 7; hex de000001e60110; asc ;;
3: len 8; hex 8000000000000001; asc ;;
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238434 lock_mode X locks gap before rec insert intention
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
到了这里,我们可以从死锁信息中看出,由于事务238435在插入时也发现了事务238434的gap锁,同样加上了插入意向锁,等待事务238434释放掉gap锁。因此出现死锁的情况。
2.3 死锁的避免
我们已经知道,这种情况出现的原因是:两个session同时通过select for update,并且未命中任何记录的情况下,是有可能得到相同gap的锁的(看where筛选条件)。此时再进行并发插入,其中一个会进入锁等待,待第二个session进行插入时,会出现死锁。MySQL会根据事务权重选择一个事务进行回滚。
那么如何避免这个情况呢?
一种解决办法是将事务隔离级别降低到Read Committed,这时不会有gap锁,对于上述场景,其中某个session会出现索引冲突,可在业务代码中捕获进行重试。
此外,上面代码示例中的代码还有一处值得注意的地方是事务注解@Transactional的传播机制,对于这类与主流程事务关系不大的方法,不妨将事务传播行为改为REQUIRES_NEW。否则某个线程在执行获取流水号的时候可能会因为另一个线程的主流程业务还没执行完毕而阻塞。
3.参考
InnoDB手册
数据库内核月报 - 2016 / 01
InnoDB源码