Laravel chunk和chunkById的坑

公司中的项目在逐渐的向Laravel框架进行迁移。在编写定时任务脚本的时候，用到了chunk和chunkById的API，记录一下踩到的坑。

一、前言

数据库引擎为innodb。

表结构简述，只列出了本文用到的字段。

字段	类型	注释
id	int(11)	ID
type	int(11)	类型
mark_time	int(10)	标注时间（时间戳）

索引，也只列出需要的部分。

索引名	字段
PRIMARY	id
idx_sid_blogdel_marktime	type blog_del mark_time
Idx_marktime	mark_time

二、需求

每天凌晨一点取出昨天标注type为99的所有数据，进行逻辑判断，然后进行其他操作。本文的重点只在于取数据的阶段。

数据按月分表，每个月表中的数据为1000w上下。

三、chunk处理数据

代码如下:

$this->dao->where('type', 99)->whereBetween('mark_time', [$date, $date+86399])->select(array('mark_time', 'id'))->chunk(1000, function ($rows){
    // 业务处理
});

从一个月中的数据，筛选出type为99，并且标注时间在某天的00:00:00-23:59:59的数据。可以使用到mark_time和type的索引。

type为99，一天的数据大概在15-25w上下的样子。使用->get()->toArray()内存会直接炸掉。所以使用chunk方法，每次取出1000条数据。

使用chucnk，不会出现内存不够的情况。但是性能较差。粗略估计，从一月数据中取出最后一天的数据，跑完20w数据大概需要一两分钟。

查看源码，底层的chunk方法，是为sql语句添加了限制和偏移量。

1	select * from `users` asc limit 500 offset 500;

在数据较多的时候，越往后的话效率会越慢，因为Mysql的limit方法底层是这样的。

limit 10000，10

是扫描满足条件的10010行，然后扔掉前面的10000行，返回最后最后20行。在数据较多的时候，性能会非常差。

查了下API，对于这种情况Laraverl提供了另一个API chunkById。

四、chunkById 原理

使用limit和偏移量在处理大量的数据会有性能的明显下降。于是chunkById使用了id进行分页处理。很好理解，代码如下：

1	select * from `users` where `id` > :last_id order by `id` asc limit 500;

API会自动保存最后一个ID，然后通过id > :last_id 再加上limit就可以通过主键索引进行分页。只取出来需要的行数。性能会有明显的提升。

五、chunkById的坑

API显示chunk和chunkById的用法完全相同。于是把脚本的代码换成了chunkById。

$this->dao->where('type', 99)->whereBetween('mark_time', [$date, $date+86399])->select(array('mark_time', 'id'))->chunkById(1000, function ($rows){
    // 业务处理
});

在执行脚本的时候，1月2号和1月1号的数据没有任何问题。执行速度快了很多。但是在执行12月31号的数据的时候，发现脚本一直执行不完。

在定位后发现是脚本没有进入业务处理的部分，也就是sql一直没有执行完。当时很疑惑，因为刚才执行的没问题，为什么执行12月31号的就出问题了呢。

于是查看sql服务器中的执行情况。

1	show full processlist;

发现了问题。上节说了chunkById的底层是通过id进行order by，然后limie取出一部分一部分的数据，也就是我们预想的sql是这样的。

1	select * from `tabel` where `type` = 99 and mark_time between :begin_date and :end_date limit 500;

explain出来的情况如下：

select_type	type	key	rows	Extra
SIMPLE	Range	idx_marktime	2370258	Using index condition; Using where

实际上的sql是这样的：

1	select * from `tabel` where `type` = 99 and mark_time between :begin_date and :end_date order by id limit 500;

实际explain出来的情况是这样的：

select_type	type	key	rows	Extra
SIMPLE	Index	PRIMARY	4379	Using where

chunkById会自动添加order by id。innodb一定会使用主键索引。那么就不会再使用mark_time的索引了。导致sql执行效率及其缓慢。

六、解决方法

再次查看chunkById的源码。

/**
  * Chunk the results of a query by comparing numeric IDs.
  *
  * @param  int  $count
  * @param  callable  $callback
  * @param  string|null  $column
  * @param  string|null  $alias
  * @return bool
  */
  public function chunkById($count, callable $callback, $column = null, $alias = null)
  {
  	$column = is_null($column) ? $this->getModel()->getKeyName() : $column;

  	$alias = is_null($alias) ? $column : $alias;

  	$lastId = null;

  	do {
        $clone = clone $this;

        // We'll execute the query for the given page and get the results. If there are
        // no results we can just break and return from here. When there are results
        // we will call the callback with the current chunk of these results here.
        $results = $clone->forPageAfterId($count, $lastId, $column)->get();

        $countResults = $results->count();

        if ($countResults == 0) {
          break;
        }

        // On each chunk result set, we will pass them to the callback and then let the
        // developer take care of everything within the callback, which allows us to
        // keep the memory low for spinning through large result sets for working.
        if ($callback($results) === false) {
          return false;
        }

        $lastId = $results->last()->{$alias};

      unset($results);
     } while ($countResults == $count);

     return true;
  }

能看到这个方法有四个参数count，callback，column，alias。

默认的column为null，第一行会进行默认赋值。

1	$column = is_null($column) ? $this->getModel()->getKeyName() : $column;

往下跟:

/**
  * Get the primary key for the model.
  *
  * @return string
  */
  public function getKeyName()
  {
  	return $this->primaryKey;
 	}
 	
/**
  * The primary key for the model.
  *
  * @var string
  */
  protected $primaryKey = 'id';

能看到默认的column为id。

进入forPageAfterId方法。

/**
  * Constrain the query to the next "page" of results after a given ID.
  *
  * @param  int  $perPage
  * @param  int|null  $lastId
  * @param  string  $column
  * @return IlluminateDatabaseQueryBuilder|static
  */
  public function forPageAfterId($perPage = 15, $lastId = 0, $column = 'id')
  {
  	$this->orders = $this->removeExistingOrdersFor($column);

  	if (! is_null($lastId)) {
  		$this->where($column, '>', $lastId);
  	}

  	return $this->orderBy($column, 'asc')
  	->take($perPage);
  }

能看到如果lastId不为0则自动添加where语句，还会自动添加order by column。

看到这里就明白了。上文的chunkById没有添加column参数，所以底层自动添加了order by id。走了主键索引，没有使用上mark_time的索引。导致查询效率非常低。

chunkById的源码显示了我们可以传递一个column字段来让底层使用这个字段来order by。

代码修改如下：

1
2
3

$this->dao->where('type', 99)->whereBetween('mark_time', [$date, $date+86399])->select(array('mark_time', 'id'))->chunkById(1000, function ($rows){
    // 业务处理
}, 'mark_time');

这样最后执行的sql如下：

1	select * from `tabel` where `type` = 99 and mark_time between :begin_date and :end_date order by mark_time limit 500;

再次执行脚本，大概执行一次也就十秒作用了，性能提升显著。

七、总结

使用 chunkById 或者 chunk 方法的时候不要添加自定义的排序，chunk和chunkById的区别就是chunk是单纯的通过偏移量来获取数据，chunkById进行了优化，不使用偏移量，使用id过滤，性能提升巨大。在数据量大的时候，性能可以差到几十倍的样子。

而且使用chunk在更新的时候，也会遇到数据会被跳过的问题。详见解决Laravel中chunk方法分块处理数据的坑

同时chunkById在你没有传递column参数时，会默认添加order by id。可能会遇到索引失效的问题。解决办法就是传递column参数即可。

本人感觉chunkById不光是根据Id分块，而是可以根据某一字段进行分块，这个字段是可以指定的。叫chunkById有一些误导性，chunkByColumn可能更容易理解。算是自己提的小小的建议。

本文非原创，转载于https://www.lqwang.net/13.html

踩过这个坑，还有下一个坑等着你，这一路就是给自己填坑，坑填多了，也就习惯了，直到这一路平坦了，也就无怨无悔了。

Laravel chunk和chunkById的坑

Laravel chunk和chunkById的坑

一、前言

二、需求

三、chunk处理数据

四、chunkById 原理

五、chunkById的坑

六、解决方法

七、总结

Published by

风君子

发表回复取消回复

Laravel chunk和chunkById的坑

一、前言

二、需求

三、chunk处理数据

四、chunkById 原理

五、chunkById的坑

六、解决方法

七、总结

Published by

风君子

发表回复 取消回复

发表回复取消回复