冒險村19 - Activerecord-import

19 - Activerecord-import

在寫入資料的時候,時常會有大量資料一次 import 的情況,可能是匯入整個 csv 檔案到資料表、或者是接收 api 將資料匯入。可能會使用 transaction 的方式來寫入確保資料都有正常匯入。

舉例來說,一次想入一筆資料:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
2.times.each do |time|
Post.create(
title: "Post test #{time}",
content: "mock content #{time}",
is_available: true,
user_id: 1
)
end

TRANSACTION (0.2ms) BEGIN
User Load (0.2ms) SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Post Create (0.3ms) INSERT INTO "posts" ("title", "content", "is_available", "user_id", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6) RETURNING "id" [["title", "Post test 0"], ["content", "mock content 0"], ["is_available", true], ["user_id", 1], ["created_at", "2021-10-02 08:40:31.090718"], ["updated_at", "2021-10-02 08:40:31.090718"]]
TRANSACTION (0.6ms) COMMIT

TRANSACTION (0.1ms) BEGIN
User Load (0.2ms) SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Post Create (0.3ms) INSERT INTO "posts" ("title", "content", "is_available", "user_id", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6) RETURNING "id" [["title", "Post test 1"], ["content", "mock content 1"], ["is_available", true], ["user_id", 1], ["created_at", "2021-10-02 08:40:31.093480"], ["updated_at", "2021-10-02 08:40:31.093480"]]
TRANSACTION (0.2ms) COMMIT

這樣子的寫法會得到每一筆都會是一個 transaction,所以可能改成用一個 transaction 來包,只會有一次 transaction

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Post.transaction do
2.times.each do |time|
Post.create(
title: "Post test #{time}",
content: "mock content #{time}",
is_available: true,
user_id: 1
)
end
end

TRANSACTION (0.1ms) BEGIN
User Load (0.2ms) SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Post Create (0.3ms) INSERT INTO "posts" ("title", "content", "is_available", "user_id", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6) RETURNING "id" [["title", "Post test 0"], ["content", "mock content 0"], ["is_available", true], ["user_id", 1], ["created_at", "2021-10-02 08:44:53.350348"], ["updated_at", "2021-10-02 08:44:53.350348"]]
User Load (0.1ms) SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2 [["id", 1], ["LIMIT", 1]]
Post Create (0.3ms) INSERT INTO "posts" ("title", "content", "is_available", "user_id", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6) RETURNING "id" [["title", "Post test 1"], ["content", "mock content 1"], ["is_available", true], ["user_id", 1], ["created_at", "2021-10-02 08:44:53.352100"], ["updated_at", "2021-10-02 08:44:53.352100"]]
TRANSACTION (0.6ms) COMMIT

不過如果資料大量的時候,SQL 就會一直 n+1 的 INSERT,也可能因為 transaction 把資料表鎖住造成其他人要使用會需要排隊的情況。

所以這篇要來介紹 Activerecord-Import gem:

gem install

1
2
# Gemfile
gem "activerecord-import"

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
values = 
2.times.each_with_object([]) do |time, array|
array << {
title: "Post test #{time}",
content: "mock content #{time}",
is_available: true,
user_id: 1
}
end

# values => [
# {:title=>"Post test 0", :content=>"mock content 0", :is_available=>true, :user_id=>1},
# {:title=>"Post test 1", :content=>"mock content 1", :is_available=>true, :user_id=>1}
# ]

Post.import values, validate: true

Post Create Many (3.4ms) INSERT INTO "posts" ("title","content","is_available","user_id","created_at","updated_at") VALUES ('Post test 0','mock content 0',TRUE,1,'2021-10-02 08:57:48.839905','2021-10-02 08:57:48.839905'),('Post test 1','mock content 1',TRUE,1,'2021-10-02 08:57:48.839905','2021-10-02 08:57:48.839905') RETURNING "id"
=> #<struct ActiveRecord::Import::Result failed_instances=[], num_inserts=1, ids=[19, 20], results=[]>

會發現只會有一筆 INSERT 且可以指定參數 validate 來看是否需要驗證,不指定則 default 為 true

註: 不需驗證的資料在 import 時速度約快有驗證的 5 倍左右

再者一個滿重要的問題,當如果輸入的資料其實其他欄位都相同,或者是差不多,但需要一個新的寫入時間來記錄是新的一筆資料的這種情況,或者是想覆蓋原本的資料,正常的情況可能會做 delete + create,不過這個套件有提供 Duplicate Key Update 來讓 MySQL 做 ON DUPLICATE KEY UPDATE,或者是 PG 做 ON CONFLICT DO UPDATE,相關可參考各資料庫 Duplicate Key Update 的寫法。

註: 須建立 uniq index 將重複資料改為 upsert

參考來源