You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You need (at least one of) the `SUPER`, `REPLICATION CLIENT`privilege(s) for this operation
46
+
Also the following (at least one of) MySQL privileges are required for this operation: `SUPER`, `REPLICATION CLIENT`
38
47
39
48
```sql
40
49
CREATEUSER 'reader'@'localhost' IDENTIFIED BY 'qwerty';
@@ -52,7 +61,7 @@ GRANT REPLICATION CLIENT, REPLICATION SLAVE, SUPER ON *.* TO 'reader'@'*'
52
61
FLUSH PRIVILEGES;
53
62
```
54
63
55
-
MySQL config options required:
64
+
Also the following MySQL config options are required:
56
65
```ini
57
66
[mysqld]
58
67
server-id = 1
@@ -62,11 +71,106 @@ max_binlog_size = 100M
62
71
binlog-format = row #Very important if you want to receive write, update and delete row events
63
72
```
64
73
74
+
# Operation
75
+
76
+
## Requirements and Limitations
77
+
78
+
Data reader understands INSERT SQL statements only. In practice this means that:
79
+
* You need to create required table in ClickHouse before starting data read procedure. More on how to create target ClickHouse table: [MySQL -> ClickHouse Data Types Mapping](#mysql---clickhouse-data-types-mapping)
80
+
* UPDATE statements are not handled - meaning UPDATEs within MySQL would not be relayed into ClickHouse
81
+
* DELETE statements are not handled - meaning DELETEs within MySQL would not be relayed into ClickHouse
82
+
* DDL statements are not handled. For example, source table structure change can lead to insertion errors
83
+
84
+
## Operation General Schema
85
+
86
+
* Step 1. Data Reader reads data from the source event-by-event (for MySQL binlog) or line-by-line (file).
87
+
* Step 2. **OPTIONAL** Caching in memory pool. Since ClickHouse prefers to get data in bundles (row-by-row insertion is extremely slow), we need to introduce some caching.
88
+
Cache can be flushed by either of:
89
+
* number of rows in cache
90
+
* number of events in cache
91
+
* time elapsed
92
+
* data source depleted
93
+
* Step 3. **OPTIONAL** Writing CSV file. Sometimes it is useful to have data also represented as a file
94
+
* Step 4. Writing data into ClickHouse. Depending on the configuration of the previous steps data are written into ClickHouse by either of:
95
+
* directly event-by-event or line-by-line
96
+
* from memory cache as a bulk insert operation
97
+
* from CSV file via `clickhouse-client`
98
+
99
+
## Example
100
+
101
+
Let's walk over test example of tool launch command line options
102
+
103
+
```bash
104
+
$PYTHON main.py ${*:1} \
105
+
--src-resume \
106
+
--src-wait \
107
+
--nice-pause=1 \
108
+
--log-level=info \
109
+
--log-file=ontime.log \
110
+
--src-host=127.0.0.1 \
111
+
--src-user=root \
112
+
--dst-host=127.0.0.1 \
113
+
--csvpool \
114
+
--csvpool-file-path-prefix=qwe_ \
115
+
--mempool-max-flush-interval=60 \
116
+
--mempool-max-events-num=1000
117
+
```
118
+
Options description
119
+
*`--src-resume` - resume data loading from the previous point. When the tool starts - resume from the end of the log
120
+
*`--src-wait` - wait for new data to come
121
+
*`--nice-pause=1` - when no data available sleep for 1 second
122
+
*`--log-level=info` - log verbosity
123
+
*`--log-file=ontime.log` - log file name
124
+
*`--src-host=127.0.0.1` - MySQL source host
125
+
*`--src-user=root` - MySQL source user (remember about PRIVILEGES for this user)
126
+
*`--dst-host=127.0.0.1` - ClickHouse host
127
+
*`--csvpool` - make pool of csv files (assumes `--mempool` also)
128
+
*`--csvpool-file-path-prefix=qwe_` - put these CSV files having `qwe_` prefix in `CWD`
129
+
*`--mempool-max-flush-interval=60` - flush mempool at least every 60 seconds
130
+
*`--mempool-max-events-num=1000` - flush mempool at least each 1000 events (not rows, but events)
131
+
132
+
# Performance
133
+
134
+
`pypy` significantly improves performance. You should try it.
135
+
For example you can start with [Portable PyPy distribution for Linux](https://github.com/squeaky-pl/portable-pypy#portable-pypy-distribution-for-linux)
136
+
Unpack it into your place of choice.
137
+
138
+
```bash
139
+
[user@localhost ~]$ ls -l pypy3.5-5.9-beta-linux_x86_64-portable
140
+
total 32
141
+
drwxr-xr-x 2 user user 140 Oct 24 01:14 bin
142
+
drwxr-xr-x 5 user user 4096 Oct 3 11:57 include
143
+
drwxr-xr-x 4 user user 4096 Oct 3 11:57 lib
144
+
drwxr-xr-x 13 user user 4096 Oct 3 11:56 lib_pypy
145
+
drwxr-xr-x 3 user user 15 Oct 3 11:56 lib-python
146
+
-rw-r--r-- 1 user user 11742 Oct 3 11:56 LICENSE
147
+
-rw-r--r-- 1 user user 1296 Oct 3 11:56 README.rst
148
+
drwxr-xr-x 14 user user 4096 Oct 24 01:16 site-packages
149
+
drwxr-xr-x 2 user user 195 Oct 3 11:57 virtualenv_support
*`TINYINT` -128 to 127. The unsigned range is 0 to 255
@@ -82,15 +186,15 @@ binlog-format = row #Very important if you want to receive write, update and
82
186
*`DOUBLE`, `REAL` Permissible values are -1.7976931348623157E+308 to -2.2250738585072014E-308, 0, and 2.2250738585072014E-308 to 1.7976931348623157E+308
83
187
84
188
85
-
### Date and Time Types
189
+
####Date and Time Types
86
190
87
191
*`DATE` The supported range is '1000-01-01' to '9999-12-31'
88
192
*`DATETIME` The supported range is '1000-01-01 00:00:00.000000' to '9999-12-31 23:59:59.999999'
89
193
*`TIMESTAMP` The range is '1970-01-01 00:00:01.000000' UTC to '2038-01-19 03:14:07.999999'
90
194
*`TIME` The range is '-838:59:59.000000' to '838:59:59.000000'
91
195
*`YEAR` Values display as 1901 to 2155, and 0000
92
196
93
-
### String Types
197
+
####String Types
94
198
*`CHAR` The range of M is 0 to 255. If M is omitted, the length is 1.
95
199
*`VARCHAR` The range of M is 0 to 65,535
96
200
*`BINARY` similar to CHAR
@@ -110,7 +214,7 @@ binlog-format = row #Very important if you want to receive write, update and
110
214
111
215
---
112
216
113
-
## ClickHouse Data Types
217
+
###ClickHouse Data Types
114
218
115
219
*`Date` number of days since 1970-01-01
116
220
*`DateTime` Unix timestamp
@@ -134,9 +238,9 @@ binlog-format = row #Very important if you want to receive write, update and
134
238
135
239
---
136
240
137
-
## MySQL -> ClickHouse Data Types Mapping
241
+
###MySQL -> ClickHouse Data Types Mapping
138
242
139
-
### Numeric Types
243
+
####Numeric Types
140
244
141
245
*`BIT` -> ??? (possibly `String`?)
142
246
*`TINYINT` -> `Int8`, `UInt8`
@@ -152,7 +256,7 @@ binlog-format = row #Very important if you want to receive write, update and
152
256
*`DOUBLE`, `REAL` -> `Float64`
153
257
154
258
155
-
### Date and Time Types
259
+
####Date and Time Types
156
260
157
261
*`DATE` -> `Date` (for valid values) or `String` (`Date` Allows storing values from just after the beginning of the Unix Epoch to the upper threshold defined by a constant at the compilation stage (currently, this is until the year 2038, but it may be expanded to 2106))
158
262
*`DATETIME` -> `DateTime` (for valid values) or `String`
@@ -161,7 +265,7 @@ binlog-format = row #Very important if you want to receive write, update and
161
265
*`YEAR` -> `UInt16`
162
266
163
267
164
-
### String Types
268
+
####String Types
165
269
166
270
*`CHAR` -> `FixedString`
167
271
*`VARCHAR` -> `String`
@@ -182,7 +286,7 @@ binlog-format = row #Very important if you want to receive write, update and
182
286
*`JSON` -> ?????? (possibly `String`?)
183
287
184
288
185
-
## MySQL Test Tables
289
+
###MySQL Test Tables
186
290
187
291
We have to separate test table into several ones because of this error, produced by MySQL:
188
292
```text
@@ -494,7 +598,7 @@ INSERT INTO long_varbinary_datatypes SET
0 commit comments