Skip to content

Commit ec769a3

Browse files
authored
Merge pull request #36 from sunsingerus/master
airline.ontime dataset
2 parents 87b5b25 + d4ad93f commit ec769a3

4 files changed

+265
-64
lines changed

README.md

Lines changed: 187 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -6,35 +6,44 @@
66

77
* [Introduction](#introduction)
88
* [Requirements](#requirements)
9+
* [Operation](#operation)
10+
* [Requirements and Limitations](#requirements-and-limitations)
11+
* [Operation General Schema](#operation-general-schema)
12+
* [Example](#example)
13+
* [Performance](#performance)
914
* [Testing](#testing)
10-
* [MySQL Data Types](#mysql-data-types)
11-
* [ClickHouse Data Types](#clickhouse-data-types)
12-
* [MySQL -> ClickHouse Data Types Mapping](#mysql---clickhouse-data-types-mapping)
13-
* [MySQL Test Tables](#mysql-test-tables)
14-
* [ClickHouse Test Tables](#clickhouse-test-tables)
15-
* [Airline ontime Test Case](#airline-ontime-test-case)
15+
* [Testing General Schema](#testing-general-schema)
16+
* [MySQL Data Types](#mysql-data-types)
17+
* [ClickHouse Data Types](#clickhouse-data-types)
18+
* [MySQL -> ClickHouse Data Types Mapping](#mysql---clickhouse-data-types-mapping)
19+
* [MySQL Test Tables](#mysql-test-tables)
20+
* [ClickHouse Test Tables](#clickhouse-test-tables)
21+
* [Test Cases](#test-cases)
22+
* [airline.ontime Test Case](#airlineontime-test-case)
1623

1724
---
1825

1926
# Introduction
2027

21-
Utility to read mysql data
28+
Utility to import data into ClickHouse from MySQL (mainly) and/or CSV files
2229

2330
# Requirements
2431

25-
This package is used for interacting with MySQL:
32+
Data reader requires **Python 3.x** with additional modules to be installed.
33+
34+
`mysql-replication` package is used for communication with MySQL:
2635
[https://github.com/noplay/python-mysql-replication](https://github.com/noplay/python-mysql-replication)
2736
```bash
2837
pip install mysql-replication
2938
```
3039

31-
This package is used for interacting with ClickHouse:
40+
`clickhouse-driver` package is used for communication with ClickHouse:
3241
[https://github.com/mymarilyn/clickhouse-driver](https://github.com/mymarilyn/clickhouse-driver)
3342
```bash
3443
pip install clickhouse-driver
3544
```
3645

37-
You need (at least one of) the `SUPER`, `REPLICATION CLIENT` privilege(s) for this operation
46+
Also the following (at least one of) MySQL privileges are required for this operation: `SUPER`, `REPLICATION CLIENT`
3847

3948
```sql
4049
CREATE USER 'reader'@'localhost' IDENTIFIED BY 'qwerty';
@@ -52,7 +61,7 @@ GRANT REPLICATION CLIENT, REPLICATION SLAVE, SUPER ON *.* TO 'reader'@'*'
5261
FLUSH PRIVILEGES;
5362
```
5463

55-
MySQL config options required:
64+
Also the following MySQL config options are required:
5665
```ini
5766
[mysqld]
5867
server-id = 1
@@ -62,11 +71,106 @@ max_binlog_size = 100M
6271
binlog-format = row #Very important if you want to receive write, update and delete row events
6372
```
6473

74+
# Operation
75+
76+
## Requirements and Limitations
77+
78+
Data reader understands INSERT SQL statements only. In practice this means that:
79+
* You need to create required table in ClickHouse before starting data read procedure. More on how to create target ClickHouse table: [MySQL -> ClickHouse Data Types Mapping](#mysql---clickhouse-data-types-mapping)
80+
* UPDATE statements are not handled - meaning UPDATEs within MySQL would not be relayed into ClickHouse
81+
* DELETE statements are not handled - meaning DELETEs within MySQL would not be relayed into ClickHouse
82+
* DDL statements are not handled. For example, source table structure change can lead to insertion errors
83+
84+
## Operation General Schema
85+
86+
* Step 1. Data Reader reads data from the source event-by-event (for MySQL binlog) or line-by-line (file).
87+
* Step 2. **OPTIONAL** Caching in memory pool. Since ClickHouse prefers to get data in bundles (row-by-row insertion is extremely slow), we need to introduce some caching.
88+
Cache can be flushed by either of:
89+
* number of rows in cache
90+
* number of events in cache
91+
* time elapsed
92+
* data source depleted
93+
* Step 3. **OPTIONAL** Writing CSV file. Sometimes it is useful to have data also represented as a file
94+
* Step 4. Writing data into ClickHouse. Depending on the configuration of the previous steps data are written into ClickHouse by either of:
95+
* directly event-by-event or line-by-line
96+
* from memory cache as a bulk insert operation
97+
* from CSV file via `clickhouse-client`
98+
99+
## Example
100+
101+
Let's walk over test example of tool launch command line options
102+
103+
```bash
104+
$PYTHON main.py ${*:1} \
105+
--src-resume \
106+
--src-wait \
107+
--nice-pause=1 \
108+
--log-level=info \
109+
--log-file=ontime.log \
110+
--src-host=127.0.0.1 \
111+
--src-user=root \
112+
--dst-host=127.0.0.1 \
113+
--csvpool \
114+
--csvpool-file-path-prefix=qwe_ \
115+
--mempool-max-flush-interval=60 \
116+
--mempool-max-events-num=1000
117+
```
118+
Options description
119+
* `--src-resume` - resume data loading from the previous point. When the tool starts - resume from the end of the log
120+
* `--src-wait` - wait for new data to come
121+
* `--nice-pause=1` - when no data available sleep for 1 second
122+
* `--log-level=info` - log verbosity
123+
* `--log-file=ontime.log` - log file name
124+
* `--src-host=127.0.0.1` - MySQL source host
125+
* `--src-user=root` - MySQL source user (remember about PRIVILEGES for this user)
126+
* `--dst-host=127.0.0.1` - ClickHouse host
127+
* `--csvpool` - make pool of csv files (assumes `--mempool` also)
128+
* `--csvpool-file-path-prefix=qwe_` - put these CSV files having `qwe_` prefix in `CWD`
129+
* `--mempool-max-flush-interval=60` - flush mempool at least every 60 seconds
130+
* `--mempool-max-events-num=1000` - flush mempool at least each 1000 events (not rows, but events)
131+
132+
# Performance
133+
134+
`pypy` significantly improves performance. You should try it.
135+
For example you can start with [Portable PyPy distribution for Linux](https://github.com/squeaky-pl/portable-pypy#portable-pypy-distribution-for-linux)
136+
Unpack it into your place of choice.
137+
138+
```bash
139+
[user@localhost ~]$ ls -l pypy3.5-5.9-beta-linux_x86_64-portable
140+
total 32
141+
drwxr-xr-x 2 user user 140 Oct 24 01:14 bin
142+
drwxr-xr-x 5 user user 4096 Oct 3 11:57 include
143+
drwxr-xr-x 4 user user 4096 Oct 3 11:57 lib
144+
drwxr-xr-x 13 user user 4096 Oct 3 11:56 lib_pypy
145+
drwxr-xr-x 3 user user 15 Oct 3 11:56 lib-python
146+
-rw-r--r-- 1 user user 11742 Oct 3 11:56 LICENSE
147+
-rw-r--r-- 1 user user 1296 Oct 3 11:56 README.rst
148+
drwxr-xr-x 14 user user 4096 Oct 24 01:16 site-packages
149+
drwxr-xr-x 2 user user 195 Oct 3 11:57 virtualenv_support
150+
```
151+
152+
Install `pip`
153+
```bash
154+
pypy3.5-5.9-beta-linux_x86_64-portable/bin/pypy -m ensurepip
155+
```
156+
Install required modules
157+
```bash
158+
pypy3.5-5.9-beta-linux_x86_64-portable/bin/pip3 install mysql-replication
159+
pypy3.5-5.9-beta-linux_x86_64-portable/bin/pip3 install clickhouse-driver
160+
```
161+
162+
Now you can run data reader via `pypy`
163+
```bash
164+
/home/user/pypy3.5-5.9-beta-linux_x86_64-portable/bin/pypy main.py
165+
```
166+
65167
# Testing
66168

67-
## MySQL Data Types
169+
## Testing General Schema
170+
171+
### MySQL Data Types
68172

69-
### Numeric Types
173+
#### Numeric Types
70174

71175
* `BIT` the number of bits per value, from 1 to 64
72176
* `TINYINT` -128 to 127. The unsigned range is 0 to 255
@@ -82,15 +186,15 @@ binlog-format = row #Very important if you want to receive write, update and
82186
* `DOUBLE`, `REAL` Permissible values are -1.7976931348623157E+308 to -2.2250738585072014E-308, 0, and 2.2250738585072014E-308 to 1.7976931348623157E+308
83187

84188

85-
### Date and Time Types
189+
#### Date and Time Types
86190

87191
* `DATE` The supported range is '1000-01-01' to '9999-12-31'
88192
* `DATETIME` The supported range is '1000-01-01 00:00:00.000000' to '9999-12-31 23:59:59.999999'
89193
* `TIMESTAMP` The range is '1970-01-01 00:00:01.000000' UTC to '2038-01-19 03:14:07.999999'
90194
* `TIME` The range is '-838:59:59.000000' to '838:59:59.000000'
91195
* `YEAR` Values display as 1901 to 2155, and 0000
92196

93-
### String Types
197+
#### String Types
94198
* `CHAR` The range of M is 0 to 255. If M is omitted, the length is 1.
95199
* `VARCHAR` The range of M is 0 to 65,535
96200
* `BINARY` similar to CHAR
@@ -110,7 +214,7 @@ binlog-format = row #Very important if you want to receive write, update and
110214

111215
---
112216

113-
## ClickHouse Data Types
217+
### ClickHouse Data Types
114218

115219
* `Date` number of days since 1970-01-01
116220
* `DateTime` Unix timestamp
@@ -134,9 +238,9 @@ binlog-format = row #Very important if you want to receive write, update and
134238

135239
---
136240

137-
## MySQL -> ClickHouse Data Types Mapping
241+
### MySQL -> ClickHouse Data Types Mapping
138242

139-
### Numeric Types
243+
#### Numeric Types
140244

141245
* `BIT` -> ??? (possibly `String`?)
142246
* `TINYINT` -> `Int8`, `UInt8`
@@ -152,7 +256,7 @@ binlog-format = row #Very important if you want to receive write, update and
152256
* `DOUBLE`, `REAL` -> `Float64`
153257

154258

155-
### Date and Time Types
259+
#### Date and Time Types
156260

157261
* `DATE` -> `Date` (for valid values) or `String` (`Date` Allows storing values from just after the beginning of the Unix Epoch to the upper threshold defined by a constant at the compilation stage (currently, this is until the year 2038, but it may be expanded to 2106))
158262
* `DATETIME` -> `DateTime` (for valid values) or `String`
@@ -161,7 +265,7 @@ binlog-format = row #Very important if you want to receive write, update and
161265
* `YEAR` -> `UInt16`
162266

163267

164-
### String Types
268+
#### String Types
165269

166270
* `CHAR` -> `FixedString`
167271
* `VARCHAR` -> `String`
@@ -182,7 +286,7 @@ binlog-format = row #Very important if you want to receive write, update and
182286
* `JSON` -> ?????? (possibly `String`?)
183287

184288

185-
## MySQL Test Tables
289+
### MySQL Test Tables
186290

187291
We have to separate test table into several ones because of this error, produced by MySQL:
188292
```text
@@ -494,7 +598,7 @@ INSERT INTO long_varbinary_datatypes SET
494598
;
495599
```
496600

497-
## ClickHouse Test Tables
601+
### ClickHouse Test Tables
498602

499603
```sql
500604
CREATE TABLE datatypes(
@@ -605,9 +709,33 @@ CREATE TABLE long_varbinary_datatypes(
605709
;
606710
```
607711

608-
## Airline ontime Test Case
712+
## Test Cases
713+
714+
### airline.ontime Test Case
715+
Main Steps
716+
* Download airline.ontime dataset
717+
* Create airline.ontime MySQL table
718+
* Create airline.ontime ClickHouse table
719+
* Start data reader (migrate data MySQL -> ClickHouse)
720+
* Start data importer (import data into MySQL)
721+
* Check how data are loaded into ClickHouse
722+
723+
#### airline.ontime Data Set in CSV files
724+
Run [download script](run_airline_ontime_data_download.sh)
725+
You may want to adjust dirs where to keep `ZIP` and `CSV` file
726+
In `run_airline_ontime_data_download.sh` edit these lines:
727+
```bash
728+
ZIP_FILES_DIR="zip"
729+
CSV_FILES_DIR="csv"
730+
```
731+
732+
```bash
733+
./run_airline_ontime_data_download.sh
734+
```
735+
Downloading can take some time.
609736

610-
### MySQL Table
737+
#### airline.ontime MySQL Table
738+
Create MySQL table of the following structure:
611739

612740
```sql
613741
CREATE DATABASE IF NOT EXISTS `airline`;
@@ -724,8 +852,8 @@ CREATE TABLE IF NOT EXISTS `airline`.`ontime` (
724852
);
725853
```
726854

727-
### ClickHouse Table
728-
855+
#### airline.ontime ClickHouse Table
856+
Create ClickHouse table of the following structure:
729857
```sql
730858
CREATE DATABASE IF NOT EXISTS `airline`;
731859
CREATE TABLE IF NOT EXISTS `airline`.`ontime` (
@@ -841,45 +969,40 @@ CREATE TABLE IF NOT EXISTS `airline`.`ontime` (
841969
) ENGINE = MergeTree(FlightDate, (FlightDate, Year, Month, DepDel15), 8192)
842970
```
843971

844-
### Import Data
845-
972+
#### airline.ontime Data Reader
973+
Run [datareader script](run_airline_ontime_data_reader.sh)
974+
You may want to adjust `PYTHON` path and source and target hosts and usernames
846975
```bash
847-
ls|sort|head -n 100
848-
849-
i=1
850-
for file in $(ls *.csv|sort|head -n 100); do
851-
echo "$i. Copy $file"
852-
cp -f $file ontime.csv
853-
echo "$i. Import $file"
854-
mysqlimport \
855-
--ignore-lines=1 \
856-
--fields-terminated-by=, \
857-
--fields-enclosed-by=\" \
858-
--local \
859-
-u root \
860-
airline ontime.csv
861-
rm -f ontime.csv
862-
i=$((i+1))
863-
done
864-
865-
#!/bin/bash
866-
files_to_import_num=3
867-
i=1
868-
for file in $(ls /mnt/nas/work/ontime/*.csv|sort|head -n $files_to_import_num); do
869-
echo "$i. Prepare $file"
870-
rm -f ontime
871-
ln -s $file ontime
872-
echo "$i. Import $file"
873-
time mysqlimport \
874-
--ignore-lines=1 \
875-
--fields-terminated-by=, \
876-
--fields-enclosed-by=\" \
877-
--local \
878-
-u root \
879-
airline ontime
880-
rm -f ontime
881-
i=$((i+1))
882-
done
976+
PYTHON=python3.6
977+
PYTHON=/home/user/pypy3.5-5.9-beta-linux_x86_64-portable/bin/pypy
978+
```
979+
```bash
980+
...
981+
--src-host=127.0.0.1 \
982+
--src-user=root \
983+
--dst-host=127.0.0.1 \
984+
...
985+
```
986+
```bash
987+
./run_airline_ontime_data_reader.sh
988+
```
883989

990+
#### airline.ontime Data Importer
991+
Run [data importer script](run_airline_ontime_import.sh)
992+
You may want to adjust `CSV` files location, number of imported files and MySQL user/password used for import
993+
```bash
994+
# looking for csv files in this dir
995+
FILES_TO_IMPORT_DIR="/mnt/nas/work/ontime"
884996

997+
# limit import to this number of files
998+
FILES_TO_IMPORT_NUM=3
885999
```
1000+
```bash
1001+
...
1002+
-u root \
1003+
...
1004+
```
1005+
1006+
```bash
1007+
./run_airline_ontime_import.sh
1008+
```

0 commit comments

Comments
 (0)