Skip to content

airline.ontime dataset #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 21, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
251 changes: 187 additions & 64 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,35 +6,44 @@

* [Introduction](#introduction)
* [Requirements](#requirements)
* [Operation](#operation)
* [Requirements and Limitations](#requirements-and-limitations)
* [Operation General Schema](#operation-general-schema)
* [Example](#example)
* [Performance](#performance)
* [Testing](#testing)
* [MySQL Data Types](#mysql-data-types)
* [ClickHouse Data Types](#clickhouse-data-types)
* [MySQL -> ClickHouse Data Types Mapping](#mysql---clickhouse-data-types-mapping)
* [MySQL Test Tables](#mysql-test-tables)
* [ClickHouse Test Tables](#clickhouse-test-tables)
* [Airline ontime Test Case](#airline-ontime-test-case)
* [Testing General Schema](#testing-general-schema)
* [MySQL Data Types](#mysql-data-types)
* [ClickHouse Data Types](#clickhouse-data-types)
* [MySQL -> ClickHouse Data Types Mapping](#mysql---clickhouse-data-types-mapping)
* [MySQL Test Tables](#mysql-test-tables)
* [ClickHouse Test Tables](#clickhouse-test-tables)
* [Test Cases](#test-cases)
* [airline.ontime Test Case](#airlineontime-test-case)

---

# Introduction

Utility to read mysql data
Utility to import data into ClickHouse from MySQL (mainly) and/or CSV files

# Requirements

This package is used for interacting with MySQL:
Data reader requires **Python 3.x** with additional modules to be installed.

`mysql-replication` package is used for communication with MySQL:
[https://github.com/noplay/python-mysql-replication](https://github.com/noplay/python-mysql-replication)
```bash
pip install mysql-replication
```

This package is used for interacting with ClickHouse:
`clickhouse-driver` package is used for communication with ClickHouse:
[https://github.com/mymarilyn/clickhouse-driver](https://github.com/mymarilyn/clickhouse-driver)
```bash
pip install clickhouse-driver
```

You need (at least one of) the `SUPER`, `REPLICATION CLIENT` privilege(s) for this operation
Also the following (at least one of) MySQL privileges are required for this operation: `SUPER`, `REPLICATION CLIENT`

```sql
CREATE USER 'reader'@'localhost' IDENTIFIED BY 'qwerty';
Expand All @@ -52,7 +61,7 @@ GRANT REPLICATION CLIENT, REPLICATION SLAVE, SUPER ON *.* TO 'reader'@'*'
FLUSH PRIVILEGES;
```

MySQL config options required:
Also the following MySQL config options are required:
```ini
[mysqld]
server-id = 1
Expand All @@ -62,11 +71,106 @@ max_binlog_size = 100M
binlog-format = row #Very important if you want to receive write, update and delete row events
```

# Operation

## Requirements and Limitations

Data reader understands INSERT SQL statements only. In practice this means that:
* You need to create required table in ClickHouse before starting data read procedure. More on how to create target ClickHouse table: [MySQL -> ClickHouse Data Types Mapping](#mysql---clickhouse-data-types-mapping)
* UPDATE statements are not handled - meaning UPDATEs within MySQL would not be relayed into ClickHouse
* DELETE statements are not handled - meaning DELETEs within MySQL would not be relayed into ClickHouse
* DDL statements are not handled. For example, source table structure change can lead to insertion errors

## Operation General Schema

* Step 1. Data Reader reads data from the source event-by-event (for MySQL binlog) or line-by-line (file).
* Step 2. **OPTIONAL** Caching in memory pool. Since ClickHouse prefers to get data in bundles (row-by-row insertion is extremely slow), we need to introduce some caching.
Cache can be flushed by either of:
* number of rows in cache
* number of events in cache
* time elapsed
* data source depleted
* Step 3. **OPTIONAL** Writing CSV file. Sometimes it is useful to have data also represented as a file
* Step 4. Writing data into ClickHouse. Depending on the configuration of the previous steps data are written into ClickHouse by either of:
* directly event-by-event or line-by-line
* from memory cache as a bulk insert operation
* from CSV file via `clickhouse-client`

## Example

Let's walk over test example of tool launch command line options

```bash
$PYTHON main.py ${*:1} \
--src-resume \
--src-wait \
--nice-pause=1 \
--log-level=info \
--log-file=ontime.log \
--src-host=127.0.0.1 \
--src-user=root \
--dst-host=127.0.0.1 \
--csvpool \
--csvpool-file-path-prefix=qwe_ \
--mempool-max-flush-interval=60 \
--mempool-max-events-num=1000
```
Options description
* `--src-resume` - resume data loading from the previous point. When the tool starts - resume from the end of the log
* `--src-wait` - wait for new data to come
* `--nice-pause=1` - when no data available sleep for 1 second
* `--log-level=info` - log verbosity
* `--log-file=ontime.log` - log file name
* `--src-host=127.0.0.1` - MySQL source host
* `--src-user=root` - MySQL source user (remember about PRIVILEGES for this user)
* `--dst-host=127.0.0.1` - ClickHouse host
* `--csvpool` - make pool of csv files (assumes `--mempool` also)
* `--csvpool-file-path-prefix=qwe_` - put these CSV files having `qwe_` prefix in `CWD`
* `--mempool-max-flush-interval=60` - flush mempool at least every 60 seconds
* `--mempool-max-events-num=1000` - flush mempool at least each 1000 events (not rows, but events)

# Performance

`pypy` significantly improves performance. You should try it.
For example you can start with [Portable PyPy distribution for Linux](https://github.com/squeaky-pl/portable-pypy#portable-pypy-distribution-for-linux)
Unpack it into your place of choice.

```bash
[user@localhost ~]$ ls -l pypy3.5-5.9-beta-linux_x86_64-portable
total 32
drwxr-xr-x 2 user user 140 Oct 24 01:14 bin
drwxr-xr-x 5 user user 4096 Oct 3 11:57 include
drwxr-xr-x 4 user user 4096 Oct 3 11:57 lib
drwxr-xr-x 13 user user 4096 Oct 3 11:56 lib_pypy
drwxr-xr-x 3 user user 15 Oct 3 11:56 lib-python
-rw-r--r-- 1 user user 11742 Oct 3 11:56 LICENSE
-rw-r--r-- 1 user user 1296 Oct 3 11:56 README.rst
drwxr-xr-x 14 user user 4096 Oct 24 01:16 site-packages
drwxr-xr-x 2 user user 195 Oct 3 11:57 virtualenv_support
```

Install `pip`
```bash
pypy3.5-5.9-beta-linux_x86_64-portable/bin/pypy -m ensurepip
```
Install required modules
```bash
pypy3.5-5.9-beta-linux_x86_64-portable/bin/pip3 install mysql-replication
pypy3.5-5.9-beta-linux_x86_64-portable/bin/pip3 install clickhouse-driver
```

Now you can run data reader via `pypy`
```bash
/home/user/pypy3.5-5.9-beta-linux_x86_64-portable/bin/pypy main.py
```

# Testing

## MySQL Data Types
## Testing General Schema

### MySQL Data Types

### Numeric Types
#### Numeric Types

* `BIT` the number of bits per value, from 1 to 64
* `TINYINT` -128 to 127. The unsigned range is 0 to 255
Expand All @@ -82,15 +186,15 @@ binlog-format = row #Very important if you want to receive write, update and
* `DOUBLE`, `REAL` Permissible values are -1.7976931348623157E+308 to -2.2250738585072014E-308, 0, and 2.2250738585072014E-308 to 1.7976931348623157E+308


### Date and Time Types
#### Date and Time Types

* `DATE` The supported range is '1000-01-01' to '9999-12-31'
* `DATETIME` The supported range is '1000-01-01 00:00:00.000000' to '9999-12-31 23:59:59.999999'
* `TIMESTAMP` The range is '1970-01-01 00:00:01.000000' UTC to '2038-01-19 03:14:07.999999'
* `TIME` The range is '-838:59:59.000000' to '838:59:59.000000'
* `YEAR` Values display as 1901 to 2155, and 0000

### String Types
#### String Types
* `CHAR` The range of M is 0 to 255. If M is omitted, the length is 1.
* `VARCHAR` The range of M is 0 to 65,535
* `BINARY` similar to CHAR
Expand All @@ -110,7 +214,7 @@ binlog-format = row #Very important if you want to receive write, update and

---

## ClickHouse Data Types
### ClickHouse Data Types

* `Date` number of days since 1970-01-01
* `DateTime` Unix timestamp
Expand All @@ -134,9 +238,9 @@ binlog-format = row #Very important if you want to receive write, update and

---

## MySQL -> ClickHouse Data Types Mapping
### MySQL -> ClickHouse Data Types Mapping

### Numeric Types
#### Numeric Types

* `BIT` -> ??? (possibly `String`?)
* `TINYINT` -> `Int8`, `UInt8`
Expand All @@ -152,7 +256,7 @@ binlog-format = row #Very important if you want to receive write, update and
* `DOUBLE`, `REAL` -> `Float64`


### Date and Time Types
#### Date and Time Types

* `DATE` -> `Date` (for valid values) or `String` (`Date` Allows storing values from just after the beginning of the Unix Epoch to the upper threshold defined by a constant at the compilation stage (currently, this is until the year 2038, but it may be expanded to 2106))
* `DATETIME` -> `DateTime` (for valid values) or `String`
Expand All @@ -161,7 +265,7 @@ binlog-format = row #Very important if you want to receive write, update and
* `YEAR` -> `UInt16`


### String Types
#### String Types

* `CHAR` -> `FixedString`
* `VARCHAR` -> `String`
Expand All @@ -182,7 +286,7 @@ binlog-format = row #Very important if you want to receive write, update and
* `JSON` -> ?????? (possibly `String`?)


## MySQL Test Tables
### MySQL Test Tables

We have to separate test table into several ones because of this error, produced by MySQL:
```text
Expand Down Expand Up @@ -494,7 +598,7 @@ INSERT INTO long_varbinary_datatypes SET
;
```

## ClickHouse Test Tables
### ClickHouse Test Tables

```sql
CREATE TABLE datatypes(
Expand Down Expand Up @@ -605,9 +709,33 @@ CREATE TABLE long_varbinary_datatypes(
;
```

## Airline ontime Test Case
## Test Cases

### airline.ontime Test Case
Main Steps
* Download airline.ontime dataset
* Create airline.ontime MySQL table
* Create airline.ontime ClickHouse table
* Start data reader (migrate data MySQL -> ClickHouse)
* Start data importer (import data into MySQL)
* Check how data are loaded into ClickHouse

#### airline.ontime Data Set in CSV files
Run [download script](run_airline_ontime_data_download.sh)
You may want to adjust dirs where to keep `ZIP` and `CSV` file
In `run_airline_ontime_data_download.sh` edit these lines:
```bash
ZIP_FILES_DIR="zip"
CSV_FILES_DIR="csv"
```

```bash
./run_airline_ontime_data_download.sh
```
Downloading can take some time.

### MySQL Table
#### airline.ontime MySQL Table
Create MySQL table of the following structure:

```sql
CREATE DATABASE IF NOT EXISTS `airline`;
Expand Down Expand Up @@ -724,8 +852,8 @@ CREATE TABLE IF NOT EXISTS `airline`.`ontime` (
);
```

### ClickHouse Table

#### airline.ontime ClickHouse Table
Create ClickHouse table of the following structure:
```sql
CREATE DATABASE IF NOT EXISTS `airline`;
CREATE TABLE IF NOT EXISTS `airline`.`ontime` (
Expand Down Expand Up @@ -841,45 +969,40 @@ CREATE TABLE IF NOT EXISTS `airline`.`ontime` (
) ENGINE = MergeTree(FlightDate, (FlightDate, Year, Month, DepDel15), 8192)
```

### Import Data

#### airline.ontime Data Reader
Run [datareader script](run_airline_ontime_data_reader.sh)
You may want to adjust `PYTHON` path and source and target hosts and usernames
```bash
ls|sort|head -n 100

i=1
for file in $(ls *.csv|sort|head -n 100); do
echo "$i. Copy $file"
cp -f $file ontime.csv
echo "$i. Import $file"
mysqlimport \
--ignore-lines=1 \
--fields-terminated-by=, \
--fields-enclosed-by=\" \
--local \
-u root \
airline ontime.csv
rm -f ontime.csv
i=$((i+1))
done

#!/bin/bash
files_to_import_num=3
i=1
for file in $(ls /mnt/nas/work/ontime/*.csv|sort|head -n $files_to_import_num); do
echo "$i. Prepare $file"
rm -f ontime
ln -s $file ontime
echo "$i. Import $file"
time mysqlimport \
--ignore-lines=1 \
--fields-terminated-by=, \
--fields-enclosed-by=\" \
--local \
-u root \
airline ontime
rm -f ontime
i=$((i+1))
done
PYTHON=python3.6
PYTHON=/home/user/pypy3.5-5.9-beta-linux_x86_64-portable/bin/pypy
```
```bash
...
--src-host=127.0.0.1 \
--src-user=root \
--dst-host=127.0.0.1 \
...
```
```bash
./run_airline_ontime_data_reader.sh
```

#### airline.ontime Data Importer
Run [data importer script](run_airline_ontime_import.sh)
You may want to adjust `CSV` files location, number of imported files and MySQL user/password used for import
```bash
# looking for csv files in this dir
FILES_TO_IMPORT_DIR="/mnt/nas/work/ontime"

# limit import to this number of files
FILES_TO_IMPORT_NUM=3
```
```bash
...
-u root \
...
```

```bash
./run_airline_ontime_import.sh
```
Loading