diff --git a/utilities/Convert_sqlite2csv-Clean.ipynb b/utilities/Convert_sqlite2csv-Clean.ipynb new file mode 100644 index 000000000..e7badbafa --- /dev/null +++ b/utilities/Convert_sqlite2csv-Clean.ipynb @@ -0,0 +1,617 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Use metadata to filter out full datasets\n", + "\n", + "From Felix:\n", + "\n", + "The id is the value that later ends up in observation, and denotes a unique dataset. The payload designates a given data transmission as either incremental, meaning that it’s a slice of data that’s transmitted during the experiment, or full, meaning that it contains the entire dataset to that point (we transmit the full dataset upon completion of the experiment). For incremental datasets, the slice indicates the first row from the dataset contained in that transmission.\n", + "\n", + "\n", + "So the idea here is to convert the `metadata` which is a series of STRING dictonaries [1](#myfootnote1) (those are dictionaries but saved as strings) into a DF. Following that to get the corresponding id or row slice number and select this row from the data list of JSONs.\n", + "\n", + "\n", + " \n", + " 1: For example see the following output:\n", + "\n", + "```\n", + "surveys_df['metadata'][0]\n", + "\n", + "'{\"slice\":47,\"id\":\"17734ede-5c5d-4c50-a48e-2acfaac1aa68\",\"payload\":\"incremental\"}'\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Follow Felix's R code" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Import data from SQLite file" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import sqlite3\n", + "import numpy as np\n", + "\n", + "con = sqlite3.connect(\"./data.sqlite\")\n", + "\n", + "# Load all the data into a DataFrame\n", + "database_d = pd.read_sql_query(\"SELECT * from labjs\", con)\n", + "\n", + "# Be sure to close the connection\n", + "con.close()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Convert JSON data to table" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Extract metadata" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# create a dataframe from the metadata column\n", + "database_d_meta = database_d['metadata'].apply(eval).apply(pd.Series).rename(columns={\"id\":\"observation\"})" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sliceobservationpayload
000a460f84-7833-4d43-b505-96d04bdf76d5full
100a460f84-7833-4d43-b505-96d04bdf76d5incremental
\n", + "
" + ], + "text/plain": [ + " slice observation payload\n", + "0 0 0a460f84-7833-4d43-b505-96d04bdf76d5 full\n", + "1 0 0a460f84-7833-4d43-b505-96d04bdf76d5 incremental" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "database_d_meta" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Merge metadata back" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# merge the two and remove metadata column\n", + "database_d = database_d.join(database_d_meta).drop(\"metadata\", axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idsessiontimestampurldatasliceobservationpayload
01o5pbac2mcj749kscpssso4f56d2021-06-08T09:32:43+00:00http://face-experiment.weizmann.ac.il/3/[{\"url\":[],\"meta\":{\"labjs_version\":\"20.2.2\",\"l...00a460f84-7833-4d43-b505-96d04bdf76d5full
12o5pbac2mcj749kscpssso4f56d2021-06-08T09:32:43+00:00http://face-experiment.weizmann.ac.il/3/[{\"url\":[],\"meta\":{\"labjs_version\":\"20.2.2\",\"l...00a460f84-7833-4d43-b505-96d04bdf76d5incremental
\n", + "
" + ], + "text/plain": [ + " id session timestamp \\\n", + "0 1 o5pbac2mcj749kscpssso4f56d 2021-06-08T09:32:43+00:00 \n", + "1 2 o5pbac2mcj749kscpssso4f56d 2021-06-08T09:32:43+00:00 \n", + "\n", + " url \\\n", + "0 http://face-experiment.weizmann.ac.il/3/ \n", + "1 http://face-experiment.weizmann.ac.il/3/ \n", + "\n", + " data slice \\\n", + "0 [{\"url\":[],\"meta\":{\"labjs_version\":\"20.2.2\",\"l... 0 \n", + "1 [{\"url\":[],\"meta\":{\"labjs_version\":\"20.2.2\",\"l... 0 \n", + "\n", + " observation payload \n", + "0 0a460f84-7833-4d43-b505-96d04bdf76d5 full \n", + "1 0a460f84-7833-4d43-b505-96d04bdf76d5 incremental " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "database_d" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Shorten the random ids" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To check how many letters in the ID we need to have as short as possible but unique identifiers, it goes over the ID and observation column and checks if number of unique parts of the string equals to number of the unique whole tags.\n", + "\n", + "We are looking for a shortest sequence -- what is the minimal number of characters which will preserve how many unique inputs there is." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "5\n" + ] + } + ], + "source": [ + "for i in range(5,37):\n", + " if (database_d.session.str.slice(stop=i).unique().size == database_d.session.unique().size ) \\\n", + " and (database_d.observation.str.slice(stop=i).unique().size==database_d.observation.unique().size):\n", + " print(i) \n", + " # if found the minimal code, replace the original session and observation with this shorted code\n", + " database_d.session = database_d.session.str.slice(stop=i)\n", + " database_d.observation = database_d.observation.str.slice(stop=i)\n", + " break\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idsessiontimestampurldatasliceobservationpayload
01o5pba2021-06-08T09:32:43+00:00http://face-experiment.weizmann.ac.il/3/[{\"url\":[],\"meta\":{\"labjs_version\":\"20.2.2\",\"l...00a460full
12o5pba2021-06-08T09:32:43+00:00http://face-experiment.weizmann.ac.il/3/[{\"url\":[],\"meta\":{\"labjs_version\":\"20.2.2\",\"l...00a460incremental
\n", + "
" + ], + "text/plain": [ + " id session timestamp \\\n", + "0 1 o5pba 2021-06-08T09:32:43+00:00 \n", + "1 2 o5pba 2021-06-08T09:32:43+00:00 \n", + "\n", + " url \\\n", + "0 http://face-experiment.weizmann.ac.il/3/ \n", + "1 http://face-experiment.weizmann.ac.il/3/ \n", + "\n", + " data slice observation \\\n", + "0 [{\"url\":[],\"meta\":{\"labjs_version\":\"20.2.2\",\"l... 0 0a460 \n", + "1 [{\"url\":[],\"meta\":{\"labjs_version\":\"20.2.2\",\"l... 0 0a460 \n", + "\n", + " payload \n", + "0 full \n", + "1 incremental " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "database_d" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prepare to extract JSON data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Extract complete datasets" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Int64Index([0], dtype='int64')\n" + ] + }, + { + "data": { + "text/plain": [ + "1" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "completed_idx = database_d.loc[database_d.payload == 'full'].index\n", + "print(completed_idx)\n", + "completed_idx.size" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "# if no experiment ended on full, skip this part\n", + "if not completed_idx.empty:\n", + " # get all the metadata which end up on full AND add the observation column to each register\n", + " database_d_full = pd.concat([pd.read_json(database_d['data'].iloc[i], dtype={'sender_id': object}).join(\n", + " pd.DataFrame({'observation': [database_d.loc[i].observation]})) for i in completed_idx])\n", + "\n", + " # unpack the first JSON meta layer\n", + " temp = pd.DataFrame.from_dict(database_d_full.meta.apply(pd.Series)).drop(0, axis=1)\n", + "\n", + " # Unpack the inner meta layer, add prefix\n", + " temp = pd.concat([temp, temp['labjs_build'].apply(pd.Series).drop(0, axis=1)],\n", + " axis=1).drop('labjs_build', axis=1).add_prefix(\"labjs_build_\")\n", + "\n", + " # concat with the original df and drop id and data columns and rename the empty col to be called x as in the R version\n", + " # also add a prefix meta_ to all the meta columns\n", + " database_d_full = pd.concat([database_d_full, temp.add_prefix(\"meta_\")], axis=1).rename(columns={\"\": \"x\"}).drop(\n", + " \"meta\", axis=1)\n", + "\n", + " # add observation to all columns\n", + " database_d_full.observation.fillna(method=\"ffill\", inplace=True)\n", + " \n", + "else:\n", + " database_d_full=pd.DataFrame({\"observation\":[]})\n", + " print(\"No experiment ended on full!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Extract incremental datasets\n", + "\n", + "do the exact same thing but with the refisters which did not end up on full and are only partial" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "temp = None\n", + "# get all the metadata which does not end up on full AND add the observation column to each register\n", + "# dtype={'sender_id': object} is necessary for https://github.com/pandas-dev/pandas/issues/41408, https://github.com/pandas-dev/pandas/issues/17869\n", + "database_d_incremental = pd.concat([pd.read_json(database_d['data'].iloc[i], dtype={'sender_id': object}).join(\n", + " pd.DataFrame({'observation': database_d.loc[i].observation}, index=[0])) for i in \\\n", + " database_d.loc[database_d.payload.isin(['latest', 'incremental'])].index])\n", + "\n", + "# unpack the first meta layer\n", + "temp = pd.DataFrame.from_dict(database_d_incremental.meta.apply(pd.Series)).drop(0, axis=1)\n", + "\n", + "# Unpack the inner layer, add prefix\n", + "temp = pd.concat([temp, temp['labjs_build'].apply(pd.Series).drop(0, axis=1)],\n", + " axis=1).drop('labjs_build', axis=1).add_prefix(\"labjs_build_\")\n", + "\n", + "# concat with the original df and drop id and data columns and rename the empty col to be called x as in the R version\n", + "# also add a prefix meta_ to all the meta columns\n", + "database_d_incremental = pd.concat([database_d_incremental, temp.add_prefix(\"meta_\")], axis=1).rename(\n", + " columns={\"\": \"x\"}).drop(\"meta\", axis=1)\n", + "\n", + "\n", + "# add observations to all rows\n", + "database_d_incremental.observation.fillna(method=\"ffill\", inplace=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Merge datasets" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "database_d_output = pd.concat([database_d_full, \\\n", + " database_d_incremental[~database_d_incremental.observation.isin(database_d_full.observation)]])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "# If there is a userID column defined, show how many and which users we have\n", + "if \"userID\" in database_d_output.columns:\n", + " print(f\"There are {database_d_output.userID.unique().shape} participants with the following names:\")\n", + " print(database_d_output.userID.unique())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Postprocessing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Fill nans" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "if \"code\" in database_d_full.columns:\n", + " database_d_full['code'].fillna(method='ffill')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Remove sensitive data" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "if \"Email\" in database_d_full.columns:\n", + " database_d_full.columns.drop(\"Email\", axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Save dataset as csv" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "name = 'Felix'\n", + "database_d_full.to_csv(name+'_database_full.csv')\n", + "database_d_incremental.to_csv(name+'_database_incremental.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### End of Felix's" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}