The package is in development. Please leave an issue or raise a pull request if you have ideas for its improvement.
You can install the development version of {azkit} with:
# install.packages("pak")
pak::pak("The-Strategy-Unit/azkit")A primary function in {azkit} enables access to an Azure blob container:
data_container <- azkit::get_container()
Authentication is handled "under the hood" by the get_container() function,
but if you need to, you can explicitly return an authentication token for
inspection or testing:
my_token <- azkit::get_auth_token()
The container returned will be set by the name stored in the AZ_CONTAINER
environment variable, if any, by default, but you can override this by supplying
a container name to the function:
custom_container <- azkit::get_container("custom")Return a list of all available containers in your default Azure storage with:
list_container_names()Once you have access to a container, you can use one of a set of data reading
functions to bring data into R from .parquet, .rds, .json or .csv files:
pqt_data <- azkit::read_azure_parquet(data_container, "v_important_data")
The functions will try to match a file of the required type using the file
name supplied. In the case above, "v_important_data" would match a file named
"v_important_data.parquet", no need to supply the file extension.
By default the read_* functions will look in the root folder of the container.
To specify a subfolder, supply this to the path argument.
The functions will not search recursively into further subfolders, so the path
needs to be full and accurate.
Or you may have "long" filenames that include the full notional path to the
file, in which case you can ignore the "path" argument.
Long filenames are returned by azkit::list_files(), for example.
azkit::list_files(data_container, "data/latest", "parquet") |>
purrr::map(\(x) azkit::read_azure_parquet(data_container, x, info = FALSE))If there is more than 1 file matching the string supplied to file argument,
the functions will throw an error.
Specifying the exact filename will avoid this of course - but shorter file
arguments may be convenient in some situations.
Currently these functions only read in a single file at a time.
Setting the info argument to TRUE will enable the functions to give some
confirmatory feedback on what file is being read in.
You can also pass through arguments that will be applied to, for example,
readr::read_delim(), such as col_types, as the function reads in a CSV file:
csv_data <- data_container |>
azkit::read_azure_csv("vital_data.csv", path = "data", col_types = "ccci")
To access Azure Storage you will want to set some environment variables.
The neatest way to do this is to include a .Renviron file in
your project folder.
.Renviron in the .gitignore file for
your project.
Your .Renviron file should contain the variables below.
Ask a member of the Data Science team for the necessary values.
# essential
AZ_STORAGE_EP=
# useful but not absolutely essential:
AZ_CONTAINER=
# optional, for certain authentication scenarios:
AZ_TENANT_ID=
AZ_CLIENT_ID=
AZ_APP_SECRET=
These may vary depending on the specific container you’re connecting to.
For one project you might want to set the default container (AZ_CONTAINER) to
one value, but for a different project you might be mainly working with a
different container so it would make sense to set the values within the
.Renviron file for each project, rather than globally for your account.
Please use the Issues feature on GitHub to report any bugs, ideas or problems, including with the package documentation.
Alternatively, to ask any questions about the package you may contact Fran Barton.
If you wish to clone this package for development, including running the
included tests, you will want some further environment variables for your local
.Renviron. Contact Fran if you need help with this.
