Acknowledgements
import Pkg; Pkg.status()
Status `~/Projects/2022-July-Workshop/module2-03-DataDeps/Project.toml` [336ed68f] CSV v0.10.4 [13f3f980] CairoMakie v0.8.10 [944b1d66] CodecZlib v0.7.0 [124859b0] DataDeps v0.7.9 [a93c6f00] DataFrames v1.3.4 [860ef19b] StableRNGs v1.0.0 [9a3f8284] Random [10745b16] Statistics
versioninfo()
Julia Version 1.7.3 Commit 742b9abb4d (2022-05-06 12:58 UTC) Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Core(TM) i9-10900KF CPU @ 3.70GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-12.0.1 (ORCJIT, skylake) Environment: JULIA = ~/julia-1.7 JULIADEV = ~/julia-1.7 JULIALTS = ~/julia-1.6
Abstract:
We present DataDeps.jl: a julia package for the reproducible handling of static datasets to enhance the repeatability of scripts used in the data and computational sciences. It is used to automate the data setup part of running software which accompanies a paper to replicate a result. This step is commonly done manually, which expends time and allows for confusion. This functionality is also useful for other packages which require data to function (e.g. a trained machine learning based model). DataDeps.jl simplifies extending research software by automatically managing the dependencies and makes it easier to run another author’s code, thus enhancing the reproducibility of data science research.
Source: DataDeps.jl documentation
Git is good for files that meet 3 requirements:
There is certainly some room around the edges for this, like storing a few images in the repository is OK, but storing all of ImageNet is a no go.
DataDeps.jl is good for:
The main use case is downloading large datasets for machine learning, and corpora for NLP.
In this case the data is not even normally yours to begin with.
It lives on some website somewhere.
You don't want to copy and redistribute it; and depending on the license you may not even be allowed to.
The DataDeps.jl package makes it so that the special incantation
@datadep_str my_dataset # same as @datadep_str(mydataset)
automatically resolves to a local filepath where a dataset is stored. This means that
xx
and yy
will be able to handle the filepath.Let's look at the interface for the DataDep
type:
DataDep(
name::String,
message::String,
remote_path::Union{String,Vector{String}...},
[checksum::Union{String,Vector{String}...},]; # Optional, generated if not provided
# Optional keyword arguments
fetch_method=fetch_default # (remote_filepath, local_directory)->local_filepath
post_fetch_method=identity # (local_filepath)->Any
)
DataDep
has a name
, which corresponds to a parent folder where data is to be stored.message
is displayed to a user. This is a good place to make any acknowledgemets regarding the source, including important URLs, papers that should be cited, links to additional info, and license (when applicable). Importantly, users must consent to downloading the data at this stage.remote_path
is the source associated with a DataDep
. This can be a single path or multiple paths when given as a list.The checksum
is verified after downloading to guarantee that files associated with a DataDep
are the same as those specified by the creator of the registration block. DataDeps.jl will help you generate a checksum
that you can then paste into your registration block. Can be an single checksum or multiple checksums applied to each file in remote_path
.
Not strictly required, but it is good practice!
If you just want to include a dataset in your project, without any preprocessing, the first four arguments are all you need.
The fetch_method
is a function that lets you override the initial download step at remote_filepath
. Given a local_directory
, this function must determine a local file name and return that name at the end.
By default, the fetch_method
simply uses HTTP.jl
to handle URLs or Base.download
to handle other filepaths.
You want to customize this if:
you need to use a transfer protocol not covered by Base Julia,
accessing the source requires some form of authentication, or
your dataset is generated by a simulation you wish to reproduce locally.
The post_fetch_method
is a function that allows one to manipulate the original data retrieved by the fetch_method
.
You want to customize this if you want to do some processing of the data and use that version as the starting point in your project. For example:
If you are creating multiple DataDep
s for tabular data in your project, you may want to set a standard format for column names (x1
, x2
or gene1
gene2
).
You want to screen for missing values, corrupted entries, and so on and have already determined a set of commands to do this automatically.
You want to apply some minor transformations to the data so that it is ready to use in your project immediately after loading it.
# Reading, writing, manipulating, and compressing data
using CSV, DataFrames, CodecZlib
# Random number generation
using Random, StableRNGs
# Basic statistics functions
using Statistics
# Visualization
using CairoMakie
# The star of the show: creating data dependencies
using DataDeps
# Specify this environment variable to disable to default load paths in DataDeps.jl
ENV["DATADEPS_NO_STANDARD_LOAD_PATH"] = true
# Specify the load path just for the purposes of this demo.
if haskey(ENV, "DATADEPS_LOAD_PATH")
rm(ENV["DATADEPS_LOAD_PATH"])
end
ENV["DATADEPS_LOAD_PATH"] = mktempdir("/tmp/")
"/tmp/jl_ObwpYP"
For this example we will use the famous iris
data hosted by the UCI Machine Learning Repository.
datadep_iris = DataDep(
# 1. Call this datset "iris".
"iris",
# 2. Set the message to display when downloading.
"""
Dataset: iris
Author: R. A. Fisher (donated by Michael Marshall)
Observations: 150
Features: 4
Classes: 3
Please see https://archive.ics.uci.edu/ml/datasets/iris for additional information.
""",
# 3. Set the remote_path used to download data.
"http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
)
DataDep{Nothing, String, typeof(DataDeps.fetch_default), typeof(identity)}("iris", "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", nothing, DataDeps.fetch_default, identity, "Dataset: iris\nAuthor: R. A. Fisher (donated by Michael Marshall)\n\nObservations: 150\nFeatures: 4\nClasses: 3\n\nPlease see https://archive.ics.uci.edu/ml/datasets/iris for additional information.\n")
register(datadep_iris)
DataDep{Nothing, String, typeof(DataDeps.fetch_default), typeof(identity)}("iris", "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", nothing, DataDeps.fetch_default, identity, "Dataset: iris\nAuthor: R. A. Fisher (donated by Michael Marshall)\n\nObservations: 150\nFeatures: 4\nClasses: 3\n\nPlease see https://archive.ics.uci.edu/ml/datasets/iris for additional information.\n")
@datadep_str "iris"
This program has requested access to the data dependency iris. which is not currently installed. It can be installed automatically, and you will not see this message again. Dataset: iris Author: R. A. Fisher (donated by Michael Marshall) Observations: 150 Features: 4 Classes: 3 Please see https://archive.ics.uci.edu/ml/datasets/iris for additional information. Do you want to download the dataset from http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data to "/tmp/jl_ObwpYP/iris"? [y/n] stdin> y
┌ Info: Downloading │ source = http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data │ dest = /tmp/jl_ObwpYP/iris/iris.data │ progress = 1.0 │ time_taken = 0.03 s │ time_remaining = 0.0 s │ average_speed = 143.365 KiB/s │ downloaded = 4.444 KiB │ remaining = 0 bytes │ total = 4.444 KiB └ @ HTTP /home/alanderos/.julia/packages/HTTP/S5kNN/src/download.jl:132 ┌ Warning: Checksum not provided, add to the Datadep Registration the following hash line │ hash = 6f608b71a7317216319b4d27b4d9bc84e6abd734eda7872b71a458569e2656c0 └ @ DataDeps /home/alanderos/.julia/packages/DataDeps/EDWdQ/src/verification.jl:44
"/tmp/jl_ObwpYP/iris"
Note the file information in the help message:
Do you want to download the dataset from http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data to "/home/alanderos/jl_81EctG/iris"?
This tells us that the file iris.data
will be downloaded to a folder iris
located in our DataDeps registry.
Let's try reading in the file with CSV.jl and DataFrames.jl:
df = CSV.read(@datadep_str("iris/iris.data"), DataFrame)
first(df, 5)
5 rows × 5 columns
5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | String15 | |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
Note the columns. In this case CSV.jl accidentally detected a header. Fix this by specifying header=false
.
df = CSV.read(@datadep_str("iris/iris.data"), DataFrame, header=false)
first(df, 5)
5 rows × 5 columns
Column1 | Column2 | Column3 | Column4 | Column5 | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | String15 | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
The download information also mentioned a missing checksum:
┌ Warning: Checksum not provided, add to the Datadep Registration the following hash line
│ hash = 6f608b71a7317216319b4d27b4d9bc84e6abd734eda7872b71a458569e2656c0
└ @ DataDeps /home/alanderos/.julia/packages/DataDeps/EDWdQ/src/verification.jl:44
Let's try recreating the DataDep
with the checksum and running the process again.
datadep_iris = DataDep(
# 1. Call this datset "iris".
"iris",
# 2. Set the message to display when downloading.
"""
Dataset: iris
Author: R. A. Fisher (donated by Michael Marshall)
Observations: 150
Features: 4
Classes: 3
Please see https://archive.ics.uci.edu/ml/datasets/iris for additional information.
""",
# 3. Set the remote_path used to download data.
"http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
# 4. Set the checksum this time.
"6f608b71a7317216319b4d27b4d9bc84e6abd734eda7872b71a458569e2656c0",
)
DataDep{String, String, typeof(DataDeps.fetch_default), typeof(identity)}("iris", "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", "6f608b71a7317216319b4d27b4d9bc84e6abd734eda7872b71a458569e2656c0", DataDeps.fetch_default, identity, "Dataset: iris\nAuthor: R. A. Fisher (donated by Michael Marshall)\n\nObservations: 150\nFeatures: 4\nClasses: 3\n\nPlease see https://archive.ics.uci.edu/ml/datasets/iris for additional information.\n")
register(datadep_iris)
┌ Warning: Over-writing registration of the datadep │ name = iris └ @ DataDeps /home/alanderos/.julia/packages/DataDeps/EDWdQ/src/registration.jl:15
DataDep{String, String, typeof(DataDeps.fetch_default), typeof(identity)}("iris", "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", "6f608b71a7317216319b4d27b4d9bc84e6abd734eda7872b71a458569e2656c0", DataDeps.fetch_default, identity, "Dataset: iris\nAuthor: R. A. Fisher (donated by Michael Marshall)\n\nObservations: 150\nFeatures: 4\nClasses: 3\n\nPlease see https://archive.ics.uci.edu/ml/datasets/iris for additional information.\n")
rm(@datadep_str("iris"); recursive=true)
df = CSV.read(@datadep_str("iris/iris.data"), DataFrame, header=false)
first(df, 5)
This program has requested access to the data dependency iris. which is not currently installed. It can be installed automatically, and you will not see this message again. Dataset: iris Author: R. A. Fisher (donated by Michael Marshall) Observations: 150 Features: 4 Classes: 3 Please see https://archive.ics.uci.edu/ml/datasets/iris for additional information. Do you want to download the dataset from http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data to "/tmp/jl_ObwpYP/iris"? [y/n] stdin> y
┌ Info: Downloading │ source = http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data │ dest = /tmp/jl_ObwpYP/iris/iris.data │ progress = 1.0 │ time_taken = 0.0 s │ time_remaining = 0.0 s │ average_speed = ∞ B/s │ downloaded = 4.444 KiB │ remaining = 0 bytes │ total = 4.444 KiB └ @ HTTP /home/alanderos/.julia/packages/HTTP/S5kNN/src/download.jl:132
5 rows × 5 columns
Column1 | Column2 | Column3 | Column4 | Column5 | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | String15 | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Everything work and there was no checksum warning this time.
We now have an easy way to access the iris
dataset but there a few improvements we could make.
Problem: Remembering that the filename is iris.data
is a bit annoying.
Problem: For our project, we always want to load as a DataFrame
(or some other format).
DataFrame
.Problem: We want to reorder the columns so that all our DataDeps
have a common format.
post_fetch_method
!This example uses breast cytology data for diagnosis of breast cancer. A summary is provided here).
To demonstrate how DataDeps.jl handles multiple files and allows for preprocessing, we will
DataDep
to store two classification datasets, bcw.csv
and wdbc.csv
,2
and 4
(or B
and M
) to benign
and malignant
..info
file.We implement these steps in a custom post_fetch_method
, which accepts a local filename as an argument:
function my_function(local_filepath)
# do something; no restrictions here
end
First, we define a function to help organize columns in a standard format.
function create_standard_df(
local_filepath;
header=false,
missingstring="",
class_index::Integer=0,
feature_indices=1:0,
)
# Sanity checks.
if class_index < 1
error("class_index should be a positive integer.")
end
if isempty(feature_indices) || any(<(0), feature_indices)
error("feature_indices should contain positive integers.")
end
# Read the input DataFrame.
input_df = CSV.read(local_filepath, DataFrame; header=header, missingstring=missingstring)
# Initialize output DataFrame.
output_df = DataFrame()
# Add the (first) column corresponding to class labels.
output_df[!, :class] = input_df[!, class_index]
# Add the remaining columns corresponding to features/predictors.
for (i, feature_index) in enumerate(feature_indices)
column_name = Symbol("feature", i)
output_df[!, column_name] = input_df[!, feature_index]
end
return output_df
end
create_standard_df (generic function with 1 method)
Test the function on our iris
example:
iris_df = create_standard_df(@datadep_str("iris/iris.data");
header=false, # source does not have a header
class_index=5, # class labels are stored in Column 5
feature_indices=1:4 # features are stored in Columns 1-4
)
first(iris_df, 5)
5 rows × 5 columns
class | feature1 | feature2 | feature3 | feature4 | |
---|---|---|---|---|---|
String15 | Float64 | Float64 | Float64 | Float64 | |
1 | Iris-setosa | 5.1 | 3.5 | 1.4 | 0.2 |
2 | Iris-setosa | 4.9 | 3.0 | 1.4 | 0.2 |
3 | Iris-setosa | 4.7 | 3.2 | 1.3 | 0.2 |
4 | Iris-setosa | 4.6 | 3.1 | 1.5 | 0.2 |
5 | Iris-setosa | 5.0 | 3.6 | 1.4 | 0.2 |
Now let's define a helper function that processes the file breast-cancer-wisconsin.data
.
A few things to note (based on the notes in breast-cancer.wisconsin.names
):
There are 16 instances in Groups 1 to 6 that contain a single missing (i.e., unavailable) attribute value, now denoted by "?".
The structure of the .data
file is:
# Attribute Domain -- ----------------------------------------- 1. Sample code number id number 2. Clump Thickness 1 - 10 3. Uniformity of Cell Size 1 - 10 4. Uniformity of Cell Shape 1 - 10 5. Marginal Adhesion 1 - 10 6. Single Epithelial Cell Size 1 - 10 7. Bare Nuclei 1 - 10 8. Bland Chromatin 1 - 10 9. Normal Nucleoli 1 - 10 10. Mitoses 1 - 10 11. Class: (2 for benign, 4 for malignant)
function process_bcw(local_filepath)
# First, let's standardize the DataFrame format. We'll drop the sample code number.
df = create_standard_df(local_filepath;
header=false,
missingstring=["?",""],
class_index=11,
feature_indices=2:10,
)
# Next, change the class labels from 2 and 4 to benign and malignant, respectively.
# If we encounter another label, set it to missing.
new_label = Vector{Union{Missing,String}}(undef, length(df.class))
for (i, row) in enumerate(eachrow(df))
if row.class == 2
new_label[i] = "benign"
elseif row.class == 4
new_label[i] = "malignant"
else
new_label[i] = missing
end
end
df.class = new_label
# Now drop any rows with missing values and set the type of every feature to Float64.
df = dropmissing(df)
for i in 2:ncol(df)
df[!,i] = map(xi -> Float64(xi), df[!,i])
end
# Set column information.
column_info = Vector{String}(undef, 10)
column_info[1] = "diagnosis"
column_info[2] = "clump_thickness"
column_info[3] = "cell_size_uniformity"
column_info[4] = "cell_shape_uniformity"
column_info[5] = "marginal_adhesion"
column_info[6] = "single_cell_epithelial_size"
column_info[7] = "bare_nuclei"
column_info[8] = "bland_chromatin"
column_info[9] = "normal_nucleoli"
column_info[10] = "mitoses"
column_info_df = DataFrame(columns=column_info)
# Finally, save our formatted data and remove the original source.
CSV.write("bcw.csv", df; writeheader=true, delim=',')
CSV.write("bcw.info", column_info_df; writeheader=false, delim=',')
rm(local_filepath)
return nothing
end
process_bcw (generic function with 1 method)
Important The post_fetch_method
will run inside the DataDep
directory, so we can safely write to the folder datadep_name
without worrying about filepaths.
Now let's do the same for the file wdbc.data
.
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32) Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
All feature values are recoded with four significant digits.
function process_wdbc(local_filepath)
# First, let's standardize the DataFrame format. We'll drop the ID number.
df = create_standard_df(local_filepath;
header=false,
missingstring=["?",""],
class_index=2,
feature_indices=3:32,
)
# Next, change the class labels from 2 and 4 to benign and malignant, respectively.
# If we encounter another label, set it to missing.
new_label = Vector{Union{Missing,String}}(undef, length(df.class))
for (i, row) in enumerate(eachrow(df))
if row.class == "B"
new_label[i] = "benign"
elseif row.class == "M"
new_label[i] = "malignant"
else
new_label[i] = missing
end
end
df.class = new_label
# Now drop any rows with missing values and set the type of every feature to Float64.
df = dropmissing(df)
for i in 2:ncol(df)
df[!,i] = map(xi -> Float64(xi), df[!,i])
end
# Set column information. Columns are <feature>_<transformation>.
features = [
"radius",
"texture",
"perimeter",
"area",
"smoothness",
"compactness",
"concavity",
"n_concave_pts",
"symmetry",
"fractal_dim",
]
transformations = ["mean", "se", "worst"]
column_info = Vector{String}(undef, 31)
idx = 1
column_info[idx] = "diagnosis"
idx += 1
for transformation in transformations, feature in features
column_info[idx] = string(feature, "_", transformation)
idx += 1
end
column_info_df = DataFrame(columns=column_info)
# Finally, save our formatted data and remove the original source.
CSV.write("wdbc.csv", df; writeheader=true, delim=',')
CSV.write("wdbc.info", column_info_df; writeheader=false, delim=',')
rm(local_filepath)
return nothing
end
process_wdbc (generic function with 1 method)
Finally, because post_fetch_method
only accepts a single function, let's stitch our functions together so that we can call the correct helper function.
function custom_post_fetch_method(local_filepath)
filename = basename(local_filepath)
if filename == "breast-cancer-wisconsin.data"
process_bcw(local_filepath)
elseif filename == "wdbc.data"
process_wdbc(local_filepath)
else
error("Did not specify a branch for $(filename).")
end
end
custom_post_fetch_method (generic function with 1 method)
register(DataDep(
# 1. Set the DataDep's name.
"breast-cancer-wisconsin",
# 2. Set the message to display when downloading.
"""
Dataset: breast-cancer-wisconsin
Author: Dr. WIlliam H. Wolberg
Donors: Olvi Mangasarian
Received by David W. Aha
This dataset contains two files, "bcw.data" and "wdbc.data" corresponding to "breast-cancer-wisconsin.data"
and "wdbc.data", respectively, in the UCI Machine Learning Repository.
Summary for "bcw.data":
Observations: 699 (16 missing are dropped)
Features: 9
Classes: 2
Summary for "wdbc.data"
Observations: 569 (0 missing)
Features: 30
Classes: 2
Please see https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original).
""",
# 3. Set the remote_path used to download data.
[
"https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",
"https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
],
# 4. Set the checksum this time.
"962af71216fdc2cbd457539d59cbadf9fbdc352a01831d8d79a9c5d1509b742e";
# 5. Pass the function to run after downloading files.
post_fetch_method=custom_post_fetch_method
))
DataDep{String, Vector{String}, typeof(DataDeps.fetch_default), typeof(custom_post_fetch_method)}("breast-cancer-wisconsin", ["https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"], "962af71216fdc2cbd457539d59cbadf9fbdc352a01831d8d79a9c5d1509b742e", DataDeps.fetch_default, custom_post_fetch_method, "Dataset: breast-cancer-wisconsin\nAuthor: Dr. WIlliam H. Wolberg\nDonors: Olvi Mangasarian\n Received by David W. Aha\n\nThis dataset contains two files, \"bcw.data\" and \"wdbc.data\" corresponding to \"breast-cancer-wisconsin.data\"\nand \"wdbc.data\", respectively, in the UCI Machine Learning Repository.\n\nSummary for \"bcw.data\":\n Observations: 699 (16 missing are dropped)\n Features: 9\n Classes: 2\n \nSummary for \"wdbc.data\"\n Observations: 569 (0 missing)\n Features: 30\n Classes: 2\n\nPlease see https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original).\n")
dir = @datadep_str "breast-cancer-wisconsin"
This program has requested access to the data dependency breast-cancer-wisconsin. which is not currently installed. It can be installed automatically, and you will not see this message again. Dataset: breast-cancer-wisconsin Author: Dr. WIlliam H. Wolberg Donors: Olvi Mangasarian Received by David W. Aha This dataset contains two files, "bcw.data" and "wdbc.data" corresponding to "breast-cancer-wisconsin.data" and "wdbc.data", respectively, in the UCI Machine Learning Repository. Summary for "bcw.data": Observations: 699 (16 missing are dropped) Features: 9 Classes: 2 Summary for "wdbc.data" Observations: 569 (0 missing) Features: 30 Classes: 2 Please see https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original). Do you want to download the dataset from ["https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"] to "/tmp/jl_ObwpYP/breast-cancer-wisconsin"? [y/n] stdin> y
┌ Info: Downloading │ source = https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data │ dest = /tmp/jl_ObwpYP/breast-cancer-wisconsin/breast-cancer-wisconsin.data │ progress = 1.0 │ time_taken = 0.0 s │ time_remaining = 0.0 s │ average_speed = ∞ B/s │ downloaded = 19.423 KiB │ remaining = 0 bytes │ total = 19.423 KiB └ @ HTTP /home/alanderos/.julia/packages/HTTP/S5kNN/src/download.jl:132 ┌ Info: Downloading │ source = https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data │ dest = /tmp/jl_ObwpYP/breast-cancer-wisconsin/wdbc.data │ progress = 1.0 │ time_taken = 0.03 s │ time_remaining = 0.0 s │ average_speed = 3.586 MiB/s │ downloaded = 121.194 KiB │ remaining = 0 bytes │ total = 121.194 KiB └ @ HTTP /home/alanderos/.julia/packages/HTTP/S5kNN/src/download.jl:132
"/tmp/jl_ObwpYP/breast-cancer-wisconsin"
What are the files inside the DataDep
?
run(`ls $(dir)`)
bcw.csv bcw.info wdbc.csv wdbc.info
Process(`ls /tmp/jl_ObwpYP/breast-cancer-wisconsin`, ProcessExited(0))
df = CSV.read(joinpath(dir, "bcw.csv"), DataFrame)
first(df, 10)
10 rows × 10 columns (omitted printing of 2 columns)
class | feature1 | feature2 | feature3 | feature4 | feature5 | feature6 | feature7 | |
---|---|---|---|---|---|---|---|---|
String15 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | benign | 5.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.0 | 3.0 |
2 | benign | 5.0 | 4.0 | 4.0 | 5.0 | 7.0 | 10.0 | 3.0 |
3 | benign | 3.0 | 1.0 | 1.0 | 1.0 | 2.0 | 2.0 | 3.0 |
4 | benign | 6.0 | 8.0 | 8.0 | 1.0 | 3.0 | 4.0 | 3.0 |
5 | benign | 4.0 | 1.0 | 1.0 | 3.0 | 2.0 | 1.0 | 3.0 |
6 | malignant | 8.0 | 10.0 | 10.0 | 8.0 | 7.0 | 10.0 | 9.0 |
7 | benign | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 | 10.0 | 3.0 |
8 | benign | 2.0 | 1.0 | 2.0 | 1.0 | 2.0 | 1.0 | 3.0 |
9 | benign | 2.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.0 | 1.0 |
10 | benign | 4.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 2.0 |
run(`cat -n $(joinpath(dir, "bcw.info"))`)
1 diagnosis 2 clump_thickness 3 cell_size_uniformity 4 cell_shape_uniformity 5 marginal_adhesion 6 single_cell_epithelial_size 7 bare_nuclei 8 bland_chromatin 9 normal_nucleoli 10 mitoses
Process(`cat -n /tmp/jl_ObwpYP/breast-cancer-wisconsin/bcw.info`, ProcessExited(0))
df = CSV.read(@datadep_str("breast-cancer-wisconsin/wdbc.csv"), DataFrame)
first(df, 10)
10 rows × 31 columns (omitted printing of 23 columns)
class | feature1 | feature2 | feature3 | feature4 | feature5 | feature6 | feature7 | |
---|---|---|---|---|---|---|---|---|
String15 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | malignant | 17.99 | 10.38 | 122.8 | 1001.0 | 0.1184 | 0.2776 | 0.3001 |
2 | malignant | 20.57 | 17.77 | 132.9 | 1326.0 | 0.08474 | 0.07864 | 0.0869 |
3 | malignant | 19.69 | 21.25 | 130.0 | 1203.0 | 0.1096 | 0.1599 | 0.1974 |
4 | malignant | 11.42 | 20.38 | 77.58 | 386.1 | 0.1425 | 0.2839 | 0.2414 |
5 | malignant | 20.29 | 14.34 | 135.1 | 1297.0 | 0.1003 | 0.1328 | 0.198 |
6 | malignant | 12.45 | 15.7 | 82.57 | 477.1 | 0.1278 | 0.17 | 0.1578 |
7 | malignant | 18.25 | 19.98 | 119.6 | 1040.0 | 0.09463 | 0.109 | 0.1127 |
8 | malignant | 13.71 | 20.83 | 90.2 | 577.9 | 0.1189 | 0.1645 | 0.09366 |
9 | malignant | 13.0 | 21.82 | 87.5 | 519.8 | 0.1273 | 0.1932 | 0.1859 |
10 | malignant | 12.46 | 24.04 | 83.97 | 475.9 | 0.1186 | 0.2396 | 0.2273 |
run(`cat -n $(joinpath(dir, "wdbc.info"))`)
1 diagnosis 2 radius_mean 3 texture_mean 4 perimeter_mean 5 area_mean 6 smoothness_mean 7 compactness_mean 8 concavity_mean 9 n_concave_pts_mean 10 symmetry_mean 11 fractal_dim_mean 12 radius_se 13 texture_se 14 perimeter_se 15 area_se 16 smoothness_se 17 compactness_se 18 concavity_se 19 n_concave_pts_se 20 symmetry_se 21 fractal_dim_se 22 radius_worst 23 texture_worst 24 perimeter_worst 25 area_worst 26 smoothness_worst 27 compactness_worst 28 concavity_worst 29 n_concave_pts_worst 30 symmetry_worst 31 fractal_dim_worst
Process(`cat -n /tmp/jl_ObwpYP/breast-cancer-wisconsin/wdbc.info`, ProcessExited(0))
In this last example we create a DataDep
for data generated by a simulation, implemented as a function called spirals
.
Here we are assuming that:
spirals
simulation can be shared with others (for example, people interested in reproducing our results), andfunction spirals(class_sizes;
rng::AbstractRNG=StableRNG(1903),
max_radius::Real=7.0,
x0::Real=-3.5,
y0::Real=3.5,
angle_start::Real=π/8,
prob::Real=1.0,
)
if length(class_sizes) != 3
error("Must specify 3 classes (length(class_sizes)=$(length(class_sizes))).")
end
if max_radius <= 0
error("Maximum radius (max_radius=$(max_radius)) must be > 0.")
end
if angle_start < 0
error("Starting angle (angle_start=$(angle_start)) should satisfy 0 ≤ θ ≤ 2π.")
end
if prob < 0 || prob > 1
error("Probability (prob=$(prob)) must satisfy 0 ≤ prob ≤ 1.")
end
# Extract parameters.
N = sum(class_sizes)
max_A, max_B, max_C = class_sizes
# Simulate the data.
L, X = Vector{String}(undef, N), Matrix{Float64}(undef, N, 2)
x, y = view(X, :, 1), view(X, :, 2)
inversions = 0
for i in 1:N
if i ≤ max_A
# The first 'max_A' samples are from Class A
(class, k, n, θ) = ("A", i, max_A, angle_start)
noise = 0.1
elseif i ≤ max_A + max_B
# The next 'max_B' samples are from Class B
(class, k, n, θ) = ("B", i-max_A+1, max_B, angle_start + 2π/3)
noise = 0.2
else
# The last 'max_C' samples are from Class C
(class, k, n, θ) = ("C", i-max_A-max_B+1, max_C, angle_start + 4π/3)
noise = 0.3
end
# Compute coordinates.
angle = θ + π * k / n
radius = max_radius * (1 - k / (n + n / 5))
x[i] = x0 + radius*cos(angle) + noise*randn(rng)
y[i] = y0 + radius*sin(angle) + noise*randn(rng)
if rand(rng) < prob
L[i] = class
else
L[i] = rand(rng, setdiff(["A", "B", "C"], [class]))
inversions += 1
end
end
println()
println("[ spirals: $(N) instances / 2 features / 3 classes ]")
println(" ∘ Pr(y | x) = $(prob)")
println(" ∘ $inversions class inversions ($(inversions/N) Bayes error)")
println()
return L, X
end
spirals (generic function with 1 method)
L, X = spirals([100, 100, 100])
label2int = Dict("A" => 1, "B" => 2, "C" => 3)
class_colors = [label2int[li] for li in L]
scatter(X[:,1], X[:,2], color=class_colors)
[ spirals: 300 instances / 2 features / 3 classes ] ∘ Pr(y | x) = 1.0 ∘ 0 class inversions (0.0 Bayes error)
register(DataDep(
# 1. Set the DataDep's name.
"spirals",
# 2. Set the message to display when downloading.
"""
Dataset: spirals
Credit: https://smorbieu.gitlab.io/generate-datasets-to-understand-some-clustering-algorithms-behavior/
A simulated dataset of three noisy spirals. Data is simulated locally; see registration block for details.
Observations: 1000
Features: 2
Classes: 3
""",
# 3. There is nothing to download, so this argument is not used.
"unused",
# 4. Specify the checksum of the simulation file, before running post_fetch_method.
"42d5c3404511db5ab48ab2224ac2d2959c82a47dd4b108fbabb3dfb27631d782";
# 5. Write a fetch_method that calls our simulation routine and creates a local file.
fetch_method = function(unused, localdir)
# Simulate the data. Note that we are forced to specify a RNG, its seed, and simulation parameters.
rng = StableRNG(1903)
L, X = spirals((600, 300, 100);
rng=rng,
max_radius=7.0,
x0=-3.5,
y0=3.5,
angle_start=pi/8,
prob=1.0,
)
# Put everything in a DataFrame.
x, y = view(X, :, 1), view(X, :, 2)
df = DataFrame(class=L, x=x, y=y)
# Shuffle the rows of the DataFrame and write to file.
local_file = joinpath(localdir, "data.csv")
perm = Random.randperm(rng, size(df, 1))
foreach(col -> permute!(col, perm), eachcol(df))
CSV.write(local_file, df)
return local_file
end,
))
DataDep{String, String, var"#9#11", typeof(identity)}("spirals", "unused", "42d5c3404511db5ab48ab2224ac2d2959c82a47dd4b108fbabb3dfb27631d782", var"#9#11"(), identity, "Dataset: spirals\nCredit: https://smorbieu.gitlab.io/generate-datasets-to-understand-some-clustering-algorithms-behavior/\n\nA simulated dataset of three noisy spirals. Data is simulated locally; see registration block for details.\n\nObservations: 1000\nFeatures: 2\nClasses: 3\n")
@datadep_str "spirals"
This program has requested access to the data dependency spirals. which is not currently installed. It can be installed automatically, and you will not see this message again. Dataset: spirals Credit: https://smorbieu.gitlab.io/generate-datasets-to-understand-some-clustering-algorithms-behavior/ A simulated dataset of three noisy spirals. Data is simulated locally; see registration block for details. Observations: 1000 Features: 2 Classes: 3 Do you want to download the dataset from unused to "/tmp/jl_ObwpYP/spirals"? [y/n] stdin> y [ spirals: 1000 instances / 2 features / 3 classes ] ∘ Pr(y | x) = 1.0 ∘ 0 class inversions (0.0 Bayes error)
"/tmp/jl_ObwpYP/spirals"
df = CSV.read(@datadep_str("spirals/data.csv"), DataFrame)
label2int = Dict("A" => 1, "B" => 2, "C" => 3)
class_colors = [label2int[li] for li in df.class]
scatter(df.x, df.y, color=class_colors)
The point here is that we are forced to document the settings used in generating our data when we create a DataDep
registration block.
Random Number Generators (RNGs) on a computer are often Pseudo-Random Number Generators (PRNGs) in that the streams they generate are deterministic given you know some initial state.
This property is great for making certain kinds of programs reproducible; for example simulations, probabilistic methods, and so on.
However, the implementation of a particular RNG algorithm may affect the random streams it generates!
Example: At some point in the development of Julia, the default RNG was based on Mersenne Twister. The default is now Xoshiro256++ from the xoshiro/xoroshiro family. Scripts run in Julia versions from before the change that seeded the global RNG produce different sequences of values compared to more recent versions.
If you absolutely need reproducibility of random numbers in a script, use a RNG that promises stability.