This guide for using the agnes data wrangling library will walk you through a fairly typical, if mundane, data science preprocessing task: loading some source data files, joining them together on the appropriate fields, selecting the specific data you’re looking for, and performing some basic transformations.
This guide is intended for people who have some failiarity with Rust.
You may also be interested in the API documentation.
Data
For this guide, we will be working with data from The World Bank to examine the link between life expectancy and GDP for various countries. We’ll be working with three individual data files (LICENSE):
- Annual GDP data for countries around the world, 1960-2017, rehosted, unmodified, for demonstration purposes here.
- Country metadata information for GDP data file above, rehosted here.
- Annual life expectancy data for countries around the world, 1960-2017, rehosted here.
Initializing our Project
We’ll start by creating a new cargo package for this guide. At a command prompt, navigate to the directory you want to create your project in, and type:
$ cargo new --bin gdp_life
This creates a new cargo package in the gdp_life
directory. We specified --bin
to indicate that we are making binary (runnable) application instead of a library.
Next, we need to modify the [dependencies]
sections of that package’s Cargo.toml
file to load agnes
:
[dependencies]
agnes = "0.3"
With the appropriate library loaded, we’re ready to move on to actual code!
Defining Tables
The first bit of code we need to write is the data table definition for the GDP table. agnes
handles data loading in a statically-typed manner, and thus we need to tell agnes
what data types we are loading and how we want to label them.
In agnes
, fields (i.e. columns in your data set) are labeled by unit-like structs by which we can refer to an individual field at the type level, as opposed to using a runtime value (for example, a field name as a str
or a usize
field index or enum variant). This provides us with valuable compile-time field type checking and prevents us from running into some avoidable runtime errors.
We define these field labels and their associated data types with a macro called tablespace
, which generally looks like this:
tablespace![
table my_first_table {
Table1Col1: u64,
Table1Col2: String,
}
table my_second_table {
Table2Col1: f64,
Table2Col2: f64,
}
]
In this example code we are declaring two tables: my_first_table
and my_second_table
, each with two fields (in the first table, an unsigned integer and a string, and in the second table, two floats). The tablespace
macro will generate a module for each table, as well as the label structs for each field. For example, the call to tablespace
above will create the module my_first_table
, which will include within it the types Table1Col1
and Table1Col2
. Note that you can preface the word table
in these declarations with visibility modifiers (e.g. pub
).
In most cases we will want to define all tables we’re loading in an application within a single tablespace
macro call. Theoretically we could call tablespace
multiple times, but we would not be able to combine fields from tables in different tablespaces.
Getting back to our GDP table example, we first have to load the agnes
create in our main.rs
file:
#[macro_use]
extern crate agnes;
Next, the appropriate tablespace
call would be:
tablespace![
table gdp {
CountryName: String,
CountryCode: String,
Gdp2015: f64,
}
]
This generates the gdp
module and the label structs CountryName
, CountryCode
, and Gdp2015
, along with some code to make these label structs work within the agnes
framework. Since this code generates a module, it is best placed outside the main
or any other functions.
Our data source contains GDP data for the years from 1960-2017. We’ve chosen (arbitrarily) to use the 2015 GDP data for our purposes.
Later on, we’ll add to this tablespace, but for now let’s continue with only the gdp
table defined.
Loading a Data Set
Now that we’ve defined the appropriate table and fields, we can move on to loading this table from a data source. To do this, we provide a source schema – a description of how to hook up data from the source file to the fields we described in our call to tablespace
above. This is done with another macro, schema
.
Within our main
function, we add the following code:
let gdp_schema = schema![
fieldname gdp::CountryName = "Country Name";
fieldname gdp::CountryCode = "Country Code";
fieldname gdp::Gdp2015 = "2015";
];
This defines a new source schema using the schema
macro, and stores the result in gdp_schema
.
The schema
macro syntax is a list of fieldname
or fieldindex
declarations that connect the field labels to column headers or column indices (starting from 0), respectively. In this case, we are providing three column headers: the CountryName
field label will take data from the column with the “Country Name” header, the CountryCode
field label will take data from the column with the “Country Code” header, and the Gdp2015
field label will take data from the column with the “2015” header.
Next, we want to load the CSV file. We could download the CSV file locally and load that, but for convenience, we’re going to have the program load the CSV file directly from the web. Fortunately, agnes
has a utility function for loading a CSV file from a URI string. At the top of our main.rs
file, after the extern crate
line, let’s add the following import:
use agnes::source::csv::load_csv_from_uri;
The load_csv_from_uri
function takes two arguments: the URI you wish to load the data from, and the source schema for the file. It returns a Result
containing the DataView
object containing our data. While agnes
strives to limit the number of runtime errors, the types of errors that can occur while loading a file (incorrect URI, network error, etc.) are not predictable, so this particular functionality does require some runtime error-checking. Our call will look like this:
let gdp_view = load_csv_from_uri(
"https://wee.codes/data/gdp.csv",
gdp_schema
).expect("CSV loading failed.");
This code calls the load_csv_from_uri
function with the URI for our data as the first argument and our source schema as the second, unwraps the Result
(with a helpful error message), and stores the resulting DataView
in gdp_view
.
The DataView object is the primary struct for working with data in the agnes
library. Its functionality includes, but is not limited to: viewing data, extracting data, merging joining other datasources, and computing simple view statistics.
Now that we’ve loaded this data, it is simple to display it using typical Rust display semantics:
println!("{}", gdp_view);
The full code for our initial example is:
#[macro_use]
extern crate agnes;
use agnes::source::csv::load_csv_from_uri;
tablespace![
table gdp {
CountryName: String,
CountryCode: String,
Gdp2015: f64,
}
];
fn main() {
let gdp_schema = schema![
fieldname gdp::CountryName = "Country Name";
fieldname gdp::CountryCode = "Country Code";
fieldname gdp::Gdp2015 = "2015";
];
// load the GDP CSV file from a URI
let gdp_view = load_csv_from_uri(
"https://wee.codes/data/gdp.csv",
gdp_schema
).expect("CSV loading failed.");
println!("{}", gdp_view);
}
If your file looks like this, you should be able to type cargo run
at your command prompt, and after compilation you’ll see your loaded data!
This example is available here. Additionally, you can find a version which loads the CSV file from a local path here.
Preprocessing
So far, we’ve loaded a data file and displayed it, but we haven’t really done any preprocessing work. Looking through the data we displayed in the last section, you may have noticed that this data set includes several aggregate categories for regions and income groups, such as “East Asia & Pacific”, “Upper middle income”, and “World”. While these categories may be useful in some contexts, we really only wanted to examine countries.
Looking at the full source data set, it doesn’t seem like there are any ways to easily filter out the categories from the countries. However, the World Bank website also provides metadata information for the GDP data, which has been rehosted here. This file contains the Country Codes reference in the primary GDP data file, along with their associated region and income group. Upon examination, you may notice that some of the regions and income groups are missing – specifically, all the “country codes” that are used to denote regions or income groups themselves.
Thus, we can come up with a preprocessing step to filter out the non-countries in our data set: filter the metadata so that only actual countries exist, and then perform a join of that metadata with the original data set (effectively filtering the original GDP data set to only include actual countries).
With this plan in mind, let’s translate this into programming tasks:
- Define the GDP metadata table in our tablespace.
- Write the source schema for loading the data from the file.
- Load the file from URI.
- Filter the loaded data set to ignore records without ‘Region’ data (therefore only include actual countries, not aggregates).
- Join the filtered metadata with the GDP
DataView
we loaded in the last section.
Metadata Table Definition
For defining the metadata table, let’s augment our existing tablespace
macro call. We’re only concerned with the CountryCode
field (to be able to join with the GDP DataView
), and the Region
field (to filter upon).
tablespace![
/* ... */
table gdp_metadata {
CountryCode: String,
Region: String,
}
]
Metadata Source Schema
Next, let’s write the source schema for the metadata file. We can add these lines to the main
function:
let gdp_metadata_schema = schema![
fieldindex gdp_metadata::CountryCode = 0usize;
fieldname gdp_metadata::Region = "Region";
];
For demonstration purposes, we’re using the fieldindex
specifier to indicate that the CountryCode
data is found in the 0th (first) column of the spreadsheet. We could have equivalently used fieldname gdp_metadata::CountryCode = "Country Code";
.
Load Metadata File
Loading the file works the same way as it did for loading the GDP data:
let mut gdp_metadata_view = load_csv_from_uri(
"https://wee.codes/data/gdp_metadata.csv",
gdp_metadata_schema
).expect("CSV loading failed.");
The only difference here is that we specified that this DataView
is mutable – this will allow us to filter the view.
Data Filtering
Now, we’ll do something new: manipulating a DataView
. The DataView
type has a method called filter
, which modifies the DataView
by filtering out every record that returns false
given a simple true-false predicate for a particular field.
You may recall that in agnes
, field labels are themselves types. Therefore, to specify a field (or fields) in agnes
, we often have to explicitly supply type arguments when calling methods on a DataView
(or other agnes
data structure) using what is often referred to as the “turbofish” syntax: object.method::<...>(...)
. The filter method on DataView
is one such method. To call this method, we write:
let gdp_metadata_view =
gdp_metadata_view.filter::<gdp_metadata::Region, _>(|val: Value<&String>| val.exists());
The filter takes two type parameters: the label we’re filtering (gdp_metadata::Region
) and the type of the predicate we’re supplying. The compiler is smart enough to figued out the predicate type based on what we pass as an argument, so we can tell the compiler to figure it out by using the symbol _
as the second type parameter.
This code also introduces another agnes
data type: the Value enum. The Value
enum is quite similar to the standard Option
enum – it has two variants: Value::Exists(...)
, which specifies that a value exists and provides the value, and Value::Na
, which specifies that the value is missing. In the code above, the predicate is expected to take one argument: a Value
type holding a reference to the data that exists in the gdp_metadata::Region
field (in this case, String
). It is typical that we will be dealing with reference-holding Value
objects, since most operations we will perform will not require taking ownership of the data itself. To use the Value
type in our code, though, we need to remember to include it in our imports at the top of our file:
use agnes::field::Value;
The predicate we provide, |val: Value<&String>| val.exists()
, is fairly simple: only return true
for non-missing values.
The filter
method consumes the DataView
and returns the filtered data, so we finish by simply assigning the result into a new variable named gdp_metadata_view
, replacing the previous one. After applying this filter, feel free to try printing out the filtered metadata and check to see if the non-country records have been removed!
Joining GDP and Metadata Data
Now that we have a properly filtered metadata DataView
, we can peform a join between the metadata and our original GDP data to effectively filter the non-country aggregates out of our GDP data. This takes a single line of code:
let gdp_country_view = gdp_view
.join::<Join<gdp::CountryCode, gdp_metadata::CountryCode, Equal>, _, _>(&gdp_metadata_view);
Here, we’re again using the ‘turbofish’ syntax to provide type arguments to the join method. In this case, we need to provide the type of the Join struct which specifies how the join should operate. Like label structs, the Join
struct is never intended to be instantiated: it’s a marker struct that exists just to tell the compiler the type of join and the fields the join is operating upon. In our care, we’re specifying that we with to perform an equality join (equijoin) on the gdp::CountryCode
and the gdp_metadata::CountryCode
fields: Join<gdp::CountryCode, gdp_metadata::CountryCode, Equal>
.
To use Equal
and Join
, we make sure to import them at the top of the file:
use agnes::join::{Equal, Join};
The remaining two type parameters of join
provide information about type of the DataView
we’re joining (gdp_metadata_view
) onto this first Dataview
(gdp_view
). Since we’re providing gdp_metadata_view
as a method argument, the compiler knows what type it is and we can again tell the compiler to figure out the relevant types using the placeholder syntax _
.
The join
method takes both gdp_view
and gdp_metadata_view
as reference, and creates a new DataView
object, not consuming or mutating either of the original DataView
objects. It should be noted, however, that the join
method does not copy any data; it only provides a new window into the data that was originally loaded from CSV files.
Now that we’ve joined these two tables, we can print out the results and see what we’ve done!
println!("{}", gdp_country_view);
The aggregate-based records are indeed gone. You may notice, however, that we have some unnecessary columns in our DataView
now – we don’t need the CountryCode
and Region
columns that were added from the gdp_metadata
table after our join. Let’s not worry about that for now, and come back to it in the next section.
Our code so far should look like (also viewable here):
#[macro_use]
extern crate agnes;
use agnes::field::Value;
use agnes::join::{Equal, Join};
use agnes::source::csv::load_csv_from_uri;
tablespace![
table gdp {
CountryName: String,
CountryCode: String,
Gdp2015: f64,
}
table gdp_metadata {
CountryCode: String,
Region: String,
}
];
fn main() {
let gdp_schema = schema![
fieldname gdp::CountryName = "Country Name";
fieldname gdp::CountryCode = "Country Code";
fieldname gdp::Gdp2015 = "2015";
];
// load the CSV file from a URI
let gdp_view =
load_csv_from_uri("https://wee.codes/data/gdp.csv", gdp_schema).expect("CSV loading failed.");
let gdp_metadata_schema = schema![
fieldindex gdp_metadata::CountryCode = 0usize;
fieldname gdp_metadata::Region = "Region";
];
let mut gdp_metadata_view = load_csv_from_uri(
"https://wee.codes/data/gdp_metadata.csv",
gdp_metadata_schema
).expect("CSV loading failed.");
let gdp_metadata_view =
gdp_metadata_view.filter::<gdp_metadata::Region, _>(|val: Value<&String>| val.exists());
let gdp_country_view = gdp_view
.join::<Join<gdp::CountryCode, gdp_metadata::CountryCode, Equal>, _, _>(&gdp_metadata_view);
println!("{}", gdp_country_view);
}
Adding Life Expectancy
Thus far, we’ve loaded up GDP data and successfully filtered out a bunch of unnecessary records. Next, let’s combine it with life expectancy data.
We should have a good idea of how to proceed at this point – we have a new data file to load, so we will need to define the table in our tablespace, write another source schema, and call load_csv_from_uri
to load the data. After that, is should be as simple as just joining the life expectancy and GDP views together to get us a combined data view!
So, let’s define our table (in the same tablespace
macro we’ve been using):
tablespace![
/* ... */
pub table life {
CountryCode: String,
Year2015: f64,
}
]
And write our source schema:
let life_schema = schema![
fieldname life::CountryCode = "Country Code";
fieldname life::Life2015 = "2015";
];
And load the file from a URI:
let life_view = load_csv_from_uri(
"https://wee.codes/data/life.csv",
life_schema
).expect("CSV loading failed.");
These should all be fairly recognizable at this point; they’re nearly identical to the source loading we did for the GDP and GDP metadata files.
Joining the GDP and life expectancy data views should also be familiar:
let gdp_life_view = gdp_country_view
.join::<Join<gdp::CountryCode, life::CountryCode, Equal>, _, _>(&life_view);
We can now print it out and take a look!
println!("{}", gdp_life_view);
It seems to have worked! But now, we’ve exacerbated our problem with the extra columns. We really only care about the country name, 2015 GDP, and 2015 life expectancy fields. We have three country code fields and a region field we don’t need anymore!
To fix this, we introduce another DataView
method: v (which is shorthand for subview). This method will take a DataView
and construct another DataView
which only contains a subset of original fields, and we call it like this:
let gdp_life_view = gdp_life_view
.v::<Labels![gdp::CountryName, gdp::Gdp2015, life::Life2015]>();
We’re again using the ‘turbofish’ syntax to specify type arguments to the v
method. In this case, the method takes a list of labels instead of a single label (like we specified in filter
or join
). agnes
provides the Labels macro to construct this label list, which we use to specify that we only want the CountryName
and Gdp2015
fields originally from the gdp
table, and the Life2015
field from the life
table. The v
method doesn’t consume the original DataView
, but since we no longer need it, we go ahead and store the resultant DataView
with the same name, shadowing the original view.
Now, when we print gdp_life_view
, we get a much less cluttered data table.
That’s it for this step! We’ve now added life expectancy data to our DataView
and removed extraneous columns. Our code should generally look like this (also viewable here):
#[macro_use]
extern crate agnes;
use agnes::field::Value;
use agnes::join::{Equal, Join};
use agnes::source::csv::load_csv_from_uri;
tablespace![
table gdp {
CountryName: String,
CountryCode: String,
Gdp2015: f64,
}
table gdp_metadata {
CountryCode: String,
Region: String,
}
pub table life {
CountryCode: String,
Life2015: f64,
}
];
fn main() {
let gdp_schema = schema![
fieldname gdp::CountryName = "Country Name";
fieldname gdp::CountryCode = "Country Code";
fieldname gdp::Gdp2015 = "2015";
];
// load the GDP CSV file from a URI
let gdp_view =
load_csv_from_uri("https://wee.codes/data/gdp.csv", gdp_schema).expect("CSV loading failed.");
let gdp_metadata_schema = schema![
fieldindex gdp_metadata::CountryCode = 0usize;
fieldname gdp_metadata::Region = "Region";
];
// load the metadata CSV file from a URI
let mut gdp_metadata_view =
load_csv_from_uri("https://wee.codes/data/gdp_metadata.csv", gdp_metadata_schema)
.expect("CSV loading failed.");
let gdp_metadata_view =
gdp_metadata_view.filter::<gdp_metadata::Region, _>(|val: Value<&String>| val.exists());
let gdp_country_view = gdp_view
.join::<Join<gdp::CountryCode, gdp_metadata::CountryCode, Equal>, _, _>(&gdp_metadata_view);
let life_schema = schema![
fieldname life::CountryCode = "Country Code";
fieldname life::Life2015 = "2015";
];
/// load the life expectancy CSV file from a URI
let life_view = load_csv_from_uri(
"https://wee.codes/data/life.csv",
life_schema
).expect("CSV loading failed.");
let gdp_life_view = gdp_country_view
.join::<Join<gdp::CountryCode, life::CountryCode, Equal>, _, _>(&life_view);
let gdp_life_view = gdp_life_view
.v::<Labels![gdp::CountryName, gdp::Gdp2015, life::Life2015]>();
println!("{}", gdp_life_view);
}
Arithmetic Transformation
Our final preprocessing task will be a arithmetic transformation of one of the fields. Let’s say that our downstream code expects the GDP to be measured in euros, not US dollars. Thus, as post of our preprocessing tasks we need to do a quick conversion.
This step introduces a few additional agnes
features. We’ll start by adding the appropriate trait imports:
use agnes::select::FieldSelect;
use agnes::access::DataIndex;
use agnes::label::IntoLabeled;
use agnes::store::IntoView;
FieldSelect is a trait for selecting a single field a DataView
, DataIndex is a trait for accessing individual data in a field, IntoLabeled is a trait for adding a label to an unlabeled data field, and IntoView is a trait for turning a field (or other data structure) into a DataView
.
Just from this list, it might start to become apparent what our plan is going to be: we’ll select the current GDP field from our DataView
, access the data, generate a new transformed field, label this new field, convert it into a new DataView
, and then merge the two DataView
s using the DataView
merge method.
One concern is that we need to be able to label this new field, but don’t have any labels to apply to it. Fortunately, we can use the same method for declaring new labels as we did when declaring the labels for tables we load: the tablespace
macro! So let’s add the following to our tablespace
call:
tablespace![
/* ... */
pub table transformed {
GdpEUR2015: f64,
}
]
The table name transformed
is arbitrary, and just provides a place to define our newly transformed data field’s label.
Let’s also define a quick conversion function for converting from USD to EUR. For the sake of simplicity, we’re just going to hard-code the conversion factor, but we could load this from a file, or read it from an API, or request it from the user, or any other method we can come up with.
At the time of the writing of this guide, 1 USD = 0.88395 EUR. Thus, our simple hard-coded conversion function is:
fn usd_to_eur(usd: &f64) -> f64 {
usd * 0.88395
}
Now we can dive into creating the transformed data field:
let transformed_gdp = gdp_life_view
.field::<gdp::Gdp2015>()
.iter()
.map_existing(usd_to_eur)
.label::<transformed::GdpEUR2015>()
.into_view();
This statement starts with our combined GDP-life expectancy DataView
, and selects a single field Gdp2015
using the field method provided by the FieldSelect
trait. Then we create an iterator over the data in thie field using the iter method (provided by the DataIndex
trait). This iterator provides the method map_existing which applies a function or closure to every existing element in the field (leaving missing data as missing), which we call with our conversion function.
Next, we label this transformed data with our new field label, and convert the field (with into_view) into a new DataView
object, which is needed to be able to merge it with our existing DataView
:
let final_view = gdp_life_view
.merge(&transformed_gdp)
.expect("Merge failed.")
.v::<Labels![gdp::CountryName, transformed::GdpEUR2015, life::Life2015]>();
Here, we merge our new single-field DataView
containing our transformed data back on to our DataView
with GDP and life expectancy data. The merge method basically adds all the fields in transformed_gdp
onto gdp_life_view
and returns a new combined DataView
. Merging can fail if you try to merge two DataView
s with different numbers of rows, but that shouldn’t be a problem here since we just defined this new field and know it has the correct number of rows.
Finally, since we don’t need the original Gdp2015
field anymore, we perfrom another subview operation, only choosing the CountryName
, GdpEUR2015
, and Life2015
columns.
Final Preprocessor
We’ve done it! We now have a preprocessor application which takes original source GDP and life expectancy data, removes unnecessary records and columns, combines our data sources into a single view of the data, and performs a minor arithmetic transformation on one of the fields. Now we can serialize this data into whatever format we need for downstream activites: visualization, regression, storage, whatever!
Our final application code should look like this (also viewable here):
#[macro_use]
extern crate agnes;
use agnes::access::DataIndex;
use agnes::field::Value;
use agnes::join::{Equal, Join};
use agnes::label::IntoLabeled;
use agnes::select::FieldSelect;
use agnes::source::csv::load_csv_from_uri;
use agnes::store::IntoView;
tablespace![
table gdp {
CountryName: String,
CountryCode: String,
Gdp2015: f64,
}
table gdp_metadata {
CountryCode: String,
Region: String,
}
pub table life {
CountryCode: String,
Life2015: f64,
}
pub table transformed {
GdpEUR2015: f64,
}
];
fn usd_to_eur(usd: &f64) -> f64 {
usd * 0.88395
}
fn main() {
let gdp_schema = schema![
fieldname gdp::CountryName = "Country Name";
fieldname gdp::CountryCode = "Country Code";
fieldname gdp::Gdp2015 = "2015";
];
// load the GDP CSV file from a URI
let gdp_view =
load_csv_from_uri("https://wee.codes/data/gdp.csv", gdp_schema).expect("CSV loading failed.");
let gdp_metadata_schema = schema![
fieldindex gdp_metadata::CountryCode = 0usize;
fieldname gdp_metadata::Region = "Region";
];
// load the metadata CSV file from a URI
let mut gdp_metadata_view =
load_csv_from_uri("https://wee.codes/data/gdp_metadata.csv", gdp_metadata_schema)
.expect("CSV loading failed.");
let gdp_metadata_view =
gdp_metadata_view.filter::<gdp_metadata::Region, _>(|val: Value<&String>| val.exists());
let gdp_country_view = gdp_view
.join::<Join<gdp::CountryCode, gdp_metadata::CountryCode, Equal>, _, _>(&gdp_metadata_view);
let life_schema = schema![
fieldname life::CountryCode = "Country Code";
fieldname life::Life2015 = "2015";
];
// load the life expectancy file from a URI
let life_view = load_csv_from_uri("https://wee.codes/data/life.csv", life_schema)
.expect("CSV loading failed.");
let gdp_life_view =
gdp_country_view.join::<Join<gdp::CountryCode, life::CountryCode, Equal>, _, _>(&life_view);
let gdp_life_view =
gdp_life_view.v::<Labels![gdp::CountryName, gdp::Gdp2015, life::Life2015]>();
let transformed_gdp = gdp_life_view
.field::<gdp::Gdp2015>()
.iter()
.map_existing(usd_to_eur)
.label::<transformed::GdpEUR2015>()
.into_view();
let final_view = gdp_life_view
.merge(&transformed_gdp)
.expect("Merge failed.")
.v::<Labels![gdp::CountryName, transformed::GdpEUR2015, life::Life2015]>();
println!("{}", final_view);
}