Data Wrangling


Data wrangling (or data munging) is an important step in analyzing any data. Data wrangling is in essence, the transformation of “raw” data into a format that is statistically usable. In this process, the data is transformed in several steps - reshaping, filtering, selecting, mutating, combining, formatting and summarizing. Data wrangling can be done manually using built-in tools or via domain specific languages (DSL) such as SQL, R or Python.

Stagraph has a comprehensive interface for wrangling data. This interface is also available in the Free edition of the program without limitations. Stagraph can be considered free tool for data wrangling. An overview of options and workflow is presented in the following video.

As can be seen from the previous video, Stagraph has complete functionality for data wrangling. Overall, we can divide these functions into three main groups:

The Complex functions group contains essentially two functions - SQL Processing and R Processing. Using these two functions, you can work with the dataset on a “complex level”. However, they requires SQL or R language knowledge. Data Wrangling group includes a complete set of functions based on the Grammar of Data. In this case, your work is divided into several simple steps. By combining them you can make complex and comprehensive data transformations. Finally, the Special Functions group contains functions for special cases. These functions are related with specific data types such as text or numeric values. Documentation for each feature is in the following list.

Data Import Functions

Data Wrangling Workflow

In the previous chapter, we’ve introduced features that you can use to import data to Stagraph. After import, your dataset is displayed in the Project Panel. If you select it in the list, a ribbon toolbar tab, named Data, will be displayed. This tab contains features that can be used for data wrangling.

If you double-click on selected dataset, it will appear in the Properties Panel, where you can process it. If you add selected function from the Data tab, it will appear in the list (Properties Panel). Here you can setup parameters of each function (processing step).

In the following example we added the data_sql function. Using this function, you can process your dataset through SQL commands. In the example, we created a SQL statement that selects the Sepal_Length variable from the dataset and records were filtered by Species variable.

By gradual addition of individual functions, you can edit the dataset to the desired form. You can add individual functions from the ribbon toolbar tab (Data) or from contextual menu if you right-click on a blank space in the Properties Panel.

In the following section, we will shortly introduce all groups of the data wrangling functions.

Complex functions

The first group are Complex Functions. These allows you to prepare very complex processing steps. The current version of the program contains two functions - data_sql and data_r. Using these functions, you can edit your dataset using custom commands written in SQL or R language.

But to use them, you need to know how to work with these languages.

Data Wrangling

The second group is Data Wrangling functions. This is a set of features that divide complex wrangling into simple steps. Each sub-group has a specific purpose.

For example, under the Reshape button, you can find functions for dataset “reshaping” (gather, spread or transpose). Under the Subset button, you can find functions for records selecting such as filtering or sampling. Under the Combine button, you’ll find features to combine dataset into one.

Functions in this group are based on the grammar of data, which is implemented in R packages tidyr and dplyr. When you use them, you follow the “step by step” style. By combining individual functions you can achieve avery advanced data processing.

Special Functions

The Special Functions group represents specific functions for working with selected data types - numeric, text, categorical and temporal.

Under each button there are multiple specific functions. In the following example, features for working with Date-Time values are displayed. The Parse function serves to format Date-Time values from a text variable. Use the Add function to move the variable in time. The Component function gets selected component from variable, such as year, month or hour.

Functions from this group let you work with variables in the selected data type. Their comprehensive description can be found under the links at the beginning of the chapter.