data_sql


Imagine that you are importing data from different sources - web services, CSV and Excel files or databases. Then you want to edit, combine and statistically process these data in a quick and easy way. In this case, use the data_sql function. With this function you work with datasets through SQL commands.

SQL (Structured Query Language) is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system. It is particularly useful in handling structured data where there are relations between different entities/variables of the data.

For data wrangling you can also use in Stagraph SQL language. You can use this feature without a difference in the data source - you can apply your SQL commands on data you import from CSV files, web services or databases. The data_sql function is based on the sqldf R package.

The user specifies an SQL statement (using datasets names in place of table names) and a database with appropriate table layouts is automatically created, the datasets are automatically loaded into the database, the specified SQL statement is performed, the result is imported back to Stagraph. The database is then automatically deleted.

As the back-end database is used an in-memory created SQLite database. SQLite is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is an embedded SQL database engine - a compact library.

Examples of practical use are shown in the following video-tutorial.

How to use data_sql

In the following, we will describe an example of using data_sql function. As sample data, we use built-in data from R Runtime. These data we import to Stagraph through the data_attach function - click in the ribbon toolbar on R_Term - Attach. The following dialog will appear.

Leave the Dataset Name property to default values - Data1. Then select the sample dataset - diamonds. When you click on the Apply button, the built-in dataset is imported into the project and preview is displayed in the Data Preview Document.

Created dataset we will modify and process using the data_sql function. This function can be found under the button SQL on the ribbon toolbar tab Data.

After pressing this button, the feature is added to the list of applied functions in the Properties Panel. If you select the function in list, the SQL Script Editor appears in the bottom panel part.

This editor includes features such as SQL syntax highlighting or code / variables autocompletion. In this editor you write your SQL statementsfor datasets processing. If you know the SQL language, datasets editing and statistical processing will be a simple task.

In the first example, we modified the imported dataset with a relatively simple command. From the Data1 dataset, we select variables (columns) named cut, carat and price. These records were sorted descending according to the price variable and finally we selected the first 15 records. The result is shown in the following figure.

In the following example, we have summarized the dataset values. From the dataset, we chose the cut variable for records grouping. In each group, we calculated average value of carat and price variables. Calculated averages were rounded to 2 decimal places and resulted dataset was sorted according to the price variable.

In this way, your datasets can be edited, summarized or combined (SQL JOIN). If you can not to complete the processing with one single command, you can apply to selected dataset several individual data_sql functions. The example is shown in the following illustration.

In this case, we applied the additional data_sql function to the previous example. We have selected the cut and carat variables from previous processing step and these values were finally sorted by the carat variable (from the smallest to the highest).

In addition to combining several data_sql functions, you can use the other data wrangling functions. The example is shown in the following image.

The data_sql function gives you a very powerful and easy-to-use data processing tool. If you know the SQL language, you can use its full functionality (in the SQLite implementation).