library(bigrquery) # R Interface to Google BigQuery API
library(dplyr) # Grammar for data manipulation
library(DBI) # Interface definition to connect to databases
BigQuery
RDBMS can be accessed via SQL and since I am covering Python and R in this site I will demonstrate how to connect to Google BigQuery using R, and still use SQL via R packages.
R & RStudio
This first section will cover how to connect to Google BigQuery using R via RStudio.
More information can be found on RStudio Setup in the SQL section of this site. I’ll briefly recap here:
dplyr as a database interface
This information was found here. I refer to that link to connect to DBs.
More information can be found at dplyr regarding connecting to db and lazy evaluation.
This page is just an overview of why I chose RStudio and doesn’t show the intricate details of connections. The dplyr
package simplifies data transformation. It provides a consistent set of functions, called verbs, that can be used in succession and interchangeably to gain understanding of the data iteratively.
dplyr
is able to interact with databases directly by translating the dplyr
verbs into SQL queries. This convenient feature allows you to ‘speak’ directly with the database from R.
SQL Translations for dplyr
A complementary package called dbplyr
contains the translations of the vendor-specific SQL for dplyr
to use. A list of known supported databases are available in our Databases page.
Databases
Name | Posit Pro Driver | dplyr support | Connect via R package |
---|---|---|---|
Amazon Redshift | ✅ | ✅ | |
Apache Hive | ✅ | ✅ | |
Apache Impala | ✅ | ✅ | |
Athena | ✅ | ||
Cassandra | ✅ | ||
Databricks | ✅ | ✅ | |
Google BigQuery | ✅ | ✅ | bigrquery |
Microsoft SQL Server | ✅ | ✅ | |
MonetDB | ✅ | MonetDBLite | |
MongoDB | ✅ | ||
MySQL | ✅ | ✅ | RMariaDB |
Netezza | ✅ | ||
Oracle | ✅ | ✅ | |
PostgreSQL | ✅ | ✅ | RPostgres |
SQLite | ✅ | RSQLite | |
Salesforce | ✅ | ||
Snowflake | ✅ | ✅ | |
Teradata | ✅ | ✅ |
Warning
As we know that R loads data into memory which could be taxing with large datasets. In this section I’ll connect to Google BigQuery and access some of their large public datasets.
Connect to BigQuery
As we know that R loads data into memory which could be taxing with large datasets. In this section I’ll connect to Google BigQuery and access some of their large public datasets.
I already have an account setup in BigQuery as I’ve done lots of work on it, and I also have some sample projects that I saved on Google Cloud that we’ll access in this section to demonstrate SQL from within RStudio and how to integrate R and SQL in the same analysis.
What follows is a short example on how to connect to BigQuery, access a project, data, query the data with R and SQL all using RStudio desktop which is free.
Packages
- “bigrquery”, “dplyr” and “DBI” packages provide an abstraction over the underlying BigQuery REST API to interact with the data.
- bigrquery and DBI provide a low level wrapper over the BigQuery API’s whereas dplyr provides a higher level of abstraction and lets BigQuery tables be treated as data frames.
Setup the Connection
library(bigrquery) # R Interface to Google BigQuery API
library(dplyr) # Grammar for data manipulation
library(DBI)
<- 'prime-depot-415622'
projectid <- 'BikeShare'
datasetid <- dbConnect((bigquery()),
con project = projectid,
dataset = datasetid,
use_legacy_sql = FALSE)
List Datasets in Project
SELECT *
EXCEPT(schema_owner, location, DDL, default_collation_name)
FROM
INFORMATION_SCHEMA.SCHEMATA
catalog_name | schema_name | creation_time | last_modified_time | sync_status |
---|---|---|---|---|
prime-depot-415622 | cus_data | 2024-03-03 21:24:36 | 2024-03-03 21:24:36 | NULL |
prime-depot-415622 | ny_citibikes | 2024-08-25 16:35:40 | 2024-08-25 16:35:40 | NULL |
prime-depot-415622 | bellabeat | 2024-03-20 20:06:32 | 2024-03-20 20:06:32 | NULL |
prime-depot-415622 | employee_data | 2024-03-06 23:59:41 | 2024-03-06 23:59:41 | NULL |
prime-depot-415622 | demos | 2024-03-05 17:43:26 | 2024-03-05 17:43:26 | NULL |
prime-depot-415622 | cars | 2024-03-03 18:17:34 | 2024-03-03 18:17:34 | NULL |
prime-depot-415622 | babynames | 2024-02-28 17:06:55 | 2024-02-28 17:06:55 | NULL |
prime-depot-415622 | BikeShare | 2024-04-01 20:47:30 | 2024-04-01 20:47:30 | NULL |
prime-depot-415622 | city_data | 2024-02-28 16:34:15 | 2024-02-28 16:34:15 | NULL |
prime-depot-415622 | customer_data | 2024-03-03 16:27:32 | 2024-03-03 16:27:32 | NULL |
List Tables w/R
To view the tables in the DB use dbListTables()
dbListTables(con)
[1] "trips19v1" "trips20"
List Tables w/SQL
SELECT table_catalog, table_schema, table_name, table_type, is_insertable_into
FROM BikeShare.INFORMATION_SCHEMA.TABLES
table_catalog | table_schema | table_name | table_type | is_insertable_into |
---|---|---|---|---|
prime-depot-415622 | BikeShare | trips19v1 | BASE TABLE | YES |
prime-depot-415622 | BikeShare | trips20 | BASE TABLE | YES |
List Column Schema
SELECT table_name, column_name, ordinal_position, is_nullable, data_type
FROM BikeShare.INFORMATION_SCHEMA.COLUMNS
table_name | column_name | ordinal_position | is_nullable | data_type |
---|---|---|---|---|
trips19v1 | ride_id | 1 | YES | INT64 |
trips19v1 | started_at | 2 | YES | TIMESTAMP |
trips19v1 | ended_at | 3 | YES | TIMESTAMP |
trips19v1 | ride_length | 4 | YES | STRING |
trips19v1 | rideable_type | 5 | YES | INT64 |
trips19v1 | tripduration | 6 | YES | FLOAT64 |
trips19v1 | start_station_id | 7 | YES | INT64 |
trips19v1 | start_station_name | 8 | YES | STRING |
trips19v1 | end_station_id | 9 | YES | INT64 |
trips19v1 | end_station_name | 10 | YES | STRING |
tbl
Access Tables
As mentioned earlier dplyr
allows access to BigQuery tables and allow us to treat them as dataframes
- use dplyr::tbl() to access the tables
- arguments: connection and TableName
Access and assign tables
<- tbl(con, "trips19v1")
trips_19 <- tbl(con, "trips20") |> show_query() trips_20
<SQL>
SELECT *
FROM `trips20`
- As you see the SQL query is all printed out for you
class(trips_19)
[1] "tbl_BigQueryConnection" "tbl_dbi" "tbl_sql"
[4] "tbl_lazy" "tbl"
Note: trips_19 & trips_20 are of class tbl_sql and tbl_lazy. So we have a direct reference to the table in BigQuery but we have not brought the data into memory
Query with R
Let’s use dplyr
verbs to access the data in BigQuery, with a simple case of counting the number of all users in the groups.
library(gt)
|> count(member_casual) |> gt() |>
trips_20 opt_table_outline()|>
tab_options(column_labels.background.color = '#1a65b9')
member_casual | n |
---|---|
casual | 48480 |
member | 378407 |
|> count(member_casual) |>
trips_19 gt() |>
opt_table_outline() |>
tab_options(column_labels.background.color = '#1a65b9')
member_casual | n |
---|---|
Subscriber | 341906 |
Customer | 23163 |
Show SQL Query
We can see how the R script translate to an SQL query by adding one show_query()
into the code
|> count(member_casual) |> show_query()
trips_19
OUTPUT<SQL>
`member_casual`, count(*) AS `n`
SELECT `trips19v1`
FROM `member_casual` GROUP BY
Query with SQL
We’re going to take the SQL query generated above and use it to directly query the database.
All we do is replace {r} with {sql} at the top of the code chunk and
provide the connection = con (defined earlier in the page)
output.var = “mydataframe” will assign the output of the query into “mydataframe” so it can be used in subsequent R code chunks
{sql, connection = con, output.var = "mydataframe"}
In this case we didn’t use the second argument
{sql, connection = con }
SELECT `member_casual`, count(*) AS `n`
FROM `trips19v1`
GROUP BY `member_casual`
member_casual | n |
---|---|
Subscriber | 341906 |
Customer | 23163 |