Hello Everyone.
This is the first time I have posted in well over a year. Today I am going to focus on visualising Snowflake's Geospatial Data in Power BI. This blog takes you through the process. So please read to get to know how not to ever need a topojson file again!
Power BI can leverage Geospatial data from snowflake with my favourite map visual - Iconmap - https://www.icon-map.com/ . Icon map can render points, polygons and linestrings using 'Well Known Text' format (WKT).
Snowflake supports converting geospatial datatypes to WKT. Not only this, Snowflake has the capabilities to perform the Engineering and analytical needs for Geospatial analysis without using any other tool. And the results can be visualised in a variety of medias such as Tableau, Hex, Carto or even a Custom built Streamlit application. I have written a Streamlit blog on this very recently.
https://blog.streamlit.io/how-to-analyze-geospatial-snowflake-data-in-streamlit/
We all know that Power BI is not a geospatial application - however, what I do know is that there are many use cases where you might need to visualise custom or non standard polygons and points on a map along side other non geospatial data.
This tutorial will take you through ingesting geospatial data and engineering the data inside a suitable format which can be easily visualised in Power BI. So here we go....
Log into snowflake using an IDE of your choice (eg Snow sight or VS Code)
Create a Database, Warehouse and Schema
CREATE OR REPLACE WAREHOUSE ENGINEERING;
CREATE OR REPLACE WAREHOUSE POWER_BI;
CREATE OR REPLACE DATABASE VEHICLE_ACCIDENTS;
CREATE OR REPLACE SCHEMA CURATED;
CREATE OR REPLACE SCHEMA RAW;
Create a Stage Containing the Raw Files
create or replace stage GEOSPATIAL_DATA
directory=(enable = TRUE);
Get some Data that has location information:
I decided to use the Vehicle Accident data and incorporated this data with Health Care Sub ICBs. Below are the files I used:
Boundaries
Accidents for 2021
Import the files
Use the simple upload tool to add the files into the stage. You can access this via the Data section in the home page.
Refresh to see the files in the directory
Create a CSV file format
CREATE FILE FORMAT CSV
TYPE = csv
PARSE_HEADER = true
FIELD_OPTIONALLY_ENCLOSED_BY = '"';
Create a Table from Accident Information -
CREATE TABLE ACCIDENTS_2021
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@VEHICLE_ACCIDENTS.RAW.GEOSPATIAL_DATA/dft- road-casualty-statistics-accident-2021.csv',
FILE_FORMAT=>'CSV'
)
));
Ingest the Data into the new table
COPY into ACCIDENTS_2021 from @VEHICLE_ACCIDENTS.RAW.GEOSPATIAL_DATA/dft-road-casualty-statistics-accident-2021.csv FILE_FORMAT = (FORMAT_NAME= 'CSV') MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE;
View the results and you should get this
Create a File Format for the Geojson - This will be a Json File format
CREATE FILE FORMAT JSON TYPE = json;
Create a Table to ingest the boundaries into
CREATE OR REPLACE TABLE BOUNDARIES
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@VEHICLE_ACCIDENTS.RAW.GEOSPATIAL_DATA/'
,FILES =>
'Integrated_Care_Boards_April_2023_EN_BSC_7929772807133590321.geojson'
,FILE_FORMAT=>'JSON'
)
));
Ingest the Boundaries
The boundaries downloaded are in the format of the British national Grid.
COPY into BOUNDARIES from @VEHICLE_ACCIDENTS.RAW.GEOSPATIAL_DATA/ FILE_FORMAT = (FORMAT_NAME= 'JSON') PATTERN='.*geojson' MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE;
The boundaries data will be loaded as one row per file. We need to use a function to effectively 'Explode the data out into columns
I am using the new Dynamic Tables feature to keep the data refreshed as the raw data changes. In this Query the following transformations are being applied:
Convert the json data for each polygon to a 'Geometry' data type
Transform the 'geometry' data type from British National Grid to the World Geodetic System - NB Icon map can handle British National Grid - I have just added this in just in case you would like to use the data in other tools that do not offer that support
Convert the Geometry Datatype To Well Known Text format
CREATE OR REPLACE DYNAMIC TABLE HEALTH_BOUNDARIES_NORMALISED
TARGET_LAG = '1 Day'
WAREHOUSE = DYNAMIC_TABLE AS
SELECT
"crs":properties:name::TEXT "CRS",
A.VALUE:properties:ICB23CD::TEXT "ICB Code",
A.VALUE:properties:ICB23NM::TEXT "ICB Name",
A.VALUE:properties:SICBL23CD::TEXT "Sub ICB Code",
A.VALUE:properties:SICBL23NM::TEXT "Sub ICB Name",
A.VALUE:properties:NHSER22CD::TEXT "Regional Code",
A.VALUE:properties:NHSER22NM::TEXT "Regional Name",
A.VALUE:properties:LAT "Latitude",
A.VALUE:properties:LONG "Longitude",
ST_ASWKT(ST_TRANSFORM(TO_GEOMETRY(A.VALUE:geometry ,RIGHT(CRS,5)::INT),4326)) WKT
FROM (SELECT * FROM BOUNDARIES),
LATERAL FLATTEN (input => "features") as A;
This is what it should look like
Now i am going to query the Length of the WKT field. Power BI has limits to character sizes each field can hold so lets check this to avoid disappointment.
SELECT max(LEN(WKT)) LEN FROM HEALTH_BOUNDARIES_NORMALISED
The results show that the max char length is 391,958 characters - well above the limit. There are two options to rectify this:
Use ST_SIMPLIFY function which is a standard snowflake function. This 'Simplfies' the polygon so the are less verticies. NB we have already chosen 'simplified' datasets from the ONS geoportal.
Reduce the precision of each co-ordinate. This makes sense as currently the level of precision is huge. Please see the table below which shows the level of accuracy in terms of size:
I have chosen to create a precision reducing function which would reduce the characters but would not incur accuracy reduction.
Create a Function to reduce precision
CREATE OR REPLACE FUNCTION py_reduceprecision(geo geography, n integer)
returns geography
language python
runtime_version = 3.8
packages = ('shapely')
handler = 'udf'
AS $$
from shapely.geometry import shape, mapping
from shapely import wkt
def udf(geo, n):
if n < 0:
raise ValueError('Number of digits must be positive')
else:
g1 = shape(geo)
updated_shape = wkt.loads(wkt.dumps(g1, rounding_precision=n))
return mapping(updated_shape)
$$;
Run a query to check the difference in Character length
select
WKT,
LEN(WKT) WKT_LENGTH,
ST_ASWKT(py_reduceprecision(st_geographyfromwkt(WKT),6))WKT_REDUCED,
LEN(WKT_REDUCED) WKT_REDUCED_LENGTH, * EXCLUDE (WKT) FROM HEALTH_BOUNDARIES_NORMALISED ORDER BY WKT_REDUCED_LENGTH DESC;
You will note that although its reduced slightly, some of the boundaries are still over 30,000 characters. Therefore I will utilise ST_SIMPLIFY for just the boundaries effected.
I did a quick check before creating a new table
You can see that the Max Character length is shown as 52,000 which I am OK with as I can split the larger fields into multiple fields and use DAX in Power BI to concatenate them together. Two fields I am OK with using this method. So lets create a new table, however I will create two columns in order to split the well known text whenever the field exceeds 30,000 characters
CREATE OR REPLACE DYNAMIC TABLE S_HEALTH_BOUNDARIES_NORMALISED
TARGET_LAG = '1 Day'
WAREHOUSE = DYNAMIC_TABLE AS
SELECT
CASE WHEN WKT_LENGTH > 50000 THEN
ST_SIMPLIFY(GEOGRAPHY,500) ELSE GEOGRAPHY END GEOGRAPHY_SIM,
ST_ASWKT(GEOGRAPHY_SIM)S_WKT,
SUBSTR(S_WKT,30000) S_WKT2,
SUBSTR(S_WKT,1,29999) S_WKT1,
* EXCLUDE(WKT, WKT_LENGTH,
WKT_REDUCED,WKT_REDUCED_LENGTH, GEOGRAPHY) from
(
select
WKT,
LEN(WKT) WKT_LENGTH,
TO_GEOGRAPHY(py_reduceprecision(st_geographyfromwkt(WKT),6))GEOGRAPHY,
ST_ASWKT(GEOGRAPHY)WKT_REDUCED,
LEN(WKT_REDUCED) WKT_REDUCED_LENGTH, * EXCLUDE (WKT) FROM HEALTH_BOUNDARIES_NORMALISED ORDER BY WKT_REDUCED_LENGTH);
ALTER DYNAMIC TABLE S_HEALTH_BOUNDARIES_NORMALISED REFRESH;
In Power BI I would like the user to select Regions, ICBs and then Sub ICBs using a simple filter. This can be achieved by creating spacial joins between each of the 3 layers so it understands ICB belongs to which region.
CREATE OR REPLACE DYNAMIC TABLE GEOGRAPHY_LOOKUP
TARGET_LAG = '1 day'
WAREHOUSE = DYNAMIC_TABLE AS
SELECT A. "Sub ICB Code", B. "ICB Code", C. "Regional Code", A. "Sub ICB Name", B. "ICB Name", C. "Regional Name"FROM
(SELECT * EXCLUDE ("Regional Code", "Regional Name", "ICB Code", "ICB Name") from RAW.S_HEALTH_BOUNDARIES_NORMALISED
WHERE "Sub ICB Code" IS NOT NULL) A
INNER JOIN
(SELECT * EXCLUDE ("Regional Code", "Regional Name", "Sub ICB Code", "Sub ICB Name") from S_HEALTH_BOUNDARIES_NORMALISED
WHERE "ICB Code" IS NOT NULL) B
ON ST_WITHIN(ST_CENTROID(A.GEOGRAPHY_SIM),B.GEOGRAPHY_SIM) -------CENTROID IS THE CENTRAL POINT TO ENSURE THAT THERE IS A DEFINATE ICB PER SUB ICB WITHOUT OVERLAPPING
INNER JOIN
(SELECT * EXCLUDE ("ICB Code", "ICB Name", "Sub ICB Code", "Sub ICB Name") from S_HEALTH_BOUNDARIES_NORMALISED
WHERE "Regional Code" IS NOT NULL) C
ON ST_WITHIN(ST_CENTROID(B.GEOGRAPHY_SIM),C.GEOGRAPHY_SIM);
Next, lets create a query which stacks our boundaries consistently into 4 columns.
CREATE OR REPLACE DYNAMIC TABLE Geography_ICON_MAP
TARGET_LAG = '1 Day'
WAREHOUSE = DYNAMIC_TABLE AS
SELECT S_WKT1, S_WKT2, "ICB Code" "Code", "ICB Name" "Name", 'ICB' "Type" FROM
S_HEALTH_BOUNDARIES_NORMALISED where "Code" IS NOT NULL
UNION ALL
SELECT S_WKT1, S_WKT2,"Sub ICB Code" "Code","Sub ICB Name" "Name", 'Sub ICB' "Type" FROM
S_HEALTH_BOUNDARIES_NORMALISED WHERE "Code" IS NOT NULL
UNION ALL
SELECT S_WKT1, S_WKT2,"Regional Code" "Code","Regional Name" "Name",'Region' "Type"
from S_health_boundaries_normalised WHERE "Code" IS NOT NULL;
ALTER DYNAMIC TABLE GEOGRAPHY_ICON_MAP REFRESH;
SELECT * FROM GEOGRAPHY_ICON_MAP;
We now have the geometries ready - now we need to create a spatial join to link the Accidents with the Sub ICBs. We will also summarise the data for simplicity
CREATE OR REPLACE DYNAMIC TABLE "Accident Summary"
TARGET_LAG = '1 day'
WAREHOUSE = DYNAMIC_TABLE AS
select
B."Sub ICB Code"
,C."ICB Code"
,C."Regional Code"
,count(*)"Accidents"
, sum("Fatal") "Fatal"
,sum("Serious") "Serious"
from
(SELECT ST_MAKEPOINT(TRY_TO_DECIMAL("longitude",12,6),TRY_TO_DECIMAL("latitude",12,6)) POINT,
CASE "accident_severity" WHEN 1 THEN 1 ELSE 0 END "Fatal",
CASE "accident_severity" WHEN 2 THEN 1 ELSE 0 END "Serious"
FROM accidents_2021) A
INNER JOIN
(SELECT GEOGRAPHY_SIM,"Sub ICB Code" FROM S_HEALTH_BOUNDARIES_NORMALISED WHERE "Sub ICB Code" is not null) B
ON
ST_WITHIN(A.POINT,B.GEOGRAPHY_SIM)
inner join
geography_lookup C on B."Sub ICB Code" = C."Sub ICB Code"
group by all
;
ALTER DYNAMIC TABLE "Accident Summary" REFRESH;
Now we have summary data for each Sub ICB Code
Before we move into Power BI, Lets create two views in our Curated View so our power BI users can leverage it.
USE SCHEMA CURATED;
CREATE OR REPLACE VIEW "Geographies for Map" as
SELECT * FROM RAW.GEOGRAPHY_ICON_MAP;
CREATE OR REPLACE VIEW "Accident Summary" as
SELECT * FROM RAW."Accident Summary";
CREATE OR REPLACE VIEW "Health Care Lookups" as
select * from RAW.GEOGRAPHY_LOOKUP;
One more thing, lets create a table called 'Type'. This will allow the user to effectively toggle through the different boundaries without actually filtering the data.
create or replace table TYPE AS
SELECT COLUMN1 "Code", COLUMN2 "Type" FROM (VALUES(1,'Region'),(2,'ICB'),(3,'Sub ICB'));
In Power BI, once connected to our data, we will create a data model which may look like this like this:
Please do not join the Type table but do load it in to the model
I then create a measure for WKT which concatenates the two fields
WKT = CONCATENATE(LASTNONBLANK('Geographies for Map'[S_WKT1],0),LASTNONBLANK('Geographies for Map'[S_WKT2],0))
I create a measure for Accidents
Sum Accidents = SUM('Accident Summary'[Accidents])
Let's see what this looks like initially.
Import the Iconmap visual and from the Geographies for Map, drag Code into the Category and Sum Accidents into the Size
Switch to the formatting options for the visual. Open up the Objects section and click on the function button under Image/WKT/GeoJSON. Select the WKT field
from the Geographies for map.
In the Formatting section, press the function button under Fill Color. Select the Sum Accidents field.
Now add a Slicer and drag Type from the Type table into it. Currently there is no relationship from Type to any of the other tables. We will not be creating one either - this is purely for switching calculations when selected.
The calculations we will be creating will automatically change the relationships between the Geography table and the Accident data.
Modify the Sum accidents measure to change the join dynamically when there is an interaction with the Type filter.
Sum Accidents = var ICB = CALCULATE(sum('Accident Summary'[Accidents]),USERELATIONSHIP('Accident Summary'[ICB Code],'Geographies for Map'[Code]))
var SUBICB = CALCULATE(sum('Accident Summary'[Accidents]),USERELATIONSHIP('Accident Summary'[Sub ICB Code],'Geographies for Map'[Code]))
var REGION = CALCULATE(sum('Accident Summary'[Accidents]),USERELATIONSHIP('Accident Summary'[Regional Code],'Geographies for Map'[Code]))
return
SWITCH(SELECTEDVALUE('Type'[Code]),1,REGION,2,ICB,3,SUBICB)
Also Create a new measure for a Dynamic Polygons. The concept is the same, however the measure is different
Dynamic WKT = var ICB = CALCULATE([WKT],USERELATIONSHIP('Accident Summary'[ICB Code],'Geographies for Map'[Code]))
var SUBICB = CALCULATE([WKT],USERELATIONSHIP('Accident Summary'[Sub ICB Code],'Geographies for Map'[Code]))
var REGION = CALCULATE([WKT],USERELATIONSHIP('Accident Summary'[Regional Code],'Geographies for Map'[Code]))
return
SWITCH(SELECTEDVALUE('Type'[Code]),1,REGION,2,ICB,3,SUBICB)
Modify the WKT field in the Object settings in the icon map to reflect this new change.
Now we want to add filters to zoom into the areas - so only see Sub ICBs for a region. This is where our Health Care Lookups come into play
The easiest away to approach this is almost from a star schema prospective - we will create 3 views in Snowflake from the original dynamic table which will be our 3 dimensions to work with.
CREATE OR REPLACE VIEW "Regional Codes" as
select distinct "Regional Code","Regional Name" from RAW.GEOGRAPHY_LOOKUP;
CREATE OR REPLACE VIEW "ICB Codes" as
select distinct "ICB Code","ICB Name" from RAW.GEOGRAPHY_LOOKUP;
CREATE OR REPLACE VIEW "Sub ICB Codes" as
select distinct "Sub ICB Code","Sub ICB Name" from RAW.GEOGRAPHY_LOOKUP;
In Power Query, add these three tables to the model. For ICBs and Sub ICBs, I chose to make them BI directional so the dimensions will only show values in the context of the parent - i.e ICBs for a region.
Now Create 3 slicers and format your dashboard. I have also added the other metrics
Now I have created the dashboard, I will change the dimensions (including the ones for the icon map) to 'dual mode'. This will cache the infrequently changing data into Power BI - and at the same time, will remain the link to Snowflake which enables it to query the live accident data efficiently.
I hope you have enjoyed reading this article. Any queries do get in touch!
Comments