How to Query Semi-Structured JSON Data in Snowflake?

How to Query Semi-Structured JSON Data in Snowflake?

Table of contents

Alright, let’s kick things off. In today’s world, where data comes in all shapes and sizes, querying semi-structured data has become super important.

It’s not just about rows and columns anymore. Modern databases need to handle messy, complex data, like JSON, XML, or Parquet.

That’s where querying semi-structured data comes in—giving us the flexibility to work with dynamic data without the need for a rigid schema.

Now, JSON (JavaScript Object Notation) has kind of taken over as the go-to format for semi-structured data.

Why?

It’s lightweight, human-readable, and perfect for data interchange between web apps, APIs, and databases. JSON’s popularity has skyrocketed because it easily represents nested objects and arrays, making it ideal for complex data that doesn’t fit neatly into traditional database tables.

That’s where Snowflake comes in the picture. Snowflake has built-in features that make working with JSON super easy. You can store, query, and break down JSON data without dealing with messy ETL processes. This means less hassle and faster insights.

If you want to make your queries faster, flattening your JSON data is a great idea. JSON is like a big, messy tree where finding specific information means you often have to look through the whole thing each time.

There are tools called stream parsers that can help, but they can be tricky to use and still need to check the entire JSON data to find what you need. On the other hand, if you convert JSON into a clear and organized format, you can index it. T

his makes searching much quicker—often ten times faster or more.

In this blog, we’ll break down how to query JSON data in Snowflake and make your life easier when handling semi-structured data. Whether you’re new to it or just want to sharpen your skills, learning how Snowflake handles JSON is a must for any data team using Snowflake.

TL;DR: This blog covers essential techniques for querying semi-structured JSON data in Snowflake, including how to flatten nested arrays and objects. We dive into using the FLATTEN function to handle complex, multi-level data structures, and discuss common issues like performance bottlenecks in Snowflake.

You'll learn how to parse JSON using PARSE_JSON and TRY_PARSE_JSON, and explore various methods for loading data into Snowflake, from ETL tools and Snowsight UI to SnowSQL commands and more, what are the other data types you can use other than VARIANT and how to use them.

What is the VARIANT Data Type in Snowflake?

The VARIANT data type in Snowflake is used when you’re dealing with semi-structured data like JSON, Avro, Parquet, XML, or even arrays and objects. Instead of worrying about strict schemas, you can throw all that into a VARIANT column, and Snowflake will handle it.

When to Use VARIANT?

  • Flexible schemas: If your data doesn’t follow a strict structure (e.g., fields can change or vary in number), VARIANT helps you store and query that data without needing to define a rigid schema upfront. You can store JSON, arrays, numbers, strings, or even a combination of these in a single column

  • Dynamic data ingestion: When you're pulling in data from APIs or third-party tools, and you don’t know the exact structure, it’s easier to dump it into a VARIANT column and query as you go.

  • Easy querying: Snowflake has functions (like FLATTEN, :, [] notation) that make it easy to extract and query data from VARIANT columns.

  • Schema-on-read: Instead of defining the schema upfront, you can store data as-is and define the schema at the time of querying (i.e., Snowflake parses it on the fly).

When Not to Use VARIANT?

  • When you need strict control over schema: If you're dealing with highly structured data (e.g., transactional tables), go for traditional data types like STRING, NUMBER, DATE, etc. VARIANT should not be used.

  • Performance critical queries: Since VARIANT columns need to be parsed during queries, performance can take a hit if you’re constantly extracting data. If you query the same fields repeatedly, it’s better to normalize the data into structured columns.

  • Storage: VARIANT stores data in a compressed format, which is efficient, but querying large or deeply nested data structures can be resource-intensive.

  • Query cost: Querying and extracting data from VARIANT columns can be slower than structured columns, especially if you do it repeatedly without optimization.

  • No enforcement: It doesn’t enforce types, which means you could have a mix of data in the same column—awesome for flexibility but potentially messy if not handled right.

Example Use:

Let’s say you have a column customer_data in a table that stores JSON objects with varying fields. You’d define it like this:

CREATE TABLE customers (
  id INT,
  customer_data VARIANT
);

Then, you can insert some data like:

INSERT INTO customers (id, customer_data)
SELECT 1, PARSE_JSON('{"name": "John", "age": 30}')
SELECT 2, PARSE_JSON('{"name": "Jane", "orders": [{"item": "Laptop", "price": 1200}]}')

undefined

To query specific fields from the VARIANT column:

SELECT customer_data:name::STRING, customer_data:age::INT FROM customers;

Here, you're selecting the name and age fields from the customer_data JSON object. The output will look like this:

undefined

- The first row has both name and age.

- The second row has a name ("Jane") but no age field, so age is NULL in the result.

Or flatten nested data:

SELECT 
    customer_data:name::STRING AS name,
    order1.value:item::STRING AS item,
    order1.value:price::NUMBER AS price
FROM customers, 
LATERAL FLATTEN(input => customer_data:orders) AS order1;

Note: Do not use the word ‘order’ as it is a reserved keyword by Snowflake.

In this query, you're flattening the orders array inside the customer_data JSON object and extracting the item and price fields from each order.

Output:

undefined

  • John doesn't have any orders, so he won't appear in the results.

  • Jane has one order with an item ("Laptop") and price (1200).

This was the TLDR; of the blog, we will talk more about LATERAL FLATTEN in the sections below. But let’s talk about how to transform strings into VARIANT type and load data into Snowflake so you can run queries on it later on.

Snowflake JSON Parsing Functions: PARSE_JSON and TRY_PARSE_JSON

In Snowflake, when you're working with JSON data, it's common to need a way to transform strings into VARIANT data types so that Snowflake can recognize and query them as JSON. That's where the PARSE_JSON and TRY_PARSE_JSON functions come in.

1. PARSE_JSON Function

What it does:

The PARSE_JSON function converts a string that contains a valid JSON into Snowflake's VARIANT data type. This allows Snowflake to treat the data as semi-structured, meaning you can query it just like any other JSON object.

Usage Example:

SELECT PARSE_JSON('{"name": "Alice", "age": 30}') AS json_data;

Output:

undefined

How it works:

In this example, the PARSE_JSON function is turning a simple JSON string into a VARIANT object, which can then be queried just like any other semi-structured data in Snowflake.

When to Use:

  • Use PARSE_JSON when you're sure that the input string contains valid JSON.

  • It's great for when you're working with well-formed data that doesn’t have the risk of errors, as any invalid JSON will throw an error.

When Not to Use:

  • Avoid using PARSE_JSON with untrusted or dynamically generated data because if the JSON is invalid, Snowflake will throw an error and the query will fail.

2. TRY_PARSE_JSON Function

What it does:
TRY_PARSE_JSON is a safer version of PARSE_JSON. It does the same thing—converts a string into a JSON VARIANT—but instead of throwing an error on invalid JSON, it returns NULL.

Usage Example:

SELECT TRY_PARSE_JSON('{"name": "Alice", "age": 30}') AS valid_json,
TRY_PARSE_JSON('Invalid JSON string') AS invalid_json;

Output:

undefined

How it works:
Here, the first string is valid JSON, so TRY_PARSE_JSON successfully converts it into a VARIANT. The second string is invalid, so it returns NULL instead of throwing an error.

When to Use:

  • Use TRY_PARSE_JSON when you’re working with data that might have invalid or unstructured JSON.

  • It’s especially useful when you want to avoid errors in your queries and handle cases where the input string may not be well-formed.

When Not to Use:

  • You might not want to use this if you always expect valid JSON and want to know when something goes wrong (since it silently fails by returning NULL).

Differences Between PARSE_JSON and TRY_PARSE_JSON

FeaturePARSE_JSONTRY_PARSE_JSON
Error HandlingThrows an error on invalid JSONReturns NULL on invalid JSON
Use CaseBest for trusted, well-formed JSON dataBest for untrusted or potentially invalid JSON data
RiskCan cause query failures on invalid JSONSafer, as it won’t fail your query

Key Takeaways

  • PARSE_JSON: Use it when you’re confident the JSON string is valid. It’s fast and precise, but will error out on bad data.

  • TRY_PARSE_JSON: Use it when dealing with untrusted or dynamic JSON, where you need to handle invalid data gracefully without breaking your queries.

Methods to Load JSON Data into Snowflake

When you’re dealing with JSON data in Snowflake, there are multiple ways to get that data into your tables. Each method has its own use case, and your choice depends on the tools you’re using and how automated or manual you want the process to be. Let’s break down the main methods for loading JSON data:

1. ETL Tools to Snowflake

ETL (Extract, Transform, Load) tools are pretty popular for automating the data pipeline. Tools like Fivetran, Stitch, or Matillion can help you load JSON data into Snowflake.

  • How it works: These tools extract JSON from source systems (like databases or APIs), do any necessary transformations you want, and load it into Snowflake.

  • Why it’s useful: It’s perfect for automating data ingestion, especially when you’re dealing with multiple data sources and need a scalable solution.

Example: Fivetran pulls JSON from a REST API, transforms it (if needed), and pushes it to Snowflake in VARIANT columns.

2. Snowsight Loading (Using the Snowflake UI)

If you want a more visual way to load JSON data into Snowflake, the Snowsight UI can help. It's pretty user-friendly and requires no command-line work.

  • How it works: You can use the web interface to drag and drop JSON files or connect to cloud storage like Amazon S3. Snowflake handles the rest, parsing the data and loading it into tables.

  • Why it’s useful: This is great for quick, one-time loads of JSON files—especially if you’re not comfortable with command-line tools.

Steps:

  1. Go to Snowsight (Snowflake’s web UI).

  2. Select your database and table.

  3. Click Load Data, upload your JSON file or select from S3, and follow the wizard.

3. Data Staging from S3

If your JSON data is stored in Amazon S3, you can use Snowflake’s staging area to load the data into your tables.

  • How it works: First, you set up an external stage that points to your S3 bucket. Then you use a COPY INTO command to load the JSON data into a Snowflake table.

  • Why it’s useful: This method is highly efficient for bulk or recurring loads from cloud storage, and it's automated if you have regularly updated data in S3.

Example:

CREATE OR REPLACE STAGE my_stage URL='s3://my-bucket/'
CREDENTIALS = (AWS_KEY_ID='xxx' AWS_SECRET_KEY='yyy');
COPY INTO my_table
FROM @my_stage/file.json
FILE_FORMAT = (TYPE = 'JSON');

4. Loading Using SnowSQL (Command Line)

For those who prefer the command line or need more control over the process, SnowSQL is Snowflake’s command-line tool. You can load JSON data using an internal stage and the COPY INTO command.

Step 1: Internal Stage Method

An internal stage is a temporary area in Snowflake’s storage where you can upload files before loading them into a table.

  • How it works: First, you upload your JSON file to an internal stage using the PUT command. Then, you use COPY INTO to load the data into your table.

  • Why it’s useful: Internal staging is ideal for ad-hoc or manual loads, especially when you need full control over the process.

Example:

Upload JSON file to stage:

PUT file://path_to_local_file.json @%my_table;
Load data into table:
COPY INTO my_table
FROM @%my_table/file.json
FILE_FORMAT = (TYPE = 'JSON');

Step 2: Associated Steps (PUT Command, COPY INTO)

  • PUT command: Uploads your local JSON files to Snowflake’s internal stage.

  • COPY INTO: This is where the magic happens—it parses the JSON data and loads it into your table.

Querying JSON Data in Snowflake

Let’s take the sample data as:

Suppose the json_data column in my_json_table has the following data:

json_data
{"brand_name": "Prometheus System Bone Screw", "company_name": "Jeil Medical Corporation", "device_count_in_base_package": "1"}
{"brand_name": "Atlas Surgical Instruments", "company_name": "Atlas Healthcare", "device_count_in_base_package": "2"}
{"brand_name": "Smith & Nephew Hip Implant", "company_name": "Smith & Nephew", "device_count_in_base_package": "5"}
{"company_name": "No Brand Company", "device_count_in_base_package": "1"}

Assuming you’ve loaded your JSON data into Snowflake—now it's time to query it! As we already know JSON in Snowflake is stored in VARIANT data types, which means it can handle pretty much any semi-structured format. Let’s walk through some common querying techniques before you start flattening your data for more complex tasks.

Performing Simple Reads Before Flattening

Before you even worry about flattening, you can run simple queries to see your JSON data as-is. It’s like taking a peek at the whole object without breaking it down just yet.

SELECT * FROM json_data;

What’s happening: This is the simplest query you can run. It just selects the entire JSON object stored in a VARIANT column and shows it in its raw format.

Selecting Specific Fields Using the GET_PATH Function

To pull out specific values from the JSON, you can use the GET_PATH function. It’s a great way to grab values directly from the VARIANT column without needing to flatten anything.

SELECT json_data:get_path('brand_name') AS brand_name
FROM my_json_table;
  • What’s happening: The GET_PATH function extracts the value for the specified key, in this case, brand_name, from the JSON object.

  • Why it’s useful: This function allows you to directly grab specific fields without needing to explode the entire JSON structure, making it quicker and simpler.

Output:

brand_name
Prometheus System Bone Screw
Atlas Surgical Instruments
Smith & Nephew Hip Implant
(null)

Casting JSON Data into Native SQL Data Types Using :: Operator

Sometimes, the value extracted from JSON may need to be cast into a specific SQL type (like STRING, NUMBER, etc.). You can use the :: operator to explicitly cast it.

SELECT json_data:get_path('device_count_in_base_package')::NUMBER AS device_count
FROM my_json_table;
  • What’s happening: Here, we’re extracting the device_count_in_base_package field and casting it into a NUMBER type using ::NUMBER. This converts the raw JSON data into something you can perform calculations or comparisons with.

  • Why it’s useful: JSON data is often stored in string format, so you’ll need to cast it to proper SQL types when performing mathematical operations or applying filters.

Output:

device_count
1
2
5
1

Let’s now standardize things moving forward. We will be using the below JSON data structure and use it to write queries and get output.

The Sample Query Data

This structure is designed to give you a broad range of nested data to query and flatten so that you understand how to write such complex queries.

The data:

{
  "customer": {
    "id": "C123",
    "name": "John Doe",
    "orders": [
      {
        "order_id": "O1001",
        "date": "2023-09-01",
        "items": [
          {
            "item_id": "I501",
            "name": "Laptop",
            "price": 1200,
            "categories": ["electronics", "computers"]
          },
          {
            "item_id": "I502",
            "name": "Mouse",
            "price": 20,
            "categories": ["electronics", "accessories"]
          }
        ],
        "shipping": {
          "address": {
            "street": "1234 Elm St",
            "city": "Los Angeles",
            "state": "CA",
            "postal_code": "90001"
          },
          "delivery_date": "2023-09-05"
        }
      },
      {
        "order_id": "O1002",
        "date": "2023-09-03",
        "items": [
          {
            "item_id": "I503",
            "name": "Keyboard",
            "price": 50,
            "categories": ["electronics", "accessories"]
          },
          {
            "item_id": "I504",
            "name": "Monitor",
            "price": 300,
            "categories": ["electronics", "displays"]
          }
        ],
        "shipping": {
          "address": {
            "street": "4321 Oak St",
            "city": "San Francisco",
            "state": "CA",
            "postal_code": "94101"
          },
          "delivery_date": "2023-09-07"
        }
      }
    ]
  },
  "feedback": [
    {
      "order_id": "O1001",
      "rating": 4,
      "comments": "Fast shipping and good products."
    },
    {
      "order_id": "O1002",
      "rating": 5,
      "comments": "Everything arrived on time."
    }
  ],
  "preferences": {
    "payment_methods": [
      {
        "type": "credit_card",
        "details": {
          "last4": "1234",
          "expiry": "2025-01"
        }
      },
      {
        "type": "paypal",
        "email": "[email protected]"
      }
    ],
    "newsletter_subscribed": true
  }
}

Explanation of the JSON Structure:

  1. Customer Object:

    • Contains details like id, name, and an orders array.

    • The orders array contains objects, each representing an order with an items array (array of objects).

  2. Feedback Array:

    • Contains feedback related to the orders, with each object referencing order_id and the corresponding rating and comments.
  3. Preferences Object:

    • Contains arrays for payment_methods and other preferences like newsletter_subscribed.

Alright, so you’re trying to take this nested JSON and make it queryable in Snowflake using LATERAL FLATTEN. Let's walk through this step by step.

How to create a table with a JSON column in Snowflake?

Step 1: Loading the JSON into a Snowflake Table

CREATE OR REPLACE TABLE json_data (
    device_info VARIANT
);

undefined

Then you’d insert your JSON into this table:

INSERT INTO json_data 
SELECT TRY_PARSE_JSON('
{
  "customer": {
    "id": "C123
...
...
}'
 );

undefined

Assuming your JSON is loaded into a table in Snowflake, let's call it json_data with a VARIANT column named device_info. The VARIANT type in Snowflake allows you to store semi-structured data like JSON, so we’ll start by querying that.

Now that your data is in Snowflake, we can move on to the flattening part.

What is FLATTEN()?

In Snowflake, FLATTEN() is a table function used to break down or "unpack" nested structures like arrays or objects in semi-structured data (think JSON, VARIANT, or XML). Basically, when you have data all bundled up in nested formats, FLATTEN() helps you spread that out into a more normal table structure so you can query it easily.

What is LATERAL FLATTEN()?

LATERAL FLATTEN() is a more powerful way to use FLATTEN() in Snowflake. The lateral part just means that you can reference columns from the main query inside the FLATTEN() function. It allows you to flatten arrays or objects in each row of a query result, instead of flattening an entire table.

It's like saying, “Hey, for each row, flatten this specific column.”

Notice the use of LATERAL before FLATTEN(). It makes sure that for each row, we're only flattening the orders field for that row. This gives the same result as above but keeps things scoped to each row in your original query. Almost in every use case, you will use LATERAL and FLATTEN together.

How Snowflake Handles Flattening of Objects

In Snowflake, objects are essentially key-value pairs (similar to dictionaries or maps in programming). To query the nested structure of objects, you generally use dot notation or the : operator.

However, if you want to "flatten" the object (i.e., break down its key-value pairs into separate rows), Snowflake doesn't directly flatten objects the way it does arrays. Instead, querying and extracting values from objects usually relies on querying specific keys within the object.

Flattening Objects (Dot Notation)

SELECT data:key1 AS key1_value, data:key2 AS key2_value
FROM my_table;

In this case, the data column stores a JSON object, and we can access specific keys using the : operator.

Limitations:

  • No Direct Flattening: Objects are not "flattened" like arrays in Snowflake. If you want to extract each key-value pair as separate rows, you’ll need to manually specify the keys you’re interested in.

  • Dot Notation: It's the primary way to handle nested objects. If objects are deeply nested, you can chain : to drill down.

Querying Nested Objects:

SELECT data:outer_key:inner_key AS inner_value
FROM my_table;

This works when you have an object inside another object and want to extract values deeply nested.

How Snowflake Handles Flattening of Arrays?

Arrays in Snowflake can contain a list of values, and to query them effectively, Snowflake provides the FLATTEN() function. This function "explodes" an array by turning each element into a row. It’s incredibly useful when dealing with JSON arrays.

  • Objects: Use dot notation or : to query specific keys, but there's no direct flattening like arrays.

  • Arrays: Use LATERAL FLATTEN to explode arrays into rows. For nested arrays, you can flatten them layer by layer.

This approach helps you query even complex nested JSON objects by flattening one level of the structure at a time, making it more manageable to work with semi-structured data in Snowflake.

1. Basic Query to Access Fields Using Brackets ([])

First, let’s start by selecting specific fields from the JSON object directly using the bracket ([]) notation.

Query: Access customer name, first order date, and first item name

SELECT
  device_info:customer:name::string AS customer_name,
  device_info:customer:orders[0]:date::string AS first_order_date,
  device_info:customer:orders[0]:items[0]:name::string AS first_item_name
FROM json_data;

Explanation:

  • device_info is the column name.

  • customer:name::string: Accesses the name field inside the customer object.

  • customer:orders[0]:date::string: Accesses the date field of the first order in the orders array.

  • customer:orders[0]:items[0]:name::string: Accesses the name of the first item in the first order.

Expected Output:

undefined

2. Using Dot Notation (:) to Access Nested Arrays

Now, let's query more complex structures like an array of objects. Here, we use the : operator to access fields in deeply nested arrays.

Query: Access the shipping address of the second order

SELECT
  device_info:customer:orders[1]:shipping:address:street::string AS street,
  device_info:customer:orders[1]:shipping:address:city::string AS city,
  device_info:customer:orders[1]:shipping:address:state::string AS state,
  device_info:customer:orders[1]:shipping:address:postal_code::string AS postal_code
FROM json_data;

Explanation:

  • customer:orders[1]:shipping:address:street: Accesses the street field of the shipping address of the second order in the orders array.

  • Similarly, other fields (city, state, postal_code) are extracted from the address.

Expected Output:

undefined

Alternatives to Dot Notation

Although dotted notation is a convenient way to access nested fields, Snowflake provides other methods that allow you to access these nested values, especially in more complex situations.

1. Using the GET_PATH Function

GET_PATH allows you to access nested values dynamically by specifying the path as an array of strings.

Query:

SELECT
  device_info:customer:name::string AS customer_name,
  GET_PATH(device_info:customer, 'orders[0].order_id')::string AS order_id,
  GET_PATH(device_info:customer, 'orders[0].items[0].name')::string AS first_item_name
FROM json_data;

Explanation:

  • GET_PATH(customer, 'orders[0].order_id'): This retrieves the order_id from the first element of the orders array using the path in string format.

  • You can also pass the path as a JSON array of keys and indices, for example: GET_PATH(customer, ARRAY_CONSTRUCT('orders', 0, 'order_id')).

Expected Output:

undefined

2. Using Bracket [] Notation

You can also use bracket notation with the field names as strings. This is useful when your field names contain special characters, spaces, or if you are working with dynamically named fields.

Query:

SELECT
  device_info:customer['name']::string AS customer_name,
  device_info:customer['orders'][0]['order_id']::string AS order_id,
  device_info:customer['orders'][0]['items'][0]['name']::string AS first_item_name
FROM json_data;

Explanation:

  • Bracket notation allows you to access fields when the key name may contain characters not allowed in standard dot notation (e.g., spaces or special characters).

Expected Output:

undefined

3. Flattening the Items Array Using LATERAL FLATTEN

Next, let’s flatten the items array inside the orders array using LATERAL FLATTEN to access individual items.

Query: Flatten the items array for the first order and display each item

SELECT
  device_info:customer:name::string AS customer_name,
  flattened.value:item_id::string AS item_id,
  flattened.value:name::string AS item_name,
  flattened.value:price::number AS item_price
FROM json_data,
LATERAL FLATTEN(input => device_info:customer:orders[0]:items) AS flattened;

Explanation:

  • LATERAL FLATTEN(input => customer:orders[0]:items): Flattens the items array inside the first order (orders[0]).

  • Each item is extracted as a separate row, and fields like item_id, name, and price are accessed.

Expected Output:

undefined

4. Flattening Nested Arrays: Categories Inside Items

When you're dealing with multiple levels of nested arrays in Snowflake, using the FLATTEN() function becomes even more interesting. You can "flatten" one level at a time, working your way deeper into the nested structure. Now, let’s flatten the nested categories array inside each item.

Query: Flatten the items and categories arrays

SELECT
  device_info:customer:name::string AS customer_name,
  flattened_items.value:name::string AS item_name,
  flattened_categories.value::string AS category
FROM json_data,
LATERAL FLATTEN(input => device_info:customer:orders[0]:items) AS flattened_items,
LATERAL FLATTEN(input => flattened_items.value:categories) AS flattened_categories;

Explanation:

  • The first LATERAL FLATTEN flattens the items array, and the second one flattens the categories array inside each item.

  • This will return one row per category for each item.

Expected Output:

undefined

So, till now:

  1. Nested Use of FLATTEN(): We used two FLATTEN() functions, one to unpack the outer orders array and another to unpack the inner items array.

  2. LATERAL Keyword: The LATERAL keyword is crucial when working with nested arrays, as it lets each FLATTEN() function reference the row from the previous step.

  3. Scalability: You can nest as many FLATTEN() functions as needed if you have more deeply nested structures.

5. Accessing Data Without Flattening

In cases where you want to extract data from specific array indices without flattening the entire array, you can use the [] notation.

Query: Access the first category of the first item in the first order

SELECT
  device_info:customer:name::string AS customer_name,
  device_info:customer:orders[0]:items[0]:categories[0]::string AS first_item_first_category
FROM json_data;

Explanation:

  • customer:orders[0]:items[0]:categories[0]: Accesses the first category of the first item in the first order.

Expected Output:

undefined

Running aggregate queries after Flattening of JSON

You can combine flattening with aggregate functions to calculate metrics such as the total price of items for each order.

Query: Calculate the total price for each order

SELECT
  device_info:customer:name::string AS customer_name,
  flattened_orders.value:order_id::string AS order_id,
  SUM(flattened_items.value:price::number) AS total_order_price
FROM json_data,
LATERAL FLATTEN(input => device_info:customer:orders) AS flattened_orders,
LATERAL FLATTEN(input => flattened_orders.value:items) AS flattened_items
GROUP BY device_info:customer:name, flattened_orders.value:order_id;

Explanation:

  • We flatten both the orders and items arrays.

  • We then aggregate the price field of the items to calculate the total price for each order.

Expected Output:

undefined

1. Aggregate Query: Count the Number of Items per Order

You can count how many items each order contains after flattening the items array.

Query: Count the number of items per order

COUNT(flattened_items.value) AS item_count
// replace the above SUM() with COUNT()

Explanation:

  • Flattening the orders array first and then the items array.

  • Using COUNT(flattened_items.value) to count the number of items for each order.

Expected Output:

undefined

2. Aggregate Query: Calculate Average Price of Items per Order

We can compute the average price of items per order using AVG().

AVG(flattened_items.value:price::number) AS avg_item_price
// replace the above SUM() with AVG()

Expected Output:

undefined

Using TO_VARIANT to Convert Values to VARIANT Type

If your data is not in the VARIANT format, TO_VARIANT can be used to cast any data type into Snowflake’s VARIANT type, making it useful when working with semi-structured data.

Example: Casting Simple Values to VARIANT

SELECT
  TO_VARIANT('{"name": "Alice"}') AS variant_json,
  TO_VARIANT(123) AS variant_number,
  TO_VARIANT(CURRENT_DATE) AS variant_date;

Expected Output:

undefined

Explanation:

  • TO_VARIANT('{"name": "Alice"}'): Converts the string into a VARIANT type.

  • This is useful for dynamically handling structured and semi-structured data.

The reason you see backslashes (\) in the output of the TO_VARIANT function is because Snowflake automatically escapes special characters (like double quotes) when outputting VARIANT data as a string representation of JSON. This is because the VARIANT type stores data in a semi-structured format, and Snowflake escapes these characters to ensure the integrity of the JSON structure.

Working with Array Stored as a String

In some cases, your array may be stored as a string rather than a proper array. In such cases, you need to first parse the string into a VARIANT type.

Example: Parsing a JSON String with an Array

Assume you have a JSON array stored as a string.

SELECT PARSE_JSON('["electronics", "computers", "accessories"]') AS parsed_array;

Expected Output:

undefined

Example: Flattening the Array Stored as a String.

You can then use LATERAL FLATTEN to query the parsed array.

SELECT
  flattened.value::string AS category
FROM json_data,
LATERAL FLATTEN(input => PARSE_JSON('["electronics", "computers", "accessories"]')) AS flattened;

Expected Output:

undefined

ARRAY_AGG to Aggregate Flattened Data

Let’s use ARRAY_AGG() to aggregate all flattened values back into an array.

Query: Aggregate all item names into an array per order

SELECT
  device_info:customer:name::string AS customer_name,
  ARRAY_AGG(flattened_items.value:name::string) AS item_names
FROM json_data,
LATERAL FLATTEN(input => device_info:customer:orders) AS flattened_orders,
LATERAL FLATTEN(input => flattened_orders.value:items) AS flattened_items
GROUP BY device_info:customer:name;

Explanation:

  • ARRAY_AGG() aggregates all item_names into a single array for each customer.

Expected Output:

undefined

Handling Array and String Containing an Array

When your data is already an array or stored as a string containing an array, you can handle it differently.

Example: Directly Flattening an Array

If the array is directly stored as a VARIANT type (as we already did with the orders and items arrays), you can use LATERAL FLATTEN as we’ve shown in previous queries.

ARRAY_SIZE: Count Elements in an Array

To count the number of elements in an array, you can use the ARRAY_SIZE() function.

Query: Count how many categories each item has

SELECT
  device_info:customer:name::string AS customer_name,
  flattened_items.value:name::string AS item_name,
  ARRAY_SIZE(flattened_items.value:categories) AS category_count
FROM json_data,
LATERAL FLATTEN(input => device_info:customer:orders) AS flattened_orders,
LATERAL FLATTEN(input => flattened_orders.value:items) AS flattened_items;

Expected Output:

undefined

Key Points to Remember:

  1. Flatten arrays using LATERAL FLATTEN and access their elements in different rows.

  2. Use PARSE_JSON or TRY_PARSE_JSON to convert strings to VARIANT types and handle semi-structured data.

  3. Use ARRAY_AGG and ARRAY_SIZE to aggregate and count array elements.

  4. TO_VARIANT is useful for casting any type into the VARIANT type for flexible JSON handling.

Creating Views and Performing Table Joins After Using LATERAL FLATTEN

1. Creating a View on Flattened Data

Let's start by creating a VIEW to simplify querying the flattened JSON structure. We'll flatten the orders and items arrays and store that result in a view.

SQL to Create a View:

CREATE OR REPLACE VIEW v_flattened_orders AS
SELECT
  device_info:customer:name::string AS customer_name,
  flattened_orders.value:order_id::string AS order_id,
  flattened_items.value:name::string AS item_name,
  flattened_items.value:price::number AS item_price,
  flattened_items.value:categories AS item_categories
FROM json_data,
LATERAL FLATTEN(input => device_info:customer:orders) AS flattened_orders,
LATERAL FLATTEN(input => flattened_orders.value:items) AS flattened_items;

undefined

Explanation:

  • This view will flatten the orders array and the items array within each order.

  • We extract relevant fields like order_id, item_name, item_price, and item_categories.

To see the created view,

SELECT 
    table_name 
FROM 
    information_schema.views 
WHERE 
    table_schema = 'PUBLIC'
    AND table_catalog = 'PRIYANSH_DB';

undefined

Note: PRIYANSH_DB is the name of the database, replace it with your own.

Now, you can query this view directly:

SELECT * FROM v_flattened_orders;

Expected Output:

undefined

2. Joining the Flattened View with Other Tables

Once we have a flattened VIEW, we can join it with other tables. For example, let's assume we have a customer_rewards table that stores rewards points for customers.

Table: customer_rewards

customer_namerewards_points
John Doe500

Example Join Query:

SELECT
  v.customer_name,
  v.order_id,
  v.item_name,
  v.item_price,
  r.rewards_points
FROM v_flattened_orders AS v
JOIN customer_rewards AS r
ON v.customer_name = r.customer_name;

Explanation:

  • This query joins the flattened v_flattened_orders view with the customer_rewards table on the customer_name.

  • You can now analyze the data across multiple tables.

Expected Output:

customer_nameorder_iditem_nameitem_pricerewards_points
John DoeO1001Laptop1200.00500
John DoeO1001Mouse20.00500
John DoeO1002Phone350.00500

3. Join Between Multiple Flattened Views

In more complex cases, you may have multiple views from different JSON structures. Let's assume you also have a v_flattened_categories view that stores categories from a different JSON column, and you want to join that with v_flattened_orders.

SQL to Create v_flattened_categories View:

CREATE OR REPLACE VIEW v_flattened_categories AS
SELECT
  flattened_categories.value::string AS category_name
FROM json_data,
LATERAL FLATTEN(input => device_info:customer:orders[0]:items[0]:categories) AS flattened_categories;

Example Join Between Two Flattened Views:

SELECT
  v.customer_name,
  v.order_id,
  v.item_name,
  v.item_price,
  c.category_name
FROM v_flattened_orders AS v
JOIN v_flattened_categories AS c
ON ARRAY_CONTAINS(c.category_name::VARIANT, v.item_categories);

Explanation:

  • This query joins v_flattened_orders with v_flattened_categories using the ARRAY_CONTAINS function to match categories within the item_categories array.

ARRAY_CONTAINS SYNTAX:

It returns TRUE if the specified value is found in the specified array.

Refer here for more - https://docs.snowflake.com/en/sql-reference/functions/array_contains

Expected Output:

undefined

4. Handling Views with Nested Arrays

If your data has more deeply nested arrays (e.g., arrays inside arrays), you can flatten multiple levels and join results at different levels of nesting.

Example: Joining on a Nested Array

Let’s say you want to join on a deeply nested tags array inside categories of an item.

First, create a view for the nested array:

CREATE OR REPLACE VIEW v_nested_tags AS

SELECT
  device_info:customer:name::string AS customer_name,
  flattened_orders.value:order_id::string AS order_id,
  flattened_items.value:name::string AS item_name,
  flattened_tags.value::string AS tag
FROM json_data,
LATERAL FLATTEN(input => device_info:customer:orders) AS flattened_orders,
LATERAL FLATTEN(input => flattened_orders.value:items) AS flattened_items,
LATERAL FLATTEN(input => flattened_items.value:categories[0]:tags) AS flattened_tags;

Join on tags with Flattened Orders:

SELECT
  v.customer_name,
  v.order_id,
  v.item_name,
  n.tag
FROM v_flattened_orders AS v
JOIN v_nested_tags AS n
ON v.customer_name = n.customer_name AND v.item_name = n.item_name;

Expected Output:

customer_nameorder_iditem_nametag
John DoeO1001Laptopnew-tech
John DoeO1001Laptophigh-end
John DoeO1002Phonemobile

5. Common Use Cases for Joins on Flattened Data

  1. Joining Orders and Customer Data: You can flatten the orders array and join it with a table storing customer details to perform customer-based analysis on their orders.

  2. Joining Inventory Data: If you have an inventory table with product details, you can join that with the flattened JSON structure to enrich the data with inventory-related attributes.

  3. Analyzing Categories and Tags: Flattened views allow you to join multiple nested levels of arrays (like categories and tags) for in-depth analysis on item classifications.

  4. Performance Considerations: When joining large, flattened views or tables, consider using appropriate filters and limits to optimize query performance. Snowflake handles lateral flattening efficiently, but deeply nested structures can still be expensive in terms of computation.

6. Nested Joins and Aggregations

You can also perform aggregate operations after joining flattened views or tables. For example:

Query: Calculate Total Price of Items per Customer with a Join

SELECT
  v.customer_name,
  SUM(v.item_price) AS total_spent,
  r.rewards_points
FROM v_flattened_orders AS v
JOIN customer_rewards AS r
ON v.customer_name = r.customer_name
GROUP BY v.customer_name, r.rewards_points;

Expected Output:

customer_nametotal_spentrewards_points
John Doe1570.00500

Key Takeaways:

  • Create Views for simplified querying of flattened JSON data.

  • Join flattened views with other tables (e.g., customer, inventory, rewards).

  • Handle deeply nested arrays by creating multiple levels of views.

  • Use aggregate operations like SUM(), COUNT(), or AVG() after joining to perform advanced analytics.

  • Optimize performance by filtering and limiting results when working with large nested structures.

CTE Query Example:

We'll use the existing JSON structure and extract the customer name, order IDs, and the total price of items in each order.

WITH extracted_orders AS (
    SELECT 
        customer:name::string AS customer_name,
        ord.value:order_id::string AS order_id,
        item.value:price::float AS item_price
    FROM your_table_name,
    LATERAL FLATTEN(input => customer:orders) AS ord,
    LATERAL FLATTEN(input => ord.value:items) AS item
)
SELECT
    customer_name,
    order_id,
    ARRAY_AGG(item_price) AS prices_array,
    SUM(item_price) AS total_order_price
FROM extracted_orders
GROUP BY customer_name, order_id;
  1. Double LATERAL FLATTEN:

    • First, LATERAL FLATTEN(input => customer:orders) flattens the orders array.

    • Then, LATERAL FLATTEN(input => ord.value:items) flattens the items array within each order.

  2. item.value:price:

    • This extracts the price for each item from the flattened items array.
  3. Main Query:

    • ARRAY_AGG(item_price) collects all item prices in the order into an array.

    • SUM(item_price) calculates the total price of the items in each order.

Expected Output:

undefined

Now that we have seen so many queries and their outputs, let's talk about some theories related to flattening in Snowflake.

Difference Between Using [] to Access Values vs. Flattening

When dealing with arrays inside JSON, you have two options: access specific elements using [] or flatten the array using LATERAL FLATTEN. Each has its own use case:

Using [] to Access Values

If you know the index of the element you need in an array, you can use [] notation to grab that specific item without needing to flatten as we showed above.

When to use: Use this when you know exactly which element in the array you’re interested in and don’t need to iterate through the entire array.

Key learnings till now:

  • GET_PATH is great for selecting specific fields, especially at the top level.

  • Use [] notation when accessing specific indices in arrays without flattening.

  • Use LATERAL FLATTEN to explode arrays and nested arrays into separate rows.

  • Aggregation can be combined with flattening to calculate metrics such as sums, averages, etc.

  • Use the :: operator to cast JSON fields into native SQL types when needed.

  • Dotted notation and [] work well for accessing nested fields or specific array elements.

  • LATERAL FLATTEN is essential when you need to break down arrays into separate rows, especially when the array size varies.

  • Nested Arrays: If you have multiple levels of arrays (which you don’t here, but might in other cases), you’d need to flatten each level one at a time. You can even nest LATERAL FLATTEN calls if necessary.

  • Null Checks: If a field might not exist in every object, add some handling to prevent errors, like abcd.value:code IS NOT NULL.

  • Performance: Flattening large arrays can be resource-intensive, so if you’re working with huge datasets, consider performance optimization strategies like indexing or breaking the job into smaller pieces.

Query Without LATERAL FLATTEN

Let’s say you want to avoid LATERAL FLATTEN and directly reference fields using the : or [] notations. For this, we won’t be able to split arrays into rows, so the data will stay nested.

If you do it this way, you’re only accessing the first element of each array, and you won’t get additional rows for other elements.

Unfortunately, this method isn’t the best fit as the data grows. With more data, you’ll have to manually add extra levels for categories and genres in your query, specifying index values each time. Just using the : and [] notation won’t dynamically retrieve every object in an array.

Key Differences Between LATERAL FLATTEN and No Flattening:

  • Without Flatten: You only get one row with the first element from each array, not all the elements.

  • With LATERAL FLATTEN: Each array element becomes its own row, which is what you want when querying data with multiple entries in arrays.

Using LATERAL FLATTEN is crucial when you want to handle arrays like items and payment_methods, because it explodes those arrays into rows and makes each element queryable. Without flattening, you're stuck with just the first element of each array, which limits the insight you can extract from the data.

Can I do queries on semi-structure data without using VARIANT data type?

Yes, In addition to the VARIANT data type, Snowflake also allows the use of OBJECT and ARRAY data types to store and query semi-structured data. While the VARIANT data type provides flexibility by allowing you to store any type of semi-structured data (JSON, XML, Avro, Parquet, etc.), there are scenarios where using OBJECT or ARRAY can be more effective for specific use cases.

1. OBJECT Data Type:

  • The OBJECT data type is specifically designed to store key-value pairs. You can think of it as similar to a JSON object, where each key has an associated value, which could be a string, number, boolean, or even another OBJECT or ARRAY.

  • An OBJECT is highly useful when you have structured key-value data, but want more efficient querying of specific keys, compared to a generic VARIANT.

Example:

CREATE OR REPLACE TABLE customer_data (
    customer_info OBJECT
);
INSERT INTO customer_data 
SELECT (OBJECT_CONSTRUCT('name', 'John Doe', 'age', 30, 'email', '[email protected]'));

Now, doing a SELECT * customer_data

OUTPUT:

undefined

Querying the OBJECT:

SELECT 
    customer_info:name::string AS customer_name,
    customer_info:age::int AS customer_age
FROM customer_data;

undefined

In this case, we’re storing structured data with defined keys (name, age, etc.) within the OBJECT type. It allows for faster querying of specific attributes compared to VARIANT, because of its more specific type structure.

2. ARRAY Data Type:

  • The ARRAY data type is used to store ordered lists of elements, where each element can be a scalar value (like a string or integer) or a complex data type (like another OBJECT or ARRAY).

  • ARRAY is particularly useful when you are dealing with lists or collections of similar data items, such as multiple addresses for a customer, multiple items in an order, or multiple tags assigned to a blog post.

Example:

CREATE OR REPLACE TABLE customer_orders (
    customer_id STRING,
    order_ids ARRAY
);
INSERT INTO customer_orders (customer_id, order_ids)
SELECT 'C123', ARRAY_CONSTRUCT('O1001', 'O1002', 'O1003');

Now, performing

select * from customer_orders

Output:

undefined

Querying the ARRAY:

SELECT 
    customer_id,
    order_ids[0] AS first_order,
    ARRAY_SIZE(order_ids) AS total_orders
FROM customer_orders;

Output:

undefined

Here, ARRAY_CONSTRUCT is used to store multiple order IDs in an ordered list, and querying an ARRAY is similar to accessing array elements in a programming language (e.g., order_ids[0] for the first element).

Benefits of Using OBJECT and ARRAY Instead of VARIANT:

  1. Performance:

    • Since OBJECT and ARRAY are more specialized data types compared to VARIANT, Snowflake can optimize queries on these types more effectively. When you know your semi-structured data will always follow a key-value or list-based format, using OBJECT or ARRAY can result in faster queries.
  2. Clarity:

    • The schema is clearer when using OBJECT or ARRAY compared to the flexible but less structured VARIANT. This makes it easier for developers to understand the structure of the data.
  3. Type-Specific Functions:

    • Snowflake provides type-specific functions for OBJECT (like OBJECT_GET or OBJECT_INSERT) and ARRAY (like ARRAY_SIZE, ARRAY_AGG, ARRAY_TO_STRING) that simplify working with semi-structured data.

Example Comparison: VARIANT vs OBJECT/ARRAY

Let’s compare using VARIANT versus OBJECT and ARRAY for a customer’s orders.

Using VARIANT:

CREATE OR REPLACE TABLE customer_variant_data (
    customer_data VARIANT
);
INSERT INTO customer_variant_data 
SELECT (PARSE_JSON('{"name": "Jane", "orders": ["O1001", "O1002", "O1003"]}'));
SELECT customer_data:name::string AS name, 
       customer_data:orders[0] AS first_order 
FROM customer_variant_data;

Output:

undefined

Using OBJECT and ARRAY:

-- Create the table
CREATE OR REPLACE TABLE customer_structured_data (
    customer_info OBJECT,
    orders ARRAY
);

-- Insert data into the table
INSERT INTO customer_structured_data (customer_info, orders)
SELECT OBJECT_CONSTRUCT('name', 'Jane'), ARRAY_CONSTRUCT('O1001', 'O1002', 'O1003');

SELECT customer_info:name::string AS name, 
       orders[0] AS first_order
FROM customer_structured_data;

Output:

undefined

You can see the Query duration difference for 1 row, for VARIANT it was 129 ms and for OBJECT and ARRAY, it was 184 ms.

In this example:

  • VARIANT allows flexibility in storing arbitrary semi-structured data, but if you know the structure (e.g., customer with a name and order array), OBJECT and ARRAY can provide better clarity and query performance.

Alternatives to VARIANT:

  • VARIANT is great when the structure of your JSON or semi-structured data is unpredictable or very dynamic.

  • OBJECT and ARRAY should be preferred when your data has more predictable structures (like key-value pairs or lists) for better query optimization and ease of use.

Read more at - https://docs.snowflake.com/en/sql-reference/data-types-semistructured

The Challenges with using Snowflake

So, we were chatting with some other data engineers, and it turns out they have been a little too reliant on Snowflake's lateral flatten function to handle their nested JSON structures and table histories. It’s been their go-to tool for a while because of ease of use, so they ended up using it all over the place in their dbt project.

But as we scale, they’re starting to hit a wall. One of their engineers pointed out that the flatten function doesn’t just flatten the parts of the JSON they actually care about. Instead, it’s flattening the entire structure every time, even if they only need the history of one column. This means they’re stuck waiting for everything to process before they can get the data we want.

That’s when they realized why their queries have been taking longer and longer to run. Even with tweaks to optimize things, it still takes around 15 minutes to flatten everything, and it’s only going to get worse as the data grows.

Other than that, the only thing holding you to run your queries fast is the size of your warehouse and the amount you wish to spend on Snowflake.

FAQs

1. What file formats does Snowflake support for loading semi-structured data?

Snowflake supports several file formats for loading semi-structured data. These include:

  • JSON (classic for handling key-value pairs)

  • Avro (used in data pipelines)

  • Parquet (for analytics)

  • ORC (efficient storage for large datasets)

  • XML (even though it's messy, it works)

  • CSV (yup, plain old text files too)

2. Which data formats are supported by Snowflake when unloading semi-structured data?

When you’re unloading semi-structured data, Snowflake supports:

  • JSON

  • Parquet

  • CSV

So if you're looking to move your data out in a semi-structured format, those are your go-to options. Parquet is especially handy when you need something lightweight for analytics.

3. Does Snowflake charge a premium for storing semi-structured data?

No, Snowflake doesn’t charge extra for storing semi-structured data. The costs are based on the compressed storage size, so whether it’s JSON or Parquet, you're not paying a premium just because it’s not structured.

4. Why is Snowflake good for working with JSON and semi-structured data?

Snowflake handles JSON and semi-structured data well. Here’s why:

  • Native VARIANT type: Snowflake stores JSON as a VARIANT data type, which can handle nested and flexible schemas.

  • Built-in Functions: You get functions like FLATTEN() and JSON path expressions, making it easy to query nested data.

  • Columnar Storage: Even though it's JSON, Snowflake stores it in a way that still supports fast analytics (without slow JSON parsing).

  • Schema-less Storage: You don’t have to worry about rigid schemas.

5. What is the recommended Snowflake data type to store semi-structured data like JSON?

The recommended data type is VARIANT. It’s Snowflake’s magic box for semi-structured data. You can throw JSON, Avro, Parquet, etc., into this column, and Snowflake will handle it all.

6. How does Snowflake support semi-structured data such as JSON and Parquet?

Snowflake has a native semi-structured data architecture, which means you can:

  • Ingest data: Load JSON, Parquet, Avro, or whatever into VARIANT columns.

  • Query it: Use SQL (and JSON path expressions) to extract the exact data you need.

  • Analyze it: Snowflake's columnar storage makes querying this data fast, even with complex, nested structures.

7. How does Snowflake handle semi-structured JSON data for fast analytics?

Snowflake uses columnar storage even for semi-structured data like JSON. This means:

  • It stores JSON efficiently by breaking it into columns behind the scenes.

  • When you run queries, it doesn’t need to parse through the entire JSON structure — it can just take the parts you need.

So, even though it's semi-structured, you still get analytics speeds similar to structured data!

8. Why is JSON considered semi-structured data?

JSON is considered semi-structured because:

  • It doesn’t have a rigid schema like a relational database.

  • The structure can vary between records (you can have optional fields, nested objects, etc.).

  • It’s more flexible than CSVs or relational tables but still has some structure — it’s not a random string of text, after all!

That’s why we call it "semi-structured" — it’s structured, but loosely.

9. Can I ingest BSON or JSONB or XML directly into Snowflake?

You can ingest JSON and XML directly into Snowflake, no problem. As for BSON and JSONB, you’d need to convert those into regular JSON before loading them. Snowflake likes its JSONs pure and simple.

10. What is the difference between flatten and lateral flatten in Snowflake?

FLATTEN handles basic unnesting of arrays or objects. LATERAL FLATTEN lets you dig deeper, applying flattening row by row in more complex scenarios.