7 Proven Techniques for Handling Changing Data Type during Semi-Structured Data Ingestion a.k.a Polymorphic Keys

7 Proven Techniques for Handling Changing Data Type during Semi-Structured Data Ingestion a.k.a Polymorphic Keys

Polymorphic keys in semi-structured data—where a single field can hold values of different types (and to make things worse, they keep on changing each time) —are a challenge to deal with, especially when figuring out how exactly to store them in the best way possible.

From schema evolution to flexible data ingestion and retrieval, handling these variations requires some thinking to do.

In this blog, we’ll explore 7 different approaches to treating polymorphic data, ranging from schema enforcement and dynamic typing to more advanced methods like data type promotion and continuous schema inference. Each method offers unique advantages and trade-offs, helping data engineers choose the best solution for their specific use case.

1. Add the extra column at the end of table / materialized view for each new data type that comes in the way

This means lets say column A as integer initially, now its data type has changed to string, hence appending the data type to column name and now we have A_int and A_string as 2 columns.

METHOD 1: Add new columns

The hypothesis for handling polymorphic data by creating separate columns for each data type is a common strategy in cases where data types can change over time. It is one of the way to handle such data.

Let’s see a detailed exploration of how this approach works, its accuracy, and potential pros and cons.

Scenario Breakdown: Initial Table Structure:
Consider a table data_table with a column A of type int.

A_int
5
10
15

1. Integer Changes to String

You encounter a scenario where new data coming into column A is no longer just integers but strings. You create a new column A_string.

Table Structure:

A_intA_string
5NULL
10NULL
NULL"abc"

Data Handling:

  • All new data goes into A_string, as strings can hold integers by type-casting.

  • Older data in A_int remains unchanged, but the new A_int values will be NULL moving forward.

2. String Changes Back to Integer

In the next case, the incoming data for column A reverts to integers and gets stored there. New integer values will be written back to A_int, and A_string will have NULLs.

3. Integer Becomes an Array of Objects

You now receive arrays instead of single integers. You create a new column A_obj[] to handle this change.

Table Structure:

A_intA_stringA_obj[]
5NULLNULL
10NULLNULL
NULL"abc"NULL
NULLNULL[1, 2, 3]

Data Handling:

  • Since A_obj[] can hold multiple values, the array is stored in this column.

  • Previous columns A_int and A_string stay intact.

4. Integer Becomes an Object

Now, the incoming data for column A becomes a structured object rather than a single value.

Table Structure:

A_intA_stringA_obj[]A_obj
5NULLNULLNULL
10NULLNULLNULL
NULL"abc"NULLNULL
NULLNULL[1, 2, 3]NULL
NULLNULLNULL{x: 1, y: 2}

Data Handling:

  • A_obj holds the object, while the other columns stay unchanged.

This seems like a good approach but there are some pros and cons associated with it:

Pros:

  1. Reliable ETL : ETL process doesn’t break as you are ingesting data in a new columns

  2. Backward Compatibility: You don’t lose old data. It remains in the older columns.

  3. Type Casting Flexibility: By directing all data to the string column in certain cases, you avoid type conflicts and can cast data types as needed later.

  4. Logical Separation: Each data type gets its column, making it easy to track and handle specific transformations.

Cons:

  1. Column Explosion: Every time a data type changes, you need to create a new column. Over time, the number of columns could grow out of control.

  2. Data Redundancy: For certain queries, you'd have to check multiple columns.

  3. Handling Complex Types: As in the case of arrays or objects, the structure becomes increasingly complex. This approach may not scale well when dealing with deeply nested or highly polymorphic data.

METHOD 2: Type Promotion

This approach can be known as type promotion or type widening, where the data type of a column is "promoted" to a more flexible or broader type to accommodate different kinds of data while preserving compatibility with older types.

Here are examples of how type promotion could work for handling polymorphic data:

1. Integer to String

Scenario: You start with integers, but later, you receive strings.

Promotion Strategy: Promote the integer type to string, since strings can hold both numeric and textual values.

  • Initial Type: int

  • New Type: string

idA
1123
2"abc"
3“456”

Explanation: Since strings can hold numeric values (as long as they can be parsed or cast), the integers are promoted to strings, and future data can include both numbers and text. The column A is now a string type, and integers are converted automatically when necessary.

While not all databases and warehouses support column promotion or casting, you can cast values but not always change the table schema.

Which allow data type changes:

  • PostgreSQL, MySQL, SQL Server

Which does not allow data type changes:

1. Redshift

Does not allow direct column type changes once a column is created. The only way to change a data type is to create a new table or column and migrate the data.

Workaround:

  • Create a new table with the desired column types.

  • Copy the data over and drop the old table.

2. Snowflake

  • What is Allowed: Changing a column's data type to an equivalent type (e.g., STRING to VARCHAR) ✅

  • What is Not Allowed: Changing a column's data type to a different type (e.g., STRING to NUMBER) ❌

3. BigQuery

Only the following data type conversions are allowed in BigQuery:

  • INT64 to NUMERIC, BIGNUMERIC, FLOAT64

  • NUMERIC to BIGNUMERIC, FLOAT64

2. String to Integer

Scenario: The column starts with string values, but later, numeric data (integer) is introduced.

Promotion Strategy: This is a bit trickier since integers can’t naturally "hold" strings without losing information. You may promote the column to a string type to accommodate both cases, assuming you can perform numeric validation or casting as needed.

  • Initial Type: string

  • New Type: Still string (strings can represent integers)

idA
1"abc"
2"456"
3"789"

Explanation: Even though you now have integers, strings can still store them, so no actual type change is required.

3. Integer to Array of Objects

Scenario: Initially, you store simple integers in the column, but now you need to store arrays of objects.

Promotion Strategy: Promote the type from int to an array of objects (array<json> or jsonb in databases like PostgreSQL).

For Snowflake, BigQuery, RedShift, as they do not support type conversion of column names, you can use workarounds (like adding a new column with required data type, copying the data and deleting older columns) as discussed in the above section.

  • Initial Type: int

  • New Type: array<json> (or equivalent in your DBMS)

idA
1123
2[{"x": 1}]
3[{"x": 2, "y": 3}]

Explanation: When promoting the type, you convert the existing integers into JSON objects or wrap them in an array (e.g., [{"x": 123}]), so the new data fits into the broader array of objects format. This avoids creating a separate column but requires careful handling of the old and new formats.

Depending on the complexity of conversion you might have to use SQL or custom scripts to achieve above.

4. Integer to Object

Scenario: You start with a simple integer but now need to store complex objects.

Promotion Strategy: Promote the column type to json or jsonb, capable of storing complex data structures.

  • Initial Type: int

  • New Type: json

idA::jsonb
1123
2{"field1": 456}
3{"field1": 789, "field2": "abc"}

Explanation: Integers are treated as simple numeric values within the json field, while new data is inserted as full-fledged objects. The flexibility of the json type allows for various structures, avoiding the need for multiple columns.

Benefits of Type Promotion:

  1. Minimizes Schema Changes: By promoting to a broader data type, you avoid having to change the schema repeatedly or create new columns, which can reduce schema maintenance overhead.

  2. Data Compatibility: Data compatibility is retained as the new type can handle older, narrower types (e.g., integers still fit within strings or JSON objects).

  3. Flexibility: This approach is flexible in how it can handle both new and old types of data.

Downsides of Type Promotion:

  1. Query Complexity: Queries need to account for type differences, especially if numeric or text data needs parsing or casting (e.g., converting a string back to an integer).

  2. Performance Overhead: Using broader types like strings or JSON might introduce performance penalties, especially if the column was initially a more optimized type like an integer.

  3. Validation: Data validation becomes more complex. If you allow multiple data types in a single column, you need stricter data validation logic to ensure type correctness for different use cases.

  4. Downstream code might break as it might assume incoming data to be int, but now you get JSON or string.

Method I and II holds true if there are more than 1 column that has variable data structure.

2. Another Example Scenario

Let;s say you have a column called sales that originally stores numbers (e.g., 100.50). Later, you start receiving text values (e.g., "one hundred"). If the column was defined as INTEGER initially, new text values would be stored as NULL.

However, by converting the column to a string type later, you can retrieve both old numeric and new text values without losing data.

1. Probable Upcasting Order:
When detecting new data types, the system should upcast the column to a type that can accommodate both the old and new values. This order of upcasting can be used:

Upcasting Order
Boolean → Date → Timestamp → Bigint → Decimal → Double → String

For instance, if the sales column starts as a number (double), it can later be changed to a string if text data is detected.

2. Preserving Old Data:
The old data should remain in its original type. For example, if the column started as a number and later changed to a string, the numeric data stays untouched. For instance:

Sales (Before)Sales (After)
100.50"one hundred"
250.75"Two hundred fifty"

The original 100.50 remains available for querying, even if new text data is stored.

Combining Old and New Data:
To merge both numeric and text data, use a function like COALESCE, which will return the first non-NULL value:

COALESCE(sales_string, CAST(sales AS VARCHAR)) AS sales_combined

This ensures that both old and new data are available in a unified format.

You can later merge these columns if needed using transformations.

Handling Complex Data Types [like timestamps]

If the data type changes from a simple value to a more complex structure, like an array, if the destination table supports integer arrays or string arrays, well and good, if not, JSON is the way to go.

Edge Cases

  • Double vs. Long: If a field contains both double and long values, keep only the double type since it can accommodate long values.

  • Timestamp vs. Long: If a field first appears as a timestamp and later receives long values, create two fields:

Column NameData Type
event_timeTimestamp
event_time_longLong (milliseconds since epoch)

Both columns remain populated (as long can be converted to timestamp using custom functions in many warehouses), and the original timestamp is preserved alongside the long values.

Redshifts, Snowflake’s, Postgres’s TO_TIMESTAMP(), BigQuery’s TIMESTAMP_MILLIS(), MySQL’s, Databricks, Hive, FROM_UNIXTIME().

Routing Values Based on Type

When data comes in with varying types for the same field, route it to the widest compatible type. Here’s an example of a routing strategy:

Incoming TypesRouted to Field
long, date, string, date[], string[]JSON

3. Use dead letter queue if there is strict type checking pre ETL

In this method, you use a Dead Letter Queue (DLQ) for handling new incoming data with changed data types that don't match the existing schema.

The DLQ temporarily stores invalid or mismatched records before creating a new table (or materialized view) with an updated schema to accommodate these changes.

Once the new schema is defined, the invalid records are reprocessed and inserted into the new structure.

Scenario Setup

Let's assume we have an initial table customer_data with the following schema:

Initial Table Schema:

id (int),
name (string),
age (int),
purchase_amount (float),
address (object) { "city": string, "pincode": int }

Initial Sample Data:

idnameagepurchase_amountcitypincode
1Alice25123.45NY10001
2Bob30345.67LA90001

Scenario: Complex Data Type Changes

Let’s consider a scenario where two columns (age and purchase_amount) undergo changes in data type:

  1. age changes from int to string (perhaps capturing data such as “Unknown”).

  2. purchase_amount changes from float to array<float> (indicating multiple purchases).

The system has strict type checking pre-ETL, and the incoming records that don’t match the schema will be sent to the DLQ for processing. Once a new schema is defined, the DLQ records will be reprocessed and inserted into the new structure.

Incoming Data with Schema Mismatch

The following data has mismatches with the existing schema:

idnameagepurchase_amount_1purchase_amount_2purchase_amount_3citypincode
3CharlieUnknown123.4567.89NULLChicago60601
4Eve28150.0200.550.75Houston77001

Here:

  • Charlie has "Unknown" as the age, which is now a string.

  • Charlie and Eve have purchase_amount values as arrays of floats, instead of a single float.

These records will be rejected and moved to the DLQ.

DLQ Handling

When these records hit the DLQ, a new schema needs to be created that accommodates both the string type for the age column and the array<float> type for the purchase_amount column.

New Table Schema (Version 2)

Once you realize the data types have changed, you create a new table (or materialized view) customer_data_v2 with the following updated schema:

New Table Schema:

id (int),

name (string),

age (string),                 -- Promoted from int to string

purchase_amount (array<float>),  -- Changed from float to array<float>

address (object) { "city": string, "pincode": int }

Final Output: Reprocessed Data in New Table

The data from the DLQ will be reprocessed and inserted into the new table customer_data_v2 as follows:

idnameagepurchase_amountaddress
1Alice25[123.45]{"city": "NY", "pincode": 10001}
2Bob30[345.67]{"city": "LA", "pincode": 90001}
3Charlie"Unknown"[123.45, 67.89]{"city": "Chicago", "pincode": 60601}
4Eve28[150.0, 200.5, 50.75]{"city": "Houston", "pincode": 77001}

Key Steps in the DLQ Approach:

  1. Invalid Records Stored in DLQ: Incoming records with mismatched data types are temporarily stored in the DLQ.

  2. Create a New Schema: The schema is updated (e.g., age promoted from int to string, purchase_amount updated from float to array<float>).

  3. Reprocessing: The DLQ records are reprocessed into the new table with the updated schema.

  4. Inserting New Data: Future data follows the new schema and is inserted directly into the new table.

Advantages of the DLQ Approach:

  • Data Integrity: No data is lost, even if it doesn’t match the original schema. It is temporarily stored in the DLQ for later reprocessing.

  • Flexibility: The approach can handle multiple types of changes, including more complex changes like converting a field from float to array<float>.

  • Seamless Transition: Existing tables remain unchanged, and new tables can evolve with the schema.

Disadvantages of the DLQ Approach:

  • Complexity: Managing DLQs, reprocessing, and creating new tables adds complexity to the data pipeline.

  • Redundancy: You may end up with multiple tables (or materialized views) that represent different versions of the schema, which can increase maintenance overhead.

  • Latency: The data in the DLQ isn’t immediately available in the main table. There’s a delay before the new schema is applied and data is reprocessed.

4. Handling Dynamic Nested Data and its Array Explosions

For illustration purposes we have taken an example JSON that has 2 levels of nesting. The JSON contains a field that can either be an integer, string or an array of integers or string or objects.

We want to "explode" the nested data into two different tables, handling both array explosion and polymorphic data changes (the change in data type).

Consider the field size, which transitions across records:

  1. An integer in one record.

  2. A string in another.

  3. An array in another.

  4. An array of objects in another.

Here’s the JSON structure:

Record 1
{
  "_id": "12345",
  "name": "Sample Device",
  "attributes": {
    "size": 5,
    "versions": [
      "v1",
      "v2"
    ]
  }
}

Record 2
  {
    "_id": "67890",
    "name": "Another Device",
    "attributes": {
      "size": "Large",
      "versions": "v1"
    }
  }

Record 3
  {
    "_id": "11223",
    "name": "Third Device",
    "attributes": {
      "size": [5, 10, 15],
      "versions": ["v1", "v2", "v3"]
    }
  }
Record 4
  {
    "_id": "44556",
    "name": "Fourth Device",
    "attributes": {
      "size": [
        {"dimension": "width", "value": 10},
        {"dimension": "height", "value": 15}
      ],
      "versions": "v1"
    }
  }

Step 2: Exploded Table Structure

Now, let’s explode each array into its own table, while still supporting polymorphic data.

1. Main Table: This table will store the non-polymorphic fields such as _id, name, and will maintain a high-level view of the versions and size fields as JSON (before explosion).

_idnameattribute_versionsattribute_size
12345Sample Device["v1", "v2"]5
67890Another Device"v1”"Large"
11223Third Device["v1", "v2", "v3"][5, 10, 15]
44556Fourth Device"v1"[{"dimension": "width", "value": 10}, {"dimension": "height", "value": 15}]

A. attribute_size_array Table: [Dynamic Table]
This table will store the size field when it’s an array of integers. Each array element is exploded into its own row.

_idattribute_size_arrayattribute_size_dimensionattribute_size_value
112235NULLNULL
1122310NULLNULL
1122315NULLNULL
44556NULLwidth10
44556NULLheight15

B. Version Table: [Dynamic Table]

We also need to explode the versions field into its own table, as it’s an array in multiple records.

_idattribute_versions_str
12345v1
12345v2
11223v1
11223v2
11223v3

Now that we have talked about some of these “change data management” strategies, let's look at how existing tools do it, the pros and cons of such an approach.

5. How Airbyte Handles Polymorphic Data?

Airbyte offers flexibility in handling varying data types across records within the same source, such as MongoDB collections. Depending on the configuration, schema enforcement can be enabled or disabled to handle these changes.

Handling Polymorphic Data (Data with Changing Types)

Polymorphic data can be a bit tricky. In some rows, a field might be a string, and in others, it could be an integer or even an array. Airbyte gives you two main options here:

Schema Enforced Mode: Here, Airbyte enforces strict consistency. If it detects a field type changing across different rows, the sync may fail unless corrected. So, if you need structure and control, this option ensures everything is uniform.

Schema-less Mode: This mode offers more flexibility and allows data fields to vary in type across different rows. For example, if the field address is a string in one document and an object in another, Airbyte can still sync all the data without errors.

  • Schema Enforcement (When Enabled):

    • When schema enforcement is enabled, Airbyte samples documents (default is 10,000 records) and infers a schema from them. If data types change (e.g., a field is an int in one document and a string in another), this can cause sync errors in destinations that require strict schema consistency, like data warehouses.

    • Airbyte will attempt to sync data that matches the enforced schema. Any rows with mismatched types (e.g., new types not matching the original schema) may be rejected or cause errors.

  • Example:

    • Field xyz is an int in some records and a string in others.

      • If the schema was enforced as int, any records with string would cause an error.

So in order to test this out, we inserted the method 4 data into mongodb (source) and connected to MySQL as a destination via Airbyte (with strict schema enabled).

Then, we added the following 2 rows into mongodb and re-synced after schema refresh.

New Row 1
{
    "_id": "445561",
    "name": "Fourth Device",
    "attributes": 2,
  }
New Row 2
 {
    "_id": "445562",
    "name": "Fourth Device",
    "attributes": "two",
  }

Here’s the sync configuration.

And the table after first full sync:

Note that we changed attributes from type object (which was initially inferred by Airbyte) to int, and then to string. We got NULL values as shown below.

The next logical step would be to perform schema refresh.

After schema refresh, it detected the change and made the object type to string type.

So we inserted some new data.

New Row 3
 {
    "_id": "445563",
    "name": "sixth Device",
    "attributes": "6",
  }

Now to remove the null values, you need to re-sync.

In the above scenario, we should either get an error message saying “cannot insert INT value into JSON type” or similar error for string into JSON field, but we got NULL. This can be attributed as an error on Airbyte’s side.

Next, we took 2 documents, and inserted them one by one:

  • Document 1: { "_id": "1", "xyz": 123 }

  • Document 2: { "_id": "2", "xyz": "[{"name": "some name", "age": 25}, {"name": "other name", "age": 21}]

  • Document 3: { "_id": "3", "xyz": "string" }

We get an error from airbyte after inserting the Document 2, as now the schema has been changed from int to JSON (or string).

It tries to attempt multiple times, each with a larger retry time but eventually will fail. To resolve, you refresh the schema and re-sync again.

  • Schema Enforcement (Disabled):

    • When schema enforcement is turned off, Airbyte reads data in schema-less mode, allowing different data types to coexist for the same field across documents.

    • In this case, Airbyte stores each record in a JSON-like structure where fields may have different data types depending on the document.

    • For example, the field xyz might be an int in one document and a string in another without causing sync errors.

Airbyte would sync both documents and store the data in the data field.

An example of Data ingested in schema-less mode:

{
  "_id": "1",
  "data": {
    "address": "123 Main St"
  }
}

{
  "_id": "2",
  "data": {
    "address": ["123 Main St", "456 Oak St"]
  }
}

{
  "_id": "3",
  "data": {
    "address": {
      "line1": "123 Main St",
      "city": "Mumbai"
    }
  }
}

If schema enforcement is enabled, this inconsistency might trigger a sync error. But in schema-less mode, all these variations are accommodated in a JSON type without any issue.

Schema Change Propagation Options

Airbyte provides several configurable options to manage schema propagation:

  1. Propagate field changes only: Only column changes are propagated. New or removed streams are ignored.

  2. Propagate all field and stream changes: Both field and stream changes (additions, deletions, or changes in columns) are propagated automatically.

  3. Approve all changes manually: You can manually approve schema changes before they are propagated to the destination.

  4. Stop future syncs: Connections will be automatically paused if schema changes are detected, and you will need to review them manually.

Schema-less Sources and Destinations

For sources like MongoDB that don’t always have a fixed structure, Airbyte allows you to work in schema-less mode.

This mode reads every document as a JSON object, so fields that differ from document to document are still captured. This is very useful for handling unstructured data or data that changes frequently.

In this setup, Airbyte generates a record with an _id and a data field containing all the fields present in the document. It’s a flexible way to deal with datasets where structure varies.

Automatic Schema Refresh

In Airbyte, schema changes are checked regularly. For Cloud users, schema changes are automatically checked every 15 minutes before syncing.

For self-managed users, Airbyte checks every 24 hours. You can also manually refresh the schema to ensure the destination stays updated with any changes in the source.

Update: You'll get data type errors in downstream models where previously string columns are now JSON. In BigQuery, nested JSON values originating from API sources were previously delivered in type string. These are now delivered in type JSON. Also, object and array properties are columns in airbyte stored as JSON columns for bigquery.

6. How Hevodata Handles Polymorphic Data

Hevo Data offers a slightly different solution for managing polymorphic data in data pipelines.

Below is an overview of how Hevo handles this aspect, including real-world examples for better understanding.

1. Data Type Compatibility

For a source field value to be loaded into a destination, according to Hevo, the data type must be compatible with the destination column. When there’s a mismatch, it automatically attempts to convert the data type of the incoming source column to store it in the destination column.

  • Example: Consider a financial application where a source system records transaction IDs as UUIDs. Hevo will convert these UUIDs to varchar for compatibility with data warehouses such as Snowflake, Google BigQuery, and Amazon Redshift, which do not support UUID as a native data type.

2. Handling Arrays and Custom Data Types

  • Arrays and Multidimensional Arrays: When dealing with customer purchase data, if a product has multiple attributes (like size, color, and variants), these attributes might be represented as arrays. During ingestion, these arrays are converted and loaded as strings to ensure compatibility.

  • Custom Data Types: For example, if a PostgreSQL source system has a custom data type for a product's rating (like a structured data type containing multiple fields), Hevo ignores this during ingestion. Users must contact Hevo Support for assistance with such custom types.

3. Approaches for Incompatible Data Types

When the source field's data type is incompatible with the destination, Hevo offers three main approaches:

  • Convert to String

  • Implicit Conversion: In a scenario where a user uploads sensor data, the first recorded temperature value is 22.5 (as a string), and the subsequent value is 25 (as an integer). The integer will be converted and stored as a string in the destination.

  • Data Type Promotion: For example, in an e-commerce platform, a field initially recorded as an integer (representing product quantity) might later receive a decimal value (e.g., 5.99 for a price).

  • If AutoMapping is enabled, the destination column’s data type will be promoted from int to decimal to accommodate this change, else it will fail.

  • In case if the column type was INT with integer data and now the values are STRING or JSON, in that case hevo raises an error code:

Hevo showing error when the INT column started to receive string data OR an array of objects with auto-mapping enabled.

CODE_DATATYPE_MISMATCH”: Data type mismatch for fields. Please correct the field value via a transformation or update the column type in Destination.

For STRING to INT mapping, Hevo inserts INT type into STRING column without causing any errors.

We inserted two columns for the above INT to string and String to INT ingestion checking.

  • { "_id": "1", "xyz": 123 }

  • { "_id": "3", "xyz": "two" }

Sample Data in Mongo

4. Schema Mapper Features

Hevo’s Schema Mapper provides automated and manual options for mapping source event types to destination tables:

  • Auto Mapping: Imagine a logistics application where shipment tracking events are ingested. If a new field for delivery_date is added to the source event, Auto Mapping will automatically create a corresponding column in the destination database without user intervention.

  • Manual Mapping: If a business wishes to track customer feedback and decides to map the feedback field of a source event type to a customer_reviews table in the destination, the user can manually configure this mapping at any time.

Schema Mapping Scenarios

Data ScenarioAuto-Mapping OnAuto-Mapping Off
New fields addedAutomatically creates columns in the destination. E.g., a new field discount_code in an order event.Requires manual mapping of new columns.
Fields deletedNo action; Hevo stops loading data for those columns. E.g., a shipping_method field removed from the order event.No action; Hevo stops loading data for those columns.
Changed to compatible typesAutomatically accommodates type changes. E.g., a quantity field changing from int to decimal.Sidelined (paused) until mappings are resolved.
Changed to incompatible typesSidelined until mappings are resolved. E.g., a price field changing from decimal to a non-compatible string.Sidelined until mappings are resolved.

Here, “sidelined” means the sync stops and manual intervention is required from the user side.

Hevo only reflects changes in the destination when the next event containing the new field is ingested. For instance, if a status field is created in a shipment event, it will be reflected in the destination upon ingestion of a shipment event containing this field.

5. Handling Schema Changes

  • Field Dropped: If a field like promo_code is dropped from an event type, the pipeline continues running, but in subsequent runs, null values will be loaded into the promo_code field in the destination. Hevo updates the metadata column hevomarked_deleted to True for identification.

  • Column Size Issues: Suppose a retail application is ingesting customer feedback where the maximum feedback length is defined as 50 characters. If an incoming feedback string exceeds this length, such as “Excellent service! Will definitely recommend this store to friends and family.”, Hevo will mark this event as Failed until the column size is adjusted.

  • Destination Schema Management: If a user decides to delete the customer_orders table in the destination, Hevo automatically detects this change and moves events meant for the deleted table to the Failed Events queue in subsequent runs.

7. How Fivetran Handles Changing Column Schemas

Fivetran automatically manages schema changes, data type conversions, and data syncing between your source and destination systems.

Fivetran replicates the exact schema from MongoDB to the destination by matching MongoDB data types with supported Fivetran types. If a particular data type from MongoDB is unsupported, Fivetran automatically converts it to a compatible Java data type to ensure the data can still be synced without loss.

Fivetran handles incoming data types in two ways:

  1. Explicit data types: These are defined by the source, and Fivetran uses them directly without any modification. For example, Salesforce provides explicit data types for its fields, which are directly mapped to the destination.

  2. Implicit data types: When the source doesn't specify a data type (like in CSV files), Fivetran analyzes the data and infers the most appropriate type for the destination.

Fivetran uses subtyping to minimize data size by selecting the smallest possible data type (e.g., using SHORT instead of INT) without losing precision. If a data type in the source differs from the destination, supertyping ensures that Fivetran selects a larger, compatible data type to store the data accurately.

For example, if the source column "count" is changed to DOUBLE, Fivetran adjusts the destination column to DOUBLE during the next sync.

1. Automatic Schema Change Management

It detects schema changes such as new tables, columns, or data type modifications and reflects them in the destination.

Schema Change Handling Options [in the UI]:

OptionDescription
Allow AllSyncs new schemas, tables, and columns from the source to the destination.
Allow ColumnsOnly syncs new columns for existing tables, but blocks new schemas and configurable tables.
Block AllPrevents syncing of new schemas, configurable tables, and new columns.

Once a schema change occurs, Fivetran triggers a full table re-sync to ensure the destination matches the source schema.

2. Data Type Changes

When data types change in the source, Fivetran manages this by converting column data types to the most specific type that can handle both the old and new data without loss.

  • If a column changes from DOUBLE to BOOLEAN, Fivetran converts it to STRING to store both number and boolean values.

  • If a column changes data types, Fivetran creates a new temporary column, transfers the data, drops the old column, and renames the new column to match the old one.

Data Type Change Process:

ActionDescription
1. Create Temporary ColumnA new column is created with a temporary name when a data type changes.
2. Data TransferData from the old column is copied into the new column, cast to a data type that accepts both old and new data types.
3. Column ReplacementAfter copying, the old column is dropped, and the new column is renamed to match the original.
4. Full Table Re-syncA full table re-sync is triggered after a column data type change

Note: They encourage people to use VIEWS.

3. Polymorphic Data Handling (MongoDB and JSON)

For sources like MongoDB or JSON files, which can have varying schemas, Fivetran promotes first-level fields (simple types) to columns in the destination.

However, complex data types such as arrays or nested JSON are mapped directly to JSON type columns without unpacking (flattening / normalizing, hence low array explosion support) them.

If the first level type field’s data type changes (from int to JSON), the same process as described in step 2 is performed.

LevelHandling
First-level fieldsPromoted to columns in the destination.
Complex data typesComplex types like arrays or nested JSON remain as JSON types in the destination. No automatic unpacking of nested data occurs.

4. Normalization and Type Inference

Fivetran normalizes the data by breaking down larger tables into smaller, more structured tables based on relationships, which reduces data redundancy. It also automatically infers data types if they are not explicitly defined.

Source: Fivetran

Normalization Process:

StepDescription
Data OrganizationData is divided into smaller tables and columns based on defined relationships to reduce redundancy.
Type InferenceFivetran infers data types from the source if they are not explicitly specified, especially for file types like CSV that lack clear data type definitions.

Connector schema changed

For this stage, Fivetran generates the following log events:

Event nameDescription
change_schema_configSchema configuration updated

To specify the Schema Change Handling strategy in Fivetran, the default setting is ALLOW_ALL, meaning all new columns and tables are synced automatically. To change this, you can use the API request to modify the strategy. For example, to block all new schema changes, you can send a PATCH request like this:

PATCH https://api.fivetran.com/v1/connectors/{connector_id}/schemas
{
  "schema_change_handling": "BLOCK_ALL"
}

A successful response would look like:
{
  "code": "Success",
  "data": {
    "schema_change_handling": "BLOCK_ALL"
  }
}

If set to ALLOW_ALL, Fivetran will sync all newly detected columns and tables, which may increase sync time if large tables are added. You can also use the Fivetran Platform Connector to check if new columns or tables have been included in your sync.

I’d love to hear your thoughts about this, so feel free to reach out to me on LinkedIn or respond in the comments below, that’d be great.