How do you setup a Synapse Serverless SQL External Table over partitioned data?

3 min read 05-10-2024
How do you setup a Synapse Serverless SQL External Table over partitioned data?


Leveraging Serverless SQL External Tables for Efficient Data Access in Azure Synapse Analytics

Introduction:

Azure Synapse Analytics empowers data professionals with a powerful platform for managing and querying vast datasets. When dealing with partitioned data, efficiently accessing and querying these partitioned tables is paramount. Serverless SQL External Tables in Synapse Analytics offer a compelling solution, enabling you to seamlessly connect to and query data stored in various external data sources like Azure Blob Storage, Azure Data Lake Storage, or even local files. This article will guide you through setting up a Serverless SQL External Table over partitioned data in Azure Synapse Analytics, highlighting its benefits and key considerations.

Scenario:

Imagine you have a large dataset stored in Azure Blob Storage, partitioned by year and month. You need to query this data regularly, but creating a dedicated table within Synapse can be inefficient due to the data's size and changing nature. Here's how a Serverless SQL External Table comes to the rescue:

CREATE EXTERNAL TABLE dbo.MyPartitionedData (
    CustomerID INT,
    OrderDate DATE,
    OrderValue DECIMAL(10,2)
)
WITH (
    LOCATION = '/my-data-lake-path/year={year}/month={month}/',
    DATA_SOURCE = 'my-data-lake-storage-account',
    FILE_FORMAT = 'my-file-format',
    PARTITION_SCHEME = (PARTITION_COLUMN = OrderDate, PARTITION_DATA_SOURCE = 'my-data-lake-storage-account', PARTITION_DATA_SOURCE_LOCATION = '/my-data-lake-path/year={year}/month={month}/')
);

Understanding the Code:

  • CREATE EXTERNAL TABLE: Defines the Serverless SQL External Table.
  • dbo.MyPartitionedData: The name of the external table.
  • CustomerID, OrderDate, OrderValue: Column definitions of the external table.
  • LOCATION: Specifies the root path of the data in Azure Blob Storage.
  • DATA_SOURCE: Identifies the Azure Storage Account containing the data.
  • FILE_FORMAT: Defines the file format of the data (e.g., Parquet, CSV).
  • PARTITION_SCHEME: Configures partitioning, specifying:
    • PARTITION_COLUMN: The column used for partitioning (OrderDate in this case).
    • PARTITION_DATA_SOURCE: The storage account for partition data.
    • PARTITION_DATA_SOURCE_LOCATION: The path where partition metadata resides.

Advantages of Serverless SQL External Tables:

  • Efficient Data Access: Serverless SQL External Tables allow you to query partitioned data without needing to load it into Synapse, saving storage and compute costs.
  • Scalability and Elasticity: As the volume of your data grows, Synapse automatically adjusts resources to handle the query load, providing scalability.
  • Dynamic Data Access: Changes in the source data are reflected in the external table without manual intervention, ensuring up-to-date information.
  • Reduced Maintenance: No need for manual data loading or table management, freeing up time for other tasks.

Key Considerations:

  • Data Format: Ensure your data is stored in a format compatible with your file format definition.
  • Partitioning: Proper partitioning is crucial for efficient data access. Choose a column that evenly divides the data and facilitates efficient querying.
  • Permissions: Grant appropriate access to your data source from the Synapse workspace.
  • Security: Secure your data sources appropriately to ensure data confidentiality and integrity.

Example:

SELECT COUNT(*) AS NumberOfOrders
FROM dbo.MyPartitionedData
WHERE OrderDate BETWEEN '2023-01-01' and '2023-03-31';

This query would only access the partitions corresponding to January, February, and March 2023, significantly reducing the amount of data scanned and improving query performance.

Conclusion:

Serverless SQL External Tables provide a powerful and efficient solution for accessing partitioned data in Azure Synapse Analytics. They offer significant benefits like reduced storage and compute costs, improved scalability, and simplified data management. By carefully planning your data storage, partitioning strategy, and security measures, you can leverage this feature to enhance your data analytics workflows and gain valuable insights from your data.

Resources: