cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
stevejohansen
Databricks Employee
Databricks Employee

Introduction

Unity Catalog (UC) is the foundation for all governance and management of data objects in Databricks Data Intelligence Platform. Since its launch several years ago Unity Catalog has become the best way to experience Azure Databricks.

For most of the time Databricks has existed the primary method of managing data objects like tables is the built-in Hive Metastore (HMS) attached to each Azure Databricks Workspace and by default UC was not enabled on new workspaces. This introduces complexity when onboarding new workspaces and requires retrospective configuration in the Databricks Account console. In order to ease the path to getting started with Unity Catalog Azure Databricks now allows the automatic assignment of a workspace catalog for each new workspace at provisioning time.

This article will present an overview of exactly what happens when a workspace catalog is automatically provisioned for a new Azure Databricks workspace. There is an assumption the reader has an understanding of what Unity Catalog is and how to set it up on Azure, along with how its securables (data objects) and permissions are managed.

Part two of this article will show the differences between the Azure implementation of UC automatic assignment and the AWS implementation.

What is a Workspace Catalog?

When a new Azure Databricks Workspace is automatically enabled for Unity Catalog an initial catalog is created called a workspace catalog. This catalog allows workspace users to easily get started using UC by granting some initial permissions to both the Workspace Administrators and Workspace Users.

uc-by-default-securables-azure.png

The workspace catalog has the following properties:

  • The name of the workspace catalog will match the workspace name
  • Will be owned by a system owned group called _workspace_admins_${workspace_name}_${workspace_id}
  • Will have its storage root located in the managed Azure storage account for the workspace (sometimes called the DBFS root) on a dedicated container called unity-catalog-storage and a folder named after the workspace_id
  • A system owned group called _workspace_users_${workspace_name}_${workspace_id} which has USE_CATALOG rights on the workspace catalog. This user also has enough rights to create objects in the default schema of the workspace catalog.

The workspace catalog is made up of three Unity Catalog securables:

  • Credential: this adds the unity-catalog-access-connector in the managed resource group to Unity Catalog. The name of this credential is the workspace name
  • External location: this adds the unity-catalog-storage container on the managed storage account as a valid path in Unity Catalog. The name of this external location is also the Workspace name. The path is abfss://unity-catalog-storage@${managed_storage_account}.dfs.core.windows.net/${workspace_id}
  • Catalog: this is the Workspace catalog that has a storage root pointing to the unity-catalog-storage container on the external location. The name of this external location is also the workspace name

All three of these UC securables are bound to the workspace and not by default available to any other workspace sharing the metastore.

How Automatic Workspace Assignment Works

All new Azure Databricks Accounts created after 9th November 2023 are enabled for Automatic Workspace Assignment, which means that when a new workspace is created it will have a workspace catalog provisioned for it. The process of enabling older Databricks Accounts is ongoing at the time of writing. Organisations who have not had Automatic Workspace Assignment enabled on their Account can request to opt-in by contacting their Databricks Account team.

uc-by-default-flow-azure.png

When creating a workspace in a region where Automatic Workspace Assignment is enabled on the Account but there is no metastore a metastore will be created for you. The properties of this metastore are:

  • The metastore will be called metastore_azure_${cloud_region}
  • The metastore will have no metastore owner (it will show System user)
  • The metastore will be created without a storage root location
  • Delta sharing will be disabled
  • Automatic Workspace Assignment will be enabled

If required a Metastore Owner can be allocated by an Account Administrator.

uc-by-default-azure-no-metastore.png

In order to automatically enable all new workspaces in a region for Unity Catalog on an existing metastore in that region the checkbox in Workspace assignment under the metastore settings in the Catalog section of the Account Console has to be checked.

When a metastore is assigned to a workspace a default catalog name is set for all users of that workspace. If the workspace is created via the UI and automatically enabled for UC then the default catalog will be the workspace catalog. If the workspace is created via an API (including using Terraform or an SDK) the default catalog will be the hive_metastore.

Azure infrastructure deployed during automatic enablement

There are several items in Azure that need to have been created in order to allow objects to be physically stored in the Workspace managed storage account:

  • A storage container in the managed storage account called unity-catalog-storage 
  • An Access Connector (Managed Identity) with the Storage Blob Data Contributor role assignment on the managed storage account

In the Azure Portal you can go to the managed resource group attached to the Azure Databricks Workspace and see the Access Connector called unity-catalog-access-connector.

uc-by-default-azure-infra.png

System-owned groups and permissions

There are system owned groups that are provisioned with the workspace in order to be granted enough permissions to ensure the workspace catalog and other securable objects created with the workspace can be managed. These groups do not appear in most surfaces in the Workspace UI, Account Console or APIs and can not be used to grant Unity Catalog privileges to other securables. The membership of these groups is kept in sync with all the users who have been pushed to the workspace as either the ADMIN or USER role using Identity Federation.

 

Group Name Unity Catalog Grants

Workspace Admin

_workspace_admins_${workspace_name}_${workspace_id}

OWNER on credential, external location and workspace catalog in addition to the metastore level rights listed in the next section

Workspace Users

_workspace_users_${workspace_name}_${workspace_id}

Usage (USE_CATALOG) rights on workspace catalog and  usage rights on default schema (see below)

The following shows the grants on the default schema.

uc-by-default-users-default-schema.png

Metastore-level grants for Auto-Enabled Workspace Administrators

In order to create all these Unity Catalog securables the Workspace Admins system owned group needs some grants on the Unity Catalog Metastore (the screenshot below also shows this workspace was provisioned via Terraform so it has hive_metastore as a default catalog)

uc-by-default-azure-metastore-grants.png

These grants do not include ownership of the metastore, meaning the workspace admin can not delete metastore level UC securables that were created or owned by other identities, including the workspace catalog and securables on other workspaces created with UC by default.

These grants also allow the Workspace Administrators to create other catalogs and related underlying securables like credentials and external locations. By default any securable created will be owned by the individual identity that created that securable and ownership allows transfer of ownership to a group.

Best practices for using the workspace catalog

While the provisioning of a workspace catalog greatly simplifies the initial setup of Unity Catalog for new workspaces it does also tie the catalog directly to the lifecycle of the workspace: if the workspace is decommissioned the managed storage account that contains any unity catalog objects in the workspace catalog will also be lost.

The recommendation is to adhere to existing best practices for creating catalogs, aligning them with SDLC (Software Development Lifecycle), business units, and/or projects. This allows more flexibility to segregate storage away from the workspace and to bind these catalogs to multiple workspaces where required. It also means that the addition or removal of a workspace does not impact the lifecycle of any data stored in Unity Catalog.

The metastore permissions granted to the system owned Workspace Admins group give enough permissions to create the required securables (credentials, external locations, catalogs etc) to achieve the required catalog design for your organisation.

Conclusion

To recap, when Azure Databricks workspaces are auto-enabled for Unity Catalog, a default workspace catalog is created along with necessary cloud resources and permissions – all without manual effort. This makes it much easier for the users of new workspaces to start using Unity Catalog immediately, however it is still important to follow best practices in catalog design.

Unity Catalog will continue to be the foundation that the Databricks Data Intelligence Platform is built on. The ability to automatically enable Unity Catalog for all new workspaces greatly reduces the friction to start getting all the benefits of the platform.

For details on what happens when using UC automatic assignment on AWS please see part two of this article.

OSZAR »