Version: Next

CSV Enricher

Important Capabilities

Capability	Status	Notes
Descriptions	✅	Supported by default
Domains	✅	Supported by default
Extract Ownership	✅	Supported by default
Extract Tags	✅	Supported by default

Looking to ingest a CSV data file into DataHub, as an asset?

Use the Local File ingestion source. The CSV enricher is used for enriching entities already ingested into DataHub.

This plugin is used to bulk upload metadata to Datahub. It will apply glossary terms, tags, description, owners and domain at the entity level. It can also be used to apply tags, glossary terms, and documentation at the column level. These values are read from a CSV file. You have the option to either overwrite or append existing values.

The format of the CSV is demonstrated below. The header is required and URNs should be surrounded by quotes when they contains commas (most URNs contains commas).

resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain,ownership_type_urn
"urn:li:dataset:(urn:li:dataPlatform:snowflake,datahub.growth.users,PROD)",,[urn:li:glossaryTerm:Users],[urn:li:tag:HighQuality],[urn:li:corpuser:lfoe|urn:li:corpuser:jdoe],CUSTOM,"description for users table",urn:li:domain:Engineering,urn:li:ownershipType:a0e9176c-d8cf-4b11-963b-f7a1bc2333c9
"urn:li:dataset:(urn:li:dataPlatform:hive,datahub.growth.users,PROD)",first_name,[urn:li:glossaryTerm:FirstName],,,,"first_name description",
"urn:li:dataset:(urn:li:dataPlatform:hive,datahub.growth.users,PROD)",last_name,[urn:li:glossaryTerm:LastName],,,,"last_name description",

Note that the first row does not have a subresource populated. That means any glossary terms, tags, and owners will be applied at the entity field. If a subresource is populated (as it is for the second and third rows), glossary terms and tags will be applied on the column. Every row MUST have a resource. Also note that owners can only be applied at the resource level.

If ownership_type_urn is set then ownership_type must be set to CUSTOM.

Note that you have the option in your recipe config to write as a PATCH or as an OVERRIDE. This choice will apply to all metadata for the entity, not just a single aspect. So OVERRIDE will override all metadata, including performing deletes if a metadata field is empty. The default is PATCH.

note

This source will not work on very large csv files that do not fit in memory.

CLI based Ingestion

Install the Plugin

The csv-enricher source works out of the box with acryl-datahub.

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
  type: csv-enricher 
  config:
    # relative path to your csv file to ingest
    filename: ./path/to/your/file.csv

# Default sink is datahub-rest and doesn't need to be configured
# See https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for customization options

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
filename ✅ string	File path or URL of CSV file to ingest.
array_delimiter string	Delimiter to use when parsing array fields (tags, terms and owners) Default: \|
delimiter string	Delimiter to use when parsing CSV Default: ,
write_semantics string	Whether the new tags, terms and owners to be added will override the existing ones added only by this source or not. Value for this config can be "PATCH" or "OVERRIDE". NOTE: this will apply to all metadata for the entity, not just a single aspect. Default: PATCH

The JSONSchema for this configuration is inlined below.

{
  "title": "CSVEnricherConfig",
  "type": "object",
  "properties": {
    "filename": {
      "title": "Filename",
      "description": "File path or URL of CSV file to ingest.",
      "type": "string"
    },
    "write_semantics": {
      "title": "Write Semantics",
      "description": "Whether the new tags, terms and owners to be added will override the existing ones added only by this source or not. Value for this config can be \"PATCH\" or \"OVERRIDE\". NOTE: this will apply to all metadata for the entity, not just a single aspect.",
      "default": "PATCH",
      "type": "string"
    },
    "delimiter": {
      "title": "Delimiter",
      "description": "Delimiter to use when parsing CSV",
      "default": ",",
      "type": "string"
    },
    "array_delimiter": {
      "title": "Array Delimiter",
      "description": "Delimiter to use when parsing array fields (tags, terms and owners)",
      "default": "|",
      "type": "string"
    }
  },
  "required": [
    "filename"
  ],
  "additionalProperties": false
}

Code Coordinates

Class Name: datahub.ingestion.source.csv_enricher.CSVEnricherSource
Browse on GitHub

Questions

If you've got any questions on configuring ingestion for CSV Enricher, feel free to ping us on our Slack.

Is this page helpful?

CSV Enricher

Important Capabilities​

CLI based Ingestion​

Install the Plugin​

Starter Recipe​

Config Details​

Code Coordinates​