How to get first row and first column value of CSV in DataFactory?

3 min read 06-10-2024
How to get first row and first column value of CSV in DataFactory?


Extracting the First Row and Column Values from a CSV in Azure Data Factory

Scenario: You have a CSV file in Azure Data Factory and need to extract the values from the very first row and the first column. This is often needed for tasks like:

  • Identifying header names from the first row.
  • Understanding the first data point in the CSV.
  • Using these values for dynamic calculations or data manipulation.

Let's delve into how to achieve this using Azure Data Factory's powerful tools.

Original Code (Example):

{
  "name": "GetFirstRowFirstColumn",
  "properties": {
    "type": "Copy",
    "source": {
      "type": "DelimitedTextSource",
      "storeSettings": {
        "type": "AzureBlobStorageReadSettings",
        "recursive": true,
        "maxConcurrentConnections": 0
      },
      "format": {
        "type": "DelimitedTextFormat",
        "delimiter": ","
      }
    },
    "sink": {
      "type": "DelimitedTextSink",
      "storeSettings": {
        "type": "AzureBlobStorageWriteSettings",
        "maxConcurrentConnections": 0
      },
      "format": {
        "type": "DelimitedTextFormat",
        "delimiter": ","
      }
    },
    "script": "/* This script doesn't extract the first row and column */\n\n/* Placeholder for extracting the first row and column */\n"
  }
}

Analysis:

This example code demonstrates a basic copy activity in Data Factory. However, it doesn't contain any logic to extract the first row and column values. Let's break down the steps to achieve this.

Steps for Extracting the First Row and Column:

  1. Load the CSV into a temporary variable:

    • Use a Lookup activity to read the first line of the CSV file.
    • Store the results in a variable.
  2. Extract the first row:

    • Utilize the split() function on the variable holding the first line, splitting it by the CSV delimiter (e.g., ",").
    • The first element of the resulting array will be the first value in the row.
  3. Extract the first column:

    • Use a ForEach activity to iterate through each line of the CSV.
    • In each iteration, use the split() function to separate values by the delimiter.
    • The first element in the resulting array will be the first column value for the current row.

Enhanced Code:

{
  "name": "ExtractFirstRowFirstColumn",
  "properties": {
    "activities": [
      {
        "name": "LookupFirstLine",
        "type": "Lookup",
        "inputs": [
          {
            "referenceName": "MyCSVDataset" 
          }
        ],
        "output": {
          "name": "FirstLineOutput",
          "type": "Dataset",
          "linkedServiceName": {
            "referenceName": "MyStorageLinkedService"
          },
          "dataset": {
            "type": "DelimitedText",
            "linkedServiceName": {
              "referenceName": "MyStorageLinkedService"
            },
            "parameters": {
              "path": {
                "value": "your-file-path.csv"
              }
            },
            "schema": [],
            "format": {
              "type": "DelimitedTextFormat",
              "delimiter": ","
            }
          }
        },
        "firstRowOnly": true
      },
      {
        "name": "ExtractFirstRow",
        "type": "SetVariable",
        "inputs": [
          {
            "referenceName": "LookupFirstLine"
          }
        ],
        "variables": {
          "FirstRow": {
            "value": "@split(activity('LookupFirstLine').output.firstRow.value, ',')[0]" 
          }
        }
      },
      {
        "name": "ExtractFirstColumn",
        "type": "ForEach",
        "inputs": [
          {
            "referenceName": "MyCSVDataset"
          }
        ],
        "variables": {
          "FirstColumn": {
            "value": "@split(item().value, ',')[0]" 
          }
        },
        "actions": [
          {
            "name": "OutputFirstColumn",
            "type": "ExecutePipeline",
            "pipeline": {
              "referenceName": "OutputFirstColumnPipeline" 
            },
            "parameters": {
              "firstColumnValue": {
                "value": "@variables('FirstColumn')" 
              }
            }
          }
        ]
      }
    ],
    "variables": {
      "FirstRow": {
        "type": "String"
      },
      "FirstColumn": {
        "type": "String"
      }
    }
  }
}

Explanation:

  • LookupFirstLine: This activity reads the first row of the CSV file into the "FirstLineOutput" dataset.
  • ExtractFirstRow: Using split(), this activity extracts the first value from the first row and stores it in the FirstRow variable.
  • ExtractFirstColumn: This activity iterates through the CSV file. In each iteration, it uses split() to get the first value (first column) from the current row and stores it in the FirstColumn variable.
  • OutputFirstColumnPipeline: This is a separate pipeline (not shown here) that receives the firstColumnValue parameter and handles further processing based on your requirements.

Benefits of this approach:

  • Scalability: The code uses Data Factory's built-in capabilities for efficient processing of large CSV files.
  • Flexibility: The approach allows for customization based on your specific needs. You can modify the code to extract different rows or columns as required.
  • Reusability: You can create reusable pipelines for extracting data from multiple CSV files.

Conclusion:

By using a combination of Data Factory's Lookup, SetVariable, and ForEach activities, you can efficiently extract the first row and column values from a CSV file. This empowers you to build robust data pipelines that cater to various processing requirements.

Remember to replace the placeholder values with your actual CSV file paths and other relevant configurations.

References: