Flattening multi nested json into a pandas dataframe

3 min read 05-10-2024
Flattening multi nested json into a pandas dataframe


Flatten Your Multi-Nested JSON: A Pandas Guide for Easy Data Analysis

Working with deeply nested JSON data can be a real headache. Trying to extract the information you need often feels like navigating a tangled web. But fear not, pandas is here to rescue you! This article will guide you through the process of flattening multi-nested JSON data into a clean and manageable pandas DataFrame, making your data analysis much smoother.

The Scenario:

Imagine you have a JSON file containing data about different products, each with multiple attributes and nested structures like this:

[
  {
    "product_id": "P123",
    "name": "Laptop",
    "category": "Electronics",
    "specifications": {
      "screen_size": "15.6 inches",
      "processor": "Intel Core i5",
      "ram": "8GB"
    },
    "price": {
      "currency": "USD",
      "amount": 1200
    },
    "reviews": [
      {
        "rating": 4,
        "comment": "Great laptop for the price!"
      },
      {
        "rating": 5,
        "comment": "Excellent performance."
      }
    ]
  },
  {
    "product_id": "P456",
    "name": "Smartphone",
    "category": "Electronics",
    "specifications": {
      "screen_size": "6.5 inches",
      "camera": "48MP",
      "battery": "4000 mAh"
    },
    "price": {
      "currency": "USD",
      "amount": 700
    },
    "reviews": [
      {
        "rating": 3,
        "comment": "Good phone, but the battery life could be better."
      }
    ]
  }
]

You want to analyze this data in a pandas DataFrame, but the nested structure makes it difficult to work with. You need to flatten it into a format where each row represents a product and each column represents a distinct attribute.

The Solution: Recursive Functions and Pandas Power

The core of the solution involves writing a recursive function that iterates through the nested JSON structure, extracting each key-value pair and creating a dictionary with a unique identifier for each nested level. We can then use this dictionary to construct the pandas DataFrame.

Here's a Python code snippet using the json and pandas libraries:

import json
import pandas as pd

def flatten_json(y, key=None, sep='_'):
    items = []
    for k, v in y.items():
        new_key = key + sep + k if key else k
        if isinstance(v, dict):
            items.extend(flatten_json(v, new_key, sep=sep).items())
        elif isinstance(v, list):
            for i, u in enumerate(v):
                items.extend(flatten_json(u, new_key + f"_{i}", sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

with open('products.json', 'r') as f:
    data = json.load(f)

flattened_data = [flatten_json(d) for d in data]
df = pd.DataFrame(flattened_data)

print(df)

Understanding the Code:

  • The flatten_json function recursively traverses the JSON structure.
  • It uses a key-value pair approach, concatenating keys to create unique identifiers for nested levels.
  • The code handles lists by adding an index to the key for each item in the list.
  • Finally, the code creates a DataFrame from the flattened dictionary.

Let's Analyze:

  1. Clarity: The code is structured to be readable and self-explanatory, with clear variable names and comments.
  2. Flexibility: The sep parameter allows you to customize the separator used in the key names, offering flexibility in structuring your DataFrame.
  3. Efficiency: The recursive approach is efficient, enabling the flattening of even deeply nested JSON structures.

Adding Value and Beyond:

  • Handling Missing Values: You can modify the code to handle missing values gracefully by adding logic to replace None with a placeholder or default value.
  • Customizing Output: You can further customize the flattening process based on your specific needs. For example, you can choose to flatten only specific levels of the JSON structure or apply custom transformations to the extracted data.

References and Resources:

  • Pandas Documentation: Explore the official pandas documentation for comprehensive insights into data manipulation and analysis.
  • JSON Documentation: Learn more about working with JSON data in Python.

Conclusion:

Flatten your nested JSON data effortlessly with this powerful pandas solution. This approach provides you with a clean and organized DataFrame, empowering you to unlock the full potential of your data analysis.