BASH extract data with conditions from multiple csv files, sort data and remove doubles. Need to speed up process

3 min read 05-10-2024

BASH extract data with conditions from multiple csv files, sort data and remove doubles. Need to speed up process

Speeding Up Data Extraction and Processing from Multiple CSV Files with Bash

Extracting data from multiple CSV files, applying conditions, sorting the results, and removing duplicates can be a common task in data analysis. While Bash provides the necessary tools for this, the process can become slow for large datasets. This article explores efficient ways to streamline this workflow using Bash scripting, focusing on optimizing performance.

The Scenario: Extracting Data with Conditions

Imagine you have a directory containing hundreds of CSV files, each containing customer data like name, age, and purchase history. You need to extract specific data points (e.g., customers over 30 years old who made a purchase within the last month), sort the results by name, and remove any duplicate entries.

Here's a basic Bash script that demonstrates the process:

#!/bin/bash

# Define the directory containing the CSV files
csv_dir="/path/to/csv/files"

# Create an empty array to store the extracted data
extracted_data=()

# Iterate through each CSV file in the directory
for file in "$csv_dir"/*.csv; do

  # Process each file
  while IFS=, read -r name age purchase_date; do

    # Check if the customer meets the criteria
    if [[ "$age" -gt 30 && "$purchase_date" -ge $(date -d "1 month ago" +%Y-%m-%d) ]]; then
      # Add the data to the array
      extracted_data+=("$name")
    fi

  done < "$file"

done

# Sort the extracted data alphabetically
sorted_data=$(sort <<< "${extracted_data[@]}" | uniq)

# Display the final sorted data
echo "Extracted and sorted data:"
echo "$sorted_data"

Performance Bottlenecks and Optimization Strategies

This script, while functional, can be inefficient for large datasets due to the following:

File I/O: Reading each CSV file separately incurs significant overhead, especially if there are many files.
Looping and Conditionals: Processing each line within a nested loop, along with condition checks, consumes processing power.
Array Manipulation: Adding elements to the extracted_data array repeatedly can slow down the script.

Here are some optimization strategies to address these issues:

Combine Files with cat: Instead of processing each file individually, we can concatenate all files into a single stream using cat:
```
cat "$csv_dir"/*.csv | ...
```
Use awk for Data Manipulation: awk is a powerful tool for data manipulation and can handle conditions and data extraction more efficiently than Bash loops:
```
awk -F, '$2 > 30 && $3 >= $(date -d "1 month ago" +%Y-%m-%d) {print $1}' "$csv_dir"/*.csv | ...
```
Utilize sort with uniq: sort and uniq are efficient command-line utilities for sorting and removing duplicates.
Utilize parallel for Parallel Processing: parallel allows you to distribute the processing of multiple files across available CPU cores, significantly accelerating the process.

Optimized Code Example

Here's an optimized version of the script incorporating the above strategies:

#!/bin/bash

# Define the directory containing the CSV files
csv_dir="/path/to/csv/files"

# Extract and filter data using awk, sort, and remove duplicates
cat "$csv_dir"/*.csv | \
  awk -F, '$2 > 30 && $3 >= $(date -d "1 month ago" +%Y-%m-%d) {print $1}' | \
  sort | uniq > extracted_data.txt

# Display the results
echo "Extracted and sorted data saved to extracted_data.txt"

Explanation:

cat "$csv_dir"/*.csv: Combines all CSV files into a single stream.
awk -F, '$2 > 30 && $3 >= $(date -d "1 month ago" +%Y-%m-%d) {print $1}': Filters data based on age and purchase date, extracting only the name.
sort: Sorts the extracted names alphabetically.
uniq: Removes duplicate entries.
> extracted_data.txt: Redirects the final output to a file.

Conclusion

This optimized Bash script significantly improves the performance of extracting data from multiple CSV files. By combining files, using awk for data manipulation, and leveraging efficient command-line tools, we can handle large datasets efficiently. Remember to adjust the conditions and data extraction logic as per your specific requirements.

Further Optimization:

Consider using specialized CSV processing tools like csvkit for even faster data manipulation.
For very large datasets, explore using databases or distributed computing frameworks like Hadoop or Spark for more efficient processing.

Resources:

awk: https://www.gnu.org/software/gawk/manual/gawk.html
sort: https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html
uniq: https://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html
parallel: https://www.gnu.org/software/parallel/
csvkit: https://csvkit.readthedocs.io/en/latest/