Use dplyr to compute "streaks" in column of data

2 min read 06-10-2024
Use dplyr to compute "streaks" in column of data


Unleashing Streaks: Using dplyr to Identify Patterns in Your Data

Have you ever needed to find consecutive occurrences of a certain value in a dataset? Perhaps you're analyzing sales data and want to identify periods of consistent growth, or tracking website traffic and seeking out periods of sustained engagement. This is where the concept of "streaks" comes into play, and the dplyr package in R provides powerful tools to efficiently uncover these patterns.

Let's imagine we have a dataset of daily stock prices for a particular company, and we want to identify periods where the stock price increased consecutively for at least three days. Here's a simplified example:

library(dplyr)

stock_data <- data.frame(
  date = seq.Date(from = as.Date("2023-01-01"), to = as.Date("2023-01-10"), by = "day"),
  price = c(100, 102, 101, 104, 106, 108, 107, 109, 111, 110)
)

In this example, we want to identify the streak of increasing prices starting from January 4th (price 104) and lasting until January 7th (price 108).

Here's how we can achieve this using dplyr:

stock_data %>% 
  mutate(
    increase = price > lag(price), # Check if current price is higher than previous
    streak_length = ifelse(increase, 1, 0), # Assign 1 for increasing price, 0 otherwise
    streak_length = cumsum(streak_length) * increase, # Calculate cumulative sum for increasing streak
    streak_start = lag(streak_length) == 0 & streak_length > 0
  ) %>%
  filter(streak_start) %>%
  mutate(
    streak_end = lead(streak_length) == 0 | lead(streak_length) < streak_length,
    streak_end = ifelse(streak_end, TRUE, FALSE)
  ) %>%
  filter(streak_end) %>%
  select(date, price, streak_length)

This code snippet does the following:

  1. Creates a "increase" column: Checks if the current price is higher than the previous day's price.
  2. Creates a "streak_length" column: Assigns 1 if the price increased, 0 otherwise. Then calculates a cumulative sum of this value, but only if the price increased. This ensures that only increasing streaks are counted.
  3. Identifies streak start: Finds the rows where a new streak begins by checking if the previous streak_length is 0 and the current one is greater than 0.
  4. Identifies streak end: Similar to finding the start, we look for rows where the next streak_length is either 0 or smaller than the current one.
  5. Filters for relevant data: We only keep the rows where a streak starts and ends, providing us with the beginning and end dates of each streak.

This results in a new dataframe showing the start and end dates of each streak, along with the streak length.

Key Insights:

  • This approach can be applied to identify streaks for any kind of data, not just financial data.
  • The logic can be modified to detect streaks of decreasing values, or even specific values, by changing the condition in the mutate() function.
  • The code can be further customized to cater to specific requirements, such as filtering for streaks longer than a certain length.

Additional Value:

Beyond identifying streaks, this code can also be used to:

  • Analyze trends: Identify periods of sustained growth or decline.
  • Segment data: Group data based on streaks for further analysis.
  • Create visualizations: Visualize streaks using charts to gain further insight.

References and Resources:

By understanding the concept of streaks and using the powerful tools provided by dplyr, you can effectively analyze your data to discover meaningful patterns and insights.