Parsing "aTimeLogger" Android app data to graphs using pandas

Drone4four · May-12-2024, 09:15 AM

Greetings Pythonistas!

I’ve got a data set in CSV format which spans from July 2019 up to December 2023. It is a collection of productivity intervals that I’ve meticulously recorded while doing social science research like Philosophy as well as learning how to program Python, Django, some DevOps, and a little algorithmic design. When I finish spending 25 minutes on my lunch hour on my tablet watching Udemy course content teaching the basics of Binary Search Trees, at the end of my lunch break, using an app called aTimeLogger on my Android phone I enter the “Activity” type (discrete subjects such as “Python”, “algorithms”, “Django”, “sysadmin”, or even “Magick” / “writing”). I also enter the date, the start time, and the end time. After entering a start time and end time, a time delta is automatically calculated. Then I write an annotation (1-2 sentences) which is sort of like a mental note for my future reference. See attached for the data set. Take note: I purged the “Comment” (annotation) data objects (string fields) from the data set for the sake of this forum thread.

For additional context, over the term of the data set, I’ve spent a total of ~378 hours doing something “Python” related and ~579 hours working on Django course content (or a Django based web project). Those are the two largest categories. The rest of the Activities aren’t as data dense.

I am trying to plot these activities as line graphs over time. The x-axis is the continuous element of time. The y-axis are the summed duration time deltas for every entry. If I take just the ‘Python’ activity and plot every instance as a line graph over the ~5 year span/period, the data points are noisy, busy, and hard to follow. But they still plot successfully. Here is my initilization of the data:

import pandas as pd
import matplotlib.pyplot as plt

bulk_df = pd.read_csv("data/all-comments-removed.csv",parse_dates=["From","To",])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])

Here is some general info about my dataset:

bulk_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2583 entries, 0 to 2582
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype          
---  ------    --------------  -----          
 0   Activity  2583 non-null   object         
 1   Duration  2583 non-null   timedelta64[ns]
 2   From      2583 non-null   datetime64[ns] 
 3   To        2583 non-null   datetime64[ns] 
dtypes: datetime64[ns](2), object(1), timedelta64[ns](1)
memory usage: 80.8+ KB

Here is my ‘Python’ activity data parsed:

# Create a DataFrame
python = bulk_df[bulk_df["Activity"] == "Python"] #.rolling(window=90).mean()

# Plotting
python.plot(x='From', y='Duration', kind='line', marker='o',figsize=(14, 8))

Here is the output so far:

That kind of works.

But what I really want to do is use pandas’ rolling() method and pass in integers like ‘182’ days for a half year and ‘90’ for 3 months (quarter year). I tried that . You can see my code snippet above where I tried to use .rolling(). My Jupyter Notebook didn’t like it so I commented it out. I am clearly doing something wrong.

Here was a separate attempt at a different stage in my coding session this morning where I tried to plot the “Python” activity with the original noisy data points which I tried to simulatenously contrast with a rolling window:

# Convert timedelta to hours
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600

# Create a DataFrame
python = bulk_df[bulk_df["Activity"] == "Python"]

# Plotting
# python.plot(x='From', y='Duration_hours', kind='line', marker='o')

# Calculate the rolling mean with a window of 2
rolling_mean = python['Duration_hours'].rolling(window=2).mean()

# Plotting
python.plot(x='From', y='Duration_hours', kind='line', marker='o',figsize=(14, 8))
rolling_mean.plot(x='From', y='Duration_hours', kind='line', marker='o', label='Rolling Mean')

Here was the strange output:

How do I tell pandas to show the quarterly and half year averages of the timedelta data point for the “Python” activity. Thanks!

Drone4four · May-12-2024, 07:55 PM

Toq uote myself from earlier:

(May-12-2024, 09:15 AM)Drone4four Wrote:

# Create a DataFrame
python = bulk_df[bulk_df["Activity"] == "Python"] #.rolling(window=90).mean()

# Plotting
python.plot(x='From', y='Duration', kind='line', marker='o',figsize=(14, 8))

At line 2, when I invoke pandas' dt module to parse quarter years with this:

python = bulk_df[bulk_df["Activity"] == "Python"].dt.quarter

I get this traceback:

Error:---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_3152351/2962891946.py in ?()
      1 # Create a DataFrame
      2 # python = bulk_df[bulk_df["Activity"] == "Python"] #.rolling(window=90).mean()
----> 3 python = bulk_df[bulk_df["Activity"] == "Python"].dt.quarter
      4 
      5 # Plotting
      6 python.plot(x='From', y='Duration', kind='line', marker='o',figsize=(14, 8))

/usr/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, name)
   5898             and name not in self._accessors
   5899             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900         ):
   5901             return self[name]
-> 5902         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'dt'

***snippsat*** · May-13-2024, 09:26 AM

Can try something like this.

import pandas as pd
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt

bulk_df = pd.read_csv('all-comments-removed.csv', parse_dates=["From", "To"])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600
# Copy so changes made to python_df dos not affect bulk_df and vice versa
python_df = bulk_df[bulk_df["Activity"] == "Python"].copy()
python_df.set_index('From', inplace=True)
# Calculate rolling means using the index now
python_df['Rolling_Mean_90'] = python_df['Duration_hours'].rolling('90D').mean()
python_df['Rolling_Mean_182'] = python_df['Duration_hours'].rolling('182D').mean()
# Plot
plt.figure(figsize=(14, 8))
plt.plot(python_df.index, python_df['Duration_hours'], marker='o', linestyle='', label='Raw Duration Data')
plt.plot(python_df.index, python_df['Rolling_Mean_90'], label='90-Day Rolling Average')
plt.plot(python_df.index, python_df['Rolling_Mean_182'], label='182-Day Rolling Average')
plt.title("Python Activity with Rolling Averages")
plt.xlabel("Date")
plt.ylabel("Hours Spent")
plt.legend()
plt.show()

Drone4four · May-19-2024, 07:38 AM

@snippsat: Thank you! This is terrific. It works really well.

I made some light changes. For example, I commented out the “Python” activity raw duration data as well as the “Python” 182-day rolling mean line and then added the “Django” activity 90-day rolling mean. This enables me to compare the two categories. Here is how it looks now:

I thought the average/mean was what I needed, but apparently not. The lines gyrate sporadically. That’s not really what I had in mind. To refine my intention, what I was intending to achieve is all the “Python” activity data stacked (summed together) in quarterly increments. There should only be one data point every 3 months.

There is another way of explaining what I have set out to accomplish. If I replot the data as a bar graph with:

plt.bar(python_df.index, python_df['Rolling_Mean_90'], label='Python 90-Day Rolling Average')
plt.bar(django_df.index, django_df['Rolling_Mean_90'], label='Django 90-Day Rolling Average', color="red")

As-is that parses like this:

It looks to me like every individual data point is getting plotted. What I really want to do is stack all the “Python” and “Django” data points collected and summed together in 90 day intervals rather than all individually.

What changes do I need to make in order to achieve that?

LauraB · May-20-2024, 07:20 PM

Its really difficult to understand! Bonkself

ericxzhou · May-22-2024, 12:37 AM

Try this:

python_df_Month = python_df['Rolling_Mean_90'].resample('MS').sum()
django_df_Month = django_df['Rolling_Mean_90'].resample('MS').sum()
py_dj_Month_combined = python_df_Month.add(django_df_Month, fill_value=0)
plt.figure(figsize=(14, 8))
plt.bar(py_dj_Month_combined.index, py_dj_Month_combined, label='django-90-day Rolling Mean', color='blue')
plt.bar(python_df_Month.index, python_df_Month, label='python-90-day Rolling Mean', color='red')
plt.legend()
plt.show()

Drone4four · 1 hour ago

@ericxzhou! Thank you. This worked. My apologies on my delayed response.

Here is how Jupyter with pandas and matplotlib parse the code now:

As you can see in the bottom (second) bar graph, all the data points are showing. Thanks gain ericxzhou. Although based mostly on your suggestions, I made some slight modifications. Below is my latest code snippet.

import pandas as pd
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt

# Load the data
bulk_df = pd.read_csv('data/all-comments-removed.csv', parse_dates=["From", "To"])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600

# Copy and filter data for Python activity
python_df = bulk_df[bulk_df["Activity"] == "Python"].copy()
python_df.set_index('From', inplace=True)

# Calculate rolling means
python_df['Rolling_Mean_90'] = python_df['Duration_hours'].rolling('90D').mean()
python_df['Rolling_Mean_182'] = python_df['Duration_hours'].rolling('182D').mean()

# Plot raw data and rolling mean
plt.figure(figsize=(12, 6))
plt.plot(python_df.index, python_df['Duration_hours'], marker='o', linestyle='', label='Raw Python Duration Data')
plt.plot(python_df.index, python_df['Rolling_Mean_90'], label='90-Day Rolling Average')
# plt.plot(python_df.index, python_df['Rolling_Mean_182'], label='182-Day Rolling Average')
plt.title("Python Activity with Rolling Averages")
plt.xlabel("Date")
plt.ylabel("Hours Spent")
plt.legend()

# Resample and combine data
python_df_Month = python_df['Rolling_Mean_90'].resample('MS').sum()
django_df_Month = django_df['Rolling_Mean_90'].resample('MS').sum()
py_dj_Month_combined = python_df_Month.add(django_df_Month, fill_value=0)

# Plot the combined data with wider bars
plt.figure(figsize=(12, 6))
plt.bar(py_dj_Month_combined.index, py_dj_Month_combined, width=20, label='Django 90-Day Rolling Mean')  # Adjust width as needed
plt.bar(python_df_Month.index, python_df_Month, width=20, label='Python 90-Day Rolling Mean')  # Adjust width as needed
plt.legend()
plt.show()

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Parsing and summing time deltas (duration) onto bar + pie charts using pandas - - DRY	Drone4four	2	726	Feb-10-2024, 06:04 PM Last Post: Drone4four
	Grouping in pandas/multi-index data frame	Aleqsie	3	833	Jan-06-2024, 03:55 PM Last Post: deanhystad
	How to further boost the data read write speed using pandas	tjk9501	1	1,330	Nov-14-2022, 01:46 PM Last Post: jefsummers
	How to plot 2 graphs in one figure?	man0s	1	1,450	Apr-25-2022, 09:18 AM Last Post: Axel_Erfurt
	can't access data from URL in pandas/jupyter notebook	aaanoushka	1	1,937	Feb-13-2022, 01:19 PM Last Post: jefsummers
	Sorting data with pandas	TheZaind	4	2,447	Nov-22-2021, 07:33 PM Last Post: aserian
	Pandas Data frame column condition check based on length of the value	aditi06	1	2,785	Jul-28-2021, 11:08 AM Last Post: jefsummers
	[Pandas] Write data to Excel with dot decimals	manonB	1	6,053	May-05-2021, 05:28 PM Last Post: ibreeden
	pandas.to_datetime: Combine data from 2 columns	ju21878436312	1	2,515	Feb-20-2021, 08:25 PM Last Post: perfringo
	pandas read_csv can't handle missing data	mrdominikku	0	2,595	Jul-09-2020, 12:26 PM Last Post: mrdominikku

Parsing "aTimeLogger" Android app data to graphs using pandas

User Panel Messages

Announcements