Python Forum
Parsing "aTimeLogger" Android app data to graphs using pandas
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parsing "aTimeLogger" Android app data to graphs using pandas
#1
Greetings Pythonistas!

I’ve got a data set in CSV format which spans from July 2019 up to December 2023. It is a collection of productivity intervals that I’ve meticulously recorded while doing social science research like Philosophy as well as learning how to program Python, Django, some DevOps, and a little algorithmic design. When I finish spending 25 minutes on my lunch hour on my tablet watching Udemy course content teaching the basics of Binary Search Trees, at the end of my lunch break, using an app called aTimeLogger on my Android phone I enter the “Activity” type (discrete subjects such as “Python”, “algorithms”, “Django”, “sysadmin”, or even “Magick” / “writing”). I also enter the date, the start time, and the end time. After entering a start time and end time, a time delta is automatically calculated. Then I write an annotation (1-2 sentences) which is sort of like a mental note for my future reference. See attached for the data set. Take note: I purged the “Comment” (annotation) data objects (string fields) from the data set for the sake of this forum thread.

For additional context, over the term of the data set, I’ve spent a total of ~378 hours doing something “Python” related and ~579 hours working on Django course content (or a Django based web project). Those are the two largest categories. The rest of the Activities aren’t as data dense.

I am trying to plot these activities as line graphs over time. The x-axis is the continuous element of time. The y-axis are the summed duration time deltas for every entry. If I take just the ‘Python’ activity and plot every instance as a line graph over the ~5 year span/period, the data points are noisy, busy, and hard to follow. But they still plot successfully. Here is my initilization of the data:

import pandas as pd
import matplotlib.pyplot as plt

bulk_df = pd.read_csv("data/all-comments-removed.csv",parse_dates=["From","To",])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
Here is some general info about my dataset:

bulk_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2583 entries, 0 to 2582
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype          
---  ------    --------------  -----          
 0   Activity  2583 non-null   object         
 1   Duration  2583 non-null   timedelta64[ns]
 2   From      2583 non-null   datetime64[ns] 
 3   To        2583 non-null   datetime64[ns] 
dtypes: datetime64[ns](2), object(1), timedelta64[ns](1)
memory usage: 80.8+ KB
Here is my ‘Python’ activity data parsed:

# Create a DataFrame
python = bulk_df[bulk_df["Activity"] == "Python"] #.rolling(window=90).mean()

# Plotting
python.plot(x='From', y='Duration', kind='line', marker='o',figsize=(14, 8))
Here is the output so far:

   

That kind of works.

But what I really want to do is use pandas’ rolling() method and pass in integers like ‘182’ days for a half year and ‘90’ for 3 months (quarter year). I tried that . You can see my code snippet above where I tried to use .rolling(). My Jupyter Notebook didn’t like it so I commented it out. I am clearly doing something wrong.

Here was a separate attempt at a different stage in my coding session this morning where I tried to plot the “Python” activity with the original noisy data points which I tried to simulatenously contrast with a rolling window:

# Convert timedelta to hours
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600

# Create a DataFrame
python = bulk_df[bulk_df["Activity"] == "Python"]

# Plotting
# python.plot(x='From', y='Duration_hours', kind='line', marker='o')

# Calculate the rolling mean with a window of 2
rolling_mean = python['Duration_hours'].rolling(window=2).mean()

# Plotting
python.plot(x='From', y='Duration_hours', kind='line', marker='o',figsize=(14, 8))
rolling_mean.plot(x='From', y='Duration_hours', kind='line', marker='o', label='Rolling Mean')
Here was the strange output:

   

How do I tell pandas to show the quarterly and half year averages of the timedelta data point for the “Python” activity. Thanks!

Attached Files

.csv   all-comments-removed.csv (Size: 144.36 KB / Downloads: 3)
Reply
#2
Toq uote myself from earlier:

(May-12-2024, 09:15 AM)Drone4four Wrote:
# Create a DataFrame
python = bulk_df[bulk_df["Activity"] == "Python"] #.rolling(window=90).mean()

# Plotting
python.plot(x='From', y='Duration', kind='line', marker='o',figsize=(14, 8))

At line 2, when I invoke pandas' dt module to parse quarter years with this:

python = bulk_df[bulk_df["Activity"] == "Python"].dt.quarter
I get this traceback:

Error:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) /tmp/ipykernel_3152351/2962891946.py in ?() 1 # Create a DataFrame 2 # python = bulk_df[bulk_df["Activity"] == "Python"] #.rolling(window=90).mean() ----> 3 python = bulk_df[bulk_df["Activity"] == "Python"].dt.quarter 4 5 # Plotting 6 python.plot(x='From', y='Duration', kind='line', marker='o',figsize=(14, 8)) /usr/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, name) 5898 and name not in self._accessors 5899 and self._info_axis._can_hold_identifiers_and_holds_name(name) 5900 ): 5901 return self[name] -> 5902 return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'dt'
Reply
#3
Can try something like this.
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt

bulk_df = pd.read_csv('all-comments-removed.csv', parse_dates=["From", "To"])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600
# Copy so changes made to python_df dos not affect bulk_df and vice versa
python_df = bulk_df[bulk_df["Activity"] == "Python"].copy()
python_df.set_index('From', inplace=True)
# Calculate rolling means using the index now
python_df['Rolling_Mean_90'] = python_df['Duration_hours'].rolling('90D').mean()
python_df['Rolling_Mean_182'] = python_df['Duration_hours'].rolling('182D').mean()
# Plot
plt.figure(figsize=(14, 8))
plt.plot(python_df.index, python_df['Duration_hours'], marker='o', linestyle='', label='Raw Duration Data')
plt.plot(python_df.index, python_df['Rolling_Mean_90'], label='90-Day Rolling Average')
plt.plot(python_df.index, python_df['Rolling_Mean_182'], label='182-Day Rolling Average')
plt.title("Python Activity with Rolling Averages")
plt.xlabel("Date")
plt.ylabel("Hours Spent")
plt.legend()
plt.show()
Pedroski55 likes this post
Reply
#4
@snippsat: Thank you! This is terrific. It works really well.

I made some light changes. For example, I commented out the “Python” activity raw duration data as well as the “Python” 182-day rolling mean line and then added the “Django” activity 90-day rolling mean. This enables me to compare the two categories. Here is how it looks now:
   
I thought the average/mean was what I needed, but apparently not. The lines gyrate sporadically. That’s not really what I had in mind. To refine my intention, what I was intending to achieve is all the “Python” activity data stacked (summed together) in quarterly increments. There should only be one data point every 3 months.

There is another way of explaining what I have set out to accomplish. If I replot the data as a bar graph with:

plt.bar(python_df.index, python_df['Rolling_Mean_90'], label='Python 90-Day Rolling Average')
plt.bar(django_df.index, django_df['Rolling_Mean_90'], label='Django 90-Day Rolling Average', color="red")
As-is that parses like this:
   
It looks to me like every individual data point is getting plotted. What I really want to do is stack all the “Python” and “Django” data points collected and summed together in 90 day intervals rather than all individually.

What changes do I need to make in order to achieve that?
Reply
#5
Its really difficult to understand! Bonkself
Reply
#6
Try this:

python_df_Month = python_df['Rolling_Mean_90'].resample('MS').sum()
django_df_Month = django_df['Rolling_Mean_90'].resample('MS').sum()
py_dj_Month_combined = python_df_Month.add(django_df_Month, fill_value=0)
plt.figure(figsize=(14, 8))
plt.bar(py_dj_Month_combined.index, py_dj_Month_combined, label='django-90-day Rolling Mean', color='blue')
plt.bar(python_df_Month.index, python_df_Month, label='python-90-day Rolling Mean', color='red')
plt.legend()
plt.show()
Reply
#7
@ericxzhou! Thank you. This worked. My apologies on my delayed response.

Here is how Jupyter with pandas and matplotlib parse the code now:

   

As you can see in the bottom (second) bar graph, all the data points are showing. Thanks gain ericxzhou. Although based mostly on your suggestions, I made some slight modifications. Below is my latest code snippet.

import pandas as pd
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt

# Load the data
bulk_df = pd.read_csv('data/all-comments-removed.csv', parse_dates=["From", "To"])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600

# Copy and filter data for Python activity
python_df = bulk_df[bulk_df["Activity"] == "Python"].copy()
python_df.set_index('From', inplace=True)

# Calculate rolling means
python_df['Rolling_Mean_90'] = python_df['Duration_hours'].rolling('90D').mean()
python_df['Rolling_Mean_182'] = python_df['Duration_hours'].rolling('182D').mean()

# Plot raw data and rolling mean
plt.figure(figsize=(12, 6))
plt.plot(python_df.index, python_df['Duration_hours'], marker='o', linestyle='', label='Raw Python Duration Data')
plt.plot(python_df.index, python_df['Rolling_Mean_90'], label='90-Day Rolling Average')
# plt.plot(python_df.index, python_df['Rolling_Mean_182'], label='182-Day Rolling Average')
plt.title("Python Activity with Rolling Averages")
plt.xlabel("Date")
plt.ylabel("Hours Spent")
plt.legend()

# Resample and combine data
python_df_Month = python_df['Rolling_Mean_90'].resample('MS').sum()
django_df_Month = django_df['Rolling_Mean_90'].resample('MS').sum()
py_dj_Month_combined = python_df_Month.add(django_df_Month, fill_value=0)

# Plot the combined data with wider bars
plt.figure(figsize=(12, 6))
plt.bar(py_dj_Month_combined.index, py_dj_Month_combined, width=20, label='Django 90-Day Rolling Mean')  # Adjust width as needed
plt.bar(python_df_Month.index, python_df_Month, width=20, label='Python 90-Day Rolling Mean')  # Adjust width as needed
plt.legend()
plt.show()
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Parsing and summing time deltas (duration) onto bar + pie charts using pandas - - DRY Drone4four 2 726 Feb-10-2024, 06:04 PM
Last Post: Drone4four
  Grouping in pandas/multi-index data frame Aleqsie 3 833 Jan-06-2024, 03:55 PM
Last Post: deanhystad
Smile How to further boost the data read write speed using pandas tjk9501 1 1,330 Nov-14-2022, 01:46 PM
Last Post: jefsummers
  How to plot 2 graphs in one figure? man0s 1 1,450 Apr-25-2022, 09:18 AM
Last Post: Axel_Erfurt
Thumbs Up can't access data from URL in pandas/jupyter notebook aaanoushka 1 1,937 Feb-13-2022, 01:19 PM
Last Post: jefsummers
Question Sorting data with pandas TheZaind 4 2,447 Nov-22-2021, 07:33 PM
Last Post: aserian
  Pandas Data frame column condition check based on length of the value aditi06 1 2,785 Jul-28-2021, 11:08 AM
Last Post: jefsummers
  [Pandas] Write data to Excel with dot decimals manonB 1 6,053 May-05-2021, 05:28 PM
Last Post: ibreeden
  pandas.to_datetime: Combine data from 2 columns ju21878436312 1 2,515 Feb-20-2021, 08:25 PM
Last Post: perfringo
  pandas read_csv can't handle missing data mrdominikku 0 2,595 Jul-09-2020, 12:26 PM
Last Post: mrdominikku

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020