I made this dataframe, which contains dates as datetime64 values.
What I want to do is a bit of a stupid example, but it illustrates my point of selecting on multiple criteria.
I want to:
For the year 2018: plot a bar chart grouped per month, of the different values. So I want to create one graph for 2018, showing on the x-axis 12 times 3 bars.
I hope someone has some idea how this works.
Thank you in advance
import pandas as pd
import numpy as np
import random
date_expected = np.arange('2006-01', '2008-06', dtype= 'datetime64[D]')
cat = ['True','False', 'Maybe']
value = [random.choice(cat) for i in range(len(date_expected))]
data = {'Date_expected': date_expected, 'Value': value }
df = pd.DataFrame(data)
print(df)
First, create a column with the month. Then, group by month and value and get the count.
You need to unstack to get the one column of count per value so that you can plot the bar chart.
df['month'] = df['Date_expected'].apply(lambda x: x.month)
df.groupby(['month', 'Value']).count().unstack().plot(kind='bar')
Related
I got a dataframe with three columns and almost 800.000 rows. I want to plot a line plot where the x axis is DateTime and the Y is Value. The problem is, I want to do a different line for EACH code (there are 6 different codes) in the same plot.Each code has NOT the same length, but that does not matter. At the end, I wanna have a plot with 6 different lines where x axis is DATETIME and Y axis is value. I tried many things but I can not plot it.
Here is a sample of my dataframe
import pandas as pd
# intialise data of lists.
data = {'Code':['AABB', 'AABC', 'AABB', 'AABC','AABD', 'AABC', 'AABB', 'AABC'],
'Value':[1, 1, 2, 2,1,3,3,4],
'Datetime': [2022-03-29,2022-03-29,2022-03-30,2022-03-30,2022-03-30,2022-03-31,
2022-03-31,2022-03-31]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
I tried this but it plots something that does not make any sense
plt.plot(df["DateTime"], df["value"],linewidth=2.0,color='b',alpha=0.5,marker='o')
Your data
There is a duplicate record, as mentioned by #claudio. There are two rows for AABC:2022/3/31:3 and AABC:2022/3/31:4. So, in such cases, I have taken the average of the two value (3.5 in this case). Also, there is only one entry for AABD. Need two points for a line. So, I have added another entry at the end. Also, the column Datetime has been changed from string to type datetime using the pandas function pd.to_datetime()
The method
You can use the pivot_table() to change the data you have provided to a format that can be converted to line plot. Here, I have used the datetime to be the index, each of the unique Code to be a column (so that each column can be converted to a separate line) and the values as values. Note that I have used aggfunc='mean' to handle the cases of duplicate entry. This will take the mean if there are multiple datapoints. Once the pivot_table is created, that can be plotted as line plot using pandas plot.
Code
import pandas as pd
# intialise data of lists.
data = {'Code':['AABB', 'AABC', 'AABB', 'AABC','AABD', 'AABC', 'AABB', 'AABC', 'AABD'],
'Value':[1, 1, 2, 2,1,3,3,4, 4],
'Datetime': ['2022-03-29','2022-03-29','2022-03-30','2022-03-30','2022-03-30','2022-03-31','2022-03-31','2022-03-31','2022-03-31']}
# Create DataFrame
df = pd.DataFrame(data)
df['Datetime'] = pd.to_datetime(df['Datetime'])
df1 = df.pivot_table(index='Datetime', columns='Code', values='Value', aggfunc='mean')
#print the pivoted data
print(df1)
df1.plot()
Output
>>> df1
Code AABB AABC AABD
Datetime
2022-03-29 1.0 1.0 NaN
2022-03-30 2.0 2.0 1.0
2022-03-31 3.0 3.5 NaN
I have a Data set that contains Dates as an index, and each column is the name of an item with count as value. I'm trying to figure out how to filter each column where there will be more than 3 consecutive days where the count is zero for each different column. I was thinking of using a for loop, any help is appreciated. I'm using python for this project.
I'm fairly new to python, so far I tried using for loops, but did not get it to work in any way.
for i in a.index:
if a.loc[i,'name']==3==df.loc[i+1,'name']==df.loc[i+2,'name']:
print(a.loc[i,"name"])
Cannot add integral value to Timestamp without freq.
It would be better if you included a sample dataframe and desired output in your question. Please do the next time. This way, I have to guess what your data looks like and may not be answering your question. I assume the values are integers. Does your dataframe have a row for every day? I will assume that might not be the case. I will make it so that every day in the last delta days has a row. I created a sample dataframe like this:
import pandas as pd
import numpy as np
import datetime
# Here I am just creating random data from your description
delta = 365
start_date = datetime.datetime.now() - datetime.timedelta(days=delta)
end_date = datetime.datetime.now()
datetimes = [end_date - diff for diff in [datetime.timedelta(days=i) for i in range(delta,0,-1)]]
# This is the list of dates we will have in our final dataframe (includes all days)
dates = pd.Series([date.strftime('%Y-%m-%d') for date in datetimes], name='Date', dtype='datetime64[ns]')
# random integer dataframe
df = pd.DataFrame(np.random.randint(0, 5, size=(delta,4)), columns=['item' + str(i) for i in range(4)])
df = pd.concat([df, dates], axis=1).set_index('Date')
# Create a missing day
df = df.drop(df.loc['2019-08-01'].name)
# Reindex so that index has all consecutive days
df = df.reindex(index=dates)
Now that we have a sample dataframe, the rest will be straightforward. I am going to check if a value in the dataframe is equal to 0 and then do a rolling sum with the window of 4 (>3). This way I can avoid for loops. The resulting dataframe has all the rows where at least one of the items had a value of 0 for 4 consecutive rows. If there is a 0 for more than window consecutive rows, it will show as two rows where the dates are just one day apart. I hope that makes sense.
# custom function as I want "np.nan" returned if a value does not equal "test_value"
def equals(df_value, test_value=0):
return 1 if df_value == test_value else np.nan
# apply the function to every value in the dataframe
# for each row, calculate the sum of four subsequent rows (>3)
df = df.applymap(equals).rolling(window=4).sum()
# if there was np.nan in the sum, the sum is np.nan, so it can be dropped
# keep the rows where there is at least 1 value
df = df.dropna(thresh=1)
# drop all columns that don't have any values
df = df.dropna(thresh=1, axis=1)
So I have a dataframe with the indices as datetime objects.
I have created a new column to indicate which month each 'ride' in the dataframe is in:
import numpy as np
import datetime as dt
from datetime import datetime
months = df.index.to_series().apply(lambda x:dt.datetime.strftime(x, '%b %Y')).tolist()
df['months'] = months
df1 = df[['distance','months']]
Which gives:
When I try to plot it onto a line graph using seaborn using months as the x-axis, it sorts it in alphabetical order, starting with april then august, etc.
l = sns.lineplot(x='months',y='distance',data=df1)
plt.xticks(rotation=45)
I don't really understand why it does this as in the dataframe that I use, the months are sorted in ascending order according to their months. Is there a way I can make it so my x-axis starts with Jan2018 and ends in July2019?
The x-coordinates must be numeric. When you supply an array of strings, Seaborn automatically sort it alphabetically. What you want can be achieved with sort=False (the default is True):
# True or omitted
sns.lineplot(x='month', y='distance', data=df1, sort=True)
# Set to False to keep the original order in your DataFrame
sns.lineplot(x='month', y='distance', data=df1, sort=False)
I have a data frame with a date, a category and a value. I'd like to plot the sum-aggregated values per category. For example I want to sum values which happen in 3 day periods, but for each category individually.
An attempt which seems too complicating is
import random
import datetime as dt
import pandas as pd
random.seed(0)
df=pd.DataFrame([[dt.datetime(2000,1,random.randint(1,31)), random.choice("abc"), random.randint(1,3)] for _ in range(100)], columns=["date", "cat", "value"])
df.set_index("date", inplace=True)
result=df.groupby("cat").resample("3d", how="sum").unstack("cat").value.fillna(0)
result.plot()
This is basically the right logic, but the resampling doesn't have a fixed start, so the date ranges for the 3-day periods don't align between categories (and I get NaN/0 values).
What is a better way to achieve this plot?
I think you should group by cat and date:
df = pd.DataFrame([[dt.datetime(2000,1,random.randint(1,31)), random.choice("abc"), random.randint(1,3)] for _ in range(100)], columns=["date", "cat", "value"])
df.groupby(["cat", pd.Grouper(freq='3d',key='date')]).sum().unstack(0).fillna(0).plot()
I'm trying to plot grouped by month DataFrame, where index column is DateTime.
My goal is to plot all months on separate plots.
index=date_range('2011-9-1 00:00:03', '2012-09-01 00:00:03', freq='10min')
df=pd.DataFrame(np.random.rand(len(index),3),index=index)
df2 = df.groupby(lambda x: x.month)
df2.plot()
This gives me 14 plots (not 12), where 2 first are empty - on the x-axis are years from 2000 to 2010. Than two first plots are January.
Hoping for your good advice how to cope with this.
What are you trying to achieve? When grouping data you usually aggregate it in some way if you want to plot it. For example:
import pandas as pd
index=pd.date_range('2011-1-1 00:00:03', '2011-12-31 23:50:03', freq='10min')
df=pd.DataFrame(np.random.rand(len(index),3),index=index)
df2 = df.groupby(lambda x: x.month)
for key, group in df2:
group.plot()
Update: Fix for groups that span more than a month. This is probably not the best solution but it's the one that first fell on my mind.
import pandas as pd
num_ticks = 15
index=pd.date_range('2011-9-1 00:00:03', '2012-09-01 00:00:03', freq='10min')
df=pd.DataFrame(np.random.rand(len(index),3),index=index)
df2 = df.groupby(lambda x: x.month)
for key, group in df2:
step = len(group) / num_ticks
reset = group.reset_index()
reset.plot()
plt.xticks(reset.index[::step],
reset['index'][::step].apply(
lambda x: x.strftime('%Y-%m-%d')).values,
rotation=70)