How do you group a dataframe by year and month? - python

I'm using the omicron dataset from Kaggle, and I wanted to make a seaborn lineplot of omicron cases in Czechia over time.
I did this, but the x axis label is unreadable, since every single day is put on there. Could you help me sort the dataframe, so that I could visualize only the summed cases for each month of every year?
Here's my code so far:
data = "..input/omicron-covid19-variant-daily-cases/covid-variants.csv"
data = data[data.variant.str.contains("Omicron")] # a mask with only Omicron cases
data = data[data.location.str.contains("Czech")] # mask only with cases from Czech republic
plt.figure(figsize=(10, 10))
plt.title("Omicron in Czech Republic")
plt.ylabel("Number of cases")
sns.lineplot(x=data.date, y=data.num_sequences_total)
I tried something with the groupby() method, but I only generated a series with 2 columns named "date" and don't know what to do next.
test = data
test.date = pd.to_datetime(data.date)
test = data.groupby([data.date.dt.year, data.date.dt.month]).num_sequences_total.sum()
test.head()
Please help me figure this out, I'm stuck 🥲

i always use this to grouping year and month
example:
GB=DF.groupby([(DF.index.year),(DF.index.month)]).sum()

Related

Why Pandas does not recognize X values in a Chart as Distinct?

I have a DataFrame with the following Columns:
countriesAndTerritories = Name of the Countries (Only contains Portugal and Spain)
cases = Number of Covid Cases
This is how the DataFrame looks like:
We have 1 row per "dateRep".
I tried to create a BarChart with the following code:
df.plot.bar(x="countriesAndTerritories", y="cases", rot=70,
title="Number of Covid cases per Country")
The result is the following:
As you can see, instead of having the total number of cases per Country (Portugal and Spain), i have multiple values in the X axis.
I've tried to investigate a little, but the examples i've found were with a inline df. So, if someone can help me, i apreciate.
PS: I'm used to QlikSense, and what i'm trying to achieve, would be something along these lines:

The matplotlib chart changes when I change the index in python pandas dataframe

I have a dataset of S&P500 historical prices with the date, the price and other data that i don't need now to solve my problem.
Date Price
0 1981.01 6.19
1 1981.02 6.17
2 1981.03 6.24
3 1981.04 6.25
. . .
and so on till 2020
The date is a float with the year, a dot and the month.
I tried to plot all historical prices with matplotlib.pyplot as plt.
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result. I used df["Price"].tail(100) so you can see better the difference between the first and the second graph(You are going to see in a sec).
But then I tried to set the index from the one before(0, 1, 2 etc..) to the df["Date"] column in the DataFrame in order to see the date in the x axis.
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result, and it's quite disappointing.
I have the Date where it should be in the x axis but the problem is that the graph is different from the one before which is the right one.
If you need the dataset to try out the problem here you can find it.
It is called U.S. Stock Markets 1871-Present and CAPE Ratio.
Hope you've understood everything.
Thanks in advance
UPDATE
I found something that could cause the problem. If you look in depth at the date you can see that in month #10 each is written as a float(in the original dataset) like this: example Year:1884 1884.1. The problem occur when you use pd.to_datetime() to transform the Date float series to a Datetime. So the problem could be that the date in the month #10, when converted into a Datetime, become: (example from before) 1884-01-01 which is the first month in the year and it has an effect on the final plot.
SOLUTION
Finally, I solved my problem!
Yes, the error was the one I explain in the UPDATE paragraph, so I decided to add a 0 as a String where the lenght of the Date (as a string) is 6 in order to change, for example: 1884.1 ==> 1884.10
df["len"] = df["Date"].apply(len)
df["Date"] = df["Date"].where(df["len"] == 7, df["Date"] + "0")
Then i drop the len column i've just created.
df.drop(columns="len", inplace=True)
At the end I changed the "Date" to a Datetime with pd.to_datetime
df["Date"] = pd.to_datetime(df["Date"], format='%Y.%m')
df = df.set_index("Date")
And then I plot
df["Price"].tail(100).plot()
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
The easiest way would be to transform the date into an actual datetime index. This way matplotlib will automatically pick it up and plot it accordingly. For example, given your date format, you could do:
df["Date"] = pd.to_datetime(df["Date"].astype(str), format='%Y.%m')
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
Currently, the first plot you showed is actually plotting the Price column against the index, which seems to be a regular range index from 0 - 1800 and something. You suggested your data starts in 1981, so although each observation is evenly spaced on the x axis (it's spaced at an interval of 1, which is the jump from one index value to the next). That's why the chart looks reasonable. Yet the x-axis values don't.
Now when you set the Date (as float) to be the index, note that you're not evenly covering the interval between, for example, 1981 and 1982. You have evenly spaced values from 1981.1 - 1981.12, but nothing from 1981.12 - 1982. That's why the second chart is also plotted as expected. Setting the index to a DatetimeIndex as described above should remove this issue, as Matplotlib will know how to evenly space the dates along the x-axis.
I think your problem is that your Date is of float type and taking it as an x-axis does exactly what is expected for taking an array of the kind ([2012.01, 2012.02, ..., 2012.12, 2013.01....]) as x-axis. You might convert the Date column to a DateTimeIndex first and then use the built-in pandas plot method:
df["Price"].tail(100).plot()
It is not a good idea to treat df['Date'] as float. It should be converted into pandas datetime64[ns]. This can be achieved using pandas pd.to_datetime method.
Try this:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('ie_data.csv')
df=df[['Date','Price']]
df.dropna(inplace=True)
#converting to pandas datetime format
df['Date'] = df['Date'].astype(str).map(lambda x : x.split('.')[0] + x.split('.')[1])
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m')
df.set_index(['Date'],inplace=True)
#plotting
df.plot() #full data plot
df.tail(100).plot() #plotting just the tail
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
Output:

How can I show the labels of the line in a grouby data plot in pandas , python

I am trying to use for loop to plot the Total Order amount of each hour in each city from the total data.
But I only get one plot and I don't know which line belongs to which city in that plot, how can I label those lines in my code?
If possible, I also want to know how I can have one plot for each city instead of having multiple lines in one plot.
Your advice will be much appreciated!
Here are my codes:
city_grp=all_data.groupby('city') # to get the list of the cities
for cty in all_data['city'].unique():
cgroup=city_grp.get_group(cty) # to get the df of each city group
h_grp=cgroup.groupby('Hour') # to group the df by hours
hs=[h for h,df in h_grp['Quantity Ordered']]
plt.xticks(hs)
plt.xlabel('{} Hour in a day'.format(cty))
plt.ylabel('Quantity Ordered')
plt.plot(hs,h_grp['Quantity Ordered'].sum())
Here is the plot that I got
Here's a solution (with fake data). Basically, you should use the legend mehtod.
cities = ["New York", "Los Angeles"]
for city in cities:
p = plt.plot(df.index, df[city])
plt.legend(cities)
The result is:
You can try something like this.
all_data.groupby(["city","Hour"])['Quantity"].sum().unstack().plot()
You may need one more groupby().
See example here https://cmdlinetips.com/2020/05/fun-with-pandas-groupby-aggregate-multi-index-and-unstack/

Sanitizing Time Series whose plots shows erratic graph lines

I want to plot timelines, my dates are formatted as day/month/year.
When building the index, I take care of that:
# format Date
test['DATA'] = pd.to_datetime(test['DATA'], format='%d/%m/%Y')
test.set_index('DATA', inplace=True)
and with a double check I see months and days are correctly interpreted:
#the number of month reflect the month, not the day : correctly imported!
test['Year'] = test.index.year
test['Month'] = test.index.month
test['Weekday Name'] = test.index.weekday_name
However, when I plot, I see datapoints get connected erratically (although their distribution seems to be correct, since I expect a seasonality):
# Start and end of the date range to extract
start, end = '2018-01', '2018-04'
# Plot daily, weekly resampled, and 7-day rolling mean time series together
fig, ax = plt.subplots()
ax.plot(test.loc['2018', 'TMIN °C'],
marker='.', linestyle='-', linewidth=0.5, label='Daily')
I suspect it may have to do with misinterpreted dates, or that dates are not put in the right sequence, but could not find a way to verify where an error may be.
Could you help validating how to import correctly my timeseries ?
Oh, it was super simple. I assumed datetime was automatically sorted, instead one must sort :
test.loc['2018-01':'2018-03'].sort_index().index #sorted
test.loc['2018-01':'2018-03'].index #not sorted
This question may be delated or marked as duplicate, I let it for moderators:
Pandas - Sorting a dataframe by using datetimeindex

Unable to Plot using Seaborn

Hi there My dataset is as follows
username switch_state time
abcd sw-off 07:53:15 +05:00
abcd sw-on 07:53:15 +05:00
Now using this i need to find that on a given day how many times in a day the switch state is manipulated i.e switch on or switch off. My test code is given below
switch_off=df.loc[df['switch_state']=='sw-off']#only off switches
groupy_result=switch_off.groupby(['time','username']).count()['switch_state'].unstack#grouping the data on the base of time and username and finding the count on a given day. fair enough
the result of this groupby clause is given as
print(groupy_result)
username abcd
time
05:08:35 3
07:53:15 3
07:58:40 1
Now as you can see that the count is concatenated in the time column. I need to separate them so that i can plot it using Seaborn scatter plot. I need to have the x and y values which in my case will be x=time,y=count
Kindly help me out that how can i plot this column.
`
You can try the following to get the data as a DataFrame itself
df = df.loc[df['switch_state']=='sw-off']
df['count'] = df.groupby(['username','time'])['username'].transform('count')
The two lines of code will give you an updated data frame df, which will add a column called count.
df = df.drop_duplicates(subset=['username', 'time'], keep='first')
The above line will remove the duplicate rows. Then you can plot df['time'] and df['count'].
plt.scatter(df['time'], df['count'])

Categories

Resources