Boxplot a dataframe column by date - python

I have a pandas dataframe of all tweets about a sporting event. The tweets are arranged by date and now include a polarity rating using the afinn sentiment library.
What I'd like to do is create a matplotlib boxplot chart for each day in the range. Unfortunately, I'm a coding newb, and I'm stuck.
Here is my code:
import numpy
import matplotlib.pyplot as plt
%matplotlib inline
boxplot_maker = lambda x: plt.boxplot()
#creating a function that I will run against a specific column in the df
Event_df["P Score"].map(boxplot_maker)
#using the .map function and my boxplot maker to the column in question
Unfortunately, this doesn't work. I get the following error:
TypeError: boxplot() missing 1 required positional argument: 'x'

Related

How to create a heatmap of Pandas dataframe in Python

I'm trying to create a heatmap and I am following the following question:
Making heatmap from pandas DataFrame
My dataframe looks like the following picture:
I tried the following code:
years = ["1860","1870", "1880","1890","1900","1910","1920","1930","1940","1950","1960","1970","1980","1990","2000"]
kantons = ["AG","AI","AR","BE","BL","BS","FR","GE","GL","GR","JU","LU","NE","NW","OW","SG","SH","SO","SZ","TG","TI","UR","VD","VS","ZG","ZH"]
df = pd(abs(dfYears), index=years, columns=kantons)
which gives the exception that:
"AG" can not be used as float
So I thought if I need to drop the index column which is not possible.
Any suggestions?
When replicating similar data, you can do:
import pandas as pd
import numpy as np
years = ["1860","1870", "1880","1890","1900","1910","1920","1930","1940","1950","1960","1970","1980","1990","2000"]
kantons = ["AG","AI","AR","BE","BL","BS","FR","GE","GL","GR","JU","LU","NE","NW","OW","SG","SH","SO","SZ","TG","TI","UR","VD","VS","ZG","ZH"]
df = pd.DataFrame(np.random.randint(low=10000, high=200000, size=(15, 26)), index=years, columns=kantons)
df.style.background_gradient(cmap='Reds')
Pandas has some Builtin Styles for the most common visualization needs. .background_gradient function is a simple way for highlighting cells based on their values. cmap parameter determines the color map based on the matplotlib colormaps.

Matplotlib shows NaN on X axis [duplicate]

This question already has answers here:
Use index in pandas to plot data
(6 answers)
Closed 1 year ago.
I'm learning Python, specifically Pandas and Matplotlib at the moment. I have a dataset of Premier League Hattrick scorers and have been using pandas to do some basic analysis. I then want to produce a bar chart based on this data extract. I have been able to create a bar chart, but the X axis shows 'nan' instead of the player names.
My code to extract the data...
import matplotlib.pyplot as plt
import pandas as pd
top10 = df.groupby(['Player'])[['Goals']].count().sort_values(['Goals'],ascending=False).head(10)
This produces the following, which I know is a Pandas DataFrame as if I print the type of 'top10' i get the following:
<class 'pandas.core.frame.DataFrame'>
This produces the following if printed out...
I tried to create a chart direct from this dataFrame, but was given an error message 'KeyError: Player'
So, I made an new dataframe and plotted this, which was kind of successful, but it displayed 'nan' on the X access?
top10df = pd.DataFrame(top10,columns=['Player','Goals'])
top10df.plot(x ='Player', y='Goals', kind='bar')
plt.show()
I did manually create a dataframe and it worked, so unsure where to go, tried googling and searching stackoverflow with no success. Any ideas please??
You could plot directly using the groupby results in the following way:
top10.plot(kind='bar', title='Your Title', ylabel='Goals',
xlabel='Player', figsize=(6, 5))
A dummy example, since you did not supply your data (next time it's best to do so):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'category': list('XYZXY'),
'sex': list('mfmff'),
'ThisColumnIsNotUsed': range(5,10)})
x = df.groupby('sex').count()
x.plot(kind='bar', ylabel='category', xlabel='sex')
we get:

Why am I not able to feed my data series from pandas into calmap.yearplot? Trying to create a calendar heat map

Beginner question here.
What I'm trying to build:
A program that takes data from a CSV and creates a calendar heat map from it. I am a language learner (language as in spanish, japanese, etc) and the data set I'm using is a CSV that shows how many hours I spent immersing in my target language per day.
I want the individual values in the heat map to be the number of hours. Y axis will be days of the week, and x axis will be months.
What I have tried:
I have tried many methods for the past two days (most of them using seaborn), that have all resulted in error-infested spaghetti code...
The method I'm using today is with calmap. Here is what I have so far:
import seaborn as sns
import matplotlib as plt
import numpy as np
from vega_datasets import data as vds
import calmap
import pandas as pd
import calplot
# importing CSV from google drive
df = pd.read_csv('ImmersionHours.csv', names=['Type', 'Name', 'Date', 'Time', 'Total Time'])
# deleting extraneous row of data
df.drop([0], inplace=True)
# making sure dates are in datetime format
df['Date'] = pd.to_datetime(df['Date'])
# setting the dates as the index
df.set_index('Date', inplace=True)
# the data is now formatted how I want
# creating a series for the heat map values
hm_values = pd.Series(df.Time)
# trying to create the heat map from the series (hm_values)
calmap.yearplot(data=hm_values, year=2021)
and here is a copy of the data set that I imported into Python (for reference) https://docs.google.com/spreadsheets/d/1owZv0NDLz7S4R5Spf-hzRDGMTCS1FVSMvi0WsZJenWE/edit?usp=sharing
Can someone tell me where I'm going wrong and why the heat map won't show?
Thank you in advance for any advice/tips/corrections.
The question is a bit old, but in case anyone is interested, I had the same problem and found that this notebook was very helpful to solve the issue: https://github.com/amandasolis/Fitbit/blob/master/FitbitSummaryPlots.ipynb
import numpy as np
import pandas as pd
import calmap
fulldf = pd.read_csv("./data.csv", index_col=0, header=None,names=['date','duration','frac'], parse_dates=['date'], usecols=['date','frac'], infer_datetime_format=True, dayfirst=True)
fulldf.index=pd.to_datetime(fulldf.index)
events = pd.Series(fulldf['frac'])
calmap.yearplot(events, year=2022) #the notebook linked above has a better but complex viz
first lines of data.csv (I plot frac, the 3rd column, not duration, but it should be similar):
03/11/2022,1,"0.0103"
08/11/2022,1,"0.0103"
15/11/2022,1,"0.0103"

Avoid plotting missing values in Seaborn

Problem: I have timeseries data of several days and I use sns.FacetGrid function of Seaborn python library to plot this data in facet form. In several cases, I found that mentioned seaborn function plots consecutive missing values (nan values) between two readings with a continuous line. While as matplotlib shows missing values as a gap, which makes sense. A demo example is as
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# create timeseries data for 3 days such that day two contains NaN values
time_duration1 = pd.date_range('1/1/2018', periods=24,freq='H')
data1 = np.random.randn(len(time_duration1))
ds1 = pd.Series(data=data1,index=time_duration1)
time_duration2 = pd.date_range('1/2/2018',periods=24,freq='H')
data2 = [float('nan')]*len(time_duration2)
ds2 = pd.Series(data=data2,index=time_duration2)
time_duration3 = pd.date_range('1/3/2018', periods=24,freq='H')
data3 = np.random.randn(len(time_duration3))
ds3 = pd.Series(data=data3,index=time_duration3)
# combine all three days series and then convert series into pandas dataframe
DS = pd.concat([ds1,ds2,ds3])
DF = DS.to_frame()
DF.plot()
It results into following plot
Above Matplotlib plot shows missing values with a gap.
Now let us prepare same data for seaborn function as
DF['col'] = np.ones(DF.shape[0])# dummy column but required for facets
DF['timestamp'] = DF.index
DF.columns = ['data_val','col','timestamp']
g = sns.FacetGrid(DF,col='col',col_wrap=1,size=2.5)
g.map_dataframe(plt.plot,'timestamp','data_val')
See, how seaborn plot shows missing data with a line. How should I force seaborn to not plot nan values with such a line?
Note: This is a dummy example, and I need facet grid in any case to plot my data.
FacetGrid by default removes nan from the data. The reason is that some functions inside seaborn would not work properly with nans (especially some of the statistical function, I'd say).
In order to keep the nan values in the data, use the dropna=False argument to FacetGrid:
g = sns.FacetGrid(DF,... , dropna=False)

Seaborn tsplot - ValueError: Index contains duplicate entries, cannot reshape tsplot

I am trying to use seaborn's tsplot (sns.tsplot) as I want to show the variation in values of a series for the different months over 5 years (as well as the mean).
I have attached an image of my dataframe. This is the code I am using:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.tsplot(data = df_monthly, time = 'month', value = 'pm_local', unit="dayofweek");
plt.show()
I cannot work out why I receive the following error message since the index just goes from 0 onwards:
ValueError: Index contains duplicate entries, cannot reshape
Any suggestions of why this might be and how I can get the tsplot working?

Categories

Resources