visualization with pandas in python - python

I have the following problem
I want to plot following table:
I want to compare the new_cases from germany and france per week how can i visualise this?
I already tried multiple plots but I'm not happy with the results:
for example:
pivot_df['France'].plot(kind='bar')
plt.figure(figsize=(15,5))
pivot_df['France'].plot(kind='bar')
plt.figure(figsize=(15,5))`
But it only shows me france

I think you're trying to get a timeseries plot. For that you'll need to convert year_week to a datetime object. Subsequently you can groupby the country, and unstack and plot the timeseries:
import datetime
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('https://opendata.ecdc.europa.eu/covid19/testing/csv/data.csv')
df = df[df['country'].isin(['France', 'Germany'])]
df = df[df['level'] == 'national'].reset_index()
df['datetime'] = df['year_week'].apply(lambda x: datetime.datetime.strptime(x + '-1', '%G-W%V-%u')) #https://stackoverflow.com/a/54033252/11380795
df.set_index('datetime', inplace=True)
grouped_df = df.groupby('country').resample('1W').sum()['new_cases']
ax = grouped_df.unstack().T.plot(figsize=(10,5))
ax.ticklabel_format(style='plain', axis='y')
result:

Here you go:
sample df:
Country week new_cases
0 FRANCE 9 210
1 GERMANY 9 300
2 FRANCE 10 410
3 GERMANY 10 200
4 FRANCE 11 910
5 GERMANY 9 500
Code:
df.groupby(['week','Country'])['new_cases'].sum().unstack().plot.bar()
plt.ylabel('New cases')
Output:

Related

Plotting a large data set while filtering for a year that is not a variable in that data set

Really easy question from a starting python programmer, but I am have been fighting this for two days now.
I want to plot life expectancy vs gdp in a scatter plot. This comes from a huge 60000 row data set containing the years 1950 until 2018. For this specific scatterplot I am only interested in 2018. How do I filter the df for this specific task without deleting data.
As shown below the variable year is not in my scatterplot, but I would like to filter it.
This is the code I have now, I have tried everything like df.loc, etc.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('GDP_LE_2018.csv')
le = df['Life expectancy']
gdp = df['GDP per capita']
year = df['Year']
year1 = df[df['Year'] == 2018]
plt.scatter(gdp, le)
plt.show()
This is the data set
Entity Code ... Population (historical estimates) Continent
0 Abkhazia OWID_ABK ... NaN Asia
1 Afghanistan AFG ... 7752117.0 NaN
2 Afghanistan AFG ... 7840151.0 NaN
3 Afghanistan AFG ... 7935996.0 NaN
4 Afghanistan AFG ... 8039684.0 NaN
Help would be so much appreciated
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('GDP_LE_2018.csv')
df = pd.DataFrame(df)
df = df.query('Year > == 2018')
le = df['Life expectancy']
gdp = df['GDP per capita']
plt.scatter(gdp, le)
plt.show()
convert Dataframe.
filter(query).

Plotly axis is showing dates in reverse order (recent to earliest) -Python

I have created a visualization utilizing the plotly library within Python. Everything looks fine, except the axis is starting with 2020 and then shows 2019. The axis should be the opposite of what is displayed.
Here is the data (df):
date percent type
3/1/2020 10 a
3/1/2020 0 b
4/1/2020 15 a
4/1/2020 60 b
1/1/2019 25 a
1/1/2019 1 b
2/1/2019 50 c
2/1/2019 20 d
This is what I am doing
import plotly.express as px
px.scatter(df, x = "date", y = "percent", color = "type", facet_col = "type")
How would I make it so that the dates are sorted correctly, earliest to latest? The dates are sorted within the raw data so why is it not reflecting this on the graph?
Any suggestion will be appreciated.
Here is the result:
It is plotting in the order of your df. If you want date order then sort so in date order.
df.sort_values('date', inplace=True)
A lot of other graphing utilities (Seaborn, etc) by default sort when plotting. Plotly Express does not do this.
Your date column seems to be a string. If you convert it to a datetime you don't have to sort your dataframe: plotly express will set the x-axis to datetime:
Working code example:
import pandas as pd
import plotly.express as px
from io import StringIO
text = """
date percent type
3/1/2020 10 a
3/1/2020 0 b
4/1/2020 15 a
4/1/2020 60 b
1/1/2019 25 a
1/1/2019 1 b
2/1/2019 50 c
2/1/2019 20 d
"""
df = pd.read_csv(StringIO(text), sep='\s+', header=0)
px.scatter(df, x="date", y="percent", color="type", facet_col="type")

Using a scatter plot to plot multiple columns from a data set

import plotly.offline as pyo
import plotly.express as px
import matplotlib.pyplot as pls
pyo.init_notebook_mode()
data = pd.read_csv(r'C:.......Coronovirus Datasets\time_series_covid19_deaths_global.csv')
countries = ['US']
filtered_data = data[data['Country/Region'].isin(countries)]
wanted_values = filtered_data[['Country/Region','1/22/2020','1/23/2020','1/24/2020', '1/25/2020','1/26/2020','1/27/2020','1/28/2020','1/28/2020','1/29/2020',
'1/30/2020','1/31/2020','2/1/2020','2/2/2020','2/3/2020','2/4/2020','2/5/2020','2/6/2020','2/7/2020','2/8/2020','2/9/2020','2/10/2020',
'2/11/2020','2/12/2020','2/13/2020','2/14/2020','2/15/2020','2/16/2020','2/17/2020','2/18/2020','2/19/2020','2/20/2020','2/21/2020','2/22/2020','2/23/2020',
'2/24/2020','2/25/2020','2/26/2020','2/27/2020','2/28/2020','2/29/2020','3/1/2020','3/2/2020','3/3/2020','3/4/2020','3/5/2020','3/6/2020','3/7/2020',
'3/8/2020','3/9/2020','3/10/2020','3/11/2020','3/12/2020','3/13/2020','3/14/2020','3/15/2020','3/16/2020','3/17/2020','3/18/2020','3/19/2020',
'3/20/2020','3/21/2020','4/1/2020','4/2/2020','4/3/2020','4/4/2020','4/5/2020','4/6/2020','4/7/2020','4/8/2020','4/9/2020','4/10/2020',
'4/11/2020','4/12/2020','4/13/2020','4/14/2020','4/15/2020','4/16/2020','4/17/2020','4/18/2020','4/19/2020','4/20/2020','4/21/2020','4/22/2020','4/23/2020',
'4/24/2020','4/25/2020','4/26/2020','4/27/2020','4/28/2020','4/29/2020','5/1/2020','5/2/2020','5/3/2020','5/4/2020','5/5/2020','5/6/2020','5/7/2020','5/8/2020','5/9/2020']]
fig = px.scatter(wanted_values, x ='Country/Region', y = 'dates' , title = 'Number of Deaths Per Day')
fig.show()
#wanted_values.plot(x="5/9/2020, 5/8/2020", y = 'filtered_data' kind = 'bar')
#pls.show()
How can I plot all the dates with their corresponding deaths as a scatter plot? I plan to use linear regression to predict the amount of deaths since January first. I have been having a lot of trouble with plotting these values as I am really new to Python.
The data set can be found here: https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases
This is how your data looks like:
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("time_series_covid19_deaths_global.csv")
data.iloc[:2,:7]
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0
1 NaN Albania 41.1533 20.1683 0 0 0
First of all, subset it by giving it the start and end of dates (that match the column names) and melting it to give long format:
data = data[data['Country/Region']=='US']
data = data.loc[:,'1/22/20':'5/9/20'].melt(var_name="date")
data['date'] = pd.to_datetime(data['date'])
Looks like this now:
date value
0 2020-01-22 0
1 2020-01-23 0
2 2020-01-24 0
Plotting is simply:
data.plot.scatter(x="date",y="value",rot=45)

How to generate a rolling mean for a specific date range and location with pandas

I have a large data set with names of stores, dates and profits.
My data set is not the most organized but I now have it in this df.
df
Store Date Profit
ABC May 1 2018 234
XYZ May 1 2018 410
AZY May 1 2018 145
ABC May 2 2018 234
XYZ May 2 2018 410
AZY May 2 2018 145
I proudly created a function to get each day into one df by itself until I realized it would be very time consuming to do one for each day.
def avg(n):
return df.loc[df['Date'] == "May" + " " + str(n) + " " +str(2018)]
where n would be the date I want to get. So that function get me just the dates I want.
What I really need is to have a way to get all dates I want in a list and to append them to a pd for each day. I tried doing this but did not work out.
def avg(n):
dlist= []
for i in n:
dlist= df.loc[df['Date'] == "May" + " " + str(i) + " " +str(2018)]
dlist=pd.DataFrame(dlist)
dlist.append(i)
return dlist
df2=avg([21,23,24,25])
My goal there was to have all the dates of (21,23,24,25) for the May
into its own series of df.
But it was a total fail got this error
cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
I am not sure if it's also possible to add a rolling average or mean, to columns for each day of (21,23,24,25), but that's where analysis will conclude.
output desired
Store Date Profit Rolling Mean
ABC May 1 2018 234 250
XYZ May 1 2018 410 401
AZY May 1 2018 145 415
where the rolling mean is for the past 30 days. Above all, I would like to have each day into its own df where I can save it to csv file the end.
Rolling Mean:
The example data given in the question, has data in the format of May 1 2018, which can't be used for rolling. Rolling requires a datetime index.
Instead of string splitting the original Date column, it should be converted to datetime, using df.Date = pd.to_datetime(df.Date), which will give dates in the format 2018-05-01
With a properly formatted datetime column, use df['Day'] = df.Date.dt.day and df['Month'] = df.Date.dt.month_name() to get a Day and Month column, if desired.
Given the original data:
Original Data:
Store Date Profit
ABC May 1 2018 234
XYZ May 1 2018 410
AZY May 1 2018 145
ABC May 2 2018 234
XYZ May 2 2018 410
AZY May 2 2018 145
Transformed Original Data:
df.Date = pd.to_datetime(df.Date)
df['Day'] = df.Date.dt.day
df['Month'] = df.Date.dt.month_name()
Store Date Profit Day Month
ABC 2018-05-01 234 1 May
XYZ 2018-05-01 410 1 May
AZY 2018-05-01 145 1 May
ABC 2018-05-02 234 2 May
XYZ 2018-05-02 410 2 May
AZY 2018-05-02 145 2 May
Rolling Example:
The example dataset is insufficient to produce a 30-day rolling average
In order to have a 30-day rolling mean, there needs to be more than 30 days of data for each store (i.e. on the 31st day, you get the 1st mean, for the previous 30 days)
The following example will setup a dataframe consisting of every day in 2018, a random profit between 100 and 1001, and a random store, chosen from ['ABC', 'XYZ', 'AZY'].
Extended Sample:
import pandas as pd
import random
import numpy as np
from datetime import datetime, timedelta
list_of_dates = [date for date in np.arange(datetime(2018, 1, 1), datetime(2019, 1, 1), timedelta(days=1)).astype(datetime)]
df = pd.DataFrame({'Store': [random.choice(['ABC', 'XYZ', 'AZY']) for _ in range(365)],
'Date': list_of_dates,
'Profit': [np.random.randint(100, 1001) for _ in range(365)]})
Store Date Profit
ABC 2018-01-01 901
AZY 2018-01-02 540
AZY 2018-01-03 417
XYZ 2018-01-04 280
XYZ 2018-01-05 384
XYZ 2018-01-06 104
XYZ 2018-01-07 691
ABC 2018-01-08 376
XYZ 2018-01-09 942
XYZ 2018-01-10 297
df.set_index('Date', inplace=True)
df_rolling = df.groupby(['Store']).rolling(30).mean()
df_rolling.rename(columns={'Profit': '30-Day Rolling Mean'}, inplace=True)
df_rolling.reset_index(inplace=True)
df_rolling.head():
Note the first 30-days for each store, will be NaN
Store Date 30-Day Rolling Mean
ABC 2018-01-01 NaN
ABC 2018-01-03 NaN
ABC 2018-01-07 NaN
ABC 2018-01-11 NaN
ABC 2018-01-13 NaN
df_rolling.tail():
Store Date 30-Day Rolling Mean
XYZ 2018-12-17 556.966667
XYZ 2018-12-18 535.633333
XYZ 2018-12-19 534.733333
XYZ 2018-12-24 551.066667
XYZ 2018-12-27 572.033333
Plot:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
g = sns.lineplot(x='Date', y='30-Day Rolling Mean', data=df_rolling, hue='Store')
for item in g.get_xticklabels():
item.set_rotation(60)
plt.show()
Alternatively: A dataframe for each store:
It's also possible to create a separate dataframe for each store and put it inside a dict
This alternative makes is easier to plot a more detailed graph with less code
import pandas as pd
import random
import numpy as np
from datetime import datetime, timedelta
list_of_dates = [date for date in np.arange(datetime(2018, 1, 1), datetime(2019, 1, 1), timedelta(days=1)).astype(datetime)]
df = pd.DataFrame({'Store': [random.choice(['ABC', 'XYZ', 'AZY']) for _ in range(365)],
'Date': list_of_dates,
'Profit': [np.random.randint(100, 1001) for _ in range(365)]})
df_dict = dict()
for store in df.Store.unique():
df_dict[store] = df[['Date', 'Profit']][df.Store == store]
df_dict[store].set_index('Date', inplace=True)
df_dict[store]['Profit: 30-Day Rolling Mean'] = df_dict[store].rolling(30).mean()
print(df_dict.keys())
>>> dict_keys(['ABC', 'XYZ', 'AZY'])
print(df_dict['ABC'].head())
Plot:
import matplotlib.pyplot as plt
_, axes = plt.subplots(1, 1, figsize=(13, 8), sharex=True)
for k, v in df_dict.items():
axes.plot(v['Profit'], marker='.', linestyle='-', linewidth=0.5, label=k)
axes.plot(v['Profit: 30-Day Rolling Mean'], marker='o', markersize=4, linestyle='-', linewidth=0.5, label=f'{k} Rolling')
axes.legend()
axes.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.ylabel('Profit ($)')
plt.xlabel('Date')
plt.title('Recorded Profit vs. 30-Day Rolling Mean of Profit')
plt.show()
Get a dataframe for a specific month:
Recall, this is randomly generated data, so the stores don't have data for every day of the month.
may_df = dict()
for k, v in df_dict.items():
v.reset_index(inplace=True)
may_df[k] = v[v.Date.dt.month_name() == 'May']
may_df[k].set_index('Date', inplace=True)
print(may_df['XYZ'])
Plot: May data only:
Save dataframes:
pandas.DataFrame.to_csv()
may_df.reset_index(inplace=True)
may_df.to_csv('may.csv', index=False)
A simple solution may be groupby()
Check out this example :
import pandas as pd
listt = [['a',2,3],
['b',5,7],
['a',3,9],
['a',1,3],
['b',9,4],
['a',4,7],
['c',7,2],
['a',2,5],
['c',4,7],
['b',5,5]]
my_df = pd.DataFrame(listt)
my_df.columns=['Class','Day_1','Day_2']
my_df.groupby('Class')['Day_1'].mean()
OutPut :
Class
a 2.400000
b 6.333333
c 5.500000
Name: Day_1, dtype: float64
Note : Similarly You can Group your data by Date and get Average of your Profit.

Pandas/NumPy -- Plotting Dates as X axis

My Goal is just to plot this simple data, as a graph, with x data being dates ( date showing in x-axis) and price as the y-axis. Understanding that the dtype of the NumPy record array for the field date is datetime64[D] which means it is a 64-bit np.datetime64 in 'day' units. While this format is more portable, Matplotlib cannot plot this format natively yet. We can plot this data by changing the dates to DateTime.date instances instead, which can be achieved by converting to an object array: which I did below view the astype('0'). But I am still getting
this error :
view limit minimum -36838.00750000001 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-DateTime value to an axis that has DateTime units
code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(r'avocado.csv')
df2 = df[['Date','AveragePrice','region']]
df2 = (df2.loc[df2['region'] == 'Albany'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2['Date'] = df2.Date.astype('O')
plt.style.use('ggplot')
ax = df2[['Date','AveragePrice']].plot(kind='line', title ="Price Change",figsize=(15,10),legend=True, fontsize=12)
ax.set_xlabel("Period",fontsize=12)
ax.set_ylabel("Price",fontsize=12)
plt.show()
df.head(3)
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
df2 = df[['Date', 'AveragePrice', 'region']]
df2 = (df2.loc[df2['region'] == 'Albany'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2[['Date', 'AveragePrice']]
df2 = df2.sort_values(['Date'])
df2 = df2.set_index('Date')
print(df2)
ax = df2.plot(kind='line', title="Price Change")
ax.set_xlabel("Period", fontsize=12)
ax.set_ylabel("Price", fontsize=12)
plt.show()
output:

Categories

Resources