Using a scatter plot to plot multiple columns from a data set - python

import plotly.offline as pyo
import plotly.express as px
import matplotlib.pyplot as pls
pyo.init_notebook_mode()
data = pd.read_csv(r'C:.......Coronovirus Datasets\time_series_covid19_deaths_global.csv')
countries = ['US']
filtered_data = data[data['Country/Region'].isin(countries)]
wanted_values = filtered_data[['Country/Region','1/22/2020','1/23/2020','1/24/2020', '1/25/2020','1/26/2020','1/27/2020','1/28/2020','1/28/2020','1/29/2020',
'1/30/2020','1/31/2020','2/1/2020','2/2/2020','2/3/2020','2/4/2020','2/5/2020','2/6/2020','2/7/2020','2/8/2020','2/9/2020','2/10/2020',
'2/11/2020','2/12/2020','2/13/2020','2/14/2020','2/15/2020','2/16/2020','2/17/2020','2/18/2020','2/19/2020','2/20/2020','2/21/2020','2/22/2020','2/23/2020',
'2/24/2020','2/25/2020','2/26/2020','2/27/2020','2/28/2020','2/29/2020','3/1/2020','3/2/2020','3/3/2020','3/4/2020','3/5/2020','3/6/2020','3/7/2020',
'3/8/2020','3/9/2020','3/10/2020','3/11/2020','3/12/2020','3/13/2020','3/14/2020','3/15/2020','3/16/2020','3/17/2020','3/18/2020','3/19/2020',
'3/20/2020','3/21/2020','4/1/2020','4/2/2020','4/3/2020','4/4/2020','4/5/2020','4/6/2020','4/7/2020','4/8/2020','4/9/2020','4/10/2020',
'4/11/2020','4/12/2020','4/13/2020','4/14/2020','4/15/2020','4/16/2020','4/17/2020','4/18/2020','4/19/2020','4/20/2020','4/21/2020','4/22/2020','4/23/2020',
'4/24/2020','4/25/2020','4/26/2020','4/27/2020','4/28/2020','4/29/2020','5/1/2020','5/2/2020','5/3/2020','5/4/2020','5/5/2020','5/6/2020','5/7/2020','5/8/2020','5/9/2020']]
fig = px.scatter(wanted_values, x ='Country/Region', y = 'dates' , title = 'Number of Deaths Per Day')
fig.show()
#wanted_values.plot(x="5/9/2020, 5/8/2020", y = 'filtered_data' kind = 'bar')
#pls.show()
How can I plot all the dates with their corresponding deaths as a scatter plot? I plan to use linear regression to predict the amount of deaths since January first. I have been having a lot of trouble with plotting these values as I am really new to Python.
The data set can be found here: https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases

This is how your data looks like:
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("time_series_covid19_deaths_global.csv")
data.iloc[:2,:7]
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0
1 NaN Albania 41.1533 20.1683 0 0 0
First of all, subset it by giving it the start and end of dates (that match the column names) and melting it to give long format:
data = data[data['Country/Region']=='US']
data = data.loc[:,'1/22/20':'5/9/20'].melt(var_name="date")
data['date'] = pd.to_datetime(data['date'])
Looks like this now:
date value
0 2020-01-22 0
1 2020-01-23 0
2 2020-01-24 0
Plotting is simply:
data.plot.scatter(x="date",y="value",rot=45)

Related

Plotting a large data set while filtering for a year that is not a variable in that data set

Really easy question from a starting python programmer, but I am have been fighting this for two days now.
I want to plot life expectancy vs gdp in a scatter plot. This comes from a huge 60000 row data set containing the years 1950 until 2018. For this specific scatterplot I am only interested in 2018. How do I filter the df for this specific task without deleting data.
As shown below the variable year is not in my scatterplot, but I would like to filter it.
This is the code I have now, I have tried everything like df.loc, etc.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('GDP_LE_2018.csv')
le = df['Life expectancy']
gdp = df['GDP per capita']
year = df['Year']
year1 = df[df['Year'] == 2018]
plt.scatter(gdp, le)
plt.show()
This is the data set
Entity Code ... Population (historical estimates) Continent
0 Abkhazia OWID_ABK ... NaN Asia
1 Afghanistan AFG ... 7752117.0 NaN
2 Afghanistan AFG ... 7840151.0 NaN
3 Afghanistan AFG ... 7935996.0 NaN
4 Afghanistan AFG ... 8039684.0 NaN
Help would be so much appreciated
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('GDP_LE_2018.csv')
df = pd.DataFrame(df)
df = df.query('Year > == 2018')
le = df['Life expectancy']
gdp = df['GDP per capita']
plt.scatter(gdp, le)
plt.show()
convert Dataframe.
filter(query).

visualization with pandas in python

I have the following problem
I want to plot following table:
I want to compare the new_cases from germany and france per week how can i visualise this?
I already tried multiple plots but I'm not happy with the results:
for example:
pivot_df['France'].plot(kind='bar')
plt.figure(figsize=(15,5))
pivot_df['France'].plot(kind='bar')
plt.figure(figsize=(15,5))`
But it only shows me france
I think you're trying to get a timeseries plot. For that you'll need to convert year_week to a datetime object. Subsequently you can groupby the country, and unstack and plot the timeseries:
import datetime
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('https://opendata.ecdc.europa.eu/covid19/testing/csv/data.csv')
df = df[df['country'].isin(['France', 'Germany'])]
df = df[df['level'] == 'national'].reset_index()
df['datetime'] = df['year_week'].apply(lambda x: datetime.datetime.strptime(x + '-1', '%G-W%V-%u')) #https://stackoverflow.com/a/54033252/11380795
df.set_index('datetime', inplace=True)
grouped_df = df.groupby('country').resample('1W').sum()['new_cases']
ax = grouped_df.unstack().T.plot(figsize=(10,5))
ax.ticklabel_format(style='plain', axis='y')
result:
Here you go:
sample df:
Country week new_cases
0 FRANCE 9 210
1 GERMANY 9 300
2 FRANCE 10 410
3 GERMANY 10 200
4 FRANCE 11 910
5 GERMANY 9 500
Code:
df.groupby(['week','Country'])['new_cases'].sum().unstack().plot.bar()
plt.ylabel('New cases')
Output:

Plotly axis is showing dates in reverse order (recent to earliest) -Python

I have created a visualization utilizing the plotly library within Python. Everything looks fine, except the axis is starting with 2020 and then shows 2019. The axis should be the opposite of what is displayed.
Here is the data (df):
date percent type
3/1/2020 10 a
3/1/2020 0 b
4/1/2020 15 a
4/1/2020 60 b
1/1/2019 25 a
1/1/2019 1 b
2/1/2019 50 c
2/1/2019 20 d
This is what I am doing
import plotly.express as px
px.scatter(df, x = "date", y = "percent", color = "type", facet_col = "type")
How would I make it so that the dates are sorted correctly, earliest to latest? The dates are sorted within the raw data so why is it not reflecting this on the graph?
Any suggestion will be appreciated.
Here is the result:
It is plotting in the order of your df. If you want date order then sort so in date order.
df.sort_values('date', inplace=True)
A lot of other graphing utilities (Seaborn, etc) by default sort when plotting. Plotly Express does not do this.
Your date column seems to be a string. If you convert it to a datetime you don't have to sort your dataframe: plotly express will set the x-axis to datetime:
Working code example:
import pandas as pd
import plotly.express as px
from io import StringIO
text = """
date percent type
3/1/2020 10 a
3/1/2020 0 b
4/1/2020 15 a
4/1/2020 60 b
1/1/2019 25 a
1/1/2019 1 b
2/1/2019 50 c
2/1/2019 20 d
"""
df = pd.read_csv(StringIO(text), sep='\s+', header=0)
px.scatter(df, x="date", y="percent", color="type", facet_col="type")

How to turn off order of the line graph in plotly Python?

I want to create a line chart using Plotly. I have 3 variables(date,shift,runt).I want to include date with runt(also i want to display shift as well).
Dataframe:
What I want is to plot a line chart using both date and shift to x-axis.
This is what i got from excel. i want to plot a same graph in python
But I can't take two values.I tried to concatenate the date and shift to one column. But it shows first day values and then night values.
import plotly.express as px
fig = px.line(df, x="Day-Shift", y="RUNT", title='Yo',template="plotly_dark")
fig.show()
Is there any way to turn off order. what i want is shown in the above excel graph
I've created a column that combines the date and the shift and specified it on the x-axis. Does this meet the intent of your question?
import pandas as pd
import numpy as np
import io
data = '''
Date Shift RUNT
0 June-16 Day 350
1 June-16 Night 20
2 June-17 Day 350
3 June-17 Night 20
4 June-18 Day 350
5 June-18 Night 20
6 June-19 Day 350
7 June-19 Night 20
8 June-20 Day 350
9 June-20 Night 20
10 June-21 Day 350
11 June-21 Night 20
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
df['Day-Shift'] = df['Date'].str.cat(df['Shift'], sep='-')
import plotly.express as px
fig = px.line(df, x="Day-Shift", y="RUNT", title='Yo',template="plotly_dark")
fig.show()

When plotting datetime index data, put markers in the plot on specific days (e.g. weekend)

I create a pandas dataframe with a DatetimeIndex like so:
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create datetime index and random data column
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=14, freq='D')
data = np.random.randint(1, 10, size=14)
columns = ['A']
df = pd.DataFrame(data, index=index, columns=columns)
# initialize new weekend column, then set all values to 'yes' where the index corresponds to a weekend day
df['weekend'] = 'no'
df.loc[(df.index.weekday == 5) | (df.index.weekday == 6), 'weekend'] = 'yes'
print(df)
Which gives
A weekend
2014-10-13 7 no
2014-10-14 6 no
2014-10-15 7 no
2014-10-16 9 no
2014-10-17 4 no
2014-10-18 6 yes
2014-10-19 4 yes
2014-10-20 7 no
2014-10-21 8 no
2014-10-22 8 no
2014-10-23 1 no
2014-10-24 4 no
2014-10-25 3 yes
2014-10-26 8 yes
I can easily plot the A colum with pandas by doing:
df.plot()
plt.show()
which plots a line of the A column but leaves out the weekend column as it does not hold numerical data.
How can I put a "marker" on each spot of the A column where the weekend column has the value yes?
Meanwhile I found out, it is as simple as using boolean indexing in pandas. Doing the plot directly with pyplot instead of pandas' own plot wrapper (which is more convenient to me):
plt.plot(df.index, df.A)
plt.plot(df[df.weekend=='yes'].index, df[df.weekend=='yes'].A, 'ro')
Now, the red dots mark all weekend days which are given by df.weekend='yes' values.

Categories

Resources