I'm trying to plot the maximum value per day of a dataframe column (ext_temp):
import pandas as pd
data = {'vin': {0: 'VF1AG0000KF908155', 1: 'VF1AG0000KF908155', 2: 'VF1AG0000KF908155', 3: 'VF1AG0000KF908155', 4: 'VF1AG0000KF908155', 5: 'VF1AG0000KF908155', 6: 'VF1AG0000KF908155', 7: 'VF1AG0000KF908155', 8: 'VF1AG0000KF908155', 9: 'VF1AG0000KF908155'}, 'date': {0: pd.Timestamp('2019-09-27 07:07:02'), 1: pd.Timestamp('2019-09-27 09:23:08'), 2: pd.Timestamp('2019-09-27 09:39:08'), 3: pd.Timestamp('2020-07-15 11:46:41'), 4: pd.Timestamp('2020-07-16 07:17:52'), 5: pd.Timestamp('2020-07-16 09:23:47'), 6: pd.Timestamp('2020-09-11 07:43:05'), 7: pd.Timestamp('2020-09-17 15:00:33'), 8: pd.Timestamp('2020-10-21 06:49:58'), 9: pd.Timestamp('2020-10-21 14:47:33')}, 'sohe': {0: 101, 1: 101, 2: 101, 3: 96, 4: 96, 5: 96, 6: 96, 7: 96, 8: 96, 9: 96}, 'soc': {0: 60, 1: 63, 2: 99, 3: 66, 4: 68, 5: 69, 6: 86, 7: 58, 8: 9, 9: 9}, 'ext_temp': {0: 27, 1: 30, 2: 31, 3: 30, 4: 26, 5: 29, 6: 26, 7: 29, 8: 28, 9: 27}, 'battery_temp': {0: 27, 1: 33, 2: 32, 3: 26, 4: 26, 5: 26, 6: 26, 7: 30, 8: 27, 9: 29}}
df = pd.DataFrame(data)
Unfortunately, when trying to use
nd = "VF1AG0000KF908155"
df = charge[charge.vin==gop]
df = df.groupby(pd.Grouper(key = 'date', freq = 'D'))
fig,ax = plt.subplots()
ax.plot(df.date, df['ext_temp'].max())
I get the following error message :
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
Using pd.Grouper has will fill missing days with a NaN
If you don't want missing days filled in, groupby the date component of 'date' by using the .dt extractor.
Use pandas.DataFrame.plot for plotting the dataframe
kind='bar' was used, since there's not much data. For a line plot, use kind='line'.
pd.Grouper
Note the need to use .dropna(), at least to plot the bar plot.
dfg = df.groupby(pd.Grouper(key='date', freq='D'))['ext_temp'].max().dropna()
ax = dfg.plot(kind='bar')
dfg = df.groupby(pd.Grouper(key='date', freq='D'))['ext_temp'].max().dropna()
ax = dfg.plot(kind='line')
.dt.date
Groupby only the date component of the 'date' column
dfg = df.groupby(df.date.dt.date)['ext_temp'].max()
ax = dfg.plot(kind='bar')
Related
My DataFrame looks like this:
What I would like to do is: if weight is once less than 70, drop all rows that have the same name. So, if Thomas' weight was once less than 70, drop all his data and repeat this for all the other names.
So in my case the result would be:
Code to rebuild data:
data = {'date': {0: Timestamp('2014-01-01 00:00:00'),
1: Timestamp('2014-01-02 00:00:00'),
2: Timestamp('2014-01-03 00:00:00'),
3: Timestamp('2014-01-04 00:00:00'),
4: Timestamp('2014-01-05 00:00:00'),
5: Timestamp('2014-01-06 00:00:00'),
6: Timestamp('2014-01-07 00:00:00'),
7: Timestamp('2014-01-08 00:00:00')},
'name': {0: 'Thomas', 1: 'Thomas', 2: 'Thomas', 3: 'Max',
4: 'Max', 5: 'Paul', 6: 'Paul', 7: 'Paul'},
'size': {0: 130, 1: 132, 2: 132, 3: 143, 4: 150, 5: 140,
6: 140, 7: 141},
'weight': {0: 60, 1: 65, 2: 80, 3: 75, 4: 56, 5: 75, 6: 76, 7: 74}}
df = pd.DataFrame(data)
Try as follows:
Select column name from the df based on Series.lt and turn into a list with Series.tolist. Feed the resulting list to Series.isin and combine with unary operator (~) for selection from the df.
res = df[~df.name.isin(df[df.weight.lt(70)].name.tolist())]
print(res)
date name size weight
5 2014-01-06 Paul 140 75
6 2014-01-07 Paul 140 76
7 2014-01-08 Paul 141 74
Or as a variant on this answer to a similar question, try as follows:
Use df.groupby on column name and apply filter with a lambda function, keeping the group only if Series.ge is True for all its values.
res = df.groupby('name').filter(lambda x: x.weight.ge(70).all())
# same result
names = list(df[df['weight']<70]['name'])
df_new = df[~(df['name'].isin(names))]
I am unable to drop the index column which is normally given by the python on its own. I melted a data frame and for further processing, I need to drop the index column and I am unable to do that.
Attached is the data frame which is uploaded in df:
{'Key': {0: 65162552161356, 1: 65162552635756, 2: 65162552843456, 3: 65162552842856, 4: 65162552736856}, '2021-04-01': {0: 31, 1: 0, 2: 281, 3: 207, 4: 55}, '2021-05-01': {0: 25, 1: 0, 2: 72, 3: 104, 4: 6}, '2021-06-01': {0: 16, 1: 0, 2: 108, 3: 32, 4: 14}, '2021-07-01': {0: 8, 1: 0, 2: 107, 3: 78, 4: 10}, '2021-08-01': {0: 21, 1: 0, 2: 80, 3: 40, 4: 9}, '2021-09-01': {0: 24, 1: 0, 2: 40, 3: 73, 4: 3}, '2021-10-01': {0: 13, 1: 0, 2: 36, 3: 79, 4: 11}, '2021-11-01': {0: 59, 1: 0, 2: 65, 3: 139, 4: 14}, '2021-12-01': {0: 51, 1: 0, 2: 41, 3: 87, 4: 10}, '2022-01-01': {0: 2, 1: 0, 2: 43, 3: 47, 4: 6}, '2022-02-01': {0: 0, 1: 0, 2: 0, 3: 63, 4: 3}, '2022-03-01': {0: 0, 1: 0, 2: 16, 3: 76, 4: 18}, '2022-04-01': {0: 0, 1: 0, 2: 37, 3: 32, 4: 8}, '2022-05-01': {0: 0, 1: 0, 2: 106, 3: 96, 4: 40}, '2022-06-01': {0: 0, 1: 0, 2: 101, 3: 75, 4: 16}, '2022-07-01': {0: 0, 1: 0, 2: 60, 3: 46, 4: 14}, '2022-08-01': {0: 0, 1: 0, 2: 73, 3: 91, 4: 13}, '2022-09-01': {0: 0, 1: 0, 2: 19, 3: 17, 4: 2}
Can someone help me out and let me know how to make the changes.
df = pd.read_excel ('C:/X/X/X/Demand_Data_Used.xlsx')
df['Key'] = df['Key'].astype(str)
df = pd.melt(df,id_vars='Key',value_vars=list(df.columns[1:]),var_name ='ds')
df.columns = df.columns.str.replace('Key', 'unique_id')
df.columns = df.columns.str.replace('value', 'y')
df["ds"] = pd.to_datetime(df["ds"],format='%Y-%m-%d')
df=df[["ds","unique_id","y"]]
df
The df data frame looks like this after the completion of this operation:
I would like it to look like this:
I know the doesnt contain the same values, I was just trying to show the expectation. Can someone help me in figure out the correct way to drop the index column?
It looks like you don't actually want to drop the index column, but instead assign another column as the index. You can do this very easily with this assignment:
df.index = df["unique_id"]
Now if you want to drop the column afterwards (the value would now show up twice essentially), you can do this as well:
df = df.drop("unique_id", axis=1)
I am trying to merge 2 dataframes and have a problem in figuring out how, as it is not straigh forward.
One data frame has match results for over 25000 games and looks like this.
The second one has team performance metrics but only for around 1500 games.
As I am not allowed to post pictures yet, here are the column names of interest:
df_match['date', 'home_team_api_id', 'away_team_api_id']
df_team_attributes['date', 'team_api_id']
Both data frames have additional columns with results or performance metrics.
To be able to merge correctly, I need to merge by date and by looking if the 'team_api_id' matches either 'home...' or 'away_team_api_id'
This is what I have tried until now:
df_team_performance = pd.merge(df_team_attributes, df_match,
how = 'left',
left_on = ['date', 'team_api_id', 'team_api_id'],
right_on = ['date', 'home_team_api_id', 'home_team_api_id'])
I have tried also with only 2 columns, but w/o succes.
What I would like to get is a new data frame with only the rows of the df_team_attributes and columns from both data frames.
Thank you in advance!
Added to request by Correlien:
output of print(df_match[['date', 'home_team_api_id', 'away_team_api_id', 'win_home', 'win_away', 'draw', 'win']].head(10).to_dict())
{'date': {0: '2008-08-17 00:00:00', 1: '2008-08-16 00:00:00', 2: '2008-08-16 00:00:00', 3: '2008-08-17 00:00:00', 4: '2008-08-16 00:00:00', 5: '2008-09-24 00:00:00', 6: '2008-08-16 00:00:00', 7: '2008-08-16 00:00:00', 8: '2008-08-16 00:00:00', 9: '2008-11-01 00:00:00'}, 'home_team_api_id': {0: 9987, 1: 10000, 2: 9984, 3: 9991, 4: 7947, 5: 8203, 6: 9999, 7: 4049, 8: 10001, 9: 8342}, 'away_team_api_id': {0: 9993, 1: 9994, 2: 8635, 3: 9998, 4: 9985, 5: 8342, 6: 8571, 7: 9996, 8: 9986, 9: 8571}, 'win_home': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 1, 9: 1}, 'win_away': {0: 0, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 0, 7: 1, 8: 0, 9: 0}, 'draw': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 0, 8: 0, 9: 0}, 'win': {0: 0, 1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 1, 8: 1, 9: 1}}
output for print(df_team_attributes[['date', 'team_api_id', 'buildUpPlaySpeed', 'buildUpPlaySpeedClass']].head(10).to_dict())
{'date': {0: '2010-02-22 00:00:00', 1: '2014-09-19 00:00:00', 2: '2015-09-10 00:00:00', 3: '2010-02-22 00:00:00', 4: '2011-02-22 00:00:00', 5: '2012-02-22 00:00:00', 6: '2013-09-20 00:00:00', 7: '2014-09-19 00:00:00', 8: '2015-09-10 00:00:00', 9: '2010-02-22 00:00:00'}, 'team_api_id': {0: 9930, 1: 9930, 2: 9930, 3: 8485, 4: 8485, 5: 8485, 6: 8485, 7: 8485, 8: 8485, 9: 8576}, 'buildUpPlaySpeed': {0: 60, 1: 52, 2: 47, 3: 70, 4: 47, 5: 58, 6: 62, 7: 58, 8: 59, 9: 60}, 'buildUpPlaySpeedClass': {0: 'Balanced', 1: 'Balanced', 2: 'Balanced', 3: 'Fast', 4: 'Balanced', 5: 'Balanced', 6: 'Balanced', 7: 'Balanced', 8: 'Balanced', 9: 'Balanced'}}
Have you tried casting the your date columns into the correct format and then attempting the merge? The following worked for me based on the example that you provided -
# Casting to date
df_match["date"] = pd.to_datetime(df_match["date"])
df_team_attributes["date"] = pd.to_datetime(df_match["date"])
# Merging on the date field alone
df_team_performance = pd.merge(df_team_attributes, df_match,
how = 'left',
on = 'date')
# Filtering out the required rows
result = df_team_performance.query("(team_api_id == home_team_api_id) | (team_api_id == away_team_api_id)")
Please let me know if my understanding of your question is correct.
I have a dataframe with the following data (example):
year
month
count
2020
11
100
12
50
2021
01
80
02
765
03
100
04
265
05
500
I would like to plot this with plotly on a bar chart where I would have 2 vertical bars for each month, one for 2020 and another for 2021. I would like the axis to be defined automatically based on the existing values on the dataset which could change. today is only for year 2020 and 2021 but it could be different.
I have searched for information but is always mentioning hardcoded dataset series names and data and I'm not understanding how I could dynamically input these in ploty.
I was expecting something like this but it is not working:
import plotly.express as px
...
px.bar(df, x=['year','month'], y='count')
fig.show()
Thank you,
To get two vertical bar for each month, I'm guessing the traces should represent each individual year. In that case you can use:
for y in df.year.unique():
dfy = df[df.year == y]
fig.add_bar(x = dfy.month, y = dfy.value, name = str(y))
Plot 1
That's the result for your limited dataset, though. If you expand the dataset a bit you'll get a better impression of how it will look:
Plot 2
Complete code:
import plotly.graph_objects as go
import pandas as pd
df = pd.DataFrame({'year': {0: 2020, 1: 2020, 2: 2021, 3: 2021, 4: 2021, 5: 2021, 6: 2021},
'month': {0: 11, 1: 12, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5},
'value': {0: 100, 1: 50, 2: 80, 3: 765, 4: 100, 5: 265, 6: 500}})
df = pd.DataFrame({'year': {0: 2020,
1: 2020,
2: 2020,
3: 2020,
4: 2020,
5: 2020,
6: 2020,
7: 2020,
8: 2020,
9: 2020,
10: 2020,
11: 2020,
12: 2021,
13: 2021,
14: 2021,
15: 2021,
16: 2021,
17: 2021,
18: 2021,
19: 2021,
20: 2021,
21: 2021,
22: 2021,
23: 2021},
'month': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5,
5: 6,
6: 7,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 1,
13: 2,
14: 3,
15: 4,
16: 5,
17: 6,
18: 7,
19: 8,
20: 9,
21: 10,
22: 11,
23: 12},
'value': {0: 100,
1: 50,
2: 265,
3: 500,
4: 80,
5: 765,
6: 100,
7: 265,
8: 500,
9: 80,
10: 765,
11: 100,
12: 80,
13: 765,
14: 100,
15: 265,
16: 500,
17: 80,
18: 765,
19: 100,
20: 265,
21: 500,
22: 80,
23: 765}})
fig = go.Figure()
for y in df.year.unique():
dfy = df[df.year == y]
fig.add_bar(x = dfy.month, y = dfy.value, name = str(y))
fig.show()
have modified your data to demonstrate
this is an example of https://plotly.com/python/categorical-axes/#multicategorical-axes Hence need to use go
import pandas as pd
import io
import plotly.express as px
import plotly.graph_objects as go
df = pd.read_csv(
io.StringIO(
"""year,month,count
2020,1,50
2020,2,50
2020,3,50
2020,4,50
2020,11,100
2020,12,50
2021,1,80
2021,2,765
2021,3,100
2021,4,265
2021,5,500"""
)
)
go.Figure(go.Bar(x=[df["month"].tolist(), df["year"].tolist()], y=df["count"]))
Using Plotly Express and updated with multi-categorical x-axis:
import pandas as pd
import io
import plotly.express as px
df = pd.read_csv(
io.StringIO(
"""year,month,count
2020,1,50
2020,2,50
2020,3,50
2020,4,50
2020,11,100
2020,12,50
2021,1,80
2021,2,765
2021,3,100
2021,4,265
2021,5,500"""
)
)
# convert year to string so you get a catergorical scale
df['year'] = df['year'].astype(str)
channel_top_Level = "year"
channel_2nd_Level = "month"
fig = px.bar(df, x = channel_2nd_Level, y = 'count', color = channel_top_Level)
for num,channel_top_Level_val in enumerate(df[channel_top_Level].unique()):
temp_df = df.query(f"{channel_top_Level} == {channel_top_Level_val !r}")
fig.data[num].x = [
temp_df[channel_2nd_Level].tolist(),
temp_df[channel_top_Level].tolist()
]
fig.layout.xaxis.title.text = f"{channel_top_Level} / { channel_2nd_Level}"
fig
I have df:
pd.DataFrame({'period': {0: pd.Timestamp('2016-05-01 00:00:00'),
1: pd.Timestamp('2017-05-01 00:00:00'),
2: pd.Timestamp('2018-03-01 00:00:00'),
3: pd.Timestamp('2018-04-01 00:00:00'),
4: pd.Timestamp('2016-05-01 00:00:00'),
5: pd.Timestamp('2017-05-01 00:00:00'),
6: pd.Timestamp('2016-03-01 00:00:00'),
7: pd.Timestamp('2016-04-01 00:00:00')},
'cost2': {0: 15,
1: 144,
2: 44,
3: 34,
4: 13,
5: 11,
6: 12,
7: 13},
'rev2': {0: 154,
1: 13,
2: 33,
3: 37,
4: 15,
5: 11,
6: 12,
7: 13},
'cost1': {0: 19,
1: 39,
2: 53,
3: 16,
4: 19,
5: 11,
6: 12,
7: 13},
'rev1': {0: 34,
1: 34,
2: 74,
3: 22,
4: 34,
5: 11,
6: 12,
7: 13},
'destination': {0: 'YYZ',
1: 'YYZ',
2: 'YYZ',
3: 'YYZ',
4: 'DFW',
5: 'DFW',
6: 'DFW',
7: 'DFW'},
'source': {0: 'SFO',
1: 'SFO',
2: 'SFO',
3: 'SFO',
4: 'MIA',
5: 'MIA',
6: 'MIA',
7: 'MIA'}})
df = df[['source','destination','period','rev1','rev2','cost1','cost2']]
which looks like:
I want the final df to have the following columns:
2017-05-01 2016-05-01
source, destination, rev1, rev2, cost1, cost2, rev1, rev2, cost1, cost2...
So essentially, for every source/destination pair, I want revenue and cost numbers for each date in a single row.
I've been tinkering with stack and unstack but haven't been able to achieve my objective.
You can using set_index + unstack, to change the long to wide , then using swaplevel to change the format of columns index you need
df.set_index(['destination','source','period']).unstack().swaplevel(0,1,axis=1).sort_index(level=0,axis=1)
An alternative to .set_index + .unstack is .pivot_table:
df.pivot_table( \
index=['source', 'destination'], \
columns=['period'], \
values=['rev1', 'rev2', 'cost1', 'cost2'] \
).swaplevel(axis=1).sort_index(axis=1, level=0)
# period 2016-03-01 2016-04-01 ...
# cost1 cost2 rev1 rev2 cost1 cost2 rev1 rev2
# source destination
# MIA DFW 12.0 12.0 12.0 12.0 13.0 13.0 13.0 13.0
# SFO YYZ NaN NaN NaN NaN NaN NaN NaN NaN