I have a dataframe like this:
import pandas as pd
import numpy as np
date_vals = ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05',
'2022-01-06', '2022-01-07', '2022-01-08', '2022-01-09', '2022-01-10',
'2022-01-11', '2022-01-12', '2022-01-13', '2022-01-14', '2022-01-15',
'2022-01-16', '2022-01-17', '2022-01-18', '2022-01-19', '2022-01-20']
machine_vals = ['M-1', 'M-2', 'M-1', 'M-2', 'M-1', 'M-2', 'M-1', 'M-2', 'M-1', 'M-2',
'M-1', 'M-2', 'M-1', 'M-2', 'M-1', 'M-2', 'M-1', 'M-2', 'M-1', 'M-2']
shift_vals = ['Day', 'Night', 'Day', 'Night', 'Day', 'Night', 'Day', 'Night', 'Day', 'Night',
'Day', 'Night', 'Day', 'Night', 'Day', 'Night', 'Day', 'Night', 'Day', 'Night']
type_vals = [['Type A', 'Type B'], 'Type B', 'Type A', 'Type B', 'Type A', 'Type A', 'Type B', 'Type B',
'Type A', 'Type B', ['Type A', 'Type B'], 'Type B', ['Type A', 'Type B'], 'Type A', 'Type B', ['Type A', 'Type B'],
'Type A', 'Type B', 'Type A', 'Type B']
meter_vals = [[1000, 800], 1500, 900, 1700, 1200, 1300, 1600, 1400, 1300, 1100, [1400, 200], 1200, [1000, 700], 1500, 1600, [1300, 900], 1200, 1100, 1300, 1700]
data = {
'Date': date_vals,
'Machine': machine_vals,
'Shift': shift_vals,
'Type': type_vals,
'Meter': meter_vals
}
df = pd.DataFrame(data)
df
Some cells in the Type and Meter columns have two values each. This means 1000 meters with Type A and 800 meters with Type B.
In this case, I have two questions:
First of all, is it a healthy way for a database to store two different variables in one cell in the column? Will it cause any problems in the later stages of the work? Do you have any advice for such situations?
The second of these is with the dataframe above,
the dates on the bottom axis, on the other axis how many meters each machine makes to each date as Type A and Type B.
the dates on the bottom axis, on the other axis how many meters each machine makes to each date in Day and Night.
The focus of my questions is how I should behave when there are two different values in a cell (for example: ['Type A', 'Type B'] and [1000, 800]). Thanks.
First, thank you for providing a reproducible example. Makes answering such questions so much easier.
So, first step would be to clean up your df a bit:
Make Date actual Timestamps.
Explode the columns that have lists. It makes further processing easier.
df2 = df.assign(Date=pd.to_datetime(df['Date'])).explode(['Type', 'Meter'])
>>> df2
Date Machine Shift Type Meter
0 2022-01-01 M-1 Day Type A 1000
0 2022-01-01 M-1 Day Type B 800
1 2022-01-02 M-2 Night Type B 1500
.. ... ... ... ... ...
17 2022-01-18 M-2 Night Type B 1100
18 2022-01-19 M-1 Day Type A 1300
19 2022-01-20 M-2 Night Type B 1700
[24 rows x 5 columns]
Then, I presume you'd like to plot these measurements in various ways.
Here are some suggestions:
By type, each date has its own bar(s)
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (5, 3)
plt.rcParams['figure.facecolor'] = 'white'
# each date with its own bar(s):
z = df2.pivot_table(values='Meter', index='Date', columns='Type', aggfunc=sum)
ax = z.plot.bar()
# matplotlib messes up categorical date formats
ax.set_xticklabels(z.index.strftime('%Y-%m-%d'))
plt.show()
Aggregate by week
zw = z.resample('W').sum()
ax = zw.plot.bar()
ax.set_xticklabels(zw.index.strftime('%Y-%m-%d'))
plt.show()
by shift (day or night), total meters, grouped by weeks
z = df2.groupby([pd.Grouper(freq='W', key='Date'), 'Shift'])['Meter'].sum().unstack('Shift')
ax = z.plot.bar()
ax.set_xticklabels(z.index.strftime('%Y-%m-%d'))
plt.show()
Two plots: one for each shift
In each plot, show meter by types.
fig, axes = plt.subplots(nrows=2, sharex=True, figsize=(5,5))
for ax, (label, sdf) in zip(axes, df2.groupby('Shift')):
z = sdf.groupby([pd.Grouper(freq='W', key='Date'), 'Type'])['Meter'].sum().unstack('Type')
z.plot.bar(ax=ax)
ax.set_xticklabels(z.index.strftime('%Y-%m-%d'))
ax.set_title(label)
plt.tight_layout()
plt.show()
Related
Let's take the following pd.DataFrame as an example
df = pd.DataFrame({
'month': ['2022-01', '2022-02', '2022-03'],
'col1': [1_000, 1_500, 2_000],
'col2': [100, 150, 200],
}).melt(id_vars=['month'], var_name='col_name')
which creates
month col_name value
-----------------------------
0 2022-01 col1 1000
1 2022-02 col1 1500
2 2022-03 col1 2000
3 2022-01 col2 100
4 2022-02 col2 150
5 2022-03 col2 200
Now when I would use simple seaborn
sns.barplot(data=df, x='month', y='value', hue='col_name');
I would get:
Now I would like to use plotly and the following code
import plotly.express as px
fig = px.histogram(df,
x="month",
y="value",
color='col_name', barmode='group', height=500, width=1_200)
fig.show()
And I get:
So why are the x-ticks so weird and not simply 2022-01, 2022-02 and 2022-03?
What is happening here?
I found that I always have this problem with the ticks when using color. It somehow messes the ticks up.
You can solve it by customizing the step as 1 month per tick with dtick="M1", as follows:
import pandas as pd
import plotly.express as px
df = pd.DataFrame({
'month': ['2022-01', '2022-02', '2022-03'],
'col1': [1000, 1500, 2000],
'col2': [100, 150, 200],
}).melt(id_vars=['month'], var_name='col_name')
fig = px.bar(df,
x="month",
y="value",
color='col_name', barmode='group', height=500, width=1200)
fig.update_xaxes(
tickformat = '%Y-%m',
dtick="M1",
)
fig.show()
I have the following dataframe
df = pd.DataFrame({
'date': [1988, 1988, 2000, 2005],
'value': [2100, 4568, 7896, 68909]
})
I want to make a time series based on this df. How can i change the year from int to a datetimeindex so i can plot a timeseries?
Use: pd.to_datetime to convert to datetime. DataFrame.set_index in order to get and plot the Series. You can plot it with Series.plot
(df.assign(date = pd.to_datetime(df['date'],format = '%Y'))
.set_index('date')['value']
.plot())
If you want keep the series use:
s = (df.assign(date = pd.to_datetime(df['date'],format = '%Y'))
.set_index('date')['value'])
and then:
s.plot()
df = pd.DataFrame({
'date': [1988, 1988, 2000, 2005],
'value': [2100, 4568, 7896, 68909]
})
date = []
for year in df.date:
date.append(pd.datetime(year,1,1))
df.index=date
df['value'].plot()
Objective: Create an output with a comparable SUMPRODUCT method within pandas
Description: There are two data frames that I need to make use of (df and df_2_copy). I am trying to add 1-mo CDs, 3-mo CDs, 6-mo CDs after multiplying each by their respective price in df (2000,3000,5000).
import pandas as pd
data = [['1-mo CDs', 1.0, 1,2000, '1, 2, 3, 4, 5, and 6'],
['3-mo CDs', 4.0 ,3 ,3000,'1 and 4'],
['6-mo CDs',9.0 ,6, 5000,'1']]
df = pd.DataFrame(data,columns=['Scenario','Yield', 'Term','Price', 'Purchase CDs in months'])
df
data_2 = [['Init Cash', 400000, 325000,335000,355000,275000,225000,240000],
['Matur CDs',0,0,0,0,0,0,0],
['Interest',0,0,0,0,0,0,0],
['1-mo CDs',0,0,0,0,0,0,0],
['3-mo CDs',0,0,0,0,0,0,0],
['6-mo CDs',0,0,0,0,0,0,0],
['Cash Uses',75000,-10000,-20000,80000,50000,-15000,60000],
['End Cash', 0,0,0,0,0,0,0]]
# set table
df_2 = pd.DataFrame(data_2,columns=['Month', 'Month 1', 'Month 2', 'Month 3', 'Month 4', 'Month 5', 'Month 6', 'End'])
df_2_copy = df_2.copy()
Ultimately, I would like to place the output of the SUMPRODUCT at the df_2_copy.iloc[7] location.
Any help would be appreciated.
You can do it the following way:
Generate df3 - values from df_2 for particular months with Month
column changed to index, for rows which have coresponding rows in df:
df3 = df_2.drop(columns='End').set_index('Month')\
.query('index in #df.Scenario')
For my test data, with Month n values changed, it was:
Month 1 Month 2 Month 3 Month 4 Month 5 Month 6
Month
1-mo CDs 1 2 0 2 2 0
3-mo CDs 1 0 3 0 4 0
6-mo CDs 1 1 0 2 0 0
Then generate df4 - df with Scenario changed to index,
limited to Price column, but still as a DataFrame:
df4 = df.set_index('Scenario').Price.to_frame()
The result is:
Price
Scenario
1-mo CDs 2000
3-mo CDs 3000
6-mo CDs 5000
Then calculate sums:
sums = (df3.values * df4.values).sum(axis=0)
The result is:
[10000 9000 9000 14000 16000 0]
And the last step is to write these numbers into the target location:
df_2.iloc[7, 1:7] = sums
I have a dataframe where I take the subset of only numeric columns, calculate the 5 day rolling average for each numeric column and add it as a new column to the df.
This approach works but currently takes quite a long time (8 seconds per column). I'm wondering if there is a better way to do this.
A working toy example of what I'm doing currently:
data = {'Group': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
'Year' : ['2017', '2017', '2017', '2018', '2018', '2018', '2017', '2017', '2018', '2018', '2017', '2017', '2017', '2017', '2018', '2018'],
'Score 1' : [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5],
'Score 2': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
for col in ['Score 1', 'Score 2']:
df[col + '_avg'] = df.groupby(['Year', 'Group'])[col].apply(lambda x: x.rolling(2, 1).mean().shift().bfill())
For anyone who lands on this, I was able to speed this up significantly by sorting first and avoiding the lambda function:
return_df[col + '_avg'] = df.sort_values(['Group', 'Year']).groupby(['Group'])[col].rolling(2,1).mean().shift().values
I need to .groupby() using customer, and then add a column for the date in which the customer made his/her first purchase, and add another column for the corresponding purchase amount.
Here is my code. I am doing the first part wrong and don't know how to do the second. I've tried .loc and .idxmin ....
mydata = [{'amount': 3200, 'close_date':'2013-03-31', 'customer': 'Customer 1',},
{'amount': 1430, 'close_date':'2013-11-30', 'customer': 'Customer 1',},
{'amount': 4320, 'close_date':'2014-03-31', 'customer': 'Customer 2',},
{'amount': 2340, 'close_date':'2015-05-18', 'customer': 'Customer 2',},
{'amount': 4320, 'close_date':'2015-06-29', 'customer': 'Customer 2',},]
df = pd.DataFrame(mydata)
df.close_date = pd.to_datetime(df.close_date)
df['first_date'] = df.groupby('customer')['close_date'].min().apply(lambda x: x.strftime('%Y-%m'))
If you sort your data by close_date, you can do as follows:
df.sort_values('close_date').groupby(['customer'])['close_date', 'amount'].first()
close_date amount
customer
Customer 1 2013-03-31 3200
Customer 2 2014-03-31 4320
.sort_values() has been added in 0.17, used to be sort() (see docs).
Two steps.
First the day of first purchase:
In [34]: first = df.groupby('customer').close_date.min()
In [35]: first
Out[35]:
customer
Customer 1 2013-03-31
Customer 2 2014-03-31
Name: close_date, dtype: object
We'll use first as an indexer,
In [36]: idx = pd.MultiIndex.from_tuples(list(first.iteritems()), names=['customer', 'close_date'])
In [37]: idx
Out[37]:
MultiIndex(levels=[['Customer 1', 'Customer 2'], ['2013-03-31', '2014-03-31']],
labels=[[0, 1], [0, 1]])
For a DataFrame with those two levels
In [38]: df2 = df.set_index(['customer', 'close_date'])
In [39]: df2.loc[idx]
Out[39]:
amount
customer close_date
Customer 1 2013-03-31 3200
Customer 2 2014-03-31 4320
This is a series, you can use .unstack() to get back a DataFrame.