Related
I'm trying to calculate the rolling mean/std for a column in dataframe. The pandas or numpy_ext rolling methods seem to need a fixed window size. The dataframe has a column "dates", I want to decide the window size based on the "dates", for example, when calculating mean/std, for rows at day 10, including rows from day 2 to day 6, for rows at day 11, including rows from day 3 to day 7, for rows at day 12, including rows from day 4 to day 8, etc.
I want to know if there are methods to do it except the brute force coding. Sample data, "quantity" is the target field to calculate mean and std.
dates
material
location
quantity
1
C
A
870
2
D
A
920
3
C
A
120
4
D
A
120
6
C
A
120
8
D
A
1200
8
c
A
720
10
D
A
480
11
D
A
600
12
C
A
720
13
D
A
80
13
D
A
600
14
D
A
1200
18
E
B
150
For example, for each row, I want to get the rolling mean for "quantity" of the previous 3-8 days (if any), the expected output will be:
| dates | material | location | quantity | Mean |
|-------|--------- |----------|----------|-------------------------|
| 1 | C | A | 870 | Nan |
| 2 | D | A | 920 | Nan |
| 3 | C | A | 120 | Nan |
| 4 | D | A | 120 | Nan |
| 6 | C | A | 120 |(870+920)/2 = 895 |
| 8 | D | A | 1200 |(870+920+120+120)/4=507.5|
| 8 | c | A | 720 |(870+920+120+120)/4=507.5|
| 10 | D | A | 480 |(920+120+120+120)/4=320 |
| 11 | D | A | 600 |(120+120+120)/3=120 |
| 12 | C | A | 720 |(120+120+1200+720)/4=540 |
| 13 | D | A | 80 |(120+1200+720)/3=680 |
| 13 | D | A | 600 |(120+1200+720)/3=680 |
| 14 | D | A | 1200 |(120+1200+720+480)/4=630 |
| 18 | E | B | 150|(480+600+720+80+600+1200)/6=613|
A follow-up question:
Is there a way to further filter the window by other columns? For example, when calculating the rolling mean for "quantity" of the previous 3-8 days, the rows in the rolling window must have the same "material" as the corresponding row. So the new expected output would be:
| dates | material | location | quantity | Mean |
|-------|--------- |----------|----------|-----------------------|
| 1 | C | A | 870 | Nan |
| 2 | D | A | 920 | Nan |
| 3 | C | A | 120 | Nan |
| 4 | D | A | 120 | Nan |
| 6 | C | A | 120 |(870)/1 = 870 |
| 8 | D | A | 1200 |(920+120)/2=520 |
| 8 | c | A | 720 |(870+120)/2=495 |
| 10 | D | A | 480 |(920+120)/2=520 |
| 11 | D | A | 600 |(120)/1 = 120 |
| 12 | C | A | 720 |(120+720)/2=420 |
| 13 | D | A | 80 |(1200)/1=1200 |
| 13 | D | A | 600 |(1200)/1=1200 |
| 14 | D | A | 1200 |(1200+480)/2=840 |
| 18 | E | B | 150 |Nan |
Inspired by #PaulS's answer, here is a simple way to select based on conditions from multiple columns:
def get_selection(row):
dates_mask = (df['dates'] < row['dates'] - 3) & (df['dates'] >= row['dates'] - 8)
material_mask = df['material'] == row['material']
return df[dates_mask & material_mask]
df['Mean'] = df.apply(lambda row: get_selection(row)['quantity'].mean(),
axis=1)
df['Std'] = df.apply(lambda row: get_selection(row)['quantity'].std(),
axis=1)
dates material location quantity Mean Std
0 1 C A 870 NaN NaN
1 2 D A 920 NaN NaN
2 3 C A 120 NaN NaN
3 4 D A 120 NaN NaN
4 6 C A 120 870.0 NaN
5 8 D A 1200 520.0 565.685425
6 8 C A 720 495.0 530.330086
7 10 D A 480 520.0 565.685425
8 11 D A 600 120.0 NaN
9 12 C A 720 420.0 424.264069
10 13 D A 80 1200.0 NaN
11 13 D A 600 1200.0 NaN
12 14 D A 1200 840.0 509.116882
13 18 E B 150 NaN NaN
You can perform your operation with a rolling, you however have to pre- and post-process the DataFrame a bit to generate the shift:
A = 3
B = 8
s = (df
# de-duplicate by getting the sum/count per identical date
.groupby('dates')['quantity']
.agg(['sum', 'count'])
# reindex to fill missing dates
.reindex(range(df['dates'].min(),
df['dates'].max()+1),
fill_value=0)
# compute classical rolling
.rolling(B-A, min_periods=1).sum()
# compute mean
.assign(mean=lambda d: d['sum']/d['count'])
['mean'].shift(A+1)
)
df['Mean'] = df['dates'].map(s)
output:
dates material location quantity Mean
0 1 C A 870 NaN
1 2 C A 920 NaN
2 3 C A 120 NaN
3 4 C A 120 NaN
4 6 C A 120 895.000000
5 8 D A 1200 507.500000
6 8 D A 720 507.500000
7 10 D A 480 320.000000
8 11 D A 600 120.000000
9 12 D A 720 540.000000
10 13 D A 80 680.000000
11 13 D A 600 680.000000
12 14 D A 1200 630.000000
13 18 E B 150 613.333333
Another possible solution:
def f(x):
return np.arange(np.amax([0, x-8]), np.amax([0, x-3]))
df['Mean'] = df.dates.map(lambda x: df.quantity[df.dates.isin(f(x))].mean())
Output:
dates material location quantity Mean
0 1 C A 870 NaN
1 2 C A 920 NaN
2 3 C A 120 NaN
3 4 C A 120 NaN
4 6 C A 120 895.000000
5 8 D A 1200 507.500000
6 8 D A 720 507.500000
7 10 D A 480 320.000000
8 11 D A 600 120.000000
9 12 D A 720 540.000000
10 13 D A 80 680.000000
11 13 D A 600 680.000000
12 14 D A 1200 630.000000
13 18 E B 150 613.333333
14 19 E B 1416 640.000000
15 20 F B 1164 650.000000
16 21 G B 11520 626.666667
The DataFrame constructor for anyone else to try:
d = {'dates': [1, 2, 3, 4, 6, 8, 8, 10, 11, 12, 13, 13, 14, 18],
'material': ['C','C','C','C','C','D','D','D','D','D','D','D','D','E'],
'location':['A','A','A','A','A','A','A','A','A','A','A','A','A','B'],
'quantity': [870, 920, 120, 120, 120, 1200, 720, 480, 600, 720, 80, 600, 1200, 150]}
df = pd.DataFrame(d)
df.rolling does accept "a time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes."
So we would have to convert your days to datetimelike (e.g., a pd.Timestamp, or a pd.Timedelta), and set it as index.
But this method won't have the ability perform the shift that you want (e.g., for day 14 you want not up to day 14 but up to day 10: 4 days before it).
So there is another option, which df.rolling also accepts:
Use a BaseIndexer subclass
There is very little documentation on it and I'm not an expert, but I was able to hack my solution into it. Surely, there must be a better (proper) way to use all its attributes correctly, and hopefully someone will show it in their answer here.
How I did it:
Inside our BaseIndexer subclass, we have to define the get_window_bounds method that returns a tuple of ndarrays: index positions of the starts of all windows, and those of the ends of all windows respectively (index positions like the ones that can be used in iloc - not with loc).
To to find them, I used the most efficient method from this answer: np.searchsorted.
Your 'dates' must be sorted for this.
Any keyword arguments that we pass to the BaseIndexer subclass constructor will be set as its attributes. I will set day_from, day_to and days:
from pandas.api.indexers import BaseIndexer
class CustomWindow(BaseIndexer):
"""
Indexer that selects the dates.
It uses the arguments:
----------------------
day_from : int
day_to : int
days : np.ndarray
"""
def get_window_bounds(self,
num_values: int,
min_periods: int | None,
center: bool | None,
closed: str | None) -> tuple[np.ndarray, np.ndarray]:
"""
I'm not using these arguments, but they must be present (not sure why):
`num_values` is the length of the df,
`center`: False, `closed`: None.
"""
days = self.days
# With `side` I'm making both ends inclusive:
window_starts = np.searchsorted(days, days + self.day_from, side='left')
window_ends = np.searchsorted(days, days + self.day_to, side='right')
return (window_starts, window_ends)
# In my implementation both ends are inclusive:
day_from = -8
day_to = -4
days = df['dates'].to_numpy()
my_indexer = CustomWindow(day_from=day_from, day_to=day_to, days=days)
df[['mean', 'std']] = (df['quantity']
.rolling(my_indexer, min_periods=0)
.agg(['mean', 'std']))
Result:
dates material location quantity mean std
0 1 C A 870 NaN NaN
1 2 C A 920 NaN NaN
2 3 C A 120 NaN NaN
3 4 C A 120 NaN NaN
4 6 C A 120 895.000000 35.355339
5 8 D A 1200 507.500000 447.911822
6 8 D A 720 507.500000 447.911822
7 10 D A 480 320.000000 400.000000
8 11 D A 600 120.000000 0.000011
9 12 D A 720 540.000000 523.067873
10 13 D A 80 680.000000 541.109970
11 13 D A 600 680.000000 541.109970
12 14 D A 1200 630.000000 452.990066
13 18 E B 150 613.333333 362.803896
I am trying to create a function that will take in CSV files and create dataframes and concatenate/sum like so:
id number_of_visits
0 3902932804358904910 2
1 5972629290368575970 1
2 5345473950081783242 1
3 4289865755939302179 1
4 36619425050724793929 19
+
id number_of_visits
0 3902932804358904910 5
1 5972629290368575970 10
2 5345473950081783242 3
3 4289865755939302179 20
4 36619425050724793929 13
=
id number_of_visits
0 3902932804358904910 7
1 5972629290368575970 11
2 5345473950081783242 4
3 4289865755939302179 21
4 36619425050724793929 32
My main issue is that in the for loop after I create the dataframes, I tried to concatenate by df += new_df and new_df wasn't being added. So I tried the following implementation.
def add_dfs(files):
master = []
big = pd.DataFrame({'id': 0, 'number_of_visits': 0}, index=[0]) # dummy df to initialize
for k in range(len(files)):
new_df = create_df(str(files[k])) # helper method to read, create and clean dfs
master.append(new_df) #creates a list of dataframes with in master
for k in range(len(master)):
big = pd.concat([big, master[k]]).groupby(['id', 'number_of_visits']).sum().reset_index()
# iterate through list of dfs and add them together
return big
Which gives me the following
id number_of_visits
1 1000036822946495682 2
2 1000036822946495682 4
3 1000044447054156512 1
4 1000044447054156512 9
5 1000131582129684623 1
So the number_of_visits for each user_id aren't actually adding together, they're just being sorted in order by number_of_visits
Pass your list of dataframes directly to concat() then group on the id and sum.
>>> pd.concat(master).groupby('id').number_of_visits.sum().reset_index()
id number_of_visits
0 36619425050724793929 32
1 3902932804358904910 7
2 4289865755939302179 21
3 5345473950081783242 4
4 5972629290368575970 11
def add_dfs(files):
master = []
for f in files:
new_df = create_df(f)
master.append(new_df)
big = pd.concat(master).groupby('id').number_of_visits.sum().reset_index()
return big
You can use
df1['number_of_visits'] += df2['number_of_visits']
this gives you:
| | id | number_of_visits |
|---:|---------------------:|-------------------:|
| 0 | 3902932804358904910 | 7 |
| 1 | 5972629290368575970 | 11 |
| 2 | 5345473950081783242 | 4 |
| 3 | 4289865755939302179 | 21 |
| 4 | 36619425050724793929 | 32 |
I am trying to add a new column to dataframe with apply function. I need to count distance between X and Y coords in row 0 and all other rows, I have created following logic:
import pandas as pd
import numpy as np
data = {'X':[0,0,0,1,1,5,6,7,8],'Y':[0,1,4,2,6,5,6,4,8],'Value':[6,7,4,5,6,5,6,4,8]}
df = pd.DataFrame(data)
def countDistance(lat1, lon1, lat2, lon2):
print(lat1, lon1, lat2, lon2)
#use basic knowledge about triangles - values are in meters
distance = np.sqrt(np.power(lat1-lat2,2)+np.power(lon1-lon2,2))
return distance
def recModif(df):
x = df.loc[0,'X']
y = df.loc[0,'Y']
df['dist'] = df.apply(lambda n: countDistance(x,y,df['X'],df['Y']), axis=1)
#more code will come here
recModif(df)
But this always returns error: ValueError: Wrong number of items passed 9, placement implies
I thought that as x and y are scalars, using np.repeat might help but it didn't, the error was still the same. I saw similar posts such as this one, but with multiplication which is simple, how can I achieve subtraction like I need?
The variable name in .apply() was messed up and collides with the outer scope. Avoid that and the code works.
df['dist'] = df.apply(lambda row: countDistance(x,y,row['X'],row['Y']), axis=1)
df
X Y Value dist
0 0 0 6 0.000000
1 0 1 7 1.000000
2 0 4 4 4.000000
3 1 2 5 2.236068
4 1 6 6 6.082763
5 5 5 5 7.071068
6 6 6 6 8.485281
7 7 4 4 8.062258
8 8 8 8 11.313708
Also note that np.power() and np.sqrt() are already vectorized, so .apply itself is redundant for the dataset given:
countDistance(x,y,df['X'],df['Y'])
Out[154]:
0 0.000000
1 1.000000
2 4.000000
3 2.236068
4 6.082763
5 7.071068
6 8.485281
7 8.062258
8 11.313708
dtype: float64
To achieve your end goal I suggest changing the function recModif to:
def recModif(df):
x = df.loc[0,'X']
y = df.loc[0,'Y']
df['dist'] = countDistance(x,y,df['X'],df['Y'])
#more code will come here
This outputs
X Y Value dist
0 0 0 6 0.000000
1 0 1 7 1.000000
2 0 4 4 4.000000
3 1 2 5 2.236068
4 1 6 6 6.082763
5 5 5 5 7.071068
6 6 6 6 8.485281
7 7 4 4 8.062258
8 8 8 8 11.313708
Solution
Try this:
## Method-1
df['dist'] = ((df.X - df.X[0])**2 + (df.Y - df.Y[0])**2)**0.5
## Method-2: .apply()
x, y = df.X[0], df.Y[0]
df['dist'] = df.apply(lambda row: ((row.X - x)**2 + (row.Y - y)**2)**0.5, axis=1)
Output:
# print(df.to_markdown(index=False))
| X | Y | Value | dist |
|----:|----:|--------:|---------:|
| 0 | 0 | 6 | 0 |
| 0 | 1 | 7 | 1 |
| 0 | 4 | 4 | 4 |
| 1 | 2 | 5 | 2.23607 |
| 1 | 6 | 6 | 6.08276 |
| 5 | 5 | 5 | 7.07107 |
| 6 | 6 | 6 | 8.48528 |
| 7 | 4 | 4 | 8.06226 |
| 8 | 8 | 8 | 11.3137 |
Dummy Data
import pandas as pd
data = {
'X': [0,0,0,1,1,5,6,7,8],
'Y': [0,1,4,2,6,5,6,4,8],
'Value':[6,7,4,5,6,5,6,4,8]
}
df = pd.DataFrame(data)
The issue
I have a dataset that looks like the toy example below.
I need to create a table that sums the value for each combination of item and period, and displays it in a crosstab / pivot table format.
If I use pandas.crosstab() I get the output I want.
If I use pandas.pivot_table I get what seems like a multi-level index for the columns.
How can I get rid of the multi-level index?
Yes, I could use just crosstab, but():
in general, I want to learn about multi-level indices
sometimes I
don't have the 'raw' data and I receive the data in the format
produced by pivot_table
What I have tried
I have tried totals_pivot_table.droplevel(0) but it says there is only one level. What does this mean?
dataframe.columns.droplevel() is no longer supported
Example tables
This is the output of pivot_table:
+--------+-------+-------+-------+-------+
| | value | value | value | value |
+--------+-------+-------+-------+-------+
| period | 1 | 2 | 3 | All |
| item | | | | |
| x | 10 | 11 | 12 | 33 |
| y | 13 | 14 | 15 | 42 |
| All | 23 | 25 | 27 | 75 |
+--------+-------+-------+-------+-------+
This is what I need:
+------+----+----+----+-----+
| item | 1 | 2 | 3 | All |
+------+----+----+----+-----+
| x | 10 | 11 | 12 | 33 |
| y | 13 | 14 | 15 | 42 |
| All | 23 | 25 | 27 | 75 |
+------+----+----+----+-----+
Toy code
df = pd.DataFrame()
df['item'] = np.repeat(['x','y'],3)
df['period'] = np.tile([1,2,3],2)
df['value'] = np.arange(10,16)
pivot = df.pivot(index ='item', columns ='period', values = None)
totals_pivot_table = df.pivot_table(index ='item', columns = 'period', aggfunc ='sum', margins = True)
totals_ct = pd.crosstab( df['item'], df['period'], values =df['value'] , aggfunc ='sum', margins=True)
Better is specified values parameter:
totals_pivot_table = df.pivot_table(index ='item',
columns = 'period',
values='value',
aggfunc ='sum',
margins=True)
print (totals_pivot_table)
period 1 2 3 All
item
x 10 11 12 33
y 13 14 15 42
All 23 25 27 75
If not possible is possible use DataFrame.droplevel, but be carefull for duplicated columns names:
print (totals_pivot_table.droplevel(0, axis=1))
period 1 2 3 All
item
x 10 11 12 33
y 13 14 15 42
All 23 25 27 75
df = pd.DataFrame()
df['item'] = np.repeat(['x','y'],3)
df['period'] = np.tile([1,2,3],2)
df['value'] = np.arange(10,16)
df['value1'] = np.arange(7,13)
print (df)
item period value value1
0 x 1 10 7
1 x 2 11 8
2 x 3 12 9
3 y 1 13 10
4 y 2 14 11
5 y 3 15 12
totals_pivot_table = df.pivot_table(index ='item',
columns = 'period',
aggfunc ='sum',
margins=True)
print (totals_pivot_table)
value value1
period 1 2 3 All 1 2 3 All
item
x 10 11 12 33 7 8 9 24
y 13 14 15 42 10 11 12 33
All 23 25 27 75 17 19 21 57
print (totals_pivot_table.droplevel(0, axis=1))
period 1 2 3 All 1 2 3 All
item
x 10 11 12 33 7 8 9 24
y 13 14 15 42 10 11 12 33
All 23 25 27 75 17 19 21 57
Use reset_index() on making your df totals_ct
totals_ct.index
gives:
Index(['x', 'y', 'All'], dtype='object', name='item')
However, using reset_index() when making totals.ct gets rid of all the three indexes
totals_ct = pd.crosstab( df['item'], df['period'], values =df['value'] , aggfunc ='sum', margins=True).reset_index()
check for result:
totals_ct.index
gives:
RangeIndex(start=0, stop=3, step=1)
maybe this is what you are looking for.
greetings Jan
I have a df which looks like this:
a b
apple | 7 | 2 |
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
other | 8 | 9 |
I want to select a given row say by name, say "apple" and move it to a new location, say -1 (second last row)
desired output
a b
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
apple | 7 | 2 |
other | 8 | 9 |
Is there any functions available to achieve this?
Use Index.difference for remove value and numpy.insert for add value to new index, last use DataFrame.reindex or DataFrame.loc for change order of rows:
a = 'apple'
idx = np.insert(df.index.difference([a], sort=False), -1, a)
print (idx)
Index(['google', 'swatch', 'merc', 'apple', 'other'], dtype='object')
df = df.reindex(idx)
#alternative
#df = df.loc[idx]
print (df)
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
This seems good, I am using pd.Index.insert() and pd.Index.drop_duplicates():
df.reindex(df.index.insert(-1,'apple').drop_duplicates(keep='last'))
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
I'm not aware of any built-in function, but one approach would be to manipulate the index only, then use the new index to re-order the DataFrame (assumes all index values are unique):
name = 'apple'
position = -1
new_index = [i for i in df.index if i != name]
new_index.insert(position, name)
df = df.loc[new_index]
Results:
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9