So basically I just need advice on how to calculate a 24 month rolling mean over each row of a dataframe. Every row indicates a particular city, and the columns are the respective sales for that month. If anyone could help me figure this out, it would be much appreciated
Edit: Clearly I failed to explain myself properly. I know that pandas has a rolling method built in. The problem is that I don't want to take the moving average of a singular column, I want to take it of columns in a row.
Sample Dataset
State - M1 - M2 - M3 - M4 - ..... - M48
UT - 40 - 20 - 30 - 60 -..... 60
CA - 30 - 60 - 20 - 40 -..... 70
So I want to find the rolling average for each states most recent 24 months (M24-M48 columns)
What I've tried:
Data['24_Month_Moving_Average'] = Data.rolling(window=24, win_type='triang', min_periods=1, axis=1).mean()
error: Wrong number of items passed 139, placement implies 1
edit 2, Sample Dataset:
Data = pd.Dataframe({'M1':[1, 2], 'M2':[3,5], 'M3':[5,6]}, index = ['UT', 'CA'])
# need code that will add column that is the rolling 24 month average for each state
Picture of Dataframe
You can use functions rolling() with mean() and specify the parameters you want window, min_periods as follow:
df.col1.rolling(n, win_type='triang', min_periods=1).mean()
Don't know what shoudl eb your expected outptu but listing a sample to show with the apply() for each row generate the rolling, make the state column the index for your dataframe, hope it helps:
import pandas as pd
df = pd.DataFrame({'B': [6, 1, 2, 20, 4],'C': [1, 1, 2, 30, 4],'D': [10, 1, 2, 5, 4]})
def test_roll(data):
return(data.rolling(window=2, win_type='triang', min_periods=1, axis=0).mean())
print(df.apply(test_roll, axis=1))
pandas.DataFrame.rolling
Related
I am a python newbie so struggling with this problem.
I am using pandas to read in csv files with multiple rows (changes depending on csv file, up to 200,000) and columns (495).
I want to search along the rows separately to find the max value, then I want to take the value that is 90% of the max and index this to find what column number it is. I want to do this for all rows separately.
For example:
row 1 has a max value of 12,098 and is in column 300
90% of 12,098 gives a value of 10,888. it is unlikely there will be an exact match, so i want
to find the nearest match in that row and then provide me with the column number (index), which
could be column 300 for example.
I then want to repeat this for every row.
This is what I have done so far:
1.search my rows of data to find the max value,
maxValues = df.max(axis = 1)
calculate 90% of this:
newmax = maxValues / 10 * 9
then find the value closest to that newmax in the row, and then tell me what the column number where that value is - this is the part I can't do. I have tried:
arr = pulses.to_numpy()
x = newmax.values`
difference_array = np.absolute(arr-x).axis=1
index = difference_array.argmin().axis=1
provides the following error: operands could not be broadcast together with shapes (114,495)
(114,)
I can do up to number 2 above, but can't figure out 3. I have tried converting them to arrays as you can see but this only produces errors.
Let's say we have a following dataframe:
import pandas as pd
d= {'a':[0,1], 'b':[10, 20], 'c':[30, 40], 'd':[15, 30]}
df = pd.DataFrame(data=d)
To go row by row you can use apply function
Since you operate with just one row, you can find its maximum with max
To find a closest value to 0.9 of maximum you need to find the smallest abs difference between numbers
To insert values by index of row in initial dataframe use at
So a code would be like this:
percent = 0.9
def foo(row):
max_val = row.max()
max_col = row[row==max_val].index[0]
second_max_val = percent * max_val
idx = row.name
df.at[idx, 'max'] = max_col
df.at[idx, '0.9max'] = (abs(row.loc[row.index!=max_col] - second_max_val)).idxmin()
return row
df.apply(lambda row: foo(row), axis=1)
print(df)
Your error occurs because you are comparing a two dimensional array with an one dimensional one (arr - x).
Consider this sample data frame:
import pandas as pd
import numpy as np
N=5
df = pd.DataFrame({
"col1": np.random.randint(100, size=(N,)),
"col2": np.random.randint(100, size=(N,)),
"col3": np.random.randint(100, size=(N,)),
"col4": np.random.randint(100, size=(N,)),
"col5": np.random.randint(100, size=(N,))
})
col1 col2 col3 col4 col5
0 48 21 74 76 95
1 66 1 13 56 83
2 91 67 96 93 28
3 49 76 39 95 84
4 65 31 61 68 24
IIUC, you could use the following code (no iteration needed, relies only on numpy and pandas) to find the index positions of those columns that are closest to the maximum value in each row multiplied by 0.9. If two values are equally close, the first index will be returned. The code only needs about five seconds for 2.mio rows.
Code:
np.argmin(df.sub(df.max(axis=1) * 0.9, axis=0).apply(np.abs).values, axis=1)
Output:
array([3, 4, 0, 4, 2])
So I'm slicing my timeseries data, but for some of the columns, I need to be able to have the sum of the elements the were sliced. For example if you had
s = pd.Series([10, 30, 21, 18])
s = s[::-2]
I need to get the sum of a range of elements in this situation so I would need
3 39
1 40
as the output. I've see things like .cumsum() but I can't find anything to sum a range of elements
I'm not quite understand what's the first column represent. But the second column seems to be the sum result.
If you have got the correct slice, it's easy to get sum with sum(), like this:
import numpy as np
import pandas as pd
data = np.arange(0, 10).reshape(-1, 2)
pd.DataFrame(data).iloc[2:].sum(axis=1)
Output is :
2 9
3 13
4 17
dtype: int64
The answer based only in your title would be df[-15:].sum(), but it seems you're looking for performing a calculation per group of slicing.
To address this problem, pandas provides the window utilities. So, you can simply do:
s = pd.Series([10, 30, 21, 18])
s.rolling(2).sum()[::-2].astype(int)
which returns:
3 39
1 40
dtype: int64
Also, it's escalable, once you can replace 2 by any other value, and the .rolling method also works in dataframe objects.
I have to group a dataset with multiple participants. The participants work a specific time on a specific tablet. If rows are the same tablet, and the time difference between consecutive rows is no more than 10 minutes, the rows belong to one participant. I would like to create a new column ("Participant") that numbers the participants. I know some python but this goes over my head. Thanks a lot!
Dataframe:
ID, Time, Tablet
1, 9:12, a
2, 9:14, a
3, 9:17, a
4, 9:45, a
5, 9:49, a
6, 9:51, a
7, 9:13, b
8, 9:15, b
...
Goal:
ID, Time, Tablet, Participant
1, 9:12, a, 1
2, 9:14, a, 1
3, 9:17, a, 1
4, 9:45, a, 2
5, 9:49, a, 2
6, 9:51, a, 2
7, 9:13, b, 3
8, 9:15, b, 3
...
You can groupby first then do a cumsum to get the participant column the way you want. Please make sure the time column is in datetime format and also sort it before you do this.
df['time'] = pd.to_datetime(df['time'])
df['time_diff']=df.groupby(['tablet'])['time'].diff().dt.seconds/60
df['participant'] = np.where((df['time_diff'].isnull()) | (df['time_diff']>10), 1,0).cumsum()
I've done something similar before, I used a combination of a group_by statement and using the Pandas shift function.
df = df.sort_values(["Tablet", "Time"])
df["Time_Period"] = df.groupby("Tablet")["Time"].shift(-1)-df["Time"]
df["Time_Period"] = df["Time_Period"].dt.total_seconds()
df["New_Participant"] = df["Time_Period"] > 10*60 #10 Minutes
df["Participant_ID"] = df["New_Participant"].cumsum()
Basically I flag every time there is a gap of over 10 minutes between sessions, then do a rolling sum to give each participant a unique ID
The title may come across as confusing (honestly, not quite sure how to summarize it in a sentence), so here is a much better explanation:
I'm currently handling a dataFrame A regarding different attributes, and I used a .groupby[].count() function on a data column age to create a list of occurrences:
A_sub = A.groupby(['age'])['age'].count()
A_sub returns a Series similar to the following (the values are randomly modified):
age
1 316
2 249
3 221
4 219
5 262
...
59 1
61 2
65 1
70 1
80 1
Name: age, dtype: int64
I would like to plot a list of values from element-wise division. The division I would like to perform is an element value divided by the sum of all the elements that has the index greater than or equal to that element. In other words, for example, for age of 3, it should return
221/(221+219+262+...+1+2+1+1+1)
The same calculation should apply to all the elements. Ideally, the outcome should be in the similar type/format so that it can be plotted.
Here is a quick example using numpy. A similar approach can be used with pandas. The for loop can most likely be replaced by something smarter and more efficient to compute the coefficients.
import numpy as np
ages = np.asarray([316, 249, 221, 219, 262])
coefficients = np.zeros(ages.shape)
for k, a in enumerate(ages):
coefficients[k] = sum(ages[k:])
output = ages / coefficients
Output:
array([0.24940805, 0.26182965, 0.31481481, 0.45530146, 1. ])
EDIT: The coefficients initizaliation at 0 and the for loop can be replaced with:
coefficients = np.flip(np.cumsum(np.flip(ages)))
You can use the function cumsum() in pandas to get accumulated sums:
A_sub = A['age'].value_counts().sort_index(ascending=False)
(A_sub / A_sub.cumsum()).iloc[::-1]
No reason to use numpy, pandas already includes everything we need.
A_sub seems to return a Series where age is the index. That's not ideal, but it should be fine. The code below therefore operates on a series, but can easily be modified to work DataFrames.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
print(s)
res = s / s[::-1].cumsum()[::-1]
res = res.rename("cumsum div")
I saw your comment about missing ages in the index. Here is how you would add the missing indexes in the range from min to max index, and then perform the division.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
s_all_idx = s.reindex(index=range(s.index.min(), s.index.max() + 1), fill_value=0)
print(s_all_idx)
res = s_all_idx / s_all_idx[::-1].cumsum()[::-1]
res = res.rename("all idx cumsum div")
Lets say I have some data like this:
category = pd.Series(np.ones(4))
job1_days = pd.Series([1, 2, 1, 2])
job1_time = pd.Series([30, 35, 50, 10])
job2_days = pd.Series([1, 3, 1, 3])
job2_time = pd.Series([10, 40, 60, 10])
job3_days = pd.Series([1, 2, 1, 3])
job3_time = pd.Series([30, 15, 50, 15])
Each entry represents an individual (so 4 people total). xxx_days represents the number of days an individual did something and xxx_time represents the number of minutes spent doing that job on a single day
I want to assign a 2 to category for an individual, if across all jobs they spent at least 3 days of 20 minutes each. So for example, person 1 does not meet the criteria because they only spent 2 total days with at least 20 minutes (their job 2 day count does not count toward the total because time is < 20). Person 2 does meet the criteria as they spent 5 total days (jobs 1 and 2).
After replacement, category should look like this:
[1, 2, 2, 1]
My current attempt to do this requires a for loop and manually indexing into each series and calculating the total days where time is greater than 20. However, this approach doesn't scale well to my actual dataset. I haven't included the code here as i'd like to approach it from a Pandas perspective instead
Whats the most efficient way to do this in Pandas? The thing that stumps me is checking conditions across multiple series and act accordingly after summation of days
Put days and time in two data frames with column positions correspondence maintained, then do the calculation in a vectorized approach:
import pandas as pd
time = pd.concat([job1_time, job2_time, job3_time], axis = 1)
days = pd.concat([job1_days, job2_days, job3_days], axis = 1)
((days * (time >= 20)).sum(1) >= 3) + 1
#0 1
#1 2
#2 2
#3 1
#dtype: int64