I created a dataframe column with the below code, and was trying to figure out how to round it down to the nearest 100th.
...
# This prints out my new value rounded to the nearest whole number.
df['new_values'] = (10000/df['old_values']).apply(numpy.floor)
# How do I get it to round down to the nearest 100th instead?
# i.e. 8450 rounded to 8400
You need divide by 100, convert to int and last multiple by 100:
df['new_values'] = (df['old_values'] / 100).astype(int) *100
Same as:
df['new_values'] = (df['old_values'] / 100).apply(np.floor).astype(int) *100
Sample:
df = pd.DataFrame({'old_values':[8450, 8470, 343, 573, 34543, 23999]})
df['new_values'] = (df['old_values'] / 100).astype(int) *100
print (df)
old_values new_values
0 8450 8400
1 8470 8400
2 343 300
3 573 500
4 34543 34500
5 23999 23900
EDIT:
df = pd.DataFrame({'old_values':[3, 6, 89, 573, 34, 23]})
#show output of first divide for verifying output
df['new_values1'] = (10000/df['old_values'])
df['new_values'] = (10000/df['old_values']).div(100).astype(int).mul(100)
print (df)
old_values new_values1 new_values
0 3 3333.333333 3300
1 6 1666.666667 1600
2 89 112.359551 100
3 573 17.452007 0
4 34 294.117647 200
5 23 434.782609 400
Borrowing #jezrael's sample dataframe
df = pd.DataFrame({'old_values':[8450, 8470, 343, 573, 34543, 23999]})
Use floordiv or //
df // 100 * 100
old_values
0 8400
1 8400
2 300
3 500
4 34500
5 23900
I've tried something similar using the math module
a = [123, 456, 789, 145]
def rdl(x):
ls = []
for i in x:
a = math.floor(i/100)*100
ls.append(a)
return ls
rdl(a)
#Output was [100, 200, 400, 700, 100]
Hope this provides some idea. Its very similar to solution provided by #jezrael
Related
I have a pandas dataset that looks at the number of n cases of an instance over time.
I have sorted the dataset in ascending order from the first recorded date and have created a new column called 'change'.
I am unsure however how to take the data from column n and map it onto the 'change' column such that each cell in the 'change' column represents the difference from the previous day.
For example, if on day 334 there were n = 14000 and on day 335 there were n = 14500 cases, in that corresponding 'change' cell I would want it to say '500'.
I have been trying things out for the past couple of hours but to no avail so have come here for some help.
I know this is wordier than I would like, but if you need any clarification let me know.
import pandas as pd
df = pd.DataFrame({
'date': [1,2,3,4,5,6,7,8,9,10],
'cases': [100, 120, 129, 231, 243, 212, 375, 412, 440, 1]
})
df['change'] = df.cases.diff()
OUTPUT
date cases change
0 1 100 NaN
1 2 120 20.0
2 3 129 9.0
3 4 231 102.0
4 5 243 12.0
5 6 212 -31.0
6 7 375 163.0
7 8 412 37.0
8 9 440 28.0
9 10 1 -439.0
The following is the dataset I'm working on
As you can see there are some missing values (NaN) which need to be replaced, on certain conditions:
If Solar.R < 50 then the missing value of Ozone needs to be replaced by the value = 30.166667
If Solar.R < 100 then the missing value of Ozone needs to be replaced by the value = 21.181818
If Solar.R < 150 then the missing value of Ozone needs to be replaced by the value = 53. 13043
If Solar.R < 200 then the missing value of Ozone needs to be replaced by the value = 59. 840000
If Solar.R < 250 then the missing value of Ozone needs to be replaced by the value = 59. 840000
If Solar.R < 300 then the missing value of Ozone needs to be replaced by the value = 50. 115385
If Solar.R < 350 then the missing value of Ozone needs to be replaced by the value = 26. 571429
Is there any way to do this using pandas and if-else? I've tried using loc() but it resulted in the non - NaN values getting modified too.
PS: This is the code using loc()
while (s['Ozone'].isna() == True):
s.loc[(s['Solar.R'] < 50), 'Ozone'] = '30.166667'
s.loc[(s['Solar.R'] < 100), 'Ozone'] = '21.181818'
s.loc[(s['Solar.R'] < 150), 'Ozone'] = '53.13043'
s.loc[(s['Solar.R'] < 200), 'Ozone'] = '59.840000'
s.loc[(s['Solar.R'] < 250), 'Ozone'] = '59.840000'
s.loc[(s['Solar.R'] < 300), 'Ozone'] = '50.115385'
s.loc[(s['Solar.R'] < 350), 'Ozone'] = '26.571429'
Try:
common = df['col_2'].isnull()
all_conditions = [(df['Solar.R'] < 50) & (common),
(df['Solar.R'] > 50) & (df['Solar.R'] < 100) & (common),
(df['Solar.R'] > 100) & (df['Solar.R'] < 150) & (common),
(df['Solar.R'] > 150) & (df['Solar.R'] < 250) & (common),
(df['Solar.R'] > 250) & (df['Solar.R'] < 300) & (common),
(df['Solar.R'] > 300) & (df['Solar.R'] < 350) & (common)]
fill_with = ['30.166667', '21.181818', '53.13043', '59.840000', '50.115385', '26.571429']
df['col_2'] = np.select(all_conditions, fill_with, default=df['col_2'])
You could use pd.cut() to bin the Solar.R values and assign Ozone values to each of the bins, and then use the resulting values in fillna().
I use the example data provided in another answer by #Marcello.
import pandas as pd
import numpy as np
# Example dataset with values for each interval - #Marcello.
example = {'Solar.R' : [25, 25, 87, 87, 134, 134, 187, 187, 234, 234, 267, 267, 345, 345],
'Ozone' : [1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan]}
df = pd.DataFrame(example)
# Find rows with missing data.
fill_needed = df["Ozone"].isna()
# In those rows only, put Solar.R into bins, labelled with values for Ozone.
fill_values = pd.cut(df["Solar.R"][fill_needed],
[0, 50, 100, 150, 200, 250, 300, 350],
labels=[30.166667, 21.181818, 53.13043,
59.840000, 59.840000, 50.115385,
26.571429],
ordered=False).astype(float)
# Put the fill values into the holes in the Ozone series.
df["Ozone"].fillna(fill_values, inplace=True)
df
# Solar.R Ozone
# 0 25 1.000000
# 1 25 30.166667
# 2 87 1.000000
# 3 87 21.181818
# 4 134 1.000000
# 5 134 53.130430
# 6 187 1.000000
# 7 187 59.840000
# 8 234 1.000000
# 9 234 59.840000
# 10 267 1.000000
# 11 267 50.115385
# 12 345 1.000000
# 13 345 26.571429
As your conditions are linear, you can use floordiv to select the right values for Ozone column and mask to hide other values:
values = [30.166667, 21.181818, 53.13043, 59.840000,
59.840000, 50.115385, 26.571429]
s['Ozone'] = s.mask(~s['Solar.R'].between(0, 350))['Solar.R'] \
.sub(1).floordiv(50).map(pd.Series(values))
print(s)
# Output:
Solar.R Ozone
0 50.0 30.166667
1 NaN NaN
2 450.0 NaN
3 98.0 21.181818
4 348.0 26.571429
5 302.0 26.571429
6 348.0 26.571429
7 279.0 50.115385
8 8.0 30.166667
9 80.0 21.181818
10 140.0 53.130430
11 239.0 59.840000
12 227.0 59.840000
13 93.0 21.181818
14 305.0 26.571429
15 80.0 21.181818
16 104.0 53.130430
17 180.0 59.840000
18 179.0 59.840000
19 59.0 21.181818
The function .fillna(value) can be used. It only changes the NaN values in a dataframe and not other values. Here is an example for your specific problem:
import pandas as pd
import numpy as np
#example dataset with values for each interval
example = {'Solar.R' : [25, 25, 87, 87, 134, 134, 187, 187, 234, 234, 267, 267, 345, 345],
'Ozone' : [1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan]}
df = pd.DataFrame(example)
#list of pairs of the cutoff and the respective values
#!!! needs to be sorted from smallest cutoff to largest
cut_off_values = [(50, 30.166667), (100, 21.181818), (150, 53.13043),
(200, 59.840000), (250, 59.840000), (300, 50.115385),
(350, 26.571429)]
#iterate the list of pairs and change only the nan values
for pair in cut_off_values:
df[df['Solar.R'] < pair[0]] = df[df['Solar.R'] < pair[0]].fillna(pair[1])
print(df.to_string())
Output:
Solar.R Ozone
0 25 1.000000
1 25 30.166667
2 87 1.000000
3 87 21.181818
4 134 1.000000
5 134 53.130430
6 187 1.000000
7 187 59.840000
8 234 1.000000
9 234 59.840000
10 267 1.000000
11 267 50.115385
12 345 1.000000
13 345 26.571429
The title describes my situation. I already have a working version of this, but it is very inefficient when scaled to large DataFrames (>1M rows). I was wondering if anyone has a better idea of doing this.
Example with solution and code
Create a new column next_time that has the next value of time where the price column is greater than the current row.
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
series_to_concat = []
for price in df['price'].unique():
index_equal_to_price = df[df['price'] == price].index
series_time_greater_than_price = df[df['price'] > price]['time']
time_greater_than_price_backfilled = series_time_greater_than_price.reindex(index_equal_to_price.union(series_time_greater_than_price.index)).fillna(method='backfill')
series_to_concat.append(time_greater_than_price_backfilled.reindex(index_equal_to_price))
df['next_time'] = pd.concat(series_to_concat, sort=False)
print(df)
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
This gets me the desired result. When scaled up to some large dataframes, calculating this can take a few minutes. Does anyone have a better idea of how to approach this?
Edit: Clarification of constraints
We can assume the dataframe is sorted by time.
Another way to word this would be, given any row n (Time_n, Price_n), 0 <= n <= len(df) - 1, find x such that Time_x > Time_n AND Price_x > Price_n AND there is no y such that n < y < x with Price_y > Price_n.
These solutions were faster when I tested with %timeit on this sample, but I tested on a larger dataframe and they were much slower than your solution. It would be interesting to see if any of the 3 solutions are faster in your larger dataframe. I would look into dask or check out: https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html
I hope someone else is able to post a more efficient solution. Some different answers below:
You can achieve this with a next one-liner that loops through both the time and price columns simultaneously with zip. The next function works exactly the same as a list comprehension, but you use need to parentheses instead of brackets, and it only returns the first True value. You also need to pass None to handle errors as a parameter within in the next function.
You need to pass axis=1, since you are comparing column-wise.
This should speed up performance, as you don't loop through the entire column as the iteration stops after returning the first value and moves to the next row.
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
df['next_time'] = (df.apply(lambda x: next((z for (y, z) in zip(df['price'], df['time'])
if y > x['price'] if z > x['time']), None), axis=1))
df
Out[1]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
As you can see list comprehension would return the same result, but in theory will be a lot slower... as the total number of iterating would increase significantly especially with a large dataframe.
df['next_time'] = (df.apply(lambda x: [z for (y, z) in zip(df['price'], df['time'])
if y > x['price'] if z > x['time']], axis=1)).str[0]
df
Out[2]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
Another Option creating a function with some numpy and np.where():
def closest(x):
try:
lst = df.groupby(df['price'].cummax())['time'].transform('first')
lst = np.asarray(lst)
lst = lst[lst>x]
idx = (np.abs(lst - x)).argmin()
return lst[idx]
except ValueError:
pass
df['next_time'] = np.where((df['price'].shift(-1) > df['price']),
df['time'].shift(-1),
df['time'].apply(lambda x: closest(x)))
This one returned a variation of your dataframe with 1,000,000 rows and 162,000 unique prices for me in less than 7 seconds. As such, I think that since you ran it on 660,000 rows and 12,000 unique prices, the increase in speed would be 100x-1000x.
The added complexity of your question is the condition that the closest higher price must be at a later time. This answer https://stackoverflow.com/a/53553226/6366770 helped me discover the bisect functions, but it didn't have your added complexity of having to rely on a time column. As such, I had to tackle the problem from a couple of different angles (as you mentioned in a comment regarding my np.where() to break it down into a couple of different methods).
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
def bisect_right(a, x, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo
def get_closest_higher(df, col, val):
higher_idx = bisect_right(df[col].values, val)
return higher_idx
df = df.sort_values(['price', 'time']).reset_index(drop=True)
df['next_time'] = df['price'].apply(lambda x: get_closest_higher(df, 'price', x))
df['next_time'] = df['next_time'].map(df['time'])
df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'] )
df = df.sort_values('time').reset_index(drop=True)
df['next_time'] = np.where((df['price'].shift(-1) > df['price'])
,df['time'].shift(-1),
df['next_time'])
df['next_time'] = df['next_time'].ffill()
df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'])
df
Out[1]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
David did come up with a great solution for finding the closest greater price at a later time. However, I did want to find the very next occurrence of a greater price at a later time though. Working with a coworker of mine, we found this solution.
Stack containing tuples (index, price)
Iterate through all rows (index i)
While the stack is non-empty AND the top of the stack has a lesser price, then pop and fill in the popped index with times[index]
Push (i, prices[i]) onto the stack
import numpy as np
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
times = df['time'].to_numpy()
prices = df['price'].to_numpy()
stack = []
next_times = np.full(len(df), np.nan)
for i in range(len(df)):
while stack and prices[i] > stack[-1][1]:
stack_time_index, stack_price = stack.pop()
next_times[stack_time_index] = times[i]
stack.append((i, prices[i]))
df['next_time'] = next_times
print(df)
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
This solution actually performs very fast. I am not totally sure, but I believe the complexity would be close to O(n) since it is one full pass through the entire dataframe. The reason this performs so well, is the stack is essentially sorted, where the largest prices will be at the bottom, and the smallest price is at the top of the stack.
Here is my test with an actual dataframe in action
print(f'{len(df):,.0f} rows with {len(df["price"].unique()):,.0f} unique prices ranging from ${df["price"].min():,.2f} to ${df["price"].max():,.2f}')
667,037 rows with 11,786 unique prices ranging from $1,857.52 to $2,022.00
def find_next_time_with_greater_price(df):
times = df['time'].to_numpy()
prices = df['price'].to_numpy()
stack = []
next_times = np.full(len(df), np.nan)
for i in range(len(df)):
while stack and prices[i] > stack[-1][1]:
stack_time_index, stack_price = stack.pop()
next_times[stack_time_index] = times[i]
stack.append((i, prices[i]))
return next_times
%timeit -n10 -r10 df['next_time'] = find_next_time_with_greater_price(df)
434 ms ± 11.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
I can't seem to get this right... here's what I'm trying to do:
import pandas as pd
df = pd.DataFrame({
'item_id': [1,1,3,3,3],
'contributor_id': [1,2,1,4,5],
'contributor_role': ['sing', 'laugh', 'laugh', 'sing', 'sing'],
'metric_1': [80, 90, 100, 92, 50],
'metric_2': [180, 190, 200, 192, 150]
})
--->
item_id contributor_id contributor_role metric_1 metric_2
0 1 1 sing 80 180
1 1 2 laugh 90 190
2 3 1 laugh 100 200
3 3 4 sing 92 192
4 3 5 sing 50 150
And I want to reshape it into:
item_id SING_1_contributor_id SING_1_metric_1 SING_1_metric_2 SING_2_contributor_id SING_2_metric_1 SING_2_metric_2 ... LAUGH_1_contributor_id LAUGH_1_metric_1 LAUGH_1_metric_2 ... <LAUGH_2_...>
0 1 1 80 180 N/A N/A N/A ... 2 90 190 ... N/A..
1 3 4 92 192 5 50 150 ... 1 100 200 ... N/A..
Basically, for each item_id, I want to collect all relevant data into a single row. Each item could have multiple types of contributors, and there is a max for each type (e.g. max SING contributor = A per item, max LAUGH contributor = B per item). There are a set of metrics tied to each contributor (but for the same contributor, the values could be different across different items / contributor types).
I can probably achieve this through some seemingly inefficient methods (e.g. looping and matching then populating a template df), but I was wondering if there is a more efficient way to achieve this, potentially through cleverly specifying the index / values / columns in the pivot operation (or any other method..).
Thanks in advance for any suggestions!
EDIT:
Ended up adapting Ben's script below into the following:
df['role_count'] = df.groupby(['item_id', 'contributor_role']).cumcount().add(1).astype(str)
df['contributor_role'] = df.apply(lambda row: row['contributor_role'] + '_' + row['role_count'], axis=1)
df = df.set_index(['item_id','contributor_role']).unstack()
df.columns = ['_'.join(x) for x in df.columns.values]
You can create the additional key with cumcount then do unstack
df['newkey']=df.groupby('item_id').cumcount().add(1).astype(str)
df['contributor_id']=df['contributor_id'].astype(str)
s = df.set_index(['item_id','newkey']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
Out[38]:
contributor_id_1 contributor_role_1 ... metric_1_3 metric_2_3
item_id ...
1 1 sing ... NaN NaN
3 1 messaround ... 50.0 150.0
I have following dataframe in pandas
ID Balance ATM_drawings Value
1 100 50 345
1 150 33 233
2 100 100 333
2 100 100 234
I want data in that desired format
ID Balance_mean Balance_sum ATM_Drawings_mean ATM_drawings_sum
1 75 250 41.5 83
2 200 100 200 100
I am using following command to do it in pandas
df1= df[['Balance','ATM_drawings']].groupby('ID', as_index = False).agg(['mean', 'sum']).reset_index()
But, it does not give what I intended to get.
You can use a dictionary to specify aggregation functions for each series:
d = {'Balance': ['mean', 'sum'], 'ATM_drawings': ['mean', 'sum']}
res = df.groupby('ID').agg(d)
# flatten MultiIndex columns
res.columns = ['_'.join(col) for col in res.columns.values]
print(res)
Balance_mean Balance_sum ATM_drawings_mean ATM_drawings_sum
ID
1 125 250 41.5 83
2 100 200 100.0 200
Or you can define d via dict.fromkeys:
d = dict.fromkeys(('Balance', 'ATM_drawings'), ['mean', 'sum'])
Not sure how to achieve this using agg, but you could reuse the `groupby´ object to avoid having to do the operation multiple times, and then use transformations:
import pandas as pd
df = pd.DataFrame({
"ID": [1, 1, 2, 2],
"Balance": [100, 150, 100, 100],
"ATM_drawings": [50, 33, 100, 100],
"Value": [345, 233, 333, 234]
})
gb = df.groupby("ID")
df["Balance_mean"] = gb["Balance"].transform("mean")
df["Balance_sum"] = gb["Balance"].transform("sum")
df["ATM_drawings_mean"] = gb["ATM_drawings"].transform("mean")
df["ATM_drawings_sum"] = gb["ATM_drawings"].transform("sum")
print df
Which yields:
ID Balance Balance_mean Balance_sum ATM_drawings ATM_drawings_mean ATM_drawings_sum Value
0 1 100 125 250 50 41.5 83 345
1 1 150 125 250 33 41.5 83 233
2 2 100 100 200 100 100.0 200 333
3 2 100 100 200 100 100.0 200 234