I have a dataframe with a column showing time(in minutes) spent for organizing each inventory item. The goal is to show minutes spent in either integer or float. However, the value in this column is not clean, see some example below. Is there a way to standardized and convert everything to an integer or float? (For example, 10 hours should be 600 minutes)
import pandas as pd
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
df1=pd.DataFrame(df1)
The desired output is like this
I used regex for this kind of problem.
import regex as re
import numpy as np
import pandas as pd
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
df1=pd.DataFrame(df1)
# Copy Dataframe for iteration
# Created a empty numpy array for parsing by index
arr = np.zeros(df1.shape[0])
df1_copy = df1.copy()
for i,j in df1_copy.iterrows():
if "h" in j["min"]:
j["min"] = re.sub(r"[a-zA-Z()\s]","",j["min"])
j["min"] = float(j["min"])
arr[i] = float(j["min"]*60)
else:
j["min"] = re.sub(r"[a-zA-Z()**\s]","",j["min"])
j["min"] = float(j["min"])
arr[i] = float(j["min"])
df1["min_clean"] = arr
print(df1)
min min_clean
0 420 420.0
1 450 450.0
2 480 480.0
3 512 512.0
4 560 560.0
5 10 hours 600.0
6 10.5 hours 630.0
7 420 (all inventory) 420.0
8 3h 180.0
9 4.1 hours 246.0
10 60** 60.0
11 6h 360.0
12 7hours 420.0
I currently don't know pandas but this solution (using regex) could help
import re
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
def mins(s):
if re.match(r"\d*\.?\d+ *(h|hour)", s):
l = re.sub(r"[^\d.]", "", s).split(".")
m = int(l[0]) * 60
if len(l) != 1:
m += int(l[1]) * 6
return m
return int(re.sub(r"\D", "", s))
min_clear = map(mins, df1["min"])
print(list(min_clear))
# output: [420, 450, 480, 512, 560, 600, 630, 420, 180, 246, 60, 360, 420]
You could later add min_clear to the DataFrame
Btw, I am just a beginner; if any use case fails, tell me and I will try to improve this.
Thanks
Related
I have a pandas dataset that looks at the number of n cases of an instance over time.
I have sorted the dataset in ascending order from the first recorded date and have created a new column called 'change'.
I am unsure however how to take the data from column n and map it onto the 'change' column such that each cell in the 'change' column represents the difference from the previous day.
For example, if on day 334 there were n = 14000 and on day 335 there were n = 14500 cases, in that corresponding 'change' cell I would want it to say '500'.
I have been trying things out for the past couple of hours but to no avail so have come here for some help.
I know this is wordier than I would like, but if you need any clarification let me know.
import pandas as pd
df = pd.DataFrame({
'date': [1,2,3,4,5,6,7,8,9,10],
'cases': [100, 120, 129, 231, 243, 212, 375, 412, 440, 1]
})
df['change'] = df.cases.diff()
OUTPUT
date cases change
0 1 100 NaN
1 2 120 20.0
2 3 129 9.0
3 4 231 102.0
4 5 243 12.0
5 6 212 -31.0
6 7 375 163.0
7 8 412 37.0
8 9 440 28.0
9 10 1 -439.0
I am working with a large dataframe (~10M rows) that contains dates & textual data, and I have a list of values that I need to make some calculations per each value in that list.
For each value, I need to filter/subset my dataframe based on 4 conditions then make my calculations and move on to the next value.
Currently, ~80% of the time is spent on the filters block making the processing time extremely long duration (few hours)
What I currently have is this:
for val in unique_list: # iterate on values in a list
if val is not None or val != kip: # as long as its an acceptable value
for year_num in range(1, 6): # split by years
# filter and make intermediate df based on per value & per year calculation
cond_1 = df[f'{kip}'].str.contains(re.escape(str(val)), na=False)
cond_2 = df[f'{kip}'].notna()
cond_3 = df['Date'].dt.year < 2015 + year_num
cond_4 = df['Date'].dt.year >= 2015 + year_num -1
temp_df = df[cond_1 & cond_2 & cond_3 & cond_4].copy()
condition 1 takes around 45% of the time while conditions 3 & 4 take 22% each
is there a better way to implement this?, is there a way to remove .dt and .str and use something faster?
the time on 3 values (out of thousands)
Total time: 16.338 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def get_word_counts(df, kip, unique_list):
2 # to hold predictors
3 1 1929.0 1929.0 0.0 predictors_df = pd.DataFrame(index=[f'{kip}'])
4 1 2.0 2.0 0.0 n = 0
5
6 3 7.0 2.3 0.0 for val in unique_list: # iterate on values in a list
7 3 3.0 1.0 0.0 if val is not None or val != kip: # as long as its an acceptable value
8 18 39.0 2.2 0.0 for year_num in range(1, 6): # split by years
9
10 # filter and make intermediate df based on per value & per year calculation
11 15 7358029.0 490535.3 45.0 cond_1 = df[f'{kip}'].str.contains(re.escape(str(val)), na=False)
12 15 992250.0 66150.0 6.1 cond_2 = df[f'{kip}'].notna()
13 15 3723789.0 248252.6 22.8 cond_3 = df['Date'].dt.year < 2015 + year_num
14 15 3733879.0 248925.3 22.9 cond_4 = df['Date'].dt.year >= 2015 + year_num -1
The data mainly looks like this (I use only relevant columns when doing the calculations):
Date Ingredient
20 2016-07-20 Magnesium
21 2020-02-18 <NA>
22 2016-01-28 Apple;Cherry;Lemon;Olives General;Peanut Butter
23 2015-07-23 <NA>
24 2018-01-11 <NA>
25 2019-05-30 Egg;Soy;Unspecified Egg;Whole Eggs
26 2020-02-20 Chocolate;Peanut;Peanut Butter
27 2016-01-21 Raisin
28 2020-05-11 <NA>
29 2020-05-15 Chocolate
30 2019-08-16 <NA>
31 2020-03-28 Chocolate
32 2015-11-04 <NA>
33 2016-08-21 <NA>
34 2015-08-25 Almond;Coconut
35 2016-12-18 Almond
36 2016-01-18 <NA>
37 2015-11-18 Peanut;Peanut Butter
38 2019-06-04 <NA>
39 2016-04-08 <NA>
So, it looks like you really just want to split by year of the 'Date' column, and do something with each group. Also, for a large df, it is usually faster to filter what you can once beforehand, and then get a smaller one (in your example with one year worth of data), then do all your looping/extractions on the smaller df.
Without knowing much more about the data itself (C-contiguous? F-contiguous? Date-sorted?), it's hard to be sure, but I would guess that the following may prove to be faster (and it also feels more natural IMHO):
# 1. do everything you can outside the loop
# 1.a prep your patterns
escaped_vals = [re.escape(str(val)) for val in unique_list
if val is not None and val != kip]
# you meant 'and', not 'or', right?
# 1.b filter and sort the data (why sort? better mem locality)
z = df.loc[(df[kip].notna()) & (df['Date'] >= '2015') & (df['Date'] < '2021')].sort_values('Date')
# 2. do one groupby by year
for date, dfy in z.groupby(pd.Grouper(key='Date', freq='Y')):
year = date.year # optional, if you need it
# 2.b reuse each group as much as possible
for escval in escaped_vals:
mask = dfy[kip].str.contains(escval, na=False)
temp_df = dfy[mask].copy()
# do something with temp_df ...
Example (guessing some data, really):
n = 10_000_000
str_examples = ['hello', 'world', 'hi', 'roger', 'kilo', 'zulu', None]
df = pd.DataFrame({
'Date': [pd.Timestamp('2010-01-01') + k*pd.Timedelta('1 day') for k in np.random.randint(0, 3650, size=n)],
'x': np.random.randint(0, 1200, size=n),
'foo': np.random.choice(str_examples, size=n),
'bar': np.random.choice(str_examples, size=n),
})
unique_list = ['rld', 'oger']
kip = 'foo'
escaped_vals = [re.escape(str(val)) for val in unique_list
if val is not None and val != kip]
%%time
z = df.loc[(df[kip].notna()) & (df['Date'] >= '2015') & (df['Date'] < '2021')].sort_values('Date')
# CPU times: user 1.67 s, sys: 124 ms, total: 1.79 s
%%time
out = defaultdict(dict)
for date, dfy in z.groupby(pd.Grouper(key='Date', freq='Y')):
year = date.year
for escval in escaped_vals:
mask = dfy[kip].str.contains(escval, na=False)
temp_df = dfy[mask].copy()
out[year].update({escval: temp_df})
# CPU times: user 2.64 s, sys: 0 ns, total: 2.64 s
Quick sniff test:
>>> out.keys()
dict_keys([2015, 2016, 2017, 2018, 2019])
>>> out[2015].keys()
dict_keys(['rld', 'oger'])
>>> out[2015]['oger'].shape
(142572, 4)
>>> out[2015]['oger'].tail()
Date x foo bar
3354886 2015-12-31 409 roger hello
8792739 2015-12-31 474 roger zulu
3944171 2015-12-31 310 roger hi
7578485 2015-12-31 125 roger None
2963220 2015-12-31 809 roger hi
The title describes my situation. I already have a working version of this, but it is very inefficient when scaled to large DataFrames (>1M rows). I was wondering if anyone has a better idea of doing this.
Example with solution and code
Create a new column next_time that has the next value of time where the price column is greater than the current row.
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
series_to_concat = []
for price in df['price'].unique():
index_equal_to_price = df[df['price'] == price].index
series_time_greater_than_price = df[df['price'] > price]['time']
time_greater_than_price_backfilled = series_time_greater_than_price.reindex(index_equal_to_price.union(series_time_greater_than_price.index)).fillna(method='backfill')
series_to_concat.append(time_greater_than_price_backfilled.reindex(index_equal_to_price))
df['next_time'] = pd.concat(series_to_concat, sort=False)
print(df)
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
This gets me the desired result. When scaled up to some large dataframes, calculating this can take a few minutes. Does anyone have a better idea of how to approach this?
Edit: Clarification of constraints
We can assume the dataframe is sorted by time.
Another way to word this would be, given any row n (Time_n, Price_n), 0 <= n <= len(df) - 1, find x such that Time_x > Time_n AND Price_x > Price_n AND there is no y such that n < y < x with Price_y > Price_n.
These solutions were faster when I tested with %timeit on this sample, but I tested on a larger dataframe and they were much slower than your solution. It would be interesting to see if any of the 3 solutions are faster in your larger dataframe. I would look into dask or check out: https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html
I hope someone else is able to post a more efficient solution. Some different answers below:
You can achieve this with a next one-liner that loops through both the time and price columns simultaneously with zip. The next function works exactly the same as a list comprehension, but you use need to parentheses instead of brackets, and it only returns the first True value. You also need to pass None to handle errors as a parameter within in the next function.
You need to pass axis=1, since you are comparing column-wise.
This should speed up performance, as you don't loop through the entire column as the iteration stops after returning the first value and moves to the next row.
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
df['next_time'] = (df.apply(lambda x: next((z for (y, z) in zip(df['price'], df['time'])
if y > x['price'] if z > x['time']), None), axis=1))
df
Out[1]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
As you can see list comprehension would return the same result, but in theory will be a lot slower... as the total number of iterating would increase significantly especially with a large dataframe.
df['next_time'] = (df.apply(lambda x: [z for (y, z) in zip(df['price'], df['time'])
if y > x['price'] if z > x['time']], axis=1)).str[0]
df
Out[2]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
Another Option creating a function with some numpy and np.where():
def closest(x):
try:
lst = df.groupby(df['price'].cummax())['time'].transform('first')
lst = np.asarray(lst)
lst = lst[lst>x]
idx = (np.abs(lst - x)).argmin()
return lst[idx]
except ValueError:
pass
df['next_time'] = np.where((df['price'].shift(-1) > df['price']),
df['time'].shift(-1),
df['time'].apply(lambda x: closest(x)))
This one returned a variation of your dataframe with 1,000,000 rows and 162,000 unique prices for me in less than 7 seconds. As such, I think that since you ran it on 660,000 rows and 12,000 unique prices, the increase in speed would be 100x-1000x.
The added complexity of your question is the condition that the closest higher price must be at a later time. This answer https://stackoverflow.com/a/53553226/6366770 helped me discover the bisect functions, but it didn't have your added complexity of having to rely on a time column. As such, I had to tackle the problem from a couple of different angles (as you mentioned in a comment regarding my np.where() to break it down into a couple of different methods).
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
def bisect_right(a, x, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo
def get_closest_higher(df, col, val):
higher_idx = bisect_right(df[col].values, val)
return higher_idx
df = df.sort_values(['price', 'time']).reset_index(drop=True)
df['next_time'] = df['price'].apply(lambda x: get_closest_higher(df, 'price', x))
df['next_time'] = df['next_time'].map(df['time'])
df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'] )
df = df.sort_values('time').reset_index(drop=True)
df['next_time'] = np.where((df['price'].shift(-1) > df['price'])
,df['time'].shift(-1),
df['next_time'])
df['next_time'] = df['next_time'].ffill()
df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'])
df
Out[1]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
David did come up with a great solution for finding the closest greater price at a later time. However, I did want to find the very next occurrence of a greater price at a later time though. Working with a coworker of mine, we found this solution.
Stack containing tuples (index, price)
Iterate through all rows (index i)
While the stack is non-empty AND the top of the stack has a lesser price, then pop and fill in the popped index with times[index]
Push (i, prices[i]) onto the stack
import numpy as np
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
times = df['time'].to_numpy()
prices = df['price'].to_numpy()
stack = []
next_times = np.full(len(df), np.nan)
for i in range(len(df)):
while stack and prices[i] > stack[-1][1]:
stack_time_index, stack_price = stack.pop()
next_times[stack_time_index] = times[i]
stack.append((i, prices[i]))
df['next_time'] = next_times
print(df)
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
This solution actually performs very fast. I am not totally sure, but I believe the complexity would be close to O(n) since it is one full pass through the entire dataframe. The reason this performs so well, is the stack is essentially sorted, where the largest prices will be at the bottom, and the smallest price is at the top of the stack.
Here is my test with an actual dataframe in action
print(f'{len(df):,.0f} rows with {len(df["price"].unique()):,.0f} unique prices ranging from ${df["price"].min():,.2f} to ${df["price"].max():,.2f}')
667,037 rows with 11,786 unique prices ranging from $1,857.52 to $2,022.00
def find_next_time_with_greater_price(df):
times = df['time'].to_numpy()
prices = df['price'].to_numpy()
stack = []
next_times = np.full(len(df), np.nan)
for i in range(len(df)):
while stack and prices[i] > stack[-1][1]:
stack_time_index, stack_price = stack.pop()
next_times[stack_time_index] = times[i]
stack.append((i, prices[i]))
return next_times
%timeit -n10 -r10 df['next_time'] = find_next_time_with_greater_price(df)
434 ms ± 11.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
I am trying to generate a dataset where each day in a given year range has a fixed number of stores. In turn, each store sells a fixed number of products. The products specific to each store and day have a value for sales (£) and number of products sold.
However, running these for loops takes a while to create the dataset.
Is there anyway I can improve the efficiency of my code?
# Generate one row Dataframes (for concatenation) for each product, in each store, on each date
dataframes = []
for d in datelist:
for s in store_IDs:
for p in product_IDs:
products_sold = random.randint(1,101)
sales = random.randint(100,1001)
data_dict = {'Date': [d], 'Store ID': [s], 'Product ID': [p], 'Sales': [sales], 'Number of Products Sold': [products_sold]}
dataframe = pd.DataFrame(data_dict)
dataframes.append(dataframe)
test_dataframe = pd.concat(dataframes)
The main reason your code is really slow now is that you have the dataframe construction buried inside of your triple loop. This is not necessary. Right now, you are creating a new dataframe inside of each loop. It is much more efficient to create all of the data in some type of format that pandas can ingest and then create the dataframe once.
For the structure that you have, the easiest mod you could do is to make a list of the data rows, append a new dictionary to that list for each row as you are constructing now, and then make a df from the list of dictionaries... Pandas knows how to do that. I also removed the list brackets of the items you had in your dictionary. That isn't necessary.
import pandas as pd
import random
datelist = [1, 2, 4, 55]
store_IDs = ['6A', '27B', '12C']
product_IDs = ['soap', 'gum']
data = [] # I just renamed this for clarity
for d in datelist:
for s in store_IDs:
for p in product_IDs:
products_sold = random.randint(1,101)
sales = random.randint(100,1001)
data_dict = {'Date': d, 'Store ID': s, 'Product ID': p, 'Sales': sales, 'Number of Products Sold': products_sold}
data.append(data_dict) # this is building a list of dictionaries...
print(data[:3])
df = pd.DataFrame(data)
print(df.head())
Yields:
[{'Date': 1, 'Store ID': '6A', 'Product ID': 'soap', 'Sales': 310, 'Number of Products Sold': 35}, {'Date': 1, 'Store ID': '6A', 'Product ID': 'gum', 'Sales': 149, 'Number of Products Sold': 34}, {'Date': 1, 'Store ID': '27B', 'Product ID': 'soap', 'Sales': 332, 'Number of Products Sold': 60}]
Date Store ID Product ID Sales Number of Products Sold
0 1 6A soap 310 35
1 1 6A gum 149 34
2 1 27B soap 332 60
3 1 27B gum 698 21
4 1 12C soap 658 51
[Finished in 0.6s]
Do you realize your sizes are huge?
Size is approximately 3 and a half years (in days) = 1277 multiplied
by 99 stores = 126,423 multiplied by 8999 products = 1,137,680,577
rows.
If you need in average 16 bytes (which is already not much) you need at least 17GB of memory for this!
For this reason, Store_IDs and Product_IDs should really be just small integers, like index in a table of more descriptive names.
The way to gain efficiency is to reduce function calls! E.g. you can use numpy random number generation to generate random values in bulk.
Assuming all numbers involved can fit in 16bits, here's one solution to your problem (still needing a lot of memory):
import pandas as pd
import numpy as np
def gen_data(datelist, store_IDs, product_IDs):
date16 = np.arange(len(datelist), dtype=np.int16)
store16 = np.arange(len(store_IDs), dtype=np.int16)
product16 = np.arange(len(product_IDs), dtype=np.int16)
A = np.array(np.meshgrid(date16, store16, product16), dtype=np.int16).reshape(3,-1)
length = A.shape[1]
sales = np.random.randint(100, 1001, size=(1,length), dtype=np.int16)
sold = np.random.randint(1, 101, size=(1,length), dtype=np.int16)
data = np.concatenate((A,sales,sold), axis=0)
df = pd.DataFrame(data.T, columns=['Date index', 'Store ID index', 'Product ID index', 'Sales', 'Number of Products Sold'], dtype=np.int16)
return df
FWIW on my machine I obtain:
Date Store ID Product ID Sales Number of Products Sold
0 0 0 0 127 85
1 0 0 1 292 37
2 0 0 2 180 36
3 0 0 3 558 88
4 0 0 4 519 79
... ... ... ... ... ...
1137680572 1276 98 8994 932 78
1137680573 1276 98 8995 401 47
1137680574 1276 98 8996 840 77
1137680575 1276 98 8997 717 91
1137680576 1276 98 8998 632 24
[1137680577 rows x 5 columns]
real 1m16.325s
user 0m22.086s
sys 0m25.800s
(I don't have enough memory and use swap)
I created a dataframe column with the below code, and was trying to figure out how to round it down to the nearest 100th.
...
# This prints out my new value rounded to the nearest whole number.
df['new_values'] = (10000/df['old_values']).apply(numpy.floor)
# How do I get it to round down to the nearest 100th instead?
# i.e. 8450 rounded to 8400
You need divide by 100, convert to int and last multiple by 100:
df['new_values'] = (df['old_values'] / 100).astype(int) *100
Same as:
df['new_values'] = (df['old_values'] / 100).apply(np.floor).astype(int) *100
Sample:
df = pd.DataFrame({'old_values':[8450, 8470, 343, 573, 34543, 23999]})
df['new_values'] = (df['old_values'] / 100).astype(int) *100
print (df)
old_values new_values
0 8450 8400
1 8470 8400
2 343 300
3 573 500
4 34543 34500
5 23999 23900
EDIT:
df = pd.DataFrame({'old_values':[3, 6, 89, 573, 34, 23]})
#show output of first divide for verifying output
df['new_values1'] = (10000/df['old_values'])
df['new_values'] = (10000/df['old_values']).div(100).astype(int).mul(100)
print (df)
old_values new_values1 new_values
0 3 3333.333333 3300
1 6 1666.666667 1600
2 89 112.359551 100
3 573 17.452007 0
4 34 294.117647 200
5 23 434.782609 400
Borrowing #jezrael's sample dataframe
df = pd.DataFrame({'old_values':[8450, 8470, 343, 573, 34543, 23999]})
Use floordiv or //
df // 100 * 100
old_values
0 8400
1 8400
2 300
3 500
4 34500
5 23900
I've tried something similar using the math module
a = [123, 456, 789, 145]
def rdl(x):
ls = []
for i in x:
a = math.floor(i/100)*100
ls.append(a)
return ls
rdl(a)
#Output was [100, 200, 400, 700, 100]
Hope this provides some idea. Its very similar to solution provided by #jezrael