I am iterating through a pandas dataframe (df) and adding scores to a dictionary containing python lists (scores):
for index, row in df.iterrows():
scores[row["key"]][row["pos"]] = scores[row["key"]][row["pos"]] + row["score"]
The scores dictionary initially is not empty. The dataframe is very large and this loop takes a long time. Is there a way to do this without a loop or speed it up in some other way?
A for loop seems somewhat inevitable, but we can speed things up with NumPy's fancy indexing and Pandas' groupby:
# group the scores over `key` and gather them in a list
grouped_scores = df.groupby("key").agg(list)
# for each key, value in the dictionary...
for key, val in scores.items():
# first lookup the positions to update and the corresponding scores
pos, score = grouped_scores.loc[key, ["pos", "score"]]
# then fancy indexing with `pos`: reaching all positions at once
scores[key][pos] += score
Related
I would like to extract lists of indexes based on the value of column ID.
data={'ID':[1,1,2,3,6,4,2,6], 'Number': [10,20,5,6,100,90,40,5]}
df=pd.DataFrame(data)
I know how to do that manually, one value/list at a time:
idx_list=df.index[df.ID == 1].tolist()
but in my code, I usually don't know how many different values of ID I have, so the above approach would not be enough.
Ideally I would like to have multiple lists as output. for each value of ID, a list of indexes.
You can use a for loop
idx_list = []
for ID in data["ID"]:
idx_list.append(df.index[df.ID == ID].tolist())
This will give you the indices for each ID. Note that there will be duplicates. To avoid this, only add to idx_list if the value is already not present:
idx_list = []
for ID in data["ID"]:
if df.index[df.ID == ID].tolist() not in idx_list: idx_list.append(df.index[df.ID == ID].tolist())
You can store the list of indexes for each value you want to filter for in aseparate container
i_list=[]
for x in df.ID:
i_list.append(df.index[df['ID'] == x].tolist())
i_list contains the list of indexes as a 2D list
df is a dataframe containing 12 millions+ lines unsorted.
Each row has a GROUP ID.
The end goal is to randomly select 1 row per unique GROUP ID, thus populating a new column named SELECTED where 1 means selected 0 means the opposite
There may be 5000+ unique GROUP IDs.
Seeking better and faster solution than the following, Potentially multi-threaded solution?
for sec in df['GROUP'].unique():
sz = df.loc[df.GROUP == sec, ['SELECTED']].size
sel = [0]*sz
sel[random.randint(0,sz-1)] = 1
df.loc[df.GROUP == sec, ['SELECTED']] = sel
You could try a vectorized version, which will probably speed things up if you have many classes.
import pandas as pd
# get fake data
df = pd.DataFrame(pd.np.random.rand(10))
df['GROUP'] = df[0].astype(str).str[2]
# mark one element of each group as selected
df['selected'] = df.index.isin( # Is current index in a selected list?
df.groupby('GROUP') # Get a GroupBy object.
.apply(pd.Series.sample) # Select one row from each group.
.index.levels[1] # Access index - in this case (group, old_id) pair; select the old_id out of the two.
).astype(pd.np.int) # Convert to ints.
Note that this may fail if duplicate indices are present.
I do not know panda's dataframe, but if you simply set selected where it is needed to be one and later assume that not having the attribute means not selected you could avoid updating all elements.
You may also do something like this :
selected = []
for sec in df['GROUP'].unique():
selected.append(random.choice(sec))
or with list comprehensions
selected = [random.choice(sec) for sec in df['GROUP'].unique()]
maybe this can speed it up because you will not need to allow new memory and udpate all elements from your dataframe.
If you really want multithreading have a look at concurrent.futures https://docs.python.org/3/library/concurrent.futures.html
Right now I have two dataframes (let's call them A and B) built from Excel imports. Both have different dimensions as well as some empty/NaN cells. Let's say A is data for individual model numbers and B is a set of order information. For every row (unique item) in A, I want to search B for the (possibly) multiple orders for that item number, average the corresponding prices, and append A with a column containing the average price for each item.
The item numbers are alphanumeric so they have to be strings. Not every item will have orders/pricing information for it and I'll be removing those at the next step. This is a large amount of data so efficiency is ideal so iterrows probably isn't the right choice. Thank you in advance!
Here's what I have so far:
avgPrice = []
for index, row in dfA.iterrows():
def avg_unit_price(item_no, unit_price):
matchingOrders = []
for item, price in zip(item_no, unit_price):
if item == row['itemNumber']:
matchingOrders.append(price)
avgPrice.append(np.mean(matchingOrders))
avg_unit_price(dfB['item_no'], dfB['unit_price'])
dfA['avgPrice'] = avgPrice
In general, avoid loops as they perform poorly. If you can't vectorise easily, then as a last resort you can try pd.Series.apply. In this case, neither were necessary.
import pandas as pd
# B: pricing data
df_b = pd.DataFrame([['I1', 34.1], ['I2', 541.31], ['I3', 451.3], ['I2', 644.3], ['I3', 453.2]],
columns=['item_no', 'unit_price'])
# create avg price dictionary
item_avg_price = df_b.groupby('item_no', as_index=False).mean().set_index('item_no')['unit_price'].to_dict()
# A: product data
df_a = pd.DataFrame([['I1'], ['I2'], ['I3'], ['I4']], columns=['item_no'])
# map price info to product data
df_a['avgPrice'] = df_a['item_no'].map(item_avg_price)
# remove unmapped items
df_a = df_a[pd.notnull(df_a['avgPrice'])]
I have a large-ish pandas dataframe with multiple columns (c1 ... c8) and ~32 mil rows. The dataframe is already sorted by c1. I want to grab other column values from rows that share a particular value of c1.
something like
keys = big_df['c1'].unique()
red = np.zeros(len(keys))
for i, key in enumerate(keys):
inds = (big_df['c1'] == key)
v1 = np.array(big_df.loc[inds]['c2'])
v2 = np.array(big_df.loc[inds]['c6'])
red[i] = reduce_fun(v1,v2)
However this turns out to be very slow I think because it checks the entire columns for the matching criterion (even though there might only be 10 rows out of 32 mil that are relevant). Since big_df is sorted by c1 and the keys is just the list of all unique c1's, is there a fast way to get the red[] array (ie i know the first row with the next key is the row after the last row of the previous key, I know that the last row for a key is the last row that matches the key, since all subsequent rows are guaranteed not to match).
Thanks,
Ilya
Edit: I am not sure what order unique() method produces, but I basically want to have for every key in keys a value of reduce_fun(), I don't particularly care what order they are (presumably the easiest order is the order c1 is already sorted in).
Edit2: I slightly restructured the code. Basically, is there an efficient way of constructing inds. big_df['c1'] == key takes 75.8% of total time in my data, while creating v1, v2 takes 21.6% according to line profiler.
Rather than a list, I chose a dictionary to hold the reduced values keyed on each item in c1.
red = {key: reduce_func(frame['c2'].values, frame['c7'].values)
for key, frame in df.groupby('c1')}
How about a groupby statement in a list comprehension? This should be especially efficient given the DataFrame is already sorted by c1:
Edit: Forgot that groupby returns a tuple. Oops!
red = [reduce_fun(g['c2'].values, g['c6'].values) for i, g in big_df.groupby('c1', sort=False)]
Seems to chug through pretty quickly for me (~2 seconds for 30 million random rows and a trivial reduce_fun).
In my pandas data frame column, I need to check if the column has any of the word in the dictionary values, then I should return the key.
my_dict = {'woodhill': ["woodhill"],'woodcocks': ["woodcocks"], 'whangateau' : ["whangateau","whangate"],'whangaripo' : ["whangaripo","whangari","whangar"],
'westmere' : ["westmere"],'western springs': ["western springs","western springs","western spring","western sprin",
"western spri","western spr","western sp","western s"]}
I can write a for loop for this, however, I have nearly 1.5 million records in my data frame and the dictionary has more than 100 items and each may have up to 20 values in some case. How do I do this efficiently? Can I create reverse the values as key and key as values in the dictionary to make it fast? Thanks.
you can reverse your dictionary
reversed_dict = {val: key for key in my_dict for val in my_dict[key]}
and then map with your dataframe
df =pd.DataFrame({'col1':['western springs','westerns','whangateau','whangate']})
df['col1'] = df['col1'].map(reversed_dict)
Try this code, this may help you.
1st reverse the dictionary items. # as limited items , so it'll be fast.
2nd create dataframe from dictionary. # instead of searching all keys for each comparison with dataframe, it's best to do join. so for that create dataframe.
3rd make left join from big size dataframe to small size dataframe (in this case dictionary).