how to vectorize a for loop on pandas dataframe? - python

i am working whit a data of about 200,000 rows, in one column of the pandas i have some values that have a empty list, the most of them are list whit several values, here is a picture:
what i want to do is change the empty sets whit this set
[[close*0.95,close*0.94]]
where the close is the close value on the table, the for loop that i use is this one:
for i in range(1,len(data3.index)):
close = data3.close[data3.index==data3.index[i]].values[0]
sell_list = data3.sell[data3.index==data3.index[i]].values[0]
buy_list = data3.buy[data3.index==data3.index[i]].values[0]
if len(sell_list)== 0:
data3.loc[data3.index[i],"sell"].append([[close*1.05,close*1.06]])
if len(buy_list)== 0:
data3.loc[data3.index[i],"buy"].append([[close*0.95,close*0.94]])
i tried to make it work whit multithread but as i need to read all the table to do the next step i cant split the data, i hope you can help me to make a kind of lamda function to apply the df, or something, i am not to much skilled on this, thanks for reading!
the expected output of the row and column "buy" of and empty set should be [[[11554, 11566]]]

Example data:
import pandas as pd
df = pd.DataFrame({'close': [11763, 21763, 31763], 'buy':[[], [[21763, 21767]], []]})
close buy
0 11763 []
1 21763 [[[21763, 21767]]]
2 31763 []
You could do it like this:
# Create mask (a bit faster than df['buy'].apply(len) == 0).
# Assumes there are no NaNs in the column. If you have NaNs, use pd.apply.
m = [len(l) == 0 for l in df['buy'].tolist()]
# Create triple nested lists and assign.
df.loc[m, 'buy'] = list(df.loc[m, ['close', 'close']].mul([0.95, 0.94]).to_numpy()[:, None][:, None])
print(df)
Result:
close buy
0 11763 [[[11174.85, 11057.22]]]
1 21763 [[[21763, 21767]]]
2 31763 [[[30174.85, 29857.219999999998]]]
Some explanation:
m is a boolean mask that selects the rows of the DataFrame with an empty list in the 'buy' column:
m = [len(l) == 0 for l in df['buy'].tolist()]
# Or (a bit slower)
# "Apply the len() function to all lists in the column.
m = df['buy'].apply(len) == 0
print(m)
0 True
1 False
2 True
Name: buy, dtype: bool
We can use this mask to select where to calculate the values.
df.loc[m, ['close', 'close']].mul([0.95, 0.94]) duplicates the 'close' column and calculates the vectorised product of all the (close, close) pairs with (0.95, 0.94) to obtain (close*0.94, close*0.94) in each row of the resulting array.
[:, None][:, None] is just a trick to create two additional axes on the resulting array. This is required since you want triple nested lists ([[[]]]).

Related

Pandas adding additional values between two row-values in a dataframe with number constraint

I have data frame. Under the same index, I have "early_date" & "latest_date", which are in "int" dtype. I want to create additional values in between the "early_date" & "latest_date" row-values. Incidentally, I want to stack the generated values into new rows between them.
Here is how I did it,
df = pd.DataFrame({'index': [1,1,2,2,3,3],
'variable': ['early_date', 'late_date']*3,
'value': [201952,202001,202002,202004,202006,202012]})
# This is what your data looks like unmelted
df_p = df.pivot('index', 'variable', 'value').reset_index()
df_p.columns.name = ''
df_p['new'] = [list(range(x,y+1)) for x, y in zip(df_p.pop('early_date'), df_p.pop('late_date'))]
This is the result
In the column "new", the filling between "201952" & "202001" in index 1 has became 201952, 201953, 201954...201999, 202001.
However, since the "new" column is actually representing the year and weeks. In index 1 case,
It shall not be filling anything between 201952 & 202001, and the result should be [201952, 202001]. Since week 52 is the end of the year.
What can I do to handling these cases?
IIUC, you can add a condition in your for loop:
df_p['new'] = [list(range(x,y+1)) if str(x)[-2:]!='52' else [x,y]
for x, y in zip(df_p.pop('early_date'), df_p.pop('late_date'))]
print(df_p)
index new
0 1 [201952, 202001]
1 2 [202002, 202003, 202004]
2 3 [202006, 202007, 202008, 202009, 202010, 20201...

New list based on 3 other lists

Starting with a CSV file with the columns ['race_number', 'number_of_horses_bet_on','odds']
I would like to add/calculate an extra column called 'desired_output'.
The 'desired_output' column is computed by
for 'race_number' 1, the 'number_of_horses_bet_on'=2, therefore in the 'desired_output column', only the first 2 'odds' are included. The remaining values for 'race_number' 1 are 0. Then we go to 'race_number' 2 and the cycle repeats.
Code I have tried includes:
import pandas as pd
df=pd.read_csv('test.csv')
desired_output=[]
count=0
for i in df.number_of_horses_bet_on:
for j in df.odds:
if count<i:
desired_output.append(j)
count+=1
else:
desired_output.append(0)
print(desired_output)
and also
df['desired_output']=df.odds.apply(lambda x: x if count<number_of_horses_bet_on else 0)
Neither of these give the output of the column 'desired_output'
I realise the 'count' in the lambda above is misplaced - but hopefully you can see what I am after.
Thanks.
I'm gonna do it a bit differently, this will be what I'm gonna do
get a list of all race_number
for each race_number, extract the number_of_horses_bet_on
create a list that contains 1 or 0, where we would have number_of_horses_bet_on number of 1s and the rest would be zero.
multiple this list with the odds column
import pandas as pd
df=pd.read_csv('test.csv')
mask = []
races = df['race_number'].unique().tolist() # unique list of all races
for race in races:
# filter the dataframe by the race number
df_race = df[df['race_number'] == race]
# assuming number of horses is unique for every race, we extract it here
number_of_horses = df_race['number_of_horses_bet_on'].iloc[0]
# this mask will contain a list of 1s and 0s, for example for race 1 it'll be [1,1,0,0,0]
mask = mask + [1] * number_of_horses + [0] * (len(df_race) - number_of_horses)
df['mask'] = mask
df['desired_output'] = df['mask'] * df['odds']
del df['mask']
print(df)
This assumes that for each race the numbers_of_horses_bet_on equals or less than the number of rows for that race, otherwise you might need to use min/max to get proper results

Why does using "==" return a Series instead of bool in pandas?

I just can't figure out what "==" means at the second line:
- It is not a test, there is no if statement...
- It is not a variable declaration...
I've never seen this before, the thing is data.ctage==cat is a pandas Series and not a test...
for cat in data["categ"].unique():
subset = data[data.categ == cat] # Création du sous-échantillon
print("-"*20)
print('Catégorie : ' + cat)
print("moyenne:\n",subset['montant'].mean())
print("mediane:\n",subset['montant'].median())
print("mode:\n",subset['montant'].mode())
print("VAR:\n",subset['montant'].var())
print("EC:\n",subset['montant'].std())
plt.figure(figsize=(5,5))
subset["montant"].hist(bins=30) # Crée l'histogramme
plt.show() # Affiche l'histogramme
It is testing each element of data.categ for equality with cat. That produces a vector of True/False values. This is passed as in indexer to data[], which returns the rows from data that correspond to the True values in the vector.
To summarize, the whole expression returns the subset of rows from data where the value of data.categ equals cat.
(Seems possible the whole operation could be done more elegantly using data.groupBy('categ').apply(someFunc).)
It creates a boolean series with indexes where data.categ is equal to cat , with this boolean mask, you can filter your dataframe, in other words subset will have all records where the categ is the value stored in cat.
This is an example using numeric data
np.random.seed(0)
a = np.random.choice(np.arange(2), 5)
b = np.random.choice(np.arange(2), 5)
df = pd.DataFrame(dict(a = a, b = b))
df[df.a == 0].head()
# a b
# 0 0 0
# 2 0 0
# 4 0 1
df[df.a == df.b].head()
# a b
# 0 0 0
# 2 0 0
# 3 1 1
Yes, it is a test. Boolean expressions are not restricted to if statements.
It looks as if data is a data frame (PANDAS). The expression used as a data frame index is how PANDAS denotes a selector or filter. This says to select every row in which the fieled categ matches the variable cat (apparently a pre-defined variable). This collection of rows becomes a new data frame, subset.
data.categ == cat will return a boolean list that will be used to filter your dataframe by lefting only values where boolean is equal True.
Booleans are used in many situations, not only in if statements.
Here you are checking data.categ with the element iterating, cat, in the dictionary of data.
And if they are equal you are continuing the loop.

Determining if Pandas column containing array includes specific values

I have a dataframe that contains three columns: two define the start and end of a period of time (a window) and another which contains an array of individual timepoints. I would like to determine if any of the individual points are within the window's start and end (the two other columns). The ideal output would be True/False for each row.
I can iterate through each row of the dataframe, extract the timepoints and start_window and end_window times and determine this one row at a time, but I was looking for a faster (no-loop) option.
Example of dataframe
row start_window end_window times (numpy array)
0 307.110309 307.710309 [307.48857, 307.6031]
1 309.140340 311.900309 [315.23134]
...
The output based on the above dataframe would be:
True
False
One way to do is use pd.DataFrame.apply:
df.apply(lambda x: any(x['start_window']< i< x['end_window'] for i in x['times']), 1)
Output:
0 True
1 False
dtype: bool
Let us do it vertorized
s=pd.DataFrame(df.time.tolist(),index=df.index)
((df.start_window-s<0)&(df.end_window-s>0)).any(1)
Out[277]:
0 True
1 False
dtype: bool
Here is another efficient solution.
t_max = df["times"].apply(max)
t_min = df["times"].apply(min)
out = (t_max > df["start_window"]) & (t_min < df["end_window"])

Using pandas, how to filter rows with similar values in two columns

I have a big dataframe (~10 millon rows). Each row has:
category
start position
end position
If two rows are in the same category and the start and end position overlap with a +-5 tolerance, I want to keep just one of the rows.
For example
1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25
I want to filter out 1 or 2.
What I'm doing right now isn't very efficient,
import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
if index in discard:
continue
df_2 = dfs[row.category]
res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
if len(res.index) > 1:
discard.extend(res.index.values)
rows.append(row)
df = pd.DataFrame(rows)
I've also tried a different approach making use of a sorted version of the dataframe.
my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
row = df.iloc[[my_index]]
cond = True
next_index = 1
while cond:
second_row = df.iloc[[my_index + next_index]]
c1 = (row.iloc[0].category == second_row.iloc[0].category)
c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
cond = c1 and c2 and c3
if cond and (c2 amd c3):
indexes.append(my_index)
cond = True
next_index += 1
indexes.append(my_index)
my_index += next_index
indexes.append(total_len - 1)
The problem is that this solution is not perfect, sometimes it misses a row because the overlapping could be several rows ahead, and not in the next one
I'm looking for any ideas on how approach this problem in a more pandas friendly way, if exists.
The approach here should be this:
pandas.groupby by categories
agg(Func) on groupby result
the Func should implement the logic of finding the best range inside categories (sorted search, balanced trees or anything else)
Do you want to merge all similar or only 2 consecutive?
If all similar, I suggest you first order the rows, by category, then on the 2 other columns and squash similar in a single row.
If only consecutive 2 then, check if the next value is in the range you set and if yes, merge it. Here you can see how:
merge rows pandas dataframe based on condition
I don't believe the numeric comparisons can be made without a loop, but you can make at least part of this cleaner and more efficient:
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
Instead of this, use df.groupby('category').apply(drop_duplicates).droplevel(0), where drop_duplicates is a function containing your second loop. The function will then be called separately for each category, with a dataframe that contains only the filtered rows. The outputs will be combined back into a single dataframe. The dataframe will be a MultiIndex with the value of "category" as an outer level; this can be removed with droplevel(0).
Secondly, within the category you could sort by the first of the two numeric columns for another small speed-up:
def drop_duplicates(df):
df = df.sort_values("sstart")
...
This will allow you to stop the inner loop as soon as the sstart column value is out of range, instead of comparing every row to every other row.

Categories

Resources