How to change values in a dataframe efficiently - python

# Incorporate delisting return
i = 0
for tc, col in dlret.iloc[:,0:50].iteritems():
idx = col.index[col.notnull()]
if len(idx) != 0:
tr = idx[0]
val = col.ix[tr]
#ret.ix[tr, tc] = val #this line is too slow
i += 1
if math.floor(i/10) > math.floor((i-1)/10):
print i
The dlret DataFrame is of 600 or so rows and 25000+ columns. I iterate through the columns to look for the first nonnull value (the delisting return) and then find the corresponding location in the ret DataFrame to set the value to that of the delisting return. However, the code runs painfully slow using ix to index the corresponding location. Any suggestion on how to efficiently achieve this?

According to your comment, what you want is iterate through the columns to look for the first non-null value for each column and update the ret DataFrame.
You can do this with following code:
mask_first_nonnull = dlret.notnull() & (dlret.notnull().cumsum()==1)
ret[mask_first_nonnull]=dlret[mask_first_nonnull]

Related

How to select 3 rows above and 3 rows below from selected row [duplicate]

I have a dataframe.
anywhere in the first row there is an unique value what will meet my criteria.
for instance t = 800
now i want to find the row where that value is in and select from n rows before till n rows after it.
I tried:
idx = df[df['t']==stime]
from_idx = idx-1200
to_idx = idx + 1200
vib = df.iloc[to_idx:from_idx]
it finds the row which meets my criteria but i cannot get the right selection out of my dataframe
idx in your logic is a pd.DataFrame object, not a scalar representing the index for a row. If you extract the scalar index then, assuming your indices are unique, you can use pd.Index.get_loc to get the integer positional location:
idx = df.index.get_loc(df[df['t'] == stime].index[0])
vib = df.iloc[idx - 1200: idx + 1200]

I want to save the mean (by row) of different set of dataframe columns and store them in a new dataframe

For doing so, I have a list of lists (which are my clusters), for example:
asset_clusts=[[0,1],[3,5],[2,4, 12],...]
and original dataframe(in my code I call it 'x') is as :
return time series of s&p 500 companies
I want to choose column [0,1] of the original dataframe and compute the mean (by row) of them and store it in a new dataframe, then compute the mean of columns [3, 5], and add it to the new dataframe, and so on ...
mu=pd.DataFrame()
for j in range(get_number_of_elements(asset_clusts)):
mu=x.iloc[:,asset_clusts[j]].mean(axis=1)
but, it gives to me only a column and i checked, this one column is the mean of last cluster columns
in case of ambiguity, function of get_number_of_elements is:
def get_number_of_elements(clist):
count = 0
for element in clist:
count += 1
return count
def get_number_of_elements(clust_list):
count = 0
for element in clust_list:
count += 1
return count
I solved it and in case if it would be helpful for others, here is the final function:
def clustered_series(x, org_asset_clust):
"""
x:return data
org_asset_clust: list of clusters
----> mean of each cluster returns by row
"""
def get_number_of_elements(org_asset_clust):
count = 0
for element in org_asset_clust:
count += 1
return count
mu=[]
for j in range(get_number_of_elements(org_asset_clust)):
mu.append(x.iloc[:,org_asset_clust[j]].mean(axis=1))
cluster_mean=pd.concat(mu, axis=1)
return cluster_mean

Using pandas, how to filter rows with similar values in two columns

I have a big dataframe (~10 millon rows). Each row has:
category
start position
end position
If two rows are in the same category and the start and end position overlap with a +-5 tolerance, I want to keep just one of the rows.
For example
1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25
I want to filter out 1 or 2.
What I'm doing right now isn't very efficient,
import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
if index in discard:
continue
df_2 = dfs[row.category]
res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
if len(res.index) > 1:
discard.extend(res.index.values)
rows.append(row)
df = pd.DataFrame(rows)
I've also tried a different approach making use of a sorted version of the dataframe.
my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
row = df.iloc[[my_index]]
cond = True
next_index = 1
while cond:
second_row = df.iloc[[my_index + next_index]]
c1 = (row.iloc[0].category == second_row.iloc[0].category)
c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
cond = c1 and c2 and c3
if cond and (c2 amd c3):
indexes.append(my_index)
cond = True
next_index += 1
indexes.append(my_index)
my_index += next_index
indexes.append(total_len - 1)
The problem is that this solution is not perfect, sometimes it misses a row because the overlapping could be several rows ahead, and not in the next one
I'm looking for any ideas on how approach this problem in a more pandas friendly way, if exists.
The approach here should be this:
pandas.groupby by categories
agg(Func) on groupby result
the Func should implement the logic of finding the best range inside categories (sorted search, balanced trees or anything else)
Do you want to merge all similar or only 2 consecutive?
If all similar, I suggest you first order the rows, by category, then on the 2 other columns and squash similar in a single row.
If only consecutive 2 then, check if the next value is in the range you set and if yes, merge it. Here you can see how:
merge rows pandas dataframe based on condition
I don't believe the numeric comparisons can be made without a loop, but you can make at least part of this cleaner and more efficient:
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
Instead of this, use df.groupby('category').apply(drop_duplicates).droplevel(0), where drop_duplicates is a function containing your second loop. The function will then be called separately for each category, with a dataframe that contains only the filtered rows. The outputs will be combined back into a single dataframe. The dataframe will be a MultiIndex with the value of "category" as an outer level; this can be removed with droplevel(0).
Secondly, within the category you could sort by the first of the two numeric columns for another small speed-up:
def drop_duplicates(df):
df = df.sort_values("sstart")
...
This will allow you to stop the inner loop as soon as the sstart column value is out of range, instead of comparing every row to every other row.

pandas select certain rows before and after certain row which meets my criteria

I have a dataframe.
anywhere in the first row there is an unique value what will meet my criteria.
for instance t = 800
now i want to find the row where that value is in and select from n rows before till n rows after it.
I tried:
idx = df[df['t']==stime]
from_idx = idx-1200
to_idx = idx + 1200
vib = df.iloc[to_idx:from_idx]
it finds the row which meets my criteria but i cannot get the right selection out of my dataframe
idx in your logic is a pd.DataFrame object, not a scalar representing the index for a row. If you extract the scalar index then, assuming your indices are unique, you can use pd.Index.get_loc to get the integer positional location:
idx = df.index.get_loc(df[df['t'] == stime].index[0])
vib = df.iloc[idx - 1200: idx + 1200]

Keeping 3 rows for particular values in column of dataframe

I have a dataframe df, which has the column months_to_maturity and has multiple rows associated with a months_to_maturity of 1,2, etc. each. I am trying to keep only the first 3 rows associated with a particular months_to_maturity value. For example, for months_to_maturity = 1 I would like to have only 3 associated rows and for months_to_maturity = 2, another 3 rows and so on. I try to do this using the code below, but get the error IndexError: index 21836 is out of bounds for axis 0 with size 4412 and hence am wondering if there is a better way to do this. pairwise gives the current and next row of the dataframe. The values of months_to_maturity are sorted.
count = 0
for (i1, row1), (i2,row2) in pairwise(df.iterrows()):
if row1.months_to_maturity == row2.months_to_maturity:
count = count + 1
if count == 3:
df.drop(df.index[i1])
df = df.reset_index()
elif row1.months_to_maturity != row2.months_to_maturity:
count = 0
Thank You
You can do:
df.groupby('months_to_maturity').head(3)

Categories

Resources