drop_duplicates in a range - python

I have a datraframe in python like that:
st se st_min st_max se_min se_max
42 922444 923190 922434 922454 923180 923200
24 922445 923190 922435 922455 923180 923200
43 928718 929456 928708 928728 929446 929466
37 928718 929459 928708 928728 929449 929469
As we can see, I have a range in the first 2 columns and a variation of 10 positions of the initial range.
I know that function drop_duplicates can remove duplicate rows based on the exact match of values.
But, if I want to remove rows based on a range of values, for example, both indexes 42 and 24 are in the same range (if I considerer a range of 10) and indexes 43 and 37 are in the same case.
How I can do this?
Ps: I can't remove based only in one column (e.g. st or se), I need to remove redundancy based on both columns (st and se), using the range of columns min and max as filters...

I assume, you want to combine all ranges. So that all ranges that overlap are reduced to one row. I think you need to do that recursively, because there could be multiple ranges, that form one big range, not just two. You could do it like this (just replace df by the variable you use to store your dataframe):
# create a dummy key column to produce a cartesian product
df['fake_key']=0
right_df= pd.DataFrame(df, copy=True)
right_df.rename({col: col + '_r' for col in right_df if col!='fake_key'}, axis='columns', inplace=True)
# this variable indicates that we need to perform the loop once more
change=True
# diff and new_diff are used to see, if the loop iteration changed something
# it's monotically increasing btw.
new_diff= (right_df['se_r'] - right_df['st_r']).sum()
while change:
diff= new_diff
joined_df= df.merge(right_df, on='fake_key')
invalid_indexer= joined_df['se']<joined_df['st_r']
joined_df.drop(joined_df[invalid_indexer].index, axis='index', inplace=True)
right_df= joined_df.groupby('st').aggregate({col: 'max' if '_min' not in col else 'min' for col in joined_df})
# update the ..._min / ..._max fields in the combined range
for col in ['st_min', 'se_min', 'st_max', 'se_max']:
col_r= col + '_r'
col1, col2= (col, col_r) if 'min' in col else (col_r, col)
right_df[col_r]= right_df[col1].where(right_df[col1]<=right_df[col2], right_df[col2])
right_df.drop(['se', 'st_r', 'st_min', 'se_min', 'st_max', 'se_max'], axis='columns', inplace=True)
right_df.rename({'st': 'st_r'}, axis='columns', inplace=True)
right_df['fake_key']=0
# now check if we need to iterate once more
new_diff= (right_df['se_r'] - right_df['st_r']).sum()
change= diff <= new_diff
# now all ranges which overlap have the same value for se_r
# so we just need to aggregate on se_r to remove them
result= right_df.groupby('se_r').aggregate({col: 'min' if '_max' not in col else 'max' for col in right_df})
result.rename({col: col[:-2] if col.endswith('_r') else col for col in result}, axis='columns', inplace=True)
result.drop('fake_key', axis='columns', inplace=True)
If you execute this on your data, you get:
st se st_min st_max se_min se_max
se_r
923190 922444 923190 922434 922455 923180 923200
929459 928718 929459 922434 928728 923180 929469
Note, if your data set is larger than a few thousand records, you might need to change the join logic above which produces a cartesian product. So in the first iteration, you get a joined_df of the size n^2, where n is the number of records in your input dataframe. Then later in each iteration the joined_df will get smaller due to the aggregation.
I just ignored that, because I don't know, how large your dataset is. Avoiding this would make the code a bit more complex. But if you need it, you could just create an auxillary dataframe which allows you to "bin" the se values on both dataframes and use the binned value as the fake_key. It's not quite regular binning, you would have to create a dataframe that contains for each fake_key all values of the in the range (0...fake_key). So e.g. if you define your fake key to be fake_key=se//1000, your dataframe would contain
fake_key fake_key_join
922 922
922 921
922 920
... ...
922 0
If you replace the merge in the loop above by code, that merges such a dataframe on fake_key with right_df and the result on fake_key_join with df you can use the rest of the code and get the same result as above but without having to produce a full cartesian product.

Note that e.g. st values for keys 42 and 24 are different, so you can
not use just st values.
If e.g. your range can be defined as st / 100 (rounded down to integer),
you can create a column with this value:
df['rng'] = df.st.floordiv(100)
Then use drop_duplicates with subset set to just this column and
drop rng column:
df.drop_duplicates(subset='rng').drop(columns=['rng'])
Or maybe st value for keys 24 should be the same as above (for key
42) and the same for se in the second pair of rows?
In this case you could use:
df.drop_duplicates(subset=['st', 'se'])
without any auxiliary column.

Related

how to vectorize a for loop on pandas dataframe?

i am working whit a data of about 200,000 rows, in one column of the pandas i have some values that have a empty list, the most of them are list whit several values, here is a picture:
what i want to do is change the empty sets whit this set
[[close*0.95,close*0.94]]
where the close is the close value on the table, the for loop that i use is this one:
for i in range(1,len(data3.index)):
close = data3.close[data3.index==data3.index[i]].values[0]
sell_list = data3.sell[data3.index==data3.index[i]].values[0]
buy_list = data3.buy[data3.index==data3.index[i]].values[0]
if len(sell_list)== 0:
data3.loc[data3.index[i],"sell"].append([[close*1.05,close*1.06]])
if len(buy_list)== 0:
data3.loc[data3.index[i],"buy"].append([[close*0.95,close*0.94]])
i tried to make it work whit multithread but as i need to read all the table to do the next step i cant split the data, i hope you can help me to make a kind of lamda function to apply the df, or something, i am not to much skilled on this, thanks for reading!
the expected output of the row and column "buy" of and empty set should be [[[11554, 11566]]]
Example data:
import pandas as pd
df = pd.DataFrame({'close': [11763, 21763, 31763], 'buy':[[], [[21763, 21767]], []]})
close buy
0 11763 []
1 21763 [[[21763, 21767]]]
2 31763 []
You could do it like this:
# Create mask (a bit faster than df['buy'].apply(len) == 0).
# Assumes there are no NaNs in the column. If you have NaNs, use pd.apply.
m = [len(l) == 0 for l in df['buy'].tolist()]
# Create triple nested lists and assign.
df.loc[m, 'buy'] = list(df.loc[m, ['close', 'close']].mul([0.95, 0.94]).to_numpy()[:, None][:, None])
print(df)
Result:
close buy
0 11763 [[[11174.85, 11057.22]]]
1 21763 [[[21763, 21767]]]
2 31763 [[[30174.85, 29857.219999999998]]]
Some explanation:
m is a boolean mask that selects the rows of the DataFrame with an empty list in the 'buy' column:
m = [len(l) == 0 for l in df['buy'].tolist()]
# Or (a bit slower)
# "Apply the len() function to all lists in the column.
m = df['buy'].apply(len) == 0
print(m)
0 True
1 False
2 True
Name: buy, dtype: bool
We can use this mask to select where to calculate the values.
df.loc[m, ['close', 'close']].mul([0.95, 0.94]) duplicates the 'close' column and calculates the vectorised product of all the (close, close) pairs with (0.95, 0.94) to obtain (close*0.94, close*0.94) in each row of the resulting array.
[:, None][:, None] is just a trick to create two additional axes on the resulting array. This is required since you want triple nested lists ([[[]]]).

Pandas adding additional values between two row-values in a dataframe with number constraint

I have data frame. Under the same index, I have "early_date" & "latest_date", which are in "int" dtype. I want to create additional values in between the "early_date" & "latest_date" row-values. Incidentally, I want to stack the generated values into new rows between them.
Here is how I did it,
df = pd.DataFrame({'index': [1,1,2,2,3,3],
'variable': ['early_date', 'late_date']*3,
'value': [201952,202001,202002,202004,202006,202012]})
# This is what your data looks like unmelted
df_p = df.pivot('index', 'variable', 'value').reset_index()
df_p.columns.name = ''
df_p['new'] = [list(range(x,y+1)) for x, y in zip(df_p.pop('early_date'), df_p.pop('late_date'))]
This is the result
In the column "new", the filling between "201952" & "202001" in index 1 has became 201952, 201953, 201954...201999, 202001.
However, since the "new" column is actually representing the year and weeks. In index 1 case,
It shall not be filling anything between 201952 & 202001, and the result should be [201952, 202001]. Since week 52 is the end of the year.
What can I do to handling these cases?
IIUC, you can add a condition in your for loop:
df_p['new'] = [list(range(x,y+1)) if str(x)[-2:]!='52' else [x,y]
for x, y in zip(df_p.pop('early_date'), df_p.pop('late_date'))]
print(df_p)
index new
0 1 [201952, 202001]
1 2 [202002, 202003, 202004]
2 3 [202006, 202007, 202008, 202009, 202010, 20201...

In Python, how do I select the columns of a dataframe satisfying a condition on the number of NaN?

I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?
To solve my problem, I tried the following code
df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8, axis=0)
The function x.isnull().sum()/len(x) is to divide the number of NaN in the column x by the length of x, and the part < 0.8 is to choose those columns containing less than 80% of NaN.
The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?
You could do this:
filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]
You want to achieve two things. First, you have to find the indices of all columns which contain at most 80% NaNs. Second, you want to discard them from your DataFrame.
To get a pandas Series indicating whether a row should be discarded by doing, you can do:
df1 = df.isnull().sum(axis=0) < 0.8*df.shape[1]
(Btw. you have a typo in your question. You should drop the ==True as it always tests whether 0.5==True)
This will give True for all column indices to keep, as .isnull() gives True (or 1) if it is NaN and False (or 0) for a valid number for every element. Then the .sum(axis=0) sums along the columns giving the number of NaNs in each column. The comparison is then, if that number is bigger than 80% of the number of columns.
For the second task, you can use this to index your columns by using:
df = df[df.columns[df1]]
or as suggested in the comments by doing:
df.drop(df.columns[df1==False], axis=1, inplace=True)

Pandas: Filter dataframe for values that are too frequent or too rare

On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number.
But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval.
import pandas as pd
import string
import numpy as np
vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in
'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval']
df = pd.DataFrame(dict(vals))
>> df.head()
city of destination city of origin distance, type of transport (air/car/foot) time of day, price-interval
0 f p a n
1 k b a f
2 q s n j
3 h c g u
4 w d m h
If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if time of day = night occurs only 3% of the time, or if foot mode of transport is rare, and so on.
I want to remove all such values from all columns (or a list of columns). One idea I have is to do a value_counts on every column, transform and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?
This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.
This answer is similar to that provided by #Ami Tavory, but with a few subtle differences:
It normalizes the value counts so you can just use a percentile threshold.
It calculates counts just once per column instead of twice. This results in faster execution.
Code:
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
Code timing:
df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True),
columns=list('ABCD'))
%%timeit df=df2.copy()
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
1 loops, best of 3: 485 ms per loop
%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
1 loops, best of 3: 688 ms per loop
I would go with one of the following:
Option A
m = 0.03 * len(df)
df[np.all(
df.apply(
lambda c: c.isin(c.value_counts()[c.value_counts() > m].index).as_matrix()),
axis=1)]
Explanation:
m = 0.03 * len(df) is the threshold (it's nice to take the constant out of the complicated expression)
df[np.all(..., axis=1)] retains the rows where some condition was obtained across all columns.
df.apply(...).as_matrix applies a function to all columns, and makes a matrix of the results.
c.isin(...) checks, for each column item, whether it is in some set.
c.value_counts()[c.value_counts() > m].index is the set of all values in a column whose count is above m.
Option B
m = 0.03 * len(df)
for c in df.columns:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
The explanation is similar to the one above.
Tradeoffs:
Personally, I find B more readable.
B creates a new DataFrame for each filtering of a column; for large DataFrames, it's probably more expensive.
I am new to Python and using Pandas. I came up with the following solution below. Maybe other people might have a better or more efficient approach.
Assuming your DataFrame is DF, you can use the following code below to filter out all infrequent values. Just be sure to update the col and bin_freq variable. DF_Filtered is your new filtered DataFrame.
# Column you want to filter
col = 'time of day'
# Set your frequency to filter out. Currently set to 5%
bin_freq = float(5)/float(100)
DF_Filtered = pd.DataFrame()
for i in DF[col].unique():
counts = DF[DF[col]==i].count()[col]
total_counts = DF[col].count()
freq = float(counts)/float(total_counts)
if freq > bin_freq:
DF_Filtered = pd.concat([DF[DF[col]==i],DF_Filtered])
print DF_Filtered
DataFrames support clip_lower(threshold, axis=None) and clip_upper(threshold, axis=None), which remove all values below or above (respectively) a certain threshhold.
We can also replace all the rare categories with one label, say "Rare" and remove later if this doesn't add value to prediction.
# function finds the labels that are more than certain percentage/threshold
def get_freq_labels(df, var, rare_perc):
df = df.copy()
tmp = df.groupby(var)[var].count() / len(df)
return tmp[tmp > rare_perc].index
vars_cat = [val for val in data.columns if data[val].dtype=='O']
for var in vars_cat:
# find the frequent categories
frequent_cat = get_freq_labels(data, var, 0.05)
# replace rare categories by the string "Rare"
data[var] = np.where(data[var].isin(
frequent_cat ), data[var], 'Rare')

Understanding this Pandas script

I received this code to group data into a histogram type data. I have been Attempting to understand the code in this pandas script in order to edit, manipulate and duplicate it. I have comments for the sections I understand.
Code
import numpy as np
import pandas as pd
column_names = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6',
'col7', 'col8', 'col9', 'col10', 'col11'] #names to be used as column labels. If no names are specified then columns can be refereed to by number eg. df[0], df[1] etc.
df = pd.read_csv('data.csv', header=None, names=column_names) #header= None means there are no column headings in the csv file
df.ix[df.col11 == 'x', 'col11']=-0.08 #trick so that 'x' rows will be grouped into a category >-0.1 and <= -0.05. This will allow all of col11 to be treated as a numbers
bins = np.arange(-0.1, 1.0, 0.05) #bins to put col11 values in. >-0.1 and <=-0.05 will be our special 'x' rows, >-0.05 and <=0 will capture all the '0' values.
labels = np.array(['%s:%s' % (x, y) for x, y in zip(bins[:-1], bins[1:])]) #create labels for the bins
labels[0] = 'x' #change first bin label to 'x'
labels[1] = '0' #change second bin label to '0'
df['col11'] = df['col11'].astype(float) #convert col11 to numbers so we can do math on them
df['bin'] = pd.cut(df['col11'], bins=bins, labels=False) # make another column 'bins' and put in an integer representing what bin the number falls into.Later we'll map the integer to the bin label
df.set_index('bin', inplace=True, drop=False, append=False) #groupby is meant to run faster with an index
def count_ones(x):
"""aggregate function to count values that equal 1"""
return np.sum(x==1)
dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]
dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')
The section I really struggle to understand is in this section:
def count_ones(x):
"""aggregate function to count values that equal 1"""
return np.sum(x==1)
dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]
dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')
If any one is able to comment this script I would be greatly appreciative. Also feel free to correct or add to my comments (these are what I assume so far they may not be correct). Im hoping this isnt too off topic for SOF. I will gladly give a 50 point bounty to any user who can help me with this.
I'll try and explain my code. As it uses a few tricks.
I've called it df to give a shorthand name for a pandas DataFrame
I've called it dfg to mean group my df.
Let me build up the expression dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
the code dfg = df[['bin','col7','col11']] is saying take the columns named 'bin' 'col7' and 'col11' from my DataFrame df.
Now that I have the 3 columns I am interested in, I want to group by the values in the 'bin' column. This is done by dfg = df[['bin','col7','col11']].groupby('bin'). I now have groups of data i.e. all records that are in bin #1, all records in bin#2, etc.
I now want to apply some aggregate functions to the records in each of my bin groups( An aggregate funcitn is something like sum, mean or count).
Now I want to apply three aggregate functions to the records in each of my bins: the mean of 'col11', the number of records in each bin, and the number of records in each bin that have 'col7' equal to one. The mean is easy; numpy already has a function to calculate the mean. If I was just doing the mean of 'col11' I would write: dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean]}). The number of records is also easy; python's len function (It's not really a function but a property of lists etc.) will give us the number of items in list. So I now have dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [len]}). Now I can't think of an existing function that counts the number of ones in a numpy array (it has to work on a numpy array). I can define my own functions that work on a numpy array, hence my function count_ones.
Now I'll deconstruct the count_ones function. the varibale x passed to the function is always going to be a 1d numpy array. In our specific case it will be all the 'col7' values that fall in bin#1, all the 'col7' values that fall in bin#2 etc.. The code x==1 will create a boolean (TRUE/FALSE) array the same size as x. The entries in the boolean array will be True if the corresponding values in x are equal to 1 and false otherwise. Because python treats True as 1 if I sum the values of my boolean array I'll get a count of the values that ==1. Now that I have my count_ones function I apply it to 'col7' by: dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
You can see that the syntax of the .agg is .agg({'column_name_to_apply_to': [list_of_function names_to_apply]}
With the boolean arrays you can do all sorts of wierd condition combinations (x==6) | (x==3) would be 'x equal to 6 or x equal to 3'. The 'and' operator is &. Always put () around each condition
Now to dfg.index = labels[dfg.index]. In dfg, because I grouped by 'bin', the index (or row label) of each row of grouped data (i.e. my dfg.index) will be my bin numbers:1,2,3, labels[dfg.index] is using fancy indexing of a numpy array. labels[0] would give me the first label, labels[3] would give me the 4th label. With normal python lists you can use slices to do labels[0:3] which would give me labels 0,1, and 2. With numpy arrays we can go a step further and just index with a list of values or another array so labels[np.array([0,2,4]) would give me labels 0,2,4. By using labels[dfg.index] I'm requesting the labels corresponding to the bin#. Basically I'm changng my bin number to bin label. I could have done that to my original data but that would be thousands of rows; by doing it after the group by I'm doing it to 21 rows or so. Note that I cannot just do dfg.index = labels as some of my bins might be empty and therefore not present in the group by data.
Now the dfg.ix['x',('col11', 'mean')]='N/A' part. Remember way back when I did df.ix[df.col11 == 'x', 'col11']=-0.08 that was so all my invalid data was treated as a number and would be placed into the 1st bin. after applying group by and aggregate functions the mean of 'col11' values in my first bin will be -0.08 (because all such values are -0.08). Now I know this not correct, all values of -0.08 actually indicate that the original value wsa x. You can't do a mean of x. So I manually put it to N/A. ie. dfg.ix['x',('col11', 'mean')]='N/A' means in dfg where index (or row) is 'x' and column is 'col11 mean') set the value to 'N/A'. the ('col11', 'mean') I believe is how pandas comes up with the aggreagate column names i.e. when I did .agg({'col11': [np.mean]}), to refer to the resulting aggregate column i need ('column_name', 'aggregate_function_name')
The motivation for all this was: convert all data to numbers so I can use the power of Pandas, then after processing, manually change any values that I know are garbage. Let me know if you need any more explanation.

Categories

Resources