seach for substring with minimum characters match pandas - python

I have 1st dataFrame with column 'X' as :
X
A468593-3
A697269-2
A561044-2
A239882 04
2nd dataFrame with column 'Y' as :
Y
000A561044
000A872220
I would like to match the part of substrings from both columns with minimum no. of characters(example 7 chars only alphanumeric to be considered for matching. all special chars to be excluded).
so, my output DataFrame should be like this
X
A561044-2
Any possible solution would highly appreciate.
Thanks in advance.

IIUC and assuming that the first three values of Y start with 0, you can slice Y by [3:] to remove the first three zero values. Then, you can join these values by |. Finally, you can create your mask using contains that checks whether a series contains a specified value (in your case you would have something like 'A|B' and check whether a value contains 'A' or 'B'). Then, this mask can be used to query your other data frame.
Code:
import pandas as pd
df1 = pd.DataFrame({"X": ["A468593-3", "A697269-2", "A561044-2", "A239882 04"]})
df2 = pd.DataFrame({"Y": ["000A561044", "000A872220"]})
mask = df1["X"].str.contains(f'({"|".join(df2["Y"].str[3:])})')
df1.loc[mask]
Output:
X
2 A561044-2
If you have values in Y that do not start with three zeros, you can use this function to reduce your columns and remove all first n zeros.
def remove_first_numerics(s):
counter = 0
while s[counter].isnumeric():
counter +=1
return s[counter:]
df_test = pd.DataFrame({"A": ["01Abd3Dc", "3Adv3Dc", "d31oVgZ", "10dZb1B", "CCcDx10"]})
df_test["A"].apply(lambda s: remove_first_numerics(s))
Output:
0 Abd3Dc
1 Adv3Dc
2 d31oVgZ
3 dZb1B
4 CCcDx10
Name: A, dtype: object

Related

how to vectorize a for loop on pandas dataframe?

i am working whit a data of about 200,000 rows, in one column of the pandas i have some values that have a empty list, the most of them are list whit several values, here is a picture:
what i want to do is change the empty sets whit this set
[[close*0.95,close*0.94]]
where the close is the close value on the table, the for loop that i use is this one:
for i in range(1,len(data3.index)):
close = data3.close[data3.index==data3.index[i]].values[0]
sell_list = data3.sell[data3.index==data3.index[i]].values[0]
buy_list = data3.buy[data3.index==data3.index[i]].values[0]
if len(sell_list)== 0:
data3.loc[data3.index[i],"sell"].append([[close*1.05,close*1.06]])
if len(buy_list)== 0:
data3.loc[data3.index[i],"buy"].append([[close*0.95,close*0.94]])
i tried to make it work whit multithread but as i need to read all the table to do the next step i cant split the data, i hope you can help me to make a kind of lamda function to apply the df, or something, i am not to much skilled on this, thanks for reading!
the expected output of the row and column "buy" of and empty set should be [[[11554, 11566]]]
Example data:
import pandas as pd
df = pd.DataFrame({'close': [11763, 21763, 31763], 'buy':[[], [[21763, 21767]], []]})
close buy
0 11763 []
1 21763 [[[21763, 21767]]]
2 31763 []
You could do it like this:
# Create mask (a bit faster than df['buy'].apply(len) == 0).
# Assumes there are no NaNs in the column. If you have NaNs, use pd.apply.
m = [len(l) == 0 for l in df['buy'].tolist()]
# Create triple nested lists and assign.
df.loc[m, 'buy'] = list(df.loc[m, ['close', 'close']].mul([0.95, 0.94]).to_numpy()[:, None][:, None])
print(df)
Result:
close buy
0 11763 [[[11174.85, 11057.22]]]
1 21763 [[[21763, 21767]]]
2 31763 [[[30174.85, 29857.219999999998]]]
Some explanation:
m is a boolean mask that selects the rows of the DataFrame with an empty list in the 'buy' column:
m = [len(l) == 0 for l in df['buy'].tolist()]
# Or (a bit slower)
# "Apply the len() function to all lists in the column.
m = df['buy'].apply(len) == 0
print(m)
0 True
1 False
2 True
Name: buy, dtype: bool
We can use this mask to select where to calculate the values.
df.loc[m, ['close', 'close']].mul([0.95, 0.94]) duplicates the 'close' column and calculates the vectorised product of all the (close, close) pairs with (0.95, 0.94) to obtain (close*0.94, close*0.94) in each row of the resulting array.
[:, None][:, None] is just a trick to create two additional axes on the resulting array. This is required since you want triple nested lists ([[[]]]).

Why does using "==" return a Series instead of bool in pandas?

I just can't figure out what "==" means at the second line:
- It is not a test, there is no if statement...
- It is not a variable declaration...
I've never seen this before, the thing is data.ctage==cat is a pandas Series and not a test...
for cat in data["categ"].unique():
subset = data[data.categ == cat] # Création du sous-échantillon
print("-"*20)
print('Catégorie : ' + cat)
print("moyenne:\n",subset['montant'].mean())
print("mediane:\n",subset['montant'].median())
print("mode:\n",subset['montant'].mode())
print("VAR:\n",subset['montant'].var())
print("EC:\n",subset['montant'].std())
plt.figure(figsize=(5,5))
subset["montant"].hist(bins=30) # Crée l'histogramme
plt.show() # Affiche l'histogramme
It is testing each element of data.categ for equality with cat. That produces a vector of True/False values. This is passed as in indexer to data[], which returns the rows from data that correspond to the True values in the vector.
To summarize, the whole expression returns the subset of rows from data where the value of data.categ equals cat.
(Seems possible the whole operation could be done more elegantly using data.groupBy('categ').apply(someFunc).)
It creates a boolean series with indexes where data.categ is equal to cat , with this boolean mask, you can filter your dataframe, in other words subset will have all records where the categ is the value stored in cat.
This is an example using numeric data
np.random.seed(0)
a = np.random.choice(np.arange(2), 5)
b = np.random.choice(np.arange(2), 5)
df = pd.DataFrame(dict(a = a, b = b))
df[df.a == 0].head()
# a b
# 0 0 0
# 2 0 0
# 4 0 1
df[df.a == df.b].head()
# a b
# 0 0 0
# 2 0 0
# 3 1 1
Yes, it is a test. Boolean expressions are not restricted to if statements.
It looks as if data is a data frame (PANDAS). The expression used as a data frame index is how PANDAS denotes a selector or filter. This says to select every row in which the fieled categ matches the variable cat (apparently a pre-defined variable). This collection of rows becomes a new data frame, subset.
data.categ == cat will return a boolean list that will be used to filter your dataframe by lefting only values where boolean is equal True.
Booleans are used in many situations, not only in if statements.
Here you are checking data.categ with the element iterating, cat, in the dictionary of data.
And if they are equal you are continuing the loop.

Drop rows from dataframe that contain characters outside of a list of characters

I'm trying to remove all rows from a panda dataframe that are not in a list
allowed_chars = list(ascii_lowercase)
data = df[df['Value'].apply(lambda x : x in allowed_chars)]
print(data.Value.tolist())
The print just prints a list of 'False' values.
What you're doing is comparing the entire string stored in the Value column to the list of character you allow, this won't work as your list of characters only consist of one character strings that don't match any of your words in the Value column. Here's what you could do instead.
allowed_chars = set('abcde...')
data = df[df['Value'].apply(lambda x: set(x).issubset(allowed_chars))]
print(data.Value.tolist())
Your format seems fine, and the only thing i can think of is that the maybe the value isn't in the right format for the test in. you may need to do str(x) or something like that in you data = line. If you can give a snippet of ascii_lowercase and data i can look further.
df2
# a b c
#0 x 2 3
#1 y 2 4
df2[df2.a.apply(lambda x: x in 'x')]
# a b c
#0 x 2 3

Remove outliers from the target column when an independent variable column has a specific value

I have a dataframe that looks as follow (click on the lick below):
df.head(10)
https://ibb.co/vqmrkXb
What I would like to do is to remove outliers from the target column (occupied_parking_spaces) when the value of the day column is equal to 6 for instance which refers to sunday (df[‘day’] == 6) using the normal distribution 68-95-99.7 rule.
I tried the following code :
df = df.mask((df['occupied_parking_spaces'] - df['occupied_parking_spaces'].mean()).abs() > 2 * df['occupied_parking_spaces'].std()).dropna()
This line of code removes outliers from the whole dataset no matter the independent variables but I only want to remove outliers from the occupied_parking_spacs column where the day value is equal to 6 for exemple.
What I can do is to create a different dataframe for which I will remove outliers:
sunday_df = df.loc[df['day'] == 0]
sunday_df = sunday_df.mask((sunday_df['occupied_parking_spaces'] - sunday_df['occupied_parking_spaces'].mean()).abs() > 2 * sunday_df['occupied_parking_spaces'].std()).dropna()
But by doing this I will get multiple dataframes for everday of the week that I will have to concatenate at the end, and this is something I do not want to do as there must be a way to do this inside the same dataframe.
Could you please help me out?
Having defined some function to remove outliers, you could use np.where to apply it selectively:
import numpy as np
df = np.where(df['day'] == 0,
remove_outliers(df['occupied_parking_spaces']),
df['occupied_parking_spaces']
)

Re-arrange 1D pandas DataFrame to 2d by splitting index names

I have a 1D DataFrame that is indexed with keys of the form i_n, where i and n are strings (for the sake of this example, i is an integer number and n is a character). This would be a simple example:
values
0_a 0.583772
1_a 0.782358
2_a 0.766844
3_a 0.072565
4_a 0.576667
0_b 0.503876
1_b 0.352815
2_b 0.512834
3_b 0.070908
4_b 0.074875
0_c 0.361226
1_c 0.526089
2_c 0.299183
3_c 0.895878
4_c 0.874512
Now I would like to re-arrange this DataFrame to be 2D such that the number (the part of the index name before the underscore) serves as column name and the character (the part of the index after the underscore) serves as index:
0 1 2 3 4
a 0.583772 0.782358 0.766844 0.0725654 0.576667
b 0.503876 0.352815 0.512834 0.0709081 0.0748752
c 0.361226 0.526089 0.299183 0.895878 0.874512
I have a solution for the problem (the function convert_2d below), but I was wondering, whether there would be a more idiomatic way to achieve this. Here the code that was used to generate the original DataFrame and to convert it to the desired form:
import pandas as pd
import numpy as np
def convert_2d(df):
df2 = pd.DataFrame(columns=['a','b','c'], index=list(range(5))).T
names = set(idx.split('_')[1] for idx in df.index)
numbers = set(idx.split('_')[0] for idx in df.index)
for i in numbers:
for n in names:
df2[i][n] = df['values']['{}_{}'.format(i,n)]
return df2
##generating 1d example data:
data = np.random.rand(15)
indices = ['{}_{}'.format(i,n) for n in ['a','b','c'] for i in range(5)]
df = pd.DataFrame(
data, columns=['values']
).rename(index={i:idx for i,idx in enumerate(indices)})
print(df)
##converting to 2d
print(convert_2d(df))
Some notes about the index keys: it can be assumed (like in my function) that there are no 'missing keys' (i.e. a 2d array can always be achieved) and the only thing that can be taken for granted about the keys is the (single) underscore (i.e. the numbers and letters were only chosen for explanatory reasons, in reality there would be just two arbitrary strings connected by the underscore).
IIUC Create the Multiple index thenunstack
df.index=pd.MultiIndex.from_tuples(df.index.str.split('_').map(tuple))
df['values'].unstack(level=0)
Out[65]:
0 1 2 3 4
a 0.583772 0.782358 0.766844 0.072565 0.576667
b 0.503876 0.352815 0.512834 0.070908 0.074875
c 0.361226 0.526089 0.299183 0.895878 0.874512

Categories

Resources