Determining if Pandas column containing array includes specific values - python

I have a dataframe that contains three columns: two define the start and end of a period of time (a window) and another which contains an array of individual timepoints. I would like to determine if any of the individual points are within the window's start and end (the two other columns). The ideal output would be True/False for each row.
I can iterate through each row of the dataframe, extract the timepoints and start_window and end_window times and determine this one row at a time, but I was looking for a faster (no-loop) option.
Example of dataframe
row start_window end_window times (numpy array)
0 307.110309 307.710309 [307.48857, 307.6031]
1 309.140340 311.900309 [315.23134]
...
The output based on the above dataframe would be:
True
False

One way to do is use pd.DataFrame.apply:
df.apply(lambda x: any(x['start_window']< i< x['end_window'] for i in x['times']), 1)
Output:
0 True
1 False
dtype: bool

Let us do it vertorized
s=pd.DataFrame(df.time.tolist(),index=df.index)
((df.start_window-s<0)&(df.end_window-s>0)).any(1)
Out[277]:
0 True
1 False
dtype: bool

Here is another efficient solution.
t_max = df["times"].apply(max)
t_min = df["times"].apply(min)
out = (t_max > df["start_window"]) & (t_min < df["end_window"])

Related

how to vectorize a for loop on pandas dataframe?

i am working whit a data of about 200,000 rows, in one column of the pandas i have some values that have a empty list, the most of them are list whit several values, here is a picture:
what i want to do is change the empty sets whit this set
[[close*0.95,close*0.94]]
where the close is the close value on the table, the for loop that i use is this one:
for i in range(1,len(data3.index)):
close = data3.close[data3.index==data3.index[i]].values[0]
sell_list = data3.sell[data3.index==data3.index[i]].values[0]
buy_list = data3.buy[data3.index==data3.index[i]].values[0]
if len(sell_list)== 0:
data3.loc[data3.index[i],"sell"].append([[close*1.05,close*1.06]])
if len(buy_list)== 0:
data3.loc[data3.index[i],"buy"].append([[close*0.95,close*0.94]])
i tried to make it work whit multithread but as i need to read all the table to do the next step i cant split the data, i hope you can help me to make a kind of lamda function to apply the df, or something, i am not to much skilled on this, thanks for reading!
the expected output of the row and column "buy" of and empty set should be [[[11554, 11566]]]
Example data:
import pandas as pd
df = pd.DataFrame({'close': [11763, 21763, 31763], 'buy':[[], [[21763, 21767]], []]})
close buy
0 11763 []
1 21763 [[[21763, 21767]]]
2 31763 []
You could do it like this:
# Create mask (a bit faster than df['buy'].apply(len) == 0).
# Assumes there are no NaNs in the column. If you have NaNs, use pd.apply.
m = [len(l) == 0 for l in df['buy'].tolist()]
# Create triple nested lists and assign.
df.loc[m, 'buy'] = list(df.loc[m, ['close', 'close']].mul([0.95, 0.94]).to_numpy()[:, None][:, None])
print(df)
Result:
close buy
0 11763 [[[11174.85, 11057.22]]]
1 21763 [[[21763, 21767]]]
2 31763 [[[30174.85, 29857.219999999998]]]
Some explanation:
m is a boolean mask that selects the rows of the DataFrame with an empty list in the 'buy' column:
m = [len(l) == 0 for l in df['buy'].tolist()]
# Or (a bit slower)
# "Apply the len() function to all lists in the column.
m = df['buy'].apply(len) == 0
print(m)
0 True
1 False
2 True
Name: buy, dtype: bool
We can use this mask to select where to calculate the values.
df.loc[m, ['close', 'close']].mul([0.95, 0.94]) duplicates the 'close' column and calculates the vectorised product of all the (close, close) pairs with (0.95, 0.94) to obtain (close*0.94, close*0.94) in each row of the resulting array.
[:, None][:, None] is just a trick to create two additional axes on the resulting array. This is required since you want triple nested lists ([[[]]]).

Performing the equivalent of a vlookup within a merged df in Pandas

I had no pandas/python experience this time last week so I have had a steep learning curve in trying to transfer a complex, multi-step process that was being done in excel, into pandas. Sorry if the following is unclear.
I merged 2 dataframes. I have a column, let's call it 'new_ID', with new ID names from originaldf1, some of which say 'no match was found'. For the 'no match was found' entries I would like to get the old ID number from originaldf2, which is another column in currentdf, let's call this col 'old_ID'. So, I would like to do something like an excel vlookup where I say: "if there is 'no match was found' in col 'new_ID', give me the ID that is in col 'old_ID', in that same row". The output I would like is just a list of all the old IDs where no match was found.
I've tried a few solutions that I found on here but all just give me blank outputs. I'm assuming this is because they aren't searching each individual instance of "no match found". For example I tried:
deletes = mydf.loc[mydf['new_ID'] == "no match was found", ['old_ID']
this turns out with just the col header, then all blank.
is what i'm trying to do possible in pandas? or maybe i'm stuck in excel ways of thinking, and there is a better/different way!?...
enter image description here
Welcome to Python. What you are trying to do is a straightforward task in pandas. Each column of a pandas Dataframe is a Series object; basically a list of values. You are trying to find which row numbers (aka indeces) satisfy this criteria: new_id == "no match was found". This can be done by pulling the column out of the dataframe and applying a lambda function. I would recommend pasting this code in a new file and playing around to see how it works.
import pandas as pd
# Create test data frame
df = pd.DataFrame(columns=('new_id','old_id'))
df.loc[0] = (1, None)
df.loc[1] = ("no match", 4)
df.loc[2] = (3, None)
df.loc[3] = ("no match", 4)
print("\nHere is our test dataframe:")
print(df)
print("\nShow the values of the 'new_id' that meet our criteria:")
print(df['new_id'][lambda x: x == "no match"])
# Pull the index from these rows
indeces = df['new_id'][lambda x: x == "no match"].index.tolist()
print("\nIndeces:\n", indeces)
print("\nShow only the rows of the data frame that match 'indeces':")
print(df.loc[indeces]['old_id'])
A couple of notes about this code:
df.loc[] refers to a specific row of a data frame. df.loc[2] refers to the 3rd row (since pandas data frames are generally zero-indexed)
A lambda function here takes each value of a list (or Series object) individually and plugs these values one-by-one into a function. In this case we are referring to each value of 'new_id' as 'x', and then checking if x == "no match". Placing brackets [] around it converts the output to a list. So in this case the output of [lambda x: x == "no_match"] will be a list of True or False values. The list is then applied to our Series object, so that only the rows with True are returned.
After the lambda function .index.tolist() is applied to convert the Series object to a list of its indeces.
Working off your example im going to assume all new_ID entries are numbers only unless there is no match.
so if your dataframe looks like this (assuming this 2nd column has any values, i didnt know so i put 0's)
new_ID
originaldf2
1
0
2
0
3
0
no match
4
Next we can check to see if your new_id column has an id or not by seeing if it contains a number using str.isnumeric()
has_id =df1.new_ID.str.isnumeric()
has_id>>>
0 True
1 True
2 True
3 False
4 True
Name: new_ID, dtype: bool
Then finally we'll use where()
what this does it takes the first argument cond that we've passed the has_id bollean filter through and checks whether its True or False. If true, it keeps original value, if false, goes to the argument found in other which in this case we assigned to the second column of our dataframe.
df1.where(has_id, df.iloc[:,1], axis=0)>>>
new_ID old_df_2
0 1 0
1 2 0
2 3 0
3 4 4

Why does using "==" return a Series instead of bool in pandas?

I just can't figure out what "==" means at the second line:
- It is not a test, there is no if statement...
- It is not a variable declaration...
I've never seen this before, the thing is data.ctage==cat is a pandas Series and not a test...
for cat in data["categ"].unique():
subset = data[data.categ == cat] # Création du sous-échantillon
print("-"*20)
print('Catégorie : ' + cat)
print("moyenne:\n",subset['montant'].mean())
print("mediane:\n",subset['montant'].median())
print("mode:\n",subset['montant'].mode())
print("VAR:\n",subset['montant'].var())
print("EC:\n",subset['montant'].std())
plt.figure(figsize=(5,5))
subset["montant"].hist(bins=30) # Crée l'histogramme
plt.show() # Affiche l'histogramme
It is testing each element of data.categ for equality with cat. That produces a vector of True/False values. This is passed as in indexer to data[], which returns the rows from data that correspond to the True values in the vector.
To summarize, the whole expression returns the subset of rows from data where the value of data.categ equals cat.
(Seems possible the whole operation could be done more elegantly using data.groupBy('categ').apply(someFunc).)
It creates a boolean series with indexes where data.categ is equal to cat , with this boolean mask, you can filter your dataframe, in other words subset will have all records where the categ is the value stored in cat.
This is an example using numeric data
np.random.seed(0)
a = np.random.choice(np.arange(2), 5)
b = np.random.choice(np.arange(2), 5)
df = pd.DataFrame(dict(a = a, b = b))
df[df.a == 0].head()
# a b
# 0 0 0
# 2 0 0
# 4 0 1
df[df.a == df.b].head()
# a b
# 0 0 0
# 2 0 0
# 3 1 1
Yes, it is a test. Boolean expressions are not restricted to if statements.
It looks as if data is a data frame (PANDAS). The expression used as a data frame index is how PANDAS denotes a selector or filter. This says to select every row in which the fieled categ matches the variable cat (apparently a pre-defined variable). This collection of rows becomes a new data frame, subset.
data.categ == cat will return a boolean list that will be used to filter your dataframe by lefting only values where boolean is equal True.
Booleans are used in many situations, not only in if statements.
Here you are checking data.categ with the element iterating, cat, in the dictionary of data.
And if they are equal you are continuing the loop.

Pandas - how to filter dataframe by regex comparisons on mutliple column values

I have a dataframe like the following, where everything is formatted as a string:
df
property value count
0 propAb True 10
1 propAA False 10
2 propAB blah 10
3 propBb 3 8
4 propBA 4 7
5 propCa 100 4
I am trying to find a way to filter the dataframe by applying a series of regex-style rules to both the property and value columns together.
For example, some sample rules may be like the following:
"if property starts with 'propA' and value is not 'True', drop the row".
Another rule may be something more mathematical, like:
"if property starts with 'propB' and value < 4, drop the row".
Is there a way to accomplish something like this without having to iterate over all rows each time for every rule I want to apply?
You still have to apply each rule (how else?), but let pandas handle the rows. Also, instead of removing the rows that you do not like, keep the rows that you do. Here's an example of how the first two rules can be applied:
rule1 = df.property.str.startswith('propA') & (df.value != 'True')
df = df[~rule1] # Keep everything that does NOT match
rule2 = df.property.str.startswith('propB') & (df.value < 4)
df = df[~rule2] # Keep everything that does NOT match
By the way, the second rule will not work because value is not a numeric column.
For the first one:
df = df.drop(df[(df.property.startswith('propA')) & (df.value is not True)].index)
and the other one:
df = df.drop(df[(df.property.startswith('propB')) & (df.value < 4)].index)

Split a list by half the length and add new column with dependent values

I have a csv datafile that I've split by a column value into 5 datasets for each person using:
for i in range(1,6):
PersonData = df[df['Person'] == i].values
P[i] = PersonData
I want to sort the data into ascending order according to one column, then split the data half way at that column to find the median.
So I sorted the data with the following:
dataP = {}
for i in range(1,6):
sortData = P[i][P[i][:,9].argsort()]
P[i] = sortData
P[i] = pd.DataFrame(P[i])
dataP[1]
Using that I get a dataframe for each of my datasets 1-6 sorted by the relevant column (9), depending on which number I put into dataP[i].
Then I calculate half the length:
for i in range(1,6):
middle = len(dataP[i])/2
print(middle)
Here is where I'm stuck!
I need to create a new column in each dataP[i] dataframe that splits the length in 2 and gives the value 0 if it's in the first half and 1 if it's in the second.
This is what I've tried but I don't understand why it doesn't produce a new list of values 0 and 1 that I can later append to dataP[i]:
for n in range(1, (len(dataP[i]))):
for n, line in enumerate(dataP[i]):
if middle > n:
confval = 0
elif middle < n:
confval = 1
for i in range(1,6):
Confval[i] = confval
Confval[1]
Sorry if this is basic, I'm quite new to this so a lot of what I've written might not be the best way to do it/necessary, and sorry also for the long post.
Any help would be massively appreciated. Thanks in advance!
If I'm reading your question right I believe you are attempting to do two things.
Find the median value of a column
Create a new column which is 0 if the value is less than the median or 1 if greater.
Let's tackle #1 first:
median = df['originalcolumn'].median()
That easy! There's many great pandas functions for things like this.
Ok so number two:
df['newcolumn'] = df[df['originalcolumn'] > median].astype(int)
What we're doing here is creating a new bool series, false if the value at that location is less than the median, true otherwise. Then we can cast that to an int which gives us 0s and 1s.

Categories

Resources