Quick Pandas question:
I cleaning up the values in individual columns of a dataframe by using an apply on a series:
# For all values in col 'Rate' over 1, divide by 100
df['rate'][df['rate']>1] = df['rate'][df['rate']>1].apply(lambda x: x/100)
This is fine when the selection criteria is simple, such as df['rate']>1. This however gets very long when you start adding multiple selection criteria:
df['rate'][(df['rate']>1) & (~df['rate'].isnull()) & (df['rate_type']=='fixed) & (df['something']<= 'nothing')] = df['rate'][(df['rate']>1) & (df['rate_type']=='fixed) & (df['something']<= 'nothing')].apply(lambda x: x/100)
What's the most concise way to:
1. Split a column off (as a Series) from a DataFrame
2. Apply a function to the items of the Series
3. Update the DataFrame with the modified series
I've tried using df.update(), but that didn't seem to work. I've also tried using the Series as a selector, e.g. isin(Series), but I wasn't able to get that to work either.
Thank you!
When there are multiple conditions, you can keep things simple using eval:
mask = df.eval("rate > 1 & rate_type == 'fixed' & something <= 'nothing'")
df.loc[mask, 'rate'] = df['rate'].apply(function)
Read more about evaluating expressions dynamically here. Of course, this particular function can be vectorized as
df.loc[mask, 'rate'] /= 100
It will work with update
con=(df['rate']>1) & (df['rate_type']=='fixed') & (df['something']<= 'nothing')
df.update(df.loc[con,['rate']].apply(lambda x: x/100))
Related
So I have a df here:
and I'm trying to do some filtering based on certain conditions. It works just fine when I've done this all based on one condition but I can't do the same thing as pythonically as I'd like because all search results I see work based on one specific column. That and the type of searches also include strings. What I'm trying to do is filter to find the count of how many rows have a value that's higher than a 3 AND also lower than a 7 (These are all integers so they'd have to have a 4, 5 and 6.)
Sure, with a row count of 5, I could just count but this is just a subset of a much larger data frame.
I've also already tried this resource: https://www.geeksforgeeks.org/filter-pandas-dataframe-with-multiple-conditions/
Thanks!
The answer I got was:
df = df[df.apply(lambda x: (x < 7) & (3 < x), axis=1)]
for col in df.columns:
df[col] = df[col].count()
All the values are the same count, but that's fine. Thanks!
Try:
df[df.apply(lambda x: (3 < x) & (x < 7)).apply(all, axis=1)]
you can use & to compare two boolean df's
(df.gt(3) & df.lt(7)).any(axis=1)
you can also use isin() with range
l = 3
h = 7
df.isin(range(l+1,h)).any(axis=1)
I am new to Python and am converting SQL to Python and want to learn the most efficient way to process a large dataset (rows > 1 million and columns > 100). I need to create multiple new columns based on other columns in the DataFrame. I have recently learned how to use pd.concat for new boolean columns, but I also have some non-boolean columns that rely on the values of other columns.
In SQL I would use a single case statement (case when age > 1000 then sample_id else 0 end as custom1, etc...). In Python I can achieve the same result in 2 steps (pd.concat + loc find & replace) as shown below. I have seen references in other posts to using the apply method but have also read in other posts that the apply method can be inefficient.
My question is then, for the code shown below, is there a more efficient way to do this? Can I do it all in one step within the pd.concat (so far I haven't been able to get that to work)? I am okay doing it in 2 steps if necessary. I need to be able to handle large integers (100 billion) in my custom1 element and have decimals in my custom2 element.
And finally, I tried using multiple separate np.where statements but received a warning that my DataFrame was fragmented and that I should try to use concat. So I am not sure which approach overall is most efficient or recommended.
Update - after receiving a comment and an answer pointing me towards use of np.where, I decided to test the approaches. Using a data set with 2.7 million rows and 80 columns, I added 25 new columns. First approach was to use the concat + df.loc replace as shown in this post. Second approach was to use np.where. I ran the test 10 times and np.where was faster in all 10 trials. As noted above, I think repeated use of np.where in this way can cause fragmentation, so I suppose now my decision comes down to faster np.where with potential fragmentation vs. slower use of concat without risk of fragmentation. Any further insight on this final update is appreciated.
df = pd.DataFrame({'age': [120, 4000],
'weight': [505.31, 29.01],
'sample_id': [999999999999, 555555555555]},
index=['rock1', 'rock2'])
#step 1: efficiently create starting custom columns using concat
df = pd.concat(
[
df,
(df["age"] > 1000).rename("custom1").astype(int),
(df["weight"] < 100).rename("custom2").astype(float),
],
axis=1,
)
#step2: assign final values to custom columns based on other column values
df.loc[df.custom1 == 1, 'custom1'] = (df['sample_id'])
df.loc[df.custom2 == 1, 'custom2'] = (df['weight'] / 2)
Thanks for any feedback you can provide...I appreciate your time helping me.
The standard way to do this is using numpy where:
import numpy as np
df['custom1'] = np.where(df.age.gt(1000), df.sample_id, 0)
df['custom2'] = np.where(df.weight.lt(100), df.weight / 2, 0)
I want to apply a regex function to selected rows in a dataframe. My solution works but the code is terribly long and I wonder if there is not a better, faster and more elegant way to solve this problem.
In words I want my regex function to be applied to elements of the source_value column, but only to rows where the column source_type == rhombus AND (rhombus_refer_to_odk_type == integer OR a decimal).
The code:
df_arrows.loc[(df_arrows['source_type']=='rhombus') & ((df_arrows['rhombus_refer_to_odk_type']=='integer') | (df_arrows['rhombus_refer_to_odk_type']=='decimal')),'source_value'] = df_arrows.loc[(df_arrows['source_type']=='rhombus') & ((df_arrows['rhombus_refer_to_odk_type']=='integer') | (df_arrows['rhombus_refer_to_odk_type']=='decimal')),'source_value'].apply(lambda x: re.sub(r'^[^<=>]+','', str(x)))
Use Series.isin with condition in variable m and for replace use Series.str.replace:
m = (df_arrows['source_type']=='rhombus') &
df_arrows['rhombus_refer_to_odk_type'].isin(['integer','decimal'])
df_arrows.loc[m,'source_value'] = df_arrows.loc[m,'source_value'].astype(str).str.replace(r'^[^<=>]+','')
EDIT: If mask is 2 dimensional possible problem should be duplicated columns names, you can test it:
print ((df_arrows['source_type']=='rhombus'))
print (df_arrows['rhombus_refer_to_odk_type'].isin(['integer','decimal']))
I am selecting/filtering a DataFrame using multiple criteria (comparsion with variables), like so:
results = df1[
(df1.Year == Year) &
(df1.headline == text) &
(df1.price > price1) &
(df1.price < price2) &
(df1.promo > promo1) &
(df1.promo < promo2)
]
While this approach works, it is very slow. Hence I wonder, is there any more efficient way of filtering/selecting rows based on multiple criteria using pandas?
Your current approach is pretty by-the-book as fair as Pandas syntax goes, in my personal opinion.
One way to optimize, if you really need to do so, is to use the underlying NumPy arrays for generating the boolean masks. Generally speaking, Pandas may come with a bit of additional overhead in how it overloads operators versus NumPy. (With the tradeoff being arguably greater flexibility and intrinsically smooth handling of NaN data.)
price = df1.price.values
promo = df1.promo.values
# Note: this is a view to a slice of df1
results = df1.loc[
(df1.Year.values == Year) &
(df1.headline.values == text) &
(price > price1) &
(price < price2) &
(promo > promo1) &
(promo < promo2)
]
Secondly, check that you are already taking advantage of numexpr, which Pandas is enabled to do:
>>> import pandas as pd
>>> pd.get_option('compute.use_numexpr') # use `pd.set_option()` if False
True
So, I've got a dataframe that looks like:
with 308 different ORIGIN_CITY_NAME and 12 different UNIQUE_CARRIER.
I am trying to remove the cities where the number of unique carrier airline is < 5 As such, I performed this function:
Now, I want i'd like to take this result and manipulate my original data, df in such a way that I can remove the rows where the ORIGIN_CITY_NAME corresponds to being TRUE.
I had an idea in mind which is to use the isin() function or the apply(lambda) function in Python but I'm not familiar how to go about it. Is there a more elegant way to go about this? Thank you!
filter was made for this
df.groubpy('ORIGIN_CITY_NAME').filter(
lambda d: d.UNIQUE_CARRIER.nunique() >= 5
)
However, to continue along the vein you were attempting to get results from...
I'd use map
mask = df.groubpy('ORIGIN_CITY_NAME').UNIQUE_CARRIER.nunique() >= 5
df[df.ORIGIN_CITY_NAME.map(mask)]
Or transform
mask = df.groupby('ORIGIN_CITY_NAME').UNIQUE_CARRIER.transform(
lambda x: x.nunique() >= 5
)
df[mask]