So, I've got a dataframe that looks like:
with 308 different ORIGIN_CITY_NAME and 12 different UNIQUE_CARRIER.
I am trying to remove the cities where the number of unique carrier airline is < 5 As such, I performed this function:
Now, I want i'd like to take this result and manipulate my original data, df in such a way that I can remove the rows where the ORIGIN_CITY_NAME corresponds to being TRUE.
I had an idea in mind which is to use the isin() function or the apply(lambda) function in Python but I'm not familiar how to go about it. Is there a more elegant way to go about this? Thank you!
filter was made for this
df.groubpy('ORIGIN_CITY_NAME').filter(
lambda d: d.UNIQUE_CARRIER.nunique() >= 5
)
However, to continue along the vein you were attempting to get results from...
I'd use map
mask = df.groubpy('ORIGIN_CITY_NAME').UNIQUE_CARRIER.nunique() >= 5
df[df.ORIGIN_CITY_NAME.map(mask)]
Or transform
mask = df.groupby('ORIGIN_CITY_NAME').UNIQUE_CARRIER.transform(
lambda x: x.nunique() >= 5
)
df[mask]
Related
I have the following list
x = [1,2,3]
And the following df
Sample df
pd.DataFrame({'UserId':[1,1,1,2,2,2,3,3,3,4,4,4],'Origins':[1,2,3,2,2,3,7,8,9,10,11,12]})
Lets say I want to return, the userid who contains any of the values in the list, in his groupby origins list.
Wanted result
pd.Series({'UserId':[1,2]})
What would be the best approach? To do this, maybe a groupby with a lambda, but I am having a little trouble formulating the condition.
df['UserId'][df['Origins'].isin(x)].drop_duplicates()
I had considered using unique(), but that returns a numpy array. Since you wanted a series, I went with drop_duplicates().
IIUC, OP wants, for each Origin, the UserId whose number appears in list x. If that is the case, the following, using pandas.Series.isin and pandas.unique will do the work
df_new = df[df['Origins'].isin(x)]['UserId'].unique()
[Out]:
[1 2]
Assuming one wants a series, one can convert the dataframe to a series as follows
df_new = pd.Series(df_new)
[Out]:
0 1
1 2
dtype: int64
If one wants to return a Series, and do it all in one step, instead of pandas.unique, one can use pandas.DataFrame.drop_duplicates (see Steven Rumbaliski answer).
I have a pd dataframe which includes the columns CompTotal and CompFreq.
I wanted to add a third column- NormalizedAnnualCompensation which uses the following logic
If the CompFreq is Yearly then use the exising value in CompTotal
If the CompFreq is Monthly then multiply the value in CompTotal with 12
If the CompFreq is Weekly then multiply the value in CompTotal with 52
I eventually used np.where() to basically write a nested if statement like I'm used to bodging together in excel(pretty new to coding in general)- that's below.
My question is- Could I have done it better? This doesn't feel very pythonic based on what I've read and what I've been taught so far.
df['NormalizedAnnualCompensation'] = np.where(df['CompFreq']=='Yearly',df.CompTotal,
(np.where(df['CompFreq']=='Monthly', df.CompTotal * 12,
(np.where(df['CompFreq']=='Weekly',df.CompTotal *52,'NA')
))))
Thanks in advance.
There is no such thing as the "proper" way to do things, so you already got the correct one!
Still, you can learn for sure by asking for different approaches (while this probably goes beyond the scope of what stackoverflow intents to be).
For example, you may consider using pandas only by using masks and accessing only some specific region of the DataFrames to be set (pd.DataFrame.loc):
df["NormalizedAnnualCompensation"] = "NA"
mask = df["CompFreq"]=="Yearly"
df.loc[mask, "NormalizedAnnualCompensation"] = df.loc[mask, "CompTotal"]
mask = df["CompFreq"]=="Monthly"
df.loc[mask, "NormalizedAnnualCompensation"] = df.loc[mask, "CompTotal"] * 12
mask = df["CompFreq"]=="Weekly"
df.loc[mask, "NormalizedAnnualCompensation"] = df.loc[mask, "CompTotal"] * 52
If you really only want to compare that column for equality and for each of the cases are filling a fixed value (i.e. CompTotal is a constant over the whole dataframe, you could consider simply using pd.Series.map , compare the following minimum example achieving a similar thing:
In [1]: pd.Series(np.random.randint(4, size=10)).map({0: "zero", 1: "one", 2: "two"}).fillna(
...: "NA"
...: )
Out[1]:
0 NA
1 two
2 NA
3 zero
4 two
5 zero
6 one
7 two
8 NA
9 two
dtype: object
np.where() is good for simple if-then-else processing. However, if you have multiple conditions to test, nesting np.where() would look complicated and difficult to read. In this case, you can get cleaner and more readable codes by using np.select(), as follows:
condlist = [df['CompFreq']=='Yearly', df['CompFreq']=='Monthly', df['CompFreq']=='Weekly']
choicelist = [df.CompTotal, df.CompTotal * 12, df.CompTotal * 52]
df['NormalizedAnnualCompensation'] = np.select(condlist, choicelist, default='NA')
I am fairly new to python and pandas and I am trying to do the following:
Here is my dataset:
df5
Out[52]:
NAME
0 JIMMcdonald
1 TomDickson
2 SamHarper
I am trying to extract the first three characters using lambda apply
Here is what I have tried:
df5["FirstName"] = df5.apply(lambda x: x[0:3],axis=1)
here is the result:
df5
Out[54]:
NAME FirstName
0 JIMMcdonald JIMMcdonald
1 TomDickson TomDickson
2 SamHarper SamHarper
I dont understand why it didnt work.. can someone help me?
Thank you
This is due to the difference between DataFrame.apply (which is what you're using) and Series.apply (which is what you want to use). The easiest way to fix this is to select the series you want from your dataframe, and use .apply on that:
df5["FirstName"] = df5["NAME"].apply(lambda x: x[0:3],axis=1)
Your current code is running the apply function once on each column, in which case it's selecting the first three rows. This fixed code is running the function on each value in the selected column.
Better, yet, as #Erfan pointed out in his comment, doing simple one-liner string operations like this can often be simplified using panda's .str, which allows you to operate on entire series of strings in much the same way you'd operate on a single string:
df5["FirstName"] = df5["NAME"].str[:3]
Quick Pandas question:
I cleaning up the values in individual columns of a dataframe by using an apply on a series:
# For all values in col 'Rate' over 1, divide by 100
df['rate'][df['rate']>1] = df['rate'][df['rate']>1].apply(lambda x: x/100)
This is fine when the selection criteria is simple, such as df['rate']>1. This however gets very long when you start adding multiple selection criteria:
df['rate'][(df['rate']>1) & (~df['rate'].isnull()) & (df['rate_type']=='fixed) & (df['something']<= 'nothing')] = df['rate'][(df['rate']>1) & (df['rate_type']=='fixed) & (df['something']<= 'nothing')].apply(lambda x: x/100)
What's the most concise way to:
1. Split a column off (as a Series) from a DataFrame
2. Apply a function to the items of the Series
3. Update the DataFrame with the modified series
I've tried using df.update(), but that didn't seem to work. I've also tried using the Series as a selector, e.g. isin(Series), but I wasn't able to get that to work either.
Thank you!
When there are multiple conditions, you can keep things simple using eval:
mask = df.eval("rate > 1 & rate_type == 'fixed' & something <= 'nothing'")
df.loc[mask, 'rate'] = df['rate'].apply(function)
Read more about evaluating expressions dynamically here. Of course, this particular function can be vectorized as
df.loc[mask, 'rate'] /= 100
It will work with update
con=(df['rate']>1) & (df['rate_type']=='fixed') & (df['something']<= 'nothing')
df.update(df.loc[con,['rate']].apply(lambda x: x/100))
Coming from R, the code would be
x <- data.frame(vals = c(100,100,100,100,100,100,200,200,200,200,200,200,200,300,300,300,300,300))
x$state <- cumsum(c(1, diff(x$vals) != 0))
Which marks every time the difference between rows is non-zero, so that I can use it to spot transitions in data, like so:
vals state
1 100 1
...
7 200 2
...
14 300 3
What would be a clean equivalent in Python?
Additional question
The answer to the original question is posted below, but won't work properly for a grouped dataframe with pandas.
Data here: https://pastebin.com/gEmPHAb7. Notice that there are 2 different filenames.
When imported as df_all I group it with the following, and then apply solution posted below.
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
Using diff and cumsum, as in your R example:
df['state'] = (df['vals'].diff()!= 0).cumsum()
This uses the fact that True has integer value 1
Bonus question
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
I think you misunderstand what groupby does. All groupby does is create groups based on the criterium (filename in this instance). You then need to tell add another operation to tell what needs to happen with this group.
Common operations are mean, sum, or more advanced as apply and transform.
You can find more information here or here
If you can explain more in detail what you want to achieve with the groupby I can help you find the correct method. If you want to perform the above operation per filename, you probably need something like this:
def get_state(group):
return (group.diff()!= 0).cumsum()
df_all['state'] = df_all.groupby('filename')['Fit'].transform(get_state)