I am fairly new to python and pandas and I am trying to do the following:
Here is my dataset:
df5
Out[52]:
NAME
0 JIMMcdonald
1 TomDickson
2 SamHarper
I am trying to extract the first three characters using lambda apply
Here is what I have tried:
df5["FirstName"] = df5.apply(lambda x: x[0:3],axis=1)
here is the result:
df5
Out[54]:
NAME FirstName
0 JIMMcdonald JIMMcdonald
1 TomDickson TomDickson
2 SamHarper SamHarper
I dont understand why it didnt work.. can someone help me?
Thank you
This is due to the difference between DataFrame.apply (which is what you're using) and Series.apply (which is what you want to use). The easiest way to fix this is to select the series you want from your dataframe, and use .apply on that:
df5["FirstName"] = df5["NAME"].apply(lambda x: x[0:3],axis=1)
Your current code is running the apply function once on each column, in which case it's selecting the first three rows. This fixed code is running the function on each value in the selected column.
Better, yet, as #Erfan pointed out in his comment, doing simple one-liner string operations like this can often be simplified using panda's .str, which allows you to operate on entire series of strings in much the same way you'd operate on a single string:
df5["FirstName"] = df5["NAME"].str[:3]
Related
Hy peeps,
I wold like to know if have some possibility to use a function at the result of pandas .loc or if exist some better way to do it.
So what I'm trying to do is:
If the value in this series is =!0, then get the values of other rows and use as parameters for one function (in this case, get_working_days_delta), after this put the result in the same series.
df.loc[(df["SERIES"] != 0), 'SERIES'] = df.apply(cal.get_working_days_delta(df["DATE_1"],df["DATE_2"]))
The output is: datetime64[ns] is of unsupported type (<class 'pandas.core.series.Series'>)
In this case, the parameters used (df["DATE_1"] df["DATE_2"]) are recognized as the entire series rather than cell values
I don't wanna use .apply or .at because this df has over 4 milion rows
Hard to give a good answer without an example.
But I think the problem is that you need to filter before using apply.
Try
df[df["SERIES"] != 0].apply(...)
I have the following list
x = [1,2,3]
And the following df
Sample df
pd.DataFrame({'UserId':[1,1,1,2,2,2,3,3,3,4,4,4],'Origins':[1,2,3,2,2,3,7,8,9,10,11,12]})
Lets say I want to return, the userid who contains any of the values in the list, in his groupby origins list.
Wanted result
pd.Series({'UserId':[1,2]})
What would be the best approach? To do this, maybe a groupby with a lambda, but I am having a little trouble formulating the condition.
df['UserId'][df['Origins'].isin(x)].drop_duplicates()
I had considered using unique(), but that returns a numpy array. Since you wanted a series, I went with drop_duplicates().
IIUC, OP wants, for each Origin, the UserId whose number appears in list x. If that is the case, the following, using pandas.Series.isin and pandas.unique will do the work
df_new = df[df['Origins'].isin(x)]['UserId'].unique()
[Out]:
[1 2]
Assuming one wants a series, one can convert the dataframe to a series as follows
df_new = pd.Series(df_new)
[Out]:
0 1
1 2
dtype: int64
If one wants to return a Series, and do it all in one step, instead of pandas.unique, one can use pandas.DataFrame.drop_duplicates (see Steven Rumbaliski answer).
I am new to Python and am converting SQL to Python and want to learn the most efficient way to process a large dataset (rows > 1 million and columns > 100). I need to create multiple new columns based on other columns in the DataFrame. I have recently learned how to use pd.concat for new boolean columns, but I also have some non-boolean columns that rely on the values of other columns.
In SQL I would use a single case statement (case when age > 1000 then sample_id else 0 end as custom1, etc...). In Python I can achieve the same result in 2 steps (pd.concat + loc find & replace) as shown below. I have seen references in other posts to using the apply method but have also read in other posts that the apply method can be inefficient.
My question is then, for the code shown below, is there a more efficient way to do this? Can I do it all in one step within the pd.concat (so far I haven't been able to get that to work)? I am okay doing it in 2 steps if necessary. I need to be able to handle large integers (100 billion) in my custom1 element and have decimals in my custom2 element.
And finally, I tried using multiple separate np.where statements but received a warning that my DataFrame was fragmented and that I should try to use concat. So I am not sure which approach overall is most efficient or recommended.
Update - after receiving a comment and an answer pointing me towards use of np.where, I decided to test the approaches. Using a data set with 2.7 million rows and 80 columns, I added 25 new columns. First approach was to use the concat + df.loc replace as shown in this post. Second approach was to use np.where. I ran the test 10 times and np.where was faster in all 10 trials. As noted above, I think repeated use of np.where in this way can cause fragmentation, so I suppose now my decision comes down to faster np.where with potential fragmentation vs. slower use of concat without risk of fragmentation. Any further insight on this final update is appreciated.
df = pd.DataFrame({'age': [120, 4000],
'weight': [505.31, 29.01],
'sample_id': [999999999999, 555555555555]},
index=['rock1', 'rock2'])
#step 1: efficiently create starting custom columns using concat
df = pd.concat(
[
df,
(df["age"] > 1000).rename("custom1").astype(int),
(df["weight"] < 100).rename("custom2").astype(float),
],
axis=1,
)
#step2: assign final values to custom columns based on other column values
df.loc[df.custom1 == 1, 'custom1'] = (df['sample_id'])
df.loc[df.custom2 == 1, 'custom2'] = (df['weight'] / 2)
Thanks for any feedback you can provide...I appreciate your time helping me.
The standard way to do this is using numpy where:
import numpy as np
df['custom1'] = np.where(df.age.gt(1000), df.sample_id, 0)
df['custom2'] = np.where(df.weight.lt(100), df.weight / 2, 0)
I'm learning Python and want to use the "apply" function. Reading around the manual I found that if a I have a simple dataframe like this:
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
A B
0 4 9
1 4 9
2 4 9
and then I use something like this:
df.apply(lambda x:x.sum(),axis=0)
output works because according to theory x receives every column and apply the sum to each so the the result is correctly this:
A 12
B 27
dtype: int64
When instead I issue something like:
df['A'].apply(lambda x:x.sum())
result is: 'int' object has no attribute 'sum'
question is: why is that working on a dataframe by column, it's ok and working on a single column is not ? In the end the logic should be the same. x should receive in input one column instead of two.
I know that for this simple example I should use other functions like df.agg or even df['A'].sum() but the question is to understand the logic of apply.
if you look at a specific column of a pandas.DataFrame object, you working with a pandas.Series with (in your case) integers as values. Well and integers don't have a sum() method.
(Run type(df['A']) to see that you are working with a series and not a data frame anymore when slicing a single column).
The irritating part is that if you work with an actual pandas.DataFrame object, every column is a pandas.Series object and they have a sum() method.
So there are two ways to fix your problem
Work with a pandas.DataFrame and not with a pandas.Series: df[['A']]. The additional brackets force pandas to return a pandas.DataFrame object. (Verify by type(df[['A']])) and use the lambda function just as you did before
use a function rather than a method when using lambda: df['A'].apply(lambda x: np.sum(x)) (assuming that you have imported numpy as np)
I would recommend to go with the second option as it seems to me the more generic and clearer way
However, this is only relevant if you want to apply a certain function to ever element in a pandas.Series or pandas.DataFrame. In your specific case, there is no need to take the detour that your are currently using. Just use df.sum(axis=0).
The approach with apply is over complicating things. The reason why this works is that every element of a pandas.DataFrame is a pandas.Series, which as a sum method. But so does a pandas.DataFrame has, so you can use this right away.
The only way, where you actually need to take the way with apply is if you had arrays in every cell of the pandas.DataFrame
I have a csv dataset with texts. I need to search through them. I couldn't find an easy way to search for a string in a dataset and get the row and column indexes. For example, let's say the dataset is like:
df = pd.DataFrame({"China": ['Xi','Lee','Hung'], "India": ['Roy','Rani','Jay'], "England": ['Tom','Sam','Jack']})
Now let's say I want to find the string 'rani' and know its location. Is there a simple function to do that? Or do I have to loop through everything to find it?
One vectorized (and therefore relatively scalable) solution to this is to leverage numpy.where:
import numpy as np
np.where(df == 'Rani')
This returns two arrays, corresponding to column and row indices:
(array([1]), array([1]))
You can continue to take advantage of vectorized operations, but also write a more complicated filtering function, like so:
np.where(df.applymap(lambda x: "ani" in x))
In other words, "apply to each cell the function that returns True if 'ani' is in the cell", and then conduct the same np.where filtering step.
You can use any function:
def _should_include_cell(cell_contents):
return cell_contents.lower() == "rani" or "Xi" in cell_contents
np.where(df.applymap(_should_include_cell)
Some final notes:
applymap is slower than simple equality checking
if you need this to scale WAY up, consider using dask instead of pandas
Not sure how this will scale but it works
df[df.eq('Rani')].dropna(1, how='all').dropna()
India
1 Rani