Applying a function to a column in a pandas dataframe

Applying a function to a column in a pandas dataframe - python

So I have a function replaceMonth(string), which is just a series of if statements that returns a string derived from a column in a pandas dataframe. Then I need to replace the original string with the derived one.
The dataframe is defined like this:
Index ID Year DSFS DrugCount
0 111111 Y1 3- 4 months 1
There are around 80K rows in the dataframe. What I need to do is to replace what is in column DSFS with the result from the replaceMonth(string) function.
So if, for example, the value in the first row of DSFS was '3-4 months', if I ran that string through replaceMonth() it would give me '_3_4' as the return value. Then I need to change the value in the dataframe from the '3- 4 months' to '_3_4'.
I've been trying to use apply on the dataframe but I'm either getting the syntax wrong or not understanding what it's doing correctly, like this:
dataframe['DSFS'].apply(replaceMonth(dataframe['DSFS']))
That doesn't ring right to me but I'm not sure where I'm messing up on it. I'm fairly new to Python so it's probably the syntax. :)
Any help is greatly appreciated!

When you apply you pass the function that you want applied to each element.
Try
dataframe['DSFS'].apply(replaceMonth)
Reassign to the dataframe to preserve the changes
dataframe['DSFS'] = dataframe['DSFS'].apply(replaceMonth)

Related

Modifying the date column calculation in pandas dataframe

I have a dataframe that looks like this
I need to adjust the time_in_weeks column for the 34 number entry. When there is a duplicate uniqueid with a different rma_created_date that means there was some failure that occurred. The 34 needs to be changed to calculate the number of weeks between the new most recent rma_created_date (2020-10-15 in this case) and subtract the rma_processed_date of the above row 2020-06-28.
I hope that makes sense in terms of what I am trying to do.
So far I did this
def clean_df(df):
'''
This function will fix the time_in_weeks column to calculate the correct number of weeks
when there is multiple failured for an item.
'''
# Sort by rma_created_date
df = df.sort_values(by=['rma_created_date'])
Now I need to perform what I described above but I am a little confused on how to do this. Especially considering we could have multiple failures and not just 2.
I should get something like this returned as output
As you can see what happened to the 34 was it got changed to take the number of weeks between 2020-10-15 and 2020-06-26
Here is another example with more rows
Using the expression suggested
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)
I get this
Final note: if there is a date of 1/1/1900 then don't perform any calculation.

Question not very clear. Happy to correct if I interpreted it wrongly.
Try use np.where(condition, choiceif condition, choice ifnotcondition)
#Coerce dates into datetime
df['rma_processed_date']=pd.to_datetime(df['rma_processed_date'])
df['rma_created_date']=pd.to_datetime(df['rma_created_date'])
#Solution
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_created_date.sub(df.rma_processed_date),df.time_in_weeks)

PYTHON How to add a column using numpy.where that includes data from the dataframe in the next row?

Im really new to python and would appreciate your help.
I'm trying to add a column to my data frame in python. I'm using the following:
df['capital']=np.where(df['year']!=1960,2,df['GDP'])
Except, when I write GDP what I really want is the GDP for the year 1961. Any idea how I can include that?

You can use .shift() to use the GDP value in the next row. By default a 1 is passed to .shift(), but you can also pass -1 or 2, etc to compare up or down as many rows as you'd like:
df['capital'] = np.where((df['year'] != 1960), 2, df['GDP'].shift())

When i add a new column to my dataframe, it shows as NaN

So I needed to add the 'count' value to the dataframe as a new column.
I used the below code:
count=HC_Pune.groupby('Customer').size()
HC_Pune.sort_values(by='Customer',inplace=True,ascending=True)
BCP.insert(2,'HC',count,True)
BCP.head(2)
This is the output that I get
Could someone please help me in clearing this problem as to why I am getting NaN for the data?

From the look of your short output,you passed empty values to count which are in turn passed to your dataframe. Take a look at your dataset.

Pandas - select lowest value to date

I'm new to Pandas.
I've got a dataframe where I want to group by user and then find their lowest score up until that date in the their speed column.
So I can't just use df.groupby(['user'])['speed'].transform('min) as this would give the min of all values not just form the current row to the first.
What can I use to get what I need?

Without seeing your dataset it's hard to help you directly. The problem does boil down to the following. You need to select the range of data you want to work with (so select rows for the date range and columns for the user/speed).
That would look something like x = df.loc[["2-4-2018","2-4-2019"], ['users', 'speed']]
From there you could do a simple x['users'].min() for the value or x['users'].idxmin() for the index of the value.
I haven't played around for a bit with Dataframes, but you're looking for how to slice Dataframes.

Generating a list of values from a pandas DataFrame column for a range of values in another column

For a list of daily maximum temperature values from 5 to 27 degrees celsius, I want to calculate the corresponding maximum ozone concentration, from the following pandas DataFrame:
I can do this by using the following code, by changing the 5 then 6, 7 etc.
df_c=df_b[df_b['Tmax']==5]
df_c.O3max.max()
Then I have to copy and paste the output values into an excel spreadsheet. I'm sure there must be a much more pythonic way of doing this, such as by using a list comprehension. Ideally I would like to generate a list of values from the column 03max. Please give me some suggestions.

use pd.Series.map with another pd.Series
pd.Series(list_of_temps).map(df_b.set_index('Tmax')['O3max'])
You can get a dataframe
result_df = pd.DataFrame(dict(temps=list_of_temps))
result_df['O3max'] = result_df.temps.map(df_b.set_index('Tmax')['O3max'])

I had another play around and think the following piece of code seems to do the job:
df_c=df_b.groupby(['Tmax'])['O3max'].max()
I would appreciate any thoughts on whether this is correct

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Applying a function to a column in a pandas dataframe - python

When you apply you pass the function that you want applied to each element. Try dataframe['DSFS'].apply(replaceMonth) Reassign to the dataframe to preserve the changes dataframe['DSFS'] = dataframe['DSFS'].apply(replaceMonth)

Related

Modifying the date column calculation in pandas dataframe

PYTHON How to add a column using numpy.where that includes data from the dataframe in the next row?

When i add a new column to my dataframe, it shows as NaN

Pandas - select lowest value to date

Generating a list of values from a pandas DataFrame column for a range of values in another column

Categories

Resources