I'm new to Pandas.
I've got a dataframe where I want to group by user and then find their lowest score up until that date in the their speed column.
So I can't just use df.groupby(['user'])['speed'].transform('min) as this would give the min of all values not just form the current row to the first.
What can I use to get what I need?
Without seeing your dataset it's hard to help you directly. The problem does boil down to the following. You need to select the range of data you want to work with (so select rows for the date range and columns for the user/speed).
That would look something like x = df.loc[["2-4-2018","2-4-2019"], ['users', 'speed']]
From there you could do a simple x['users'].min() for the value or x['users'].idxmin() for the index of the value.
I haven't played around for a bit with Dataframes, but you're looking for how to slice Dataframes.
Related
How would I have copy the last value from one column and copy this to another column in Pandas. I am coding a stock portfolio withdrawal strategy and I need to copy the ["Portfolio Remaining"] variable * ["Return"] from index number [1: End_date] as after the first year the portfolio after the withdrawal (["Portfolio Remaining"]) column needs to multiplied from index [1: End_Date]. So I want a column to copy another column however the value in the new column will be 1 row lower in the new column.
I have tried to use
df(Equity_Value).loc[0] * (df["Return]) then use
df3["Portfolio_Value"].loc[1:End_Date] * (df["Return"] + 1).cumprod() however the calculations were wrong and python said this method isn't suitable due to the slicing however I am unsure how to do this in Pandas. I can't find anything on the documentation or online resources on how to do this.
Assuming pd is a pandas DataFrame, the following works for shifting records by one row:
pd["new_column"] = pd["old_column"].shift(1)
details at https://pandas.pydata.org/docs/reference/api/pandas.Series.shift.html
I have a dataframe which is similar to this
d1 = pd.DataFrame({'name':['xyz','abc','dfg'],
'age':[15,34,22],
'sex':['s1','s2','s3'],
'w-1(6)':[96,66,74],
'w-2(5)':[55,86,99],
'w-3(4)':[11,66,44]})
Note that in my original DataFrame the week numbers are generated dynamically (i.e) The columns
w-1(6),w-2(5) and w-3(4) are generated dynamically and change every week. I want to sort all the three columns of the week based on descending order of the values.
But the names of the columns cannot be used as they change every week.
Is there any possible way to achieve this?
Edit : The numbers might not always present for all the three weeks, in the sense that if W-1 has no data, i wont have that column in the dataset at all. So that would mean only two week columns and not three.
You can use the column indices.
d1.sort_values(by=[d1.columns[3], d1.columns[4], d1.columns[5]] , ascending=False)
Background info
I'm working on a DataFrame where I have successfully joined two different datasets of football players using fuzzymatcher. These datasets did not have keys for an exact match and instead had to be done by their names. An example match of the name column from two databases to merge as one is the following
long_name name
L. Messi Lionel Andrés Messi Cuccittini
As part of the validation process of a 18,000 row database, I want to check the two date of birth columns in the merged DataFrame - df, ensuring that the columns match like the example below
dob birth_date
1987-06-24 1987-06-24
Both date columns have been converted from strings to dates using pd.to_datetime(), e.g.
df['birth_date'] = pd.to_datetime(df['birth_date'])
My question
My query, I have another column called 'value'. I want to update my pandas DataFrame so that if the two date columns match, the entry is unchanged. However, if the two date columns don't match, I want the data in this value column to be changed to null. This is something I can do quite easily in Excel with a date_diff calculation but I'm unsure in pandas.
My current code is the following:
df.loc[(df['birth_date'] != df['dob']),'value'] = np.nan
Reason for this step (feel free to skip)
The reason for this code is that it will quickly show me fuzzy matches that are inaccurate (approx 10% of total database) and allow me to quickly fix those.
Ideally I need to also work on the matching algorithm to ensure a perfect date match, however, my current algorithm currently works quite well in it's current state and the project is nearly complete. Any advice on this however I'd be happy to hear, if this is something you know about
Many thanks in advance!
IICU:
Please Try np.where.
Works as follows;
np.where(if condition, assign x, else assign y)
if condition=df.loc[(df['birth_date'] != df['dob'],
x=np.nan and
y= prevailing df.value
df['value']= np.where(df.loc[(df['birth_date'] != df['dob']),'value'], np.nan, df['value'])
I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()
For a list of daily maximum temperature values from 5 to 27 degrees celsius, I want to calculate the corresponding maximum ozone concentration, from the following pandas DataFrame:
I can do this by using the following code, by changing the 5 then 6, 7 etc.
df_c=df_b[df_b['Tmax']==5]
df_c.O3max.max()
Then I have to copy and paste the output values into an excel spreadsheet. I'm sure there must be a much more pythonic way of doing this, such as by using a list comprehension. Ideally I would like to generate a list of values from the column 03max. Please give me some suggestions.
use pd.Series.map with another pd.Series
pd.Series(list_of_temps).map(df_b.set_index('Tmax')['O3max'])
You can get a dataframe
result_df = pd.DataFrame(dict(temps=list_of_temps))
result_df['O3max'] = result_df.temps.map(df_b.set_index('Tmax')['O3max'])
I had another play around and think the following piece of code seems to do the job:
df_c=df_b.groupby(['Tmax'])['O3max'].max()
I would appreciate any thoughts on whether this is correct