Apologies for the fairly basic question.
Basically I have a large dataframe where I'm pulling out the top dates for the sum of certain values. Looks like this:
hv_toploss = hv.groupby(['END_VALID_DT']).sum()
hv_toploss=hv_toploss.sort_values('TOTALPL',ascending=False).iloc[:10]
hv_toploss['END_VALID_DT'] = pd.to_datetime(hv_toploss['END_VALID_DT'])
Now, END_VALID_DT becomes the index of hv_toploss, and I get a KeyError when running line 3. If I try to reindex, I get a multi-index error, and since these are the values I need, I can't just drop the index.
I will be calling these values in a line like:
PnlByDay = PnlByDay.loc[hv_toploss['END_VALID_DT']]
Any help here would be great. I'm still a novice using Python.
You can use the index directly instead of creating another column containing the index.
the_dates = hv_toploss.sort_values('TOTALPL',ascending=False).iloc[:10].index
PnlByDay.loc[PnlByDay.index.isin(the_dates)]
I don't know the structure of PnlByDay, so you may have to modify that part.
Ok I got around this by just copying the index values into a new column and using that.
hv_toploss = hv.groupby(['END_VALID_DT']).sum()
hv_toploss['Scenario_Dates'] = hv_toploss.index
hv_toploss=hv_toploss.sort_values('TOTALPL',ascending=False).iloc[:10]
However any input on how to do this properly please advise.
Related
this is probably a easy fix that just eludes me right now.
I have a excel file with the following content.
from it I want to filter out the "Num-" from the rest. For simplicitys sake I use .str here.
df_test = pd.read_excel(r'C:\...\test.xlsx')
df_test = df_test.filter(like='Order_Number', axis=1)
df_test = df_test['Order_Number'].str[4:]
df_test.head()
The output comes out without the title Order_Number though and I am not sure why. How can I preserve it without adding it manually back?
It appears you are assigning the new values of 'Order_Number' column to entire dataframe, instead of assigning them to the actual column. Try:
df_test['Order_Number'] = df_test['Order_Number'].str[4:]
I am working with dataframes for a uni assignment, but do not have a lot of experience with it. One of the datasets we use automatically puts the date as the index, as you can see in the screenshot of the dataframe. I have to work with if- and for-loops, which works better with a regular index. I can't find anywhere how I can transform the date index into a regular column, and add normal index numbers. Can anyone help me with this?
Try this:
df_sleep_2.reset_index()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
You can either set the parameter inplace=True to directly modify the dataframe, or assign it to a new variable e.g.
# modify dataframe in place
df_sleep_2.reset_index(inplace=True)
# or assign result to new variable
df_sleep_2_new_index = df_sleep_2.reset_index()
try to reset the index using reset_index
df_sleep_2.reset_index()
Thanks in advance for the advice. I have a pandas dataframe. What I would like to do is label a column (adata.obs['new_annotation]) as 'pDC' when another column (adata.obs['leiden_scVI'] == 10.
It seems to me that loc is probably the best way to go about this. Therefore I have tried:
adata.obs.loc[adata.obs['leiden_scVI']== '10', 'new_annotation'] = 'pDC'
But this generates a value error:
ValueError: Cannot setitem on a Categorical with a new category, set the categories first.
I've tried appending .astype(category) but this does not seem to solve the problem.
Is there another way of overcoming this please?
Many thanks.
ADDENDUM
Now solved - just need to change columns to .astype(str)
I have a dataframe that's the result of importing a csv and then performing a few operations and adding a column that's the difference between two other columns (column 10 - column 9 let's say). I am trying to sort the dataframe by the absolute value of that difference column, without changing its value or adding another column.
I have seen this syntax over and over all over the internet, with indications that it was a success (accepted answers, comments saying "thanks, that worked", etc.). However, I get the error you see below:
df.sort_values(by='Difference', ascending=False, inplace=True, key=abs)
Error:
TypeError: sort_values() got an unexpected keyword argument 'key'
I'm not sure why the syntax that I see working for other people is not working for me. I have a lot more going on with the code and other dataframes, so it's not a pandas import problem I don't think.
I have moved on and just made a new column that is the absolute value of the difference column and sorted by that, and exclude that column from my export to worksheet, but I really would like to know how to get it to work the other way. Any help is appreciated.
I'm using Python 3
df.loc[(df.c - df.b).sort_values(ascending = False).index]
Sorting by difference between "c" and "b" , without creating new column.
I hope this is what you were looking for.
key is optional argument
It accepts series as input , maybe you were working with dataframe.
check this
I am working with a CSV file and I need to find the greatest several items in a column. I was able to find the top value just by doing the standard looping through and comparing values.
My idea to get the top few values would be to either store all of the values from that column into an array, sort it, and then pull the last three indices. However I'm not sure if that would be a good idea in terms of efficiency. I also need to pull other attributes associated with the top value and it seems like separating out these column values would make everything messy.
Another thing that I thought about doing is having three variables and doing a running top value sort of deal, where every time I find something bigger I compare the "top three" amongst each other and reorder them. That also seems a bit complex and I'm not sure how I would implement it.
I would appreciate some ideas or if someone told if I'm missing something obvious. Let me know if you need to see my sample code (I felt it was probably unnecessary here).
Edit: To clarify, if the column values are something like [2,5,6,3,1,7] I would want to have the values first = 7, second = 6, third = 5
Pandas looks perfect for your task:
import pandas as pd
df = pd.read_csv('data.csv')
df.nlargest(3, 'column name')