Losing data on a Pandas DataFrame reindex [duplicate] - python

This question already has answers here:
Difference between df.reindex() and df.set_index() methods in pandas
(3 answers)
Closed 2 years ago.
I was losing data on a reindex. I just wanted to make an existing column the index.
So this works:
df_all_maa = df_all_maa.set_index("VERSION_SEQ")
Originally I was doing this:
df_all_maa = df_all_maa.reindex(df_all_maa["VERSION_SEQ"])
I think what was happening was I was only getting values in the resulting dataframe, where the VERSION_SEQ value happened to match the numeric default index, but I would be interested to know what my original incorrect syntax was actually doing.

reindex is similar to loc, but allowing non-existing indexes. reindex creates a row with nan values whence there are non-existing indexes, while loc would throw an error.

Related

Using Pandas function isin() [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 1 year ago.
I explain my problem to you. I have a data frame and I want to add a column (true / false). This dataframe contains the following columns: Référence, msn, description... I have another dataframe containing a reference called "AM" and other columns. The objective of filling this one column (true / false) if there is a correspondence between the two tables on the refe field.
here is my python code:
df["Avis BE"]=False
df[df["Référence"].isin(df1["AM"])]["Avis BE"]=True
I have this error message:
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
It's a warning, use
df.loc[:, "Avis BE"] = False
df.loc[df["Référence"].isin(df1["AM"]), "Avis BE"] = True
Also refer pandas documentation for indexing and setting values.
it highlights this issue and suggesets better practices.
Documentation

Pandas: How to replace values in a dataframe based on a conditional [duplicate]

This question already has answers here:
Pandas fill missing values in dataframe from another dataframe
(6 answers)
Closed 1 year ago.
I am trying to replace the values of my second dataframe ('area') with those of my first dataframe ('test').
Image of my inputs:
The catch is that I only want to replace the values that are not NaN, so, for example, area.iloc[0,1] will be "6643.68" rather than "3321.84" but area.iloc[-2,-1] will be "19.66" rather than "NaN". I would have thought I could do something like:
area.loc[test.notnull()] = test
or
area.replace(area.loc[test.notnull()], test.notnull())
But this gives me the error "Cannot index with multidimensional key". Any ideas? This should be simple.
Use fillna like:
area.fillna(test)

How to convert index and values to a proper dataframe with callable column names? Python Pandas [duplicate]

This question already has answers here:
Accessing a Pandas index like a regular column
(3 answers)
Closed 1 year ago.
I am working on this dataset where I have used the sort_values() function to get two distinct columns: the index and the values. I can even rename the index and the values columns. However, if I rename the dataset columns and assign everything to a new dataframe, I am not able to call the index column with the name that I assigned to it earlier.
pm_freq = df["payment_method"].value_counts()
pm_freq = pm_freq.head().to_frame()
pm_freq.index.name="Method"
pm_freq.rename(columns={"payment_method":"Frequency"},inplace=True)
pm_freq
Now I want to call it like this:
pm_freq["Method"]
But there's an error which states:
" KeyError: 'Method'"
Can someone please help me out with this?
Check out the comment here, not sure if still correct:
https://stackoverflow.com/a/18023468/15600610

change 1 column and leave the rest unchanged [duplicate]

This question already has answers here:
Convert Pandas Column to DateTime
(8 answers)
Closed 2 years ago.
I have a dataset with one column that I want to change to date-time format. If I use this:
df = pd.to_datetime(df['product_first_sold_date'],unit='d',origin='1900-01-01')
df will only have this one particular column while all others are removed. Instead, I want to keep the remaining columns unchanged and just apply the to_datetime function to one column.
I tried using loc with multiple ways, including this:
df.loc[df['product_first_sold_date']] = pd.to_datetime(df['product_first_sold_date'],unit='d',origin='1900-01-01')
but it throws a key error.
How else can I achieve this?
df['product_first_sold_date'] = pd.to_datetime(df['product_first_sold_date'],unit='d',origin='1900-01-01')
should work i think

Using more than one column for indexing in a pivot table [duplicate]

This question already has an answer here:
Multi-index pivoting in Pandas
(1 answer)
Closed 6 years ago.
I would like to combine two columns for index while pivoting a pandas dataframe. I'm using the following code to do so:
ConceptTemp = Concept.pivot(index=['memberid','testscoreid'], columns='questionid', values='correct')
this gives me the following error:
ValueError: Wrong number of items passed 1532, placement implies 2
1532 is the number of rows in my dataframe. I can't pivot only on memberid or on testscoreid as I'll have duplicate questionid values. The index column has to be a combination of testscoreid AND memberid.
Would anyone have any pointers on how to get this done?
I think you can use pivot_table:
ConceptTemp = Concept.pivot_table(index=['memberid','testscoreid'],
columns='questionid',
values='correct')
pivot_table uses aggfunc, default is aggfunc=np.mean if duplicates. Better explanation with sample is here and in docs.

Categories

Resources