Modifying dataframe subset isn't changing source

Modifying dataframe subset isn't changing source - python

I am trying to normalize subsets of data from a dataframe in python. As I understand, setting one df equal to another only references and doesn't copy - such that changing the new df also changes the old. When I try to leverage this, though, the changes I'm making aren't showing up in the original df.
temp = df.loc[(df['sample']==s) & (df['pixel']==p),['pce']]
nval=temp.iloc[0]['pce']
temp['pce']=temp['pce']/nval
I expected that modification of temp would also modify df, but this doesn't seem to be the case. The normalization is only happening in temp. What am I missing?

Assigning one df to another only copies the reference to the dataframe object, so this assumption is correct. However, slicing the original dataframe cannot copy the reference, it creates a new object which is different to the original dataframe, it is a subset of it. The new object does not contain all of the original dataframes' data. If you want to modify subset of the original dataframe, assign the newly created data back to where you got them from.
temp = df.loc[(df['sample']==s) & (df['pixel']==p),['pce']]
nval=temp.iloc[0]['pce']
temp['pce']=temp['pce']/nval
df.loc[temp.index, 'pce'] = temp['pce']

As far as I know (hopefully I'm not missing something either), this is the normal behavior.
In order to change the values in df, try using the indices of the sliced DataFrame instead:
df.loc[temp.index, 'pce'] = df.loc[temp.index, 'pce'] / nval

Related

Problems with DataFrame indexing with pandas

Using pandas, I have to modify a DataFrame so that it only has the indexes that are also present in a vector, which was acquired by performing operations in one of the df's columns. Here's the specific line of code used for that (please do not mind me picking the name 'dataset' instead of 'dataframe' or 'df'):
dataset = dataset.iloc[list(set(dataset.index).intersection(set(vector.index)))]
it worked, and the image attached here shows the df and some of its indexes. However, when I try accessing a specific value by index in the new 'dataset', such as the line shown below, I get an error: single positional indexer is out-of-bounds
print(dataset.iloc[:, 21612])
note: I've also tried the following, to make sure it isn't simply an issue with me not knowing how to use iloc:
print(dataset.iloc[21612, :])
and
print(dataset.iloc[21612])
Do I have to create another column to "mimic" the actual indexes? What am I doing wrong? Please mind that it's necessary for me to make it so the indexes are not changed at all, despite the size of the DataFrame changing. E.g. if the DataFrame originally had 21000 rows and the new one only 15000, I still need to use the number 20999 as an index if it passed the intersection check shown in the first code snippet. Thanks in advance

Try this:
print(dataset.loc[21612, :])
After you have eliminated some of the original rows, the first (i.e., index) argument to iloc[] must not be greater than len(index) - 1.

Is there a way to add regular index numbers to a dataframe with dates as the index?

I am working with dataframes for a uni assignment, but do not have a lot of experience with it. One of the datasets we use automatically puts the date as the index, as you can see in the screenshot of the dataframe. I have to work with if- and for-loops, which works better with a regular index. I can't find anywhere how I can transform the date index into a regular column, and add normal index numbers. Can anyone help me with this?

Try this:
df_sleep_2.reset_index()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
You can either set the parameter inplace=True to directly modify the dataframe, or assign it to a new variable e.g.
# modify dataframe in place
df_sleep_2.reset_index(inplace=True)
# or assign result to new variable
df_sleep_2_new_index = df_sleep_2.reset_index()

try to reset the index using reset_index
df_sleep_2.reset_index()

Why to create new dataframe to filter? On Python pandas

I am very new to pandas, I was just practicing some examples (code pasted below), I need a clarification, I read the csv file, I have applied change of value on 'age' & 'survived'
based multi conditioned filter on the data frame one on each line;
when I print the original data frame after the two lines I have both the new values applied on the data frame.
But when I tried to filter the existing data frame I had to assign it to an new data frame object to see the changes?, why is that? can someone pls explain the behavior?
when i tried to do any manipulation on that new data frame it shows
"A value is trying to be set on a copy of a slice from a DataFrame" warning,
yet the change is applied!
I do not understand.. Can someone pls help me with the concept and what is the right way to do it?.
Thanks in advance guys!!
import pandas as pd
tit_read = pd.read_csv('titanic.csv').head(10)
tit_read.loc[(tit_read['pclass'] > 1) & (tit_read['sex'] == 'male'), 'age'] = 50
tit_read.loc[(tit_read['age'] > 35) & (tit_read['sex'] == 'male'), 'survived'] = 2
print(tit_read)
#2nd data frame
df = tit_read.loc[(tit_read['pclass'] > 1) & (tit_read['sex'] == 'male')]
df.survived = 3
print(df)

The reason you get the "A value is trying to be set on a copy of a slice from a DataFrame" is due to the difference between operations returning views or copies.
A view as the name suggests is only similar to looking at the DataFrame through a filtered window, or put in a technical way it is a subset of the DataFrame which is still linked to the original DataFrame. Whereas , a copy is an entirely new DataFrame.
A point to note is that, since the views are still linked to the original DataFrame, whatever changes you make to the view are reflected in the original DataFrame. This does not happen with copies as they are completely different entities.

Save an updated dataframe in Pandas

I am using pandas for the first time.
df.groupby(np.arange(len(df))//10).mean()
I used the code above which works to take an average of every 10th row. I want to save this updated data frame but doing df.to_csv is saving the original dataframe which I imported.
I also want to multiply one column from my df (df.groupby dataframe essentially) with a number and make a new column. How do I do that?

The operation:
df.groupby(np.arange(len(df))//10).mean()
Might return the averages dataframe as you want it, but it wont change the original dataframe. Instead you'll need to do:
df_new = df.groupby(np.arange(len(df))//10).mean()
You could assign it the same name if you want. The other options is some operations which you might expect to modify the dataframe accept in inplace argument which normally defaults to False. See this question on SO.
To create a new column which is an existing column multpied by a number you'd do:
df_new['new_col'] = df_new['existing_col']*a_number

Create new dataframe from existing dataframe

I need to create a new dataframe containing specific columns from an existing df. My code runs correctly but I get the SettingWithCopyWarning. I have researched this warning and I understand why it exists (ie: chained assignments). I also know you can simply turn the warning off.
However my question is, what is the correct way to create a new dataframe using specific columns from an existing dataframe in order to not get this warning. I don't want to simply turn the warning off,...because I presume there is a better (more pythonic) way to do this than how im currently doing it. In other words, I want a completely new dataframe to work with, and I don't want to just copy all the columns.
The below code passes the existing dataframe to a function (removeBrokenRule). The new dataframe is created which contains only 4 columns from the existing df. I then perform certain operations on the new df and return it.
newdf = removeBrokenRule('Forgot rules', df)
def removeBrokenRule(rule, df):
newdf = df[['Actual ticks', 'Broken Rules', 'Perfect ticks', 'Cumulative Actual']]
newdf['Actual ticks'][newdf['Broken Rules'] == rule] = newdf['Perfect ticks']
newdf['New Curve'] = newdf['Actual ticks'].cumsum()
return newdf
Much appreciated.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Modifying dataframe subset isn't changing source - python

As far as I know (hopefully I'm not missing something either), this is the normal behavior. In order to change the values in df, try using the indices of the sliced DataFrame instead: df.loc[temp.index, 'pce'] = df.loc[temp.index, 'pce'] / nval

Related

Problems with DataFrame indexing with pandas

Is there a way to add regular index numbers to a dataframe with dates as the index?

Why to create new dataframe to filter? On Python pandas

Save an updated dataframe in Pandas

Create new dataframe from existing dataframe

Categories

Resources