I need to create a new dataframe containing specific columns from an existing df. My code runs correctly but I get the SettingWithCopyWarning. I have researched this warning and I understand why it exists (ie: chained assignments). I also know you can simply turn the warning off.
However my question is, what is the correct way to create a new dataframe using specific columns from an existing dataframe in order to not get this warning. I don't want to simply turn the warning off,...because I presume there is a better (more pythonic) way to do this than how im currently doing it. In other words, I want a completely new dataframe to work with, and I don't want to just copy all the columns.
The below code passes the existing dataframe to a function (removeBrokenRule). The new dataframe is created which contains only 4 columns from the existing df. I then perform certain operations on the new df and return it.
newdf = removeBrokenRule('Forgot rules', df)
def removeBrokenRule(rule, df):
newdf = df[['Actual ticks', 'Broken Rules', 'Perfect ticks', 'Cumulative Actual']]
newdf['Actual ticks'][newdf['Broken Rules'] == rule] = newdf['Perfect ticks']
newdf['New Curve'] = newdf['Actual ticks'].cumsum()
return newdf
Much appreciated.
Related
I am very new to pandas, I was just practicing some examples (code pasted below), I need a clarification, I read the csv file, I have applied change of value on 'age' & 'survived'
based multi conditioned filter on the data frame one on each line;
when I print the original data frame after the two lines I have both the new values applied on the data frame.
But when I tried to filter the existing data frame I had to assign it to an new data frame object to see the changes?, why is that? can someone pls explain the behavior?
when i tried to do any manipulation on that new data frame it shows
"A value is trying to be set on a copy of a slice from a DataFrame" warning,
yet the change is applied!
I do not understand.. Can someone pls help me with the concept and what is the right way to do it?.
Thanks in advance guys!!
import pandas as pd
tit_read = pd.read_csv('titanic.csv').head(10)
tit_read.loc[(tit_read['pclass'] > 1) & (tit_read['sex'] == 'male'), 'age'] = 50
tit_read.loc[(tit_read['age'] > 35) & (tit_read['sex'] == 'male'), 'survived'] = 2
print(tit_read)
#2nd data frame
df = tit_read.loc[(tit_read['pclass'] > 1) & (tit_read['sex'] == 'male')]
df.survived = 3
print(df)
The reason you get the "A value is trying to be set on a copy of a slice from a DataFrame" is due to the difference between operations returning views or copies.
A view as the name suggests is only similar to looking at the DataFrame through a filtered window, or put in a technical way it is a subset of the DataFrame which is still linked to the original DataFrame. Whereas , a copy is an entirely new DataFrame.
A point to note is that, since the views are still linked to the original DataFrame, whatever changes you make to the view are reflected in the original DataFrame. This does not happen with copies as they are completely different entities.
I have added a new column to an existing datframe but it's not reflected in dataframe.
customerDf.withColumn("fullname",expr("concat(firstname,'|',lastname)"))
customerDf.show() # it's showing existing old df records without new columns.
we can see the results if we can assign the dataframe to another dataframe
test = customerDf.withColumn("fullname",expr("concat(firstname,'|',lastname)"))
test.show()
Is there any way to add a new column to an existing dataframe (without copy dataframe)?
We will have one option (inplace=True in pandas). Do we have any similar function in pyspark?
Short answer: no there is no such thing in pyspark.
Spark DataFrames are immutable. This means, when you add a new column (or any other transformation) you're not changing the data frame, but creating a new one.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn:
Returns a new DataFrame by adding a column or replacing the existing
column that has the same name.
In Python you can, however, re-assign the result to "same variable" :
customerDf = customerDf.withColumn("fullname",expr("concat(firstname,'|',lastname)"))
I'm a Pandas newbie, so please bear with me.
Overview: I started with a free-form text file created by a data harvesting script that remotely accessed dozens of different kinds of devices, and multiple instances of each. I used OpenRefine (a truly wonderful tool) to munge that into a CSV that was then input to dataframe df using Pandas in a JupyterLab notebook.
My first inspection of the data showed the 'Timestamp' column was not monotonic. I accessed individual data sources as follows, in this case for the 'T-meter' data source. (The technique was taken from a search result - I don't really understand it, but it worked.)
cond = df['Source']=='T-meter'
rows = df.loc[cond, :]
df_tmeter = pd.DataFrame(columns=df.columns)
df_tmeter = df_tmeter.append(rows, ignore_index=True)
then checked each as follows:
df_tmeter['Timestamp'].is_monotonic
Fortunately, the problem was easy to identify and fix: Some sensors were resetting, then sending bad (but still monotonic) timestamps until their clocks were updated. I wrote the function healing() to cleanly patch such errors, and it worked a treat:
df_tmeter['healed'] = df_tmeter['Timestamp'].apply(healing)
Now for my questions:
How do I get the 'healed' values back into the original df['Timestamp'] column for only the 'T-meter' items in df['Source']?
Given the function healing(), is there a clean way to do this directly on df?
Thanks!
Edit: I first thought I should be using 'views' into df, but other operations on the data would either generate errors, or silently turn the views into copies.
I wrote a wrapper function heal_row() for healing():
def heal_row( row ):
if row['Source'] == 'T-meter': # Redundant check, but safe!
row['Timestamp'] = healing(row['Timestamp'])
return row
then did the following:
df = df.apply(lambda row: row if row['Source'] != 'T-meter' else heal_row(row), axis=1)
This ordering is important, since healing() is stateful based on the prior row(s), and thus can't be the default operation.
I am using pandas for the first time.
df.groupby(np.arange(len(df))//10).mean()
I used the code above which works to take an average of every 10th row. I want to save this updated data frame but doing df.to_csv is saving the original dataframe which I imported.
I also want to multiply one column from my df (df.groupby dataframe essentially) with a number and make a new column. How do I do that?
The operation:
df.groupby(np.arange(len(df))//10).mean()
Might return the averages dataframe as you want it, but it wont change the original dataframe. Instead you'll need to do:
df_new = df.groupby(np.arange(len(df))//10).mean()
You could assign it the same name if you want. The other options is some operations which you might expect to modify the dataframe accept in inplace argument which normally defaults to False. See this question on SO.
To create a new column which is an existing column multpied by a number you'd do:
df_new['new_col'] = df_new['existing_col']*a_number
I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow