Save an updated dataframe in Pandas - python

I am using pandas for the first time.
df.groupby(np.arange(len(df))//10).mean()
I used the code above which works to take an average of every 10th row. I want to save this updated data frame but doing df.to_csv is saving the original dataframe which I imported.
I also want to multiply one column from my df (df.groupby dataframe essentially) with a number and make a new column. How do I do that?

The operation:
df.groupby(np.arange(len(df))//10).mean()
Might return the averages dataframe as you want it, but it wont change the original dataframe. Instead you'll need to do:
df_new = df.groupby(np.arange(len(df))//10).mean()
You could assign it the same name if you want. The other options is some operations which you might expect to modify the dataframe accept in inplace argument which normally defaults to False. See this question on SO.
To create a new column which is an existing column multpied by a number you'd do:
df_new['new_col'] = df_new['existing_col']*a_number

Related

How do I add every column in a pandas dataframe to a list except for the first column?

Normally, I would be able to call dataframe.columns for a list of all columns, but I don't want to include the very first column in my list. Writing each column manually is an option, but one I'd like to avoid, given the few hundred column headers I'm working with. I do need to use this column, though, so deleting it from the dataframe entirely wouldn't work. How can I put every column into a list except for the first one?
This should work:
list(df.columns[1:])

Pandas Dataframe delete row by repitition

I am looking to delete a row in a dataframe that is imported into python by pandas.
if you see the sheet below, the first column has same name multiple times. So the condition is, if the first column value re-appears in a next row, delete that row. If not keep that frame in the dataframe.
My final output should look like the following:
Presently I am doing it by converting each column into a list and deleting them by index values. I am hoping there would be an easy way. Rather than this workaround/
df.drop_duplicates([df.columns[0])
should do the trick.
Try the following code;
df.drop_duplicates(subset='columnName', keep=’first’, inplace=true)

What is the the best way to modify (e.g., perform math functions) a column in a Dask DataFrame?

I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow

what effect does changing the datatype in a pandas dataframe or series have?

Specifically,
If I don't need to change the datatype, is it better left alone? Does it copy the whole column of a dataframe? Does it copy the whole dataframe? Or does it just alter some setting in the dataframe to treat the entries in that column as a particular type?
Also, is there a way to set the type of the columns while the dataframe is getting created?
Here is one example "2014-05-25 12:14:01.929000" is cast as a np.datetime64 when the dataframe is created. then I save the dataframe onto a csv. then I read from the csv, and it becomes an arbitrary object. How would I avoid this? Or how can I re-cast this particular column as an np.datetime64 whilst doing a pd.DataFrame.read_csv ....
Thanks.

Create new dataframe from existing dataframe

I need to create a new dataframe containing specific columns from an existing df. My code runs correctly but I get the SettingWithCopyWarning. I have researched this warning and I understand why it exists (ie: chained assignments). I also know you can simply turn the warning off.
However my question is, what is the correct way to create a new dataframe using specific columns from an existing dataframe in order to not get this warning. I don't want to simply turn the warning off,...because I presume there is a better (more pythonic) way to do this than how im currently doing it. In other words, I want a completely new dataframe to work with, and I don't want to just copy all the columns.
The below code passes the existing dataframe to a function (removeBrokenRule). The new dataframe is created which contains only 4 columns from the existing df. I then perform certain operations on the new df and return it.
newdf = removeBrokenRule('Forgot rules', df)
def removeBrokenRule(rule, df):
newdf = df[['Actual ticks', 'Broken Rules', 'Perfect ticks', 'Cumulative Actual']]
newdf['Actual ticks'][newdf['Broken Rules'] == rule] = newdf['Perfect ticks']
newdf['New Curve'] = newdf['Actual ticks'].cumsum()
return newdf
Much appreciated.

Categories

Resources