How to rename the index of a Dask Dataframe - python

How would I go about renaming the index on a dask dataframe? I tried it like so
df.index.name = 'foo'
but rechecking df.index.name shows it still being whatever it was previously.

This does not seem like an efficient way to do it, so I wouldn't be surprised if there is something more direct.
d.index.name starts off as 'foo';
def f(df, name):
df.index.name = name
return df
d.map_partitions(f, 'pow')
The output now has index name of 'pow'. If this is done with the threaded scheduler, I think you also change the index name of d in-place (in which case you don't really need the output of map_partitions).

A bit late, but the following functions:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame().assign(s=[1, 2], o=[3, 4], p=[5, 6]).set_index("si")
ddf = dd.from_pandas(df, npartitions=2)
ddf.index = ddf.index.rename("si2")
I hope this can help someone else out!

Related

I wish to print a pandas dataframe name

Please be patient I am new to Python and Pandas.
I have a lot of pandas dataframe, but some are duplicates. So I wrote a function that check if 2 dataframes are equal, if they are 1 will be deleted:
def check_eq(df1, df2):
if df1.equals(df2):
del[df2]
print( "Deleted %s" (df_name) )
The function works, but I wish to know how to have the variable "df_name" as string with the name of the dataframe.
I don't understand, the parameters df1 and df2 are dataframe objects how I can get their name at run-time if I wish to print it?
Thanks in advance.
What you are trying to use is an f-string.
def check_eq(df1, df2):
if df1.equals(df2):
del[df2]
print(f"Deleted {df2.name}")
I'm not certain whether you can call this print method, though. Since you deleted the dataframe right before you call its name attribute. So df2 is unbound.
Instead try this:
def check_eq(df1, df2):
if df1.equals(df2):
print(f"Deleted {df2.name}")
del df2
Now, do note that your usage of 'del' is also not correct. I assume you want to delete the second dataframe in your code. However, you only delete it inside the scope of the check_eq method. You should familiarize yourself with the scope concept first. https://www.w3schools.com/python/python_scope.asp
The code I used:
d = {'col1': [1, 2], 'col2': [3, 4]}
df1 = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d)
df1.name='dataframe1'
df2.name='dataframe2'
def check_eq(df1, df2):
if df1.equals(df2):
print(f"Deleted {df2.name}")

Action on one pandas dataframe does the same to the one it was copied from

I was using this bit of code (re-worked for my application) when I found that the df_temp.drop(index=sample.index, inplace=True) performed the same action on df_input i.e. it emptied it!!! I was not expecting that at all.
I solved it by changing df_temp = df_input to df_temp = df_input.copy() but can someone illuminate me on what is going on here?
import seaborn as sns
import pandas as pd
df_input = sns.load_dataset('diamonds')
df = df_input.loc[[]]
df_temp = df_input # this is where we're sampling from
n_samples = 1000
for _ in range(n_samples):
sample = df_temp.sample(1)
df_temp.drop(index=sample.index, inplace=True)
df = df.append(sample)
assert((df.index.value_counts() > 1).sum() == 0)
df
Pandas does not copy the whole df if you simply assign it to a new variable. After executing df_temp = df_input you end up with two variables referring to the exact same df. It's not the case that both are referring to an identical df, they are actually pointing to the same df. (think: you just gave this one df two names (variable names)) So no matter which variable (think: name) you are using to alter the df you're also changing for the other variable. If you use .copy() you get what you intended, namely two variables with two distinct versions of the df.

Python module export Pandas DataFrame

I relatively new to Python but my understanding of Python modules is that any object defined in a module can be exported, for example is you had:
# my_module.py
obj1 = 4
obj2 = 8
you can import both these objects simply with from my_module import obj1, obj2.
While working with Pandas, it is common to have code with looks like this (not actual working code):
# pandas_module.py
import pandas as pd
df = pd.DataFrame(...)
df = df.drop()
df = df[df.col > 0]
where the same object (df) is redefined multiple times. If I want to export df how should I handle this? My guess is that if I simply from pandas_module import df from elsewhere, all the pandas code will run first and I will the the final df as expected, but I'm not sure if this is good practice. Maybe it is better to do something like final_df = df.copy() and export final_df instead. This seems like it would be more understandable for someone who is not that familiar with Python.
So my question is, what is the proper way to handle this situation of exporting a df which is defined multiple times?
Personally, I usually create a function that returns a Dataframe object. Such as:
# pandas_module.py
import pandas as pd
def clean_data():
df = pd.DataFrame(...)
df = df.drop()
df = df[df.col > 0]
return df
Then you can call the function from your main work flow and get the expected Dataframe:
from pandas_module.py import clean_data
df = clean_data()

Replacing nan with blanks in Python

Below is my dataframe:
Id,ReturnCreated,ReturnTime,TS_startTime
O108808972773560,Return Not Created,nan,2018-08-23 12:30:41
O100497888936380,Return Not Created,nan,2018-08-18 14:57:20
O109648374050370,Return Not Created,nan,2018-08-16 13:50:06
O112787613729150,Return Not Created,nan,2018-08-16 13:15:26
O110938305325240,Return Not Created,nan,2018-08-22 11:03:37
O110829757146060,Return Not Created,nan,2018-08-21 16:10:37
I want to replace the nan with Blanks. Tried the below code, but its not working.
import pandas as pd
import numpy as np
df = pd.concat({k:pd.Series(v) for k, v in ordercreated.items()}).unstack().astype(str).sort_index()
df.columns = 'ReturnCreated ReturnTime TS_startTime'.split()
df1 = df.replace(np.nan,"", regex=True)
df1.to_csv('OrderCreationdetails.csv')
Kindly help me understand where i am going wrong and how can i fix the same.
You should try DataFrame.fillna() method
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
In your case:
df1 = df.fillna("")
should work I think
I think nans are strings, because .astype(str). So need:
df1 = df.replace('nan',"")
Either you can use df.fillna("") (i think that will perform better) or simple replace that values with blank
df1 = df.replace('NaN',"")

Can I set the index column when reading a CSV using Python dask?

When using Python Pandas to read a CSV it is possible to specify the index column. Is this possible using Python Dask when reading the file, as opposed to setting the index afterwards?
For example, using pandas:
df = pandas.read_csv(filename, index_col=0)
Ideally using dask could this be:
df = dask.dataframe.read_csv(filename, index_col=0)
I have tried
df = dask.dataframe.read_csv(filename).set_index(?)
but the index column does not have a name (and this seems slow).
No, these need to be two separate methods. If you try this then Dask will tell you in a nice error message.
In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv('*.csv', index='my-index')
ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead
But this won't be any slower or faster than doing it the other way.
I know I'm a bit late, but this is the first result on google so it should get answered.
If you write your dataframe with:
# index = True is default
my_pandas_df.to_csv('path')
#so this is same
my_pandas_df.to_csv('path', index=True)
And import with Dask:
import dask.dataframe as dd
my_dask_df = dd.read_csv('path').set_index('Unnamed: 0')
It will use column 0 as your index (which is unnamed thanks to pandas.DataFrame.to_csv() ).
How to figure it out:
my_dask_df = dd.read_csv('path')
my_dask_df.columns
which returns
Index(['Unnamed: 0', 'col 0', 'col 1',
...
'col n'],
dtype='object', length=...)
Now you can write: df = pandas.read_csv(filename, index_col='column_name') (Where column name is the name of the column you want to set as the index).

Categories

Resources