I often see redundant imports in code like this:
import pandas as pd
from pandas import Series as s
from pandas import DataFrame as df
I get that it's simpler to use s and df as shorthand rather than typing out pd.DataFrame everytime, but would it not be preferrable to just assign pd.DataFrame to df rather than importing it (since you essentially already imported it with its parent)?
For instance, I think it could be cleaned up like this:
import pandas as pd
s, df = pd.Series, pd.DataFrame
Is there any drawbacks to doing it this way? I'm still fairly new to Python so I'm wondering if I'm missing something here.
Related
I use the modin library for multiprocessing.
While the library is great for faster processing, it fails at merge and I would like to revert to default pandas in between the code.
I understand as per PEP 8: E402 conventions, import should be declared once and at the top of the code however my case would need otherwise.
import pandas as pd
import modin.pandas as mpd
import os
import ray
ray.init()
os.environ["MODIN_ENGINE"] = "ray"
df = mpd.read_csv()
do stuff
Then I would like to revert to default pandas within the same code
but how would i do the below in pandas as there does not seem to be a clear way to switch from pd and mpd in the below lines and unfortunately modin seems to take precedence over pandas.
df = df.loc[:, df.columns.intersection(['col1', 'col2'])]
df = df.drop_duplicates()
df = df.sort_values(['col1', 'col2'], ascending=[True, True])
Is it possible?
if yes, how?
You can simply do the following :
import modin.pandas as mpd
import pandas as pd
This way you have both modin as well as original pandas in memory and you can efficiently switch as per your need.
Since many have posted answers however in this particular case, as applicable and pointed out by #Nin17 and this comment from Modin GitHub, to convert from Modin to Pandas for single core processing of some of the operations like df.merge you can use
import pandas as pd
import modin.pandas as mpd
import os
import ray
ray.init()
os.environ["MODIN_ENGINE"] = "ray"
df_modin = mpd.read_csv() #reading dataframe into Modin for parallel processing
df_pandas = df_modin._to_pandas() #converting Modin Dataframe into pandas for single core processing
and if you would like to reconvert the dataframe to a modin dataframe for parallel processing
df_modin = mpd.DataFrame(df_pandas)
You can try pandarallel package instead of modin , It is based on similar concept : https://pypi.org/project/pandarallel/#description
Pandarallel Benchmarks : https://libraries.io/pypi/pandarallel
As #Nin17 said in a comment on the question, this comment from the Modin GitHub describes how to convert a Modin dataframe to pandas. Once you have a pandas dataframe, you call any pandas method on it. This other comment from the same issue describes how to convert the pandas dataframe back to a Modin dataframe.
I have noticed that when we set some options for pandas DataFrames such as pandas.DataFrame('max_rows',10) it works perfectly for DataFrame objects.
However, it has no effect on Style objects.
Check the following code :
import pandas as pd
import numpy as np
data= np.zeros((10,20))
pd.set_option('max_rows',4)
pd.set_option('max_columns',10)
df=pd.DataFrame(data)
display(df)
display(df.style)
Which will result in :
I do not know how to set the properties for Style object.
Thanks.
Styler is developing its own options. The current version 1.3.0 of pandas has not got many. Perhaps only the styler.render.max_elements.
Some recent pull requests to the github repo are adding these features but they will be Stylers own version.
As #attack69 mentioned, styler has its own options under development.
However, I could mimic set_option(max_row) and set_option(max_columns) for styler objects.
Check the following code:
import pandas as pd
import numpy as np
data= np.zeros((10,20))
mx_rw=4
mx_cl=10
pd.set_option('max_rows',mx_rw)
pd.set_option('max_columns',mx_cl)
df=pd.DataFrame(data)
display(df)
print(type(df))
df.loc[mx_rw/2]='...'
df.loc[:][mx_cl/2]='...'
temp=list(range(0,int(mx_rw/2),1))
temp.append('...')
temp.extend(range(int(mx_rw/2)+1,data.shape[0],1))
df.index=temp
del temp
temp=list(range(0,int(mx_cl/2),1))
temp.append('...')
temp.extend(range(int(mx_cl/2)+1,data.shape[1],1))
df.columns=temp
del temp
df=df.drop(list(range(int(mx_rw/2)+1,data.shape[0]-int(mx_rw/2),1)),0)
df=df.drop(list(range(int(mx_cl/2)+1,data.shape[1]-int(mx_cl/2),1)),1)
df=df.style.format(precision=1)
display(df)
print(type(df))
which both DataFrame and Styler object display the same thing.
I read my arff dataframe from here https://archive.ics.uci.edu/ml/machine-learning-databases/00426/ like this:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0])
df.head()
But my dataframe has b' in all values in all columns:
How to remove it?
When i try this, it doesn't work as well:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0].str.decode('utf-8'))
df.head()
It says AttributeError: 'numpy.ndarray' object has no attribute 'str'
as you see .str.decode('utf-8') from Removing b'' from string column in a pandas dataframe didn't solve a problem
This doesn't work as well:
df.index = df.index.str.encode('utf-8')
A you see its both string and and numbers are bytes object
I was looking at the same dataset and had a similar issue. I did find a workaround and am not sure if this post will be helpful? So rather than use the from scipy.io import arff, I used another library called liac-arff. So the code should be like
pip install liac-arff
Or whatever the pip command that works for your operating system or IDE, and then
import arff
import pandas as pd
data = arff.loads('Autism-Adult-Data.arff')
Data returns a dictionary. To find what columns that dictionary has, you do
data.keys()
and you will find that all arff files have the following keys
['description', 'relation', 'attributes', 'data']
Where data is the actual data and attributes has the column names and the unique values of those columns. So to get a data frame you need to do the following
colnames = []
for i in range(len(data['attributes'])):
colnames.append(data['attributes'][i][0])
df = pd.DataFrame.from_dict(data['data'])
df.columns = colnames
df.head()
So I went overboard here with all creating the dataframe and all but this returns a data frame with no issues with a b', and the key is using import arff.
So the GitHub for the library I used can be found here.
Although Shimon shared an answer, you could also give this a try:
df.apply(lambda x: x.str.decode('utf8'))
I have written some functions to read data from a custom file format and convert it into a pandas data frame. I would like to be able to access this from within the pandas namespace, i.e, after installing my package with pip, I should be able to
import pandas as pd
pd.read_custom("/my/file")
My questions are:
Is this even possible?
How would I implement this?
P.S: I remember that pandas support for feather used to work this way until it officially became a part of pandas.io. I can't seem to find the code for it now.
Just create your own class, which should inherit from the DataFrame class and implement the to_custom() method.
Simple example:
class MyDF(pd.DataFrame):
def to_custom(self, filename, **kwargs):
# put your deserializer code here ...
return self.to_csv(filename, **kwargs)
Test:
In [16]: df = pd.DataFrame(np.arange(9).reshape(3,3), columns=list('abc'))
In [17]: mdf = MyDF(df)
In [18]: type(mdf)
Out[18]: __main__.MyDF
In [19]: mdf.to_custom('d:/temp/res.csv', index=False)
Result:
In [20]: from pathlib import Path
In [21]: print(Path('d:/temp/res.csv').read_text())
a,b,c
0,1,2
3,4,5
6,7,8
print(pd.read_excel(File,Sheet_Name,0,None,0,None,["Column_Name"],1))
Since i am a noob to pandas i want to retrive a column of ExcelSheet using pandas in the form of array. I tried the code above but it didn't really work.
The way to do it is:
import pandas as pd
df = pd.read_excel(File,sheetname=Sheet_Name)
print(df['column_name'])