I have written some functions to read data from a custom file format and convert it into a pandas data frame. I would like to be able to access this from within the pandas namespace, i.e, after installing my package with pip, I should be able to
import pandas as pd
pd.read_custom("/my/file")
My questions are:
Is this even possible?
How would I implement this?
P.S: I remember that pandas support for feather used to work this way until it officially became a part of pandas.io. I can't seem to find the code for it now.
Just create your own class, which should inherit from the DataFrame class and implement the to_custom() method.
Simple example:
class MyDF(pd.DataFrame):
def to_custom(self, filename, **kwargs):
# put your deserializer code here ...
return self.to_csv(filename, **kwargs)
Test:
In [16]: df = pd.DataFrame(np.arange(9).reshape(3,3), columns=list('abc'))
In [17]: mdf = MyDF(df)
In [18]: type(mdf)
Out[18]: __main__.MyDF
In [19]: mdf.to_custom('d:/temp/res.csv', index=False)
Result:
In [20]: from pathlib import Path
In [21]: print(Path('d:/temp/res.csv').read_text())
a,b,c
0,1,2
3,4,5
6,7,8
Related
I use Pandas and Dask all the time. I also have a number of custom classes and functions which I utilize a lot for different analyses, which I am always having to edit to account for either Dask or Pandas. I consistently find myself in a situation where I wish I could assign attributes to the dataset which I am analyzing, minimizing the compute command from dask and also allowing easier management of functions as I switch between data types. Something effectively akin to:
import pandas as pd
import dask.dataframe as dd
from pydataset import data
df = data('titanic')
setattr(df, 'vals12', 1)
test = dd.from_pandas(df, npartitions = 2)
test.vals12 #would still contain the attribute vals12
df = test.compute()
df.vals12 #would still contain the attribute vals12
However, I do not know of a way to achieve this, without editing the base packages (Pandas / Dask). As a result, I was wondering if there was a way to achieve the above example without creating a new class (or static version of the packages) or if there is a way to "branch" the repos in a non-public way (allowing for my edits to be added, but still allowing me to easily get future features without pain)?
In the upcoming release of Dask, you will be able to do this by using the recent attrs feature in pandas 1.0. For now, you can pip install dask from Github to use this functionality.
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({
"a":[0,1,2],
"b":[2,3,4]
})
df.attrs["vals12"] = 1
ddf = dd.from_pandas(df, npartitions=2)
ddf.attrs
{'vals12': 1}
I'm trying to read in json files into dataframes.
df = pd.read_json('test.log', lines=True)
However there are values which are int64 and Pandas raises:
ValueError: Value is too big
I tried setting precise_float to True, but this didn't solve it.
It works when I do it line by line:
df = pd.DataFrame()
with open('test.log') as f:
for line in f:
data = json.loads(line)
df = df.append(data, ignore_index=True)
However this is very slow. Already for files around 50k lines it takes a very long time.
Is there a way I can set the value of certain columns to use int64?
After updating pandas to a newer version (tested with 1.0.3), this workaround by artdgn can be applied to overwrite the loads() function in pandas.io.json._json, which is ultimately used when pd.read_json() is called.
Copying the workaround in case the links above stop working:
import pandas as pd
# monkeypatch using standard python json module
import json
pd.io.json._json.loads = lambda s, *a, **kw: json.loads(s)
# monkeypatch using faster simplejson module
import simplejson
pd.io.json._json.loads = lambda s, *a, **kw: simplejson.loads(s)
# normalising (unnesting) at the same time (for nested jsons)
pd.io.json._json.loads = lambda s, *a, **kw: pandas.json_normalize(simplejson.loads(s))
After overwriting the loads() function with 1 of the 3 methods described by artdgn, read_json() also works with int64.
This is a well-known issue.
The decoding of big numbers is still not implemented in the pandas' fork of the ultrajson library. The closest implementation was not merged. Whatever it was, you can use the workarounds provided in other answers.
There are multiple ways to read excel data into python.
Pandas provides aslo an API for writing and reading
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('File.xlsx', sheetname='Sheet1')
That works fine.
BUT: What is the way to access the tables of every sheet directly into a pandas dataframe??
The above picture shows a sheet including a table SEPARATED THAN CELL (1,1).
Moreover the sheet might include several tables (listobjects in VBA).
I can not find anywhere the way to read them into pandas.
Note1: It is not possible to modify the workbook to bring all the tables towards cell(1,1).
Note2: I would like to use just pandas (if it is possible) and minimize the need to import other libraries. But it there is no other way I am ready to use other lybray. In any case I could not manage with xlwings for instance.
here it looks like its possible to parse the excel file, but no soilution is provided for tables, just for complete sheets.
The documentation of pandas does not seem to offer that possibility.
Thanks.
You can use xlwings, great package for working with excel files in python.
This is for a single table, but it is pretty trivial to use xlwings collections (App>books>sheets>tables) to iterate over all tables. Tables are ofcourse listobjects.
import xlwings
import pandas
with xlwings.App() as App:
_ = App.books.open('my.xlsx')
rng = App.books['my.xlsx'].sheets['mysheet'].tables['mytablename'].range
df: pandas.DataFrame = rng.expand().options(pandas.DataFrame).value
I understand that this question has been marked solved already, but I found an article that provides a much more robust solution:
Full Post
I suppose a newer version of this library supports better visibility of the workbook structure. Here is a summary:
Load the workbook using the load_workbook function from openpyxl
Then, you are able to access the sheets within, which contains collection of List-Objects (Tables) in excel.
Once you gain access to the tables, you are able to get to the range addresses of those tables.
Finally they loop through the ranges and create a pandas data-frame from it.
This is a nicer solution as it gives us the ability to loop through all the sheets and tables in a workbook.
Here is a way to parse one table, howver it's need you to know some informations on the seet parsed.
df = pd.read_excel("file.xlsx", usecols="B:I", index_col=3)
print(df)
Not elegant and work only if one table is present inside the sheet, but that a first step:
import pandas as pd
import string
letter = list(string.ascii_uppercase)
df1 = pd.read_excel("file.xlsx")
def get_start_column(df):
for i, column in enumerate(df.columns):
if df[column].first_valid_index():
return letter[i]
def get_last_column(df):
columns = df.columns
len_column = len(columns)
for i, column in enumerate(columns):
if df[column].first_valid_index():
return letter[len_column - i]
def get_first_row(df):
for index, row in df.iterrows():
if not row.isnull().values.all():
return index + 1
def usecols(df):
start = get_start_column(df)
end = get_last_column(df)
return f"{start}:{end}"
df = pd.read_excel("file.xlsx", usecols=usecols(df1), header=get_first_row(df1))
print(df)
I often see redundant imports in code like this:
import pandas as pd
from pandas import Series as s
from pandas import DataFrame as df
I get that it's simpler to use s and df as shorthand rather than typing out pd.DataFrame everytime, but would it not be preferrable to just assign pd.DataFrame to df rather than importing it (since you essentially already imported it with its parent)?
For instance, I think it could be cleaned up like this:
import pandas as pd
s, df = pd.Series, pd.DataFrame
Is there any drawbacks to doing it this way? I'm still fairly new to Python so I'm wondering if I'm missing something here.
In the lab that I work in, we process a lot of data produced by a 96 well plate reader. I'm trying to write a script that will perform a few calculations and output a bar graph using matplotlib.
The problem is that the plate reader outputs data into a .xlsx file. I understand that some modules like pandas have a read_excel function, can you explain how I should go about reading the excel file and putting it into a dataframe?
Thanks
Data sample of a 24 well plate (for simplicity):
0.0868 0.0910 0.0912 0.0929 0.1082 0.1350
0.0466 0.0499 0.0367 0.0445 0.0480 0.0615
0.6998 0.8476 0.9605 0.0429 1.1092 0.0644
0.0970 0.0931 0.1090 0.1002 0.1265 0.1455
I'm not exactly sure what you mean when you say array, but if you mean into a matrix, might you be looking for:
import pandas as pd
df = pd.read_excel([path here])
df.as_matrix()
This returns a numpy.ndarray type.
This task is super easy in Pandas these days.
import pandas as pd
df = pd.read_excel('file_name_here.xlsx', sheet_name='Sheet1')
or
df = pd.read_csv('file_name_here.csv')
This returns a pandas.DataFrame object which is very powerful for performing operations by column, row, over an entire df, or over individual items with iterrows. Not to mention slicing in different ways.
There is awesome xlrd package with quick start example here.
You can just google it to find code snippets. I have never used panda's read_excel function, but xlrd covers all my needs, and can offer even more, I believe.
You could also try it with my wrapper library, which uses xlrd as well:
import pyexcel as pe # pip install pyexcel
import pyexcel.ext.xls # pip install pyexcel-xls
your_matrix = pe.get_array(file_name=path_here) # done