Converting Pandas DataFrame to Orange Table

Converting Pandas DataFrame to Orange Table - python

I notice that this is an issue on GitHub already. Does anyone have any code that converts a Pandas DataFrame to an Orange Table?
Explicitly, I have the following table.
user hotel star_rating user home_continent gender
0 1 39 4.0 1 2 female
1 1 44 3.0 1 2 female
2 2 63 4.5 2 3 female
3 2 2 2.0 2 3 female
4 3 26 4.0 3 1 male
5 3 37 5.0 3 1 male
6 3 63 4.5 3 1 male

The documentation of Orange package didn't cover all the details. Table._init__(Domain, numpy.ndarray) works only for int and float according to lib_kernel.cpp.
They really should provide an C-level interface for pandas.DataFrames, or at least numpy.dtype("str") support.
Update: Adding table2df, df2table performance improved greatly by utilizing numpy for int and float.
Keep this piece of script in your orange python script collections, now you are equipped with pandas in your orange environment.
Usage: a_pandas_dataframe = table2df( a_orange_table ) , a_orange_table = df2table( a_pandas_dataframe )
Note: This script works only in Python 2.x, refer to #DustinTang 's answer for Python 3.x compatible script.
import pandas as pd
import numpy as np
import Orange
#### For those who are familiar with pandas
#### Correspondence:
#### value <-> Orange.data.Value
#### NaN <-> ["?", "~", "."] # Don't know, Don't care, Other
#### dtype <-> Orange.feature.Descriptor
#### category, int <-> Orange.feature.Discrete # category: > pandas 0.15
#### int, float <-> Orange.feature.Continuous # Continuous = core.FloatVariable
#### # refer to feature/__init__.py
#### str <-> Orange.feature.String
#### object <-> Orange.feature.Python
#### DataFrame.dtypes <-> Orange.data.Domain
#### DataFrame.DataFrame <-> Orange.data.Table = Orange.orange.ExampleTable
#### # You will need this if you are reading sources
def series2descriptor(d, discrete=False):
if d.dtype is np.dtype("float"):
return Orange.feature.Continuous(str(d.name))
elif d.dtype is np.dtype("int"):
return Orange.feature.Continuous(str(d.name), number_of_decimals=0)
else:
t = d.unique()
if discrete or len(t) < len(d) / 2:
t.sort()
return Orange.feature.Discrete(str(d.name), values=list(t.astype("str")))
else:
return Orange.feature.String(str(d.name))
def df2domain(df):
featurelist = [series2descriptor(df.icol(col)) for col in xrange(len(df.columns))]
return Orange.data.Domain(featurelist)
def df2table(df):
# It seems they are using native python object/lists internally for Orange.data types (?)
# And I didn't find a constructor suitable for pandas.DataFrame since it may carry
# multiple dtypes
# --> the best approximate is Orange.data.Table.__init__(domain, numpy.ndarray),
# --> but the dtype of numpy array can only be "int" and "float"
# --> * refer to src/orange/lib_kernel.cpp 3059:
# --> * if (((*vi)->varType != TValue::INTVAR) && ((*vi)->varType != TValue::FLOATVAR))
# --> Documents never mentioned >_<
# So we use numpy constructor for those int/float columns, python list constructor for other
tdomain = df2domain(df)
ttables = [series2table(df.icol(i), tdomain[i]) for i in xrange(len(df.columns))]
return Orange.data.Table(ttables)
# For performance concerns, here are my results
# dtndarray = np.random.rand(100000, 100)
# dtlist = list(dtndarray)
# tdomain = Orange.data.Domain([Orange.feature.Continuous("var" + str(i)) for i in xrange(100)])
# tinsts = [Orange.data.Instance(tdomain, list(dtlist[i]) )for i in xrange(len(dtlist))]
# t = Orange.data.Table(tdomain, tinsts)
#
# timeit list(dtndarray) # 45.6ms
# timeit [Orange.data.Instance(tdomain, list(dtlist[i])) for i in xrange(len(dtlist))] # 3.28s
# timeit Orange.data.Table(tdomain, tinsts) # 280ms
# timeit Orange.data.Table(tdomain, dtndarray) # 380ms
#
# As illustrated above, utilizing constructor with ndarray can greatly improve performance
# So one may conceive better converter based on these results
def series2table(series, variable):
if series.dtype is np.dtype("int") or series.dtype is np.dtype("float"):
# Use numpy
# Table._init__(Domain, numpy.ndarray)
return Orange.data.Table(Orange.data.Domain(variable), series.values[:, np.newaxis])
else:
# Build instance list
# Table.__init__(Domain, list_of_instances)
tdomain = Orange.data.Domain(variable)
tinsts = [Orange.data.Instance(tdomain, [i]) for i in series]
return Orange.data.Table(tdomain, tinsts)
# 5x performance
def column2df(col):
if type(col.domain[0]) is Orange.feature.Continuous:
return (col.domain[0].name, pd.Series(col.to_numpy()[0].flatten()))
else:
tmp = pd.Series(np.array(list(col)).flatten()) # type(tmp) -> np.array( dtype=list (Orange.data.Value) )
tmp = tmp.apply(lambda x: str(x[0]))
return (col.domain[0].name, tmp)
def table2df(tab):
# Orange.data.Table().to_numpy() cannot handle strings
# So we must build the array column by column,
# When it comes to strings, python list is used
series = [column2df(tab.select(i)) for i in xrange(len(tab.domain))]
series_name = [i[0] for i in series] # To keep the order of variables unchanged
series_data = dict(series)
print series_data
return pd.DataFrame(series_data, columns=series_name)

Answer below from a closed issue on github
from Orange.data.pandas_compat import table_from_frame
out_data = table_from_frame(df)
Where df is your dataFrame. So far I've only noticed a need to manually define a domain to handle dates if the data source wasn't 100% clean and to the required ISO standard.
I realize this is an old question and a lot changed from when it was first asked - but this question comes up top on google search results on the topic.

from Orange.data.pandas_compat import table_from_frame,table_to_frame
df= table_to_frame(in_data)
#here you go
out_data = table_from_frame(df)
based on answer of Creo

In order to convert pandas DataFrame to Orange Table you need to construct a domain, which specifies the column types.
For continuous variables, you only need to provide the name of the variable, but for Discrete variables, you also need to provide a list of all possible values.
The following code will construct a domain for your DataFrame and convert it to an Orange Table:
import numpy as np
from Orange.feature import Discrete, Continuous
from Orange.data import Domain, Table
domain = Domain([
Discrete('user', values=[str(v) for v in np.unique(df.user)]),
Discrete('hotel', values=[str(v) for v in np.unique(df.hotel)]),
Continuous('star_rating'),
Discrete('user', values=[str(v) for v in np.unique(df.user)]),
Discrete('home_continent', values=[str(v) for v in np.unique(df.home_continent)]),
Discrete('gender', values=['male', 'female'])], False)
table = Table(domain, [map(str, row) for row in df.as_matrix()])
The map(str, row) step is needed so Orange know that the data contains values of discrete features (and not the indices of values in the values list).

This code is revised from #TurtleIzzy for Python3.
import numpy as np
from Orange.data import Table, Domain, ContinuousVariable, DiscreteVariable
def series2descriptor(d):
if d.dtype is np.dtype("float") or d.dtype is np.dtype("int"):
return ContinuousVariable(str(d.name))
else:
t = d.unique()
t.sort()
return DiscreteVariable(str(d.name), list(t.astype("str")))
def df2domain(df):
featurelist = [series2descriptor(df.iloc[:,col]) for col in range(len(df.columns))]
return Domain(featurelist)
def df2table(df):
tdomain = df2domain(df)
ttables = [series2table(df.iloc[:,i], tdomain[i]) for i in range(len(df.columns))]
ttables = np.array(ttables).reshape((len(df.columns),-1)).transpose()
return Table(tdomain , ttables)
def series2table(series, variable):
if series.dtype is np.dtype("int") or series.dtype is np.dtype("float"):
series = series.values[:, np.newaxis]
return Table(series)
else:
series = series.astype('category').cat.codes.reshape((-1,1))
return Table(series)

Something like this?
table = Orange.data.Table(df.as_matrix())
The columns in Orange will get generic names (a1, a2...). If you want to copy the names and the types from the data frame, construct Orange.data.Domain object (http://docs.orange.biolab.si/reference/rst/Orange.data.domain.html#Orange.data.Domain.init) from the data frame and pass it as the first argument above.
See the constructors in http://docs.orange.biolab.si/reference/rst/Orange.data.table.html.

table_from_frame, which is available in Python 3, doesn't allow the definition of a class column and therefore, the generated table cannot be used directly to train a classification model. I tweaked the table_from_frame function so it'll allow the definition of a class column. Notice that the class name should be given as an additional parameter.
"""Pandas DataFrame↔Table conversion helpers"""
import numpy as np
import pandas as pd
from pandas.api.types import (
is_categorical_dtype, is_object_dtype,
is_datetime64_any_dtype, is_numeric_dtype,
)
from Orange.data import (
Table, Domain, DiscreteVariable, StringVariable, TimeVariable,
ContinuousVariable,
)
__all__ = ['table_from_frame', 'table_to_frame']
def table_from_frame(df,class_name, *, force_nominal=False):
"""
Convert pandas.DataFrame to Orange.data.Table
Parameters
----------
df : pandas.DataFrame
force_nominal : boolean
If True, interpret ALL string columns as nominal (DiscreteVariable).
Returns
-------
Table
"""
def _is_discrete(s):
return (is_categorical_dtype(s) or
is_object_dtype(s) and (force_nominal or
s.nunique() < s.size**.666))
def _is_datetime(s):
if is_datetime64_any_dtype(s):
return True
try:
if is_object_dtype(s):
pd.to_datetime(s, infer_datetime_format=True)
return True
except Exception: # pylint: disable=broad-except
pass
return False
# If df index is not a simple RangeIndex (or similar), put it into data
if not (df.index.is_integer() and (df.index.is_monotonic_increasing or
df.index.is_monotonic_decreasing)):
df = df.reset_index()
attrs, metas,calss_vars = [], [],[]
X, M = [], []
# Iter over columns
for name, s in df.items():
name = str(name)
if name == class_name:
discrete = s.astype('category').cat
calss_vars.append(DiscreteVariable(name, discrete.categories.astype(str).tolist()))
X.append(discrete.codes.replace(-1, np.nan).values)
elif _is_discrete(s):
discrete = s.astype('category').cat
attrs.append(DiscreteVariable(name, discrete.categories.astype(str).tolist()))
X.append(discrete.codes.replace(-1, np.nan).values)
elif _is_datetime(s):
tvar = TimeVariable(name)
attrs.append(tvar)
s = pd.to_datetime(s, infer_datetime_format=True)
X.append(s.astype('str').replace('NaT', np.nan).map(tvar.parse).values)
elif is_numeric_dtype(s):
attrs.append(ContinuousVariable(name))
X.append(s.values)
else:
metas.append(StringVariable(name))
M.append(s.values.astype(object))
return Table.from_numpy(Domain(attrs, calss_vars, metas),
np.column_stack(X) if X else np.empty((df.shape[0], 0)),
None,
np.column_stack(M) if M else None)

This works well
from Orange.data.pandas_compat import table_from_frame,table_to_frame
import pandas as pd
# read the input data into pandas data-frame
df= table_to_frame(in_data)
# perform all data operations / wrangling
# for example only few columns are required in output
df = df[['Col1', 'Col2']]
# Final output
out_data = table_from_frame(df)

Related

Running Scipy Linregress Across Dataframe Where Each Element is a List

I am working with a Pandas dataframe where each element contains a list of values. I would like to run a regression between the lists in the first column and the lists in each subsequent column for every row in the dataframe, and store the t-stats of each regression (currently using a numpy array to store them). I am able to do this using a nested for loop that loops through each row and column, but the performance is not optimal for the amount of data I am working with.
Here is a quick sample of what I have so far:
import numpy as np
import pandas as pd
from scipy.stats import linregress
df = pd.DataFrame(
{'a': [list(np.random.rand(11)) for i in range(100)],
'b': [list(np.random.rand(11)) for i in range(100)],
'c': [list(np.random.rand(11)) for i in range(100)],
'd': [list(np.random.rand(11)) for i in range(100)],
'e': [list(np.random.rand(11)) for i in range(100)],
'f': [list(np.random.rand(11)) for i in range(100)]
}
)
Here is what the data looks like:
a b c d e f
0 [0.279347961395256, 0.07198822780319691, 0.209... [0.4733815106836531, 0.5807425586417414, 0.068... [0.9377037591435088, 0.9698329284595916, 0.241... [0.03984770879654953, 0.650429630364027, 0.875... [0.04654151678901641, 0.1959629573862498, 0.36... [0.01328000288459652, 0.10429773699794731, 0.0...
1 [0.1739544898167934, 0.5279297754363472, 0.635... [0.6464841177367048, 0.004013634850660308, 0.2... [0.0403944630279538, 0.9163938509072009, 0.350... [0.8818108296208096, 0.2910758930807579, 0.739... [0.5263032002243185, 0.3746299115677546, 0.122... [0.5511171062367501, 0.327702669239891, 0.9147...
2 [0.49678125158054476, 0.807770957943305, 0.396... [0.6218806473477556, 0.01720135741717188, 0.15... [0.6110516368605904, 0.20848099927159314, 0.51... [0.7473669581190695, 0.5107081859246958, 0.442... [0.8231961741887535, 0.9686869510163731, 0.473... [0.34358121300094313, 0.9787339533782848, 0.72...
3 [0.7672751789941814, 0.412055981587398, 0.9951... [0.8470471648467321, 0.9967427749160083, 0.818... [0.8591072331661481, 0.6279199806511635, 0.365... [0.9456189188046846, 0.5084362869897466, 0.586... [0.2685328112579779, 0.8893788305422594, 0.235... [0.029919732007230193, 0.6377951981939682, 0.1...
4 [0.21420195955828203, 0.15178914447352077, 0.9... [0.6865307542882283, 0.0620359602798356, 0.382... [0.6469510945986712, 0.676059598071864, 0.0396... [0.2320436872397288, 0.09558341089961908, 0.98... [0.7733653233006889, 0.2405189745554751, 0.016... [0.8359561624563979, 0.24335481664355396, 0.38...
... ... ... ... ... ... ...
95 [0.42373270776373506, 0.7731750012629109, 0.90... [0.9430465078763153, 0.8506292743184455, 0.567... [0.41367168515273345, 0.9040247409476362, 0.72... [0.23016875953835192, 0.8206550830081965, 0.26... [0.954233948805146, 0.995068745046983, 0.20247... [0.26269690906898413, 0.5032835345055103, 0.26...
96 [0.36114607798432685, 0.11322299769211142, 0.0... [0.729848741496316, 0.9946930423163686, 0.2265... [0.17207915211677138, 0.3270055732644267, 0.73... [0.13211243241239223, 0.28382298905995607, 0.2... [0.03915259352564071, 0.05639914089770948, 0.0... [0.12681415759423675, 0.006417761276839351, 0....
97 [0.5020186971295065, 0.04018166955309821, 0.19... [0.9082402680300308, 0.1334790715379094, 0.991... [0.7003469664104871, 0.9444397336912727, 0.113... [0.7982221018200218, 0.9097963438776192, 0.163... [0.07834894180973451, 0.7948519146738178, 0.56... [0.5833962514812425, 0.403689767723475, 0.7792...
98 [0.16413822314461857, 0.40683312270714234, 0.4... [0.07366489230864415, 0.2706766599711766, 0.71... [0.6410967759869383, 0.5780018716586993, 0.622... [0.5466463581695835, 0.4949639043264169, 0.749... [0.40235314091318986, 0.8305539205264385, 0.35... [0.009668651763079184, 0.8071825962911674, 0.0...
99 [0.8189246990381518, 0.69175150213841, 0.82687... [0.40469941577758317, 0.49004906937461257, 0.7... [0.4940080411615112, 0.33621539942693246, 0.67... [0.8637418291877355, 0.34876318713083676, 0.09... [0.3526913672876807, 0.5177762589812651, 0.746... [0.3463129199717484, 0.9694802522161138, 0.732...
100 rows × 6 columns
My code to run the regressions and store the t-stats:
rows = len(df)
cols = len(df.columns)
tstats = np.zeros(shape=(rows,cols-1))
for i in range(0,rows):
for j in range(1,cols):
lg = linregress(df.iloc[i,0],df.iloc[i,j])
tstats[i,j-1] = lg.slope/lg.stderr
The code above works just fine and is doing exactly what I need, however as I mentioned above the performance begins to slow down when the # of rows and columns in df increases substantially.
I'm hoping someone could offer advice on how to optimize my code for better performance.
Thank you!

I am newbie to this but I do optimization your original code:
by purely use python builtin list object (there is no need to use pandas and to be honest I cannot find a better way to solve your problem in pandas than you original code :D)
by using numpy, which should be (at least they claimed) faster than python builtin list.
You can jump to see the code, its in Jupyter notebook format so you need to install Jupyter first.
Conclusion
Here is the test result:
On a (100, 100) matrix containing (30,) length random lists,
the total time difference is around 1 second.
Time elapsed to run 1 times on new method is 24.282760 seconds.
Time elapsed to run 1 times on old method is 25.954801 seconds.
Refer to
test_perf
in sample code for result.
PS: During test only one thread is used, so maybe multi-thread will help to improve performance, but that's out of my ability...
Idea
I think numpy.nditer is suitable for your request, though the result of optimization is not that significant. Here is my idea:
Generate the input array
I have altered you first part of script, I think using list comprehension along is enough to build a matrix of random lists. Refer to
get_matrix_from_builtin.
Please note I have stored the random lists in another 1-element tuple to keep the shape as ndarray generate from numpy.
As a compare, you can also construct such matrix with numpy. Refer to
get_matrix_from_numpy.
Because ndarray try to boardcast list-like object (and I don't know how to stop it), I have to wrap it into a tuple to avoid auto boardcast from numpy.array constructor. If anyone have a better solution please note it, thanks :)
Calculate the result
I altered you original code using pandas.DataFrame to access element by row/col index, but it is not that way.
Pandas provides some iteration tool for DataFrame: pipe, apply, agg, and appymap, search API for more info, but it seems not suitable for your request here, as you want to obtain the current index of row and col during iteration.
I searched and found numpy.nditer can provide that needs: it return a iterator of ndarray, which have an attribution multi_index that provide the row/col pair of current element. see iterating-over-arrays
Explain on solve.ipynb
I use Jupyter Notebook to test this, you might need got one, here is the instruction of install.
I have altered your original code, which remove the request of pandas and purely used builtin list. Refer to
old_calc_tstat
in the sample code.
Also, I used numpy.nditer to calc your tstats matrix, Refer to
new_calc_tstat
in the sample code.
Then, I tested if the result of both methods are equal, I used same input array to ensure random won't affect the test. Refer to
test_equal
for result.
Finally, do the time performance. I am not patient so I only run it for one time, you may add the repeats count of test in the
test_perf function.
The code
# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %% [markdown]
# [origin question](https://stackoverflow.com/questions/69228572/running-scipy-linregress-across-dataframe-where-each-element-is-a-list)
#
# %%
import sys
import time
import numpy as np
from scipy.stats import linregress
# %%
def get_matrix_from_builtin():
# use builtin list to construct matrix of random list
# note I put random list inside a tuple to keep it same shape
# as I later use numpy to do the same thing.
return [
[(list(np.random.rand(11)),)
for col in range(6)]
for row in range(100)
]
# %timeit get_matrix_from_builtin()
# %%
def get_matrix_from_numpy(
gen=np.random.rand,
shape=(1, 1),
nest_shape=(1, ),
):
# custom dtype for random lists
mydtype = [
('randonlist', 'f', nest_shape)
]
a = np.empty(shape, dtype=mydtype)
# [DOC] moditfying array values
# https://numpy.org/doc/stable/reference/arrays.nditer.html#modifying-array-values
# enable per operation flags 'readwrite' to modify element in ndarray
# enable global flag 'refs_ok' to allow use callable function 'gen' in iteration
with np.nditer(a, op_flags=['readwrite'], flags=['refs_ok']) as it:
for x in it:
# pack list in a 1-d turple to prevent numpy boardcast it
x[...] = (gen(nest_shape[0]), )
return a
def test_get_matrix_from_numpy():
gen = np.random.rand # generator of random list
shape = (6, 100) # shape of matrix to hold random lists
nest_shape = (11, ) # shape of random lists
return get_matrix_from_numpy(gen, shape, nest_shape)
# access a random list by a[row][col][0]
# %timeit test_get_matrix_from_numpy()
# %%
def test_get_matrix_from_numpy():
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
return get_matrix_from_numpy(gen, shape, nest_shape)
# %%
def old_calc_tstat(a=None):
if a is None:
a = get_matrix_from_builtin()
a = np.array(a)
rows, cols = a.shape[:2]
tstats = np.zeros(shape=(rows, cols))
for i in range(0, rows):
for j in range(1, cols):
lg = linregress(a[i][0][0], a[i][j][0])
tstats[i, j-1] = lg.slope/lg.stderr
return tstats
# %%
def new_calc_tstat(a=None):
# read input metrix of random lists
if a is None:
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# construct ndarray for t-stat result
tstats = np.empty(a.shape)
# enable global flags 'multi_index' to retrive index of current element
# [DOC] Tracking an Index or Multi-Index
# https://numpy.org/doc/stable/reference/arrays.nditer.html#tracking-an-index-or-multi-index
it = np.nditer(tstats, op_flags=['readwrite'], flags=['multi_index'])
# obtain total columns count of tstats's shape
col = tstats.shape[1]
for x in it:
i, j = it.multi_index
# trick to avoid IndexError: substract len(list) after +1 to index
j = j + 1 - col
lg = linregress(
a[i][0][0],
a[i][j][0]
)
# note: nditer ignore ZeroDivisionError by default, and return np.inf to the element
# you have to override it manually:
if lg.stderr == 0:
x[...] = 0
else:
x[...] = lg.slope / lg.stderr
return tstats
# new_calc_tstat()
# %%
def test_equal():
"""Test if the new method has equal output to old one"""
# use same input list to avoid affect of rand
a = test_get_matrix_from_numpy()
old = old_calc_tstat(a)
new = new_calc_tstat(a)
print(
"Is the shape of old and new same ?\n%s. old: %s, new: %s\n" % (
old.shape == new.shape, old.shape, new.shape),
)
res = (old == new)
print(
"Is the result object same?"
)
if res.all() == True:
print("True.")
else:
print("False. Difference(new - old) as below:\n")
print(new - old)
return old, new
old, new = test_equal()
# %%
# the only diff is the last element
# in old method it is 0
# in new method it is inf
# if you perfer the old method, just add condition in new method to override
# [new[x][99] for x in range(6)]
# %%
# python version: 3.8.8
timer = time.clock if sys.platform[:3] == 'win' else time.time
def total(func, *args, _reps=1, **kwargs):
start = timer()
for i in range(_reps):
ret = func(*args, **kwargs)
elapsed = timer() - start
return elapsed
def test_perf():
"""Test of performance"""
# first, get a larger input array
gen = np.random.rand
shape = (1000, 100)
nest_shape = (30, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# repeat how many time for each test
reps = 1
# then, time both old and new calculation method
old = total(old_calc_tstat, a, _reps=reps)
new = total(new_calc_tstat, a, _reps=reps)
msg = "Time elapsed to run %d times on %s is %f seconds."
print(msg % (reps, 'new method', new))
print(msg % (reps, 'old method', old))
test_perf()

Create a Python Function in Excel

I have a python function that I want to call in Excel. However I'm getting #VALUE errors, I think it's to do with passing an excel range to a python list.
Python function requires 2 inputs with 4 optional, the first input is a string with the second being a list of list of strings [rows[columns]], a score is then produced for the string against each row in the list. Finally outputting a dataframe:
import pythoncom
import win32com.client
import pandas as pd
from sklearn import feature_extraction, metrics
from typing import List, Any
class PythonObjectLibrary:
_reg_clsid_ = pythoncom.CreateGuid()
_reg_clsctx_ = pythoncom.CLSCTX_LOCAL_SERVER
_reg_progid_ = "Python.ObjectLibrary"
_reg_desc_ = "This is our Python object library."
_public_methods_ = ['nlp_vlookup']
def nlp_vlookup(value: str,
table: List[List[Any]],
col_index: int = None,
include_score: bool = False,
all_matches: bool = False,
threshold: float = 0.5) -> pd.DataFrame:
words = [str(x[0]) for x in table]
vectorizer = feature_extraction.text.CountVectorizer()
vectors = vectorizer.fit_transform([value]+words).toarray()
cosine_sim = metrics.pairwise.cosine_similarity(list(vectors))
scores = cosine_sim[0][1:]
scores_df = pd.DataFrame({"score": scores}, index=words)
table_df = pd.DataFrame(table, index=words)
df = table_df.join(scores_df)
df = df[df["score"] >= threshold]
if not len(df.index):
raise ValueError("No matches found")
df = df.sort_values(by="score", ascending=False)
if not all_matches:
df = df.head(1)
columns = table_df.columns.to_list() if col_index is None else [col_index-1]
if include_score:
columns = ["score"] + columns
df = df.reindex(columns=columns)
return df
if __name__ == '__main__':
import win32com.server.register
win32com.server.register.UseCommandLine(PythonObjectLibrary)
# test function:
# search_list = [['Capital bank'], ['The Little Bank'], ['The Big Bank']]
# match = PythonObjectLibrary.nlp_vlookup("The Capital Bank", search_list, include_score=True, all_matches=True, threshold=0.5)
# print(match)
Excel Function to map the python function:
Function nlp_vlookup(value As String, table As Range, col_index As Integer, include_score As Boolean, all_matches As Boolean, threshold As Double)
nlp_vlookup = VBA.CreateObject("PythonObjectLibrary").nlp_vlookup(value, table, col_index, include_score, all_matches, threshold)
End Function

EDITED after #norie's comment, as it seems the zero-based array thing is a red herring.
I used this test Object (note the hard-coded UUID). It has a simple doubleArray() method that receives a Variant Array of numbers from Excel, doubles them, and returns an array. I have created a dummy DataFrame to test the extraction to a list. Apologies if the Python is inelegant ... I am only a beginner!
import pythoncom as pc
import win32com.client as cl
import numpy as np
import pandas as pd
from win32com.client import Dispatch
class PythonComTestObject:
_reg_clsid_ = '{BB58C07E-B9AD-4BC7-BB8C-01D2FF8FD4E9}' #replace this
_reg_clsctx_ = pc.CLSCTX_LOCAL_SERVER
_reg_progid_ = "PythonComTestObject.Library"
_reg_desc_ = "A library for doubling things."
# a list of strings that indicate the public methods for the object. If they aren't listed they are considered private.
_public_methods_ = ['doubleArray']
# double every value in array and return
def doubleArray(self, vArray):
#the VARIANT array comes gets converted to an [[]] array
nRows = len(vArray)
nCols = len(vArray[0])
headers = [ 'Col{0:}'.format(n) for n in range(1,nCols+1)]
retDF = pd.DataFrame(np.array(vArray) * 2,columns=headers)
return [retDF.columns.values.tolist()] + retDF.values.tolist()
if __name__ == '__main__':
import win32com.server.register
win32com.server.register.UseCommandLine(PythonComTestObject)
Then this is my VBA in Excel. I realize it is a bit more than the original. The conversion from a zero-based to a 1-based array could be put into a utility function. Note I've added a global object for my Python library, so that it gets created once and re-used. This makes subsequent function calls more rapid as you don't have to do the hard work of creating the object again.
Option Explicit
Dim g_PythonObj As Object
Public Function pythonDouble(rng As Variant) As Variant
On Error GoTo comerror
'Initialize the global object
If g_PythonObj Is Nothing Then
Set g_PythonObj = CreateObject("PythonComTestObject.Library")
End If
'Convert the rng parameter to a variant array
Dim vIn As Variant
vIn = rng
'Call the Python function
pythonDouble = g_PythonObj.doubleArray(vIn)
Exit Function
'Catch any errors here
comerror:
Debug.Print Err.Description
pythonDouble = CVErr(xlErrValue)
End Function
'A test subroutine for debugging
Sub testPython()
Dim rng As Range
Set rng = Range("Input") 'A range on my test sheet
Dim vIn As Variant
vIn = rng
Dim vRes As Variant
vRes = pythonDouble(vIn)
End Sub
This is the spreadsheet result:

Equivalent of pd.Series.str.slice() and pd.Series.apply() in cuDF

I am wanting to convert the following code (which runs in pandas) to code that runs in cuDF.
Sample data from .head() of Series being manipulated is plugged into OG code in the 3rd code cell down -- should be able to copy/paste run.
Original code in pandas
# both are float columns now
# rawcensustractandblock
s_rawcensustractandblock = df_train['rawcensustractandblock'].apply(lambda x: str(x))
# adjust/set new tract number
df_train['census_tractnumber'] = s_rawcensustractandblock.str.slice(4,11)
# adjust block number
df_train['block_number'] = s_rawcensustractandblock.str.slice(start=11)
df_train['block_number'] = df_train['block_number'].apply(lambda x: x[:4]+'.'+x[4:]+'0' )
df_train['block_number'] = df_train['block_number'].apply(lambda x: int(round(float(x),0)) )
df_train['block_number'] = df_train['block_number'].apply(lambda x: str(x).ljust(4,'0') )
Data being manipulated
# series of values from df_train.['rawcensustractandblock'].head()
data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401,
60372963.002002, 60590423.381006])
Code adjusted to start with this sample data
Here's how the code looks when using the above provided data instead of the entire dataframe.
Based on errors encountered when trying to convert, this issue is at the Series level, so the converting the cell below to execute in cuDF should solve the problem.
import pandas as pd
# series of values from df_train.['rawcensustractandblock'].head()
data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401,
60372963.002002, 60590423.381006])
# how the first line looks using the series
s_rawcensustractandblock = data.apply(lambda x: str(x))
# adjust/set new tract number
census_tractnumber = s_rawcensustractandblock.str.slice(4,11)
# adjust block number
block_number = s_rawcensustractandblock.str.slice(start=11)
block_number = block_number.apply(lambda x: x[:4]+'.'+x[4:]+'0' )
block_number = block_number.apply(lambda x: int(round(float(x),0)) )
block_number = block_number.apply(lambda x: str(x).ljust(4,'0') )
Expected changes (output)
df_train['census_tractnumber'].head()
# out
0 1066.46
1 0524.22
2 4638.00
3 2963.00
4 0423.38
Name: census_tractnumber, dtype: object
df_train['block_number'].head()
0 1001
1 2024
2 3004
3 2002
4 1006
Name: block_number, dtype: object

You can use cuDF string methods (via nvStrings) for almost everything you're trying to do. You will lose some precision converting these floats to strings in cuDF (though it may not matter in your example above), so for this example I've simply converted beforehand. If possible, I would recommend initially creating the rawcensustractandblock as a string column rather than a float column.
import cudf
import pandas as pd

gdata = cudf.from_pandas(pd_data.astype('str'))

tractnumber = gdata.str.slice(4,11)
blocknumber = gdata.str.slice(11)
blocknumber = blocknumber.str.slice(0,4).str.cat(blocknumber.str.slice(4), '.')
blocknumber = blocknumber.astype('float').round(0).astype('int')
blocknumber = blocknumber.astype('str').str.ljust(4, '0')

tractnumber
0 1066.46
1 0524.22
2 4638.00
3 2963.00
4 0423.38
dtype: object
blocknumber
0 1001
1 2024
2 3004
3 2002
4 1006
dtype: object

for loop solution
pandas (original code)
import pandas as pd
# data from df_train.rawcensustractandblock.head()
pd_data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401,
60372963.002002, 60590423.381006])
# using series instead of dataframe
pd_raw_block = pd_data.apply(lambda x: str(x))
# adjust/set new tract number
pd_tractnumber = pd_raw_block.str.slice(4,11)
# set/adjust block number
pd_block_number = pd_raw_block.str.slice(11)
pd_block_number = pd_block_number.apply(lambda x: x[:4]+'.'+x[4:]+'0')
pd_block_number = pd_block_number.apply(lambda x: int(round(float(x),0)))
pd_block_number = pd_block_number.apply(lambda x: str(x).ljust(4,'0'))
# print(list(pd_tractnumber))
# print(list(pd_block_number))
cuDF (solution code)
import cudf
# data from df_train.rawcensustractandblock.head()
cudf_data = cudf.Series([60371066.461001, 60590524.222024, 60374638.00300401,
60372963.002002, 60590423.381006])
# using series instead of dataframe
cudf_tractnumber = cudf_data.values_to_string()
# adjust/set new tract number
for i in range(len(cudf_tractnumber)):
funct = slice(4,11)
cudf_tractnumber[i] = cudf_tractnumber[i][funct]
# using series instead of dataframe
cudf_block_number = cudf_data.values_to_string()
# set/adjust block number
for i in range(len(cudf_block_number)):
funct = slice(11, None)
cudf_block_number[i] = cudf_block_number[i][funct]
cudf_block_number[i] = cudf_block_number[i][:4]+'.'+cudf_block_number[i][4:]+'0'
cudf_block_number[i] = int(round(float(cudf_block_number[i]), 0))
cudf_block_number[i] = str(cudf_block_number[i]).ljust(4,'0')
# print(cudf_tractnumber)
# print(cudf_block_number)

How to make my modified pandas/numpy .where function adaptable to different sizes of a list parameter?

I want to create my own function that scans a number of user-specified columns in a dataframe, and that function will create a new variable and assign it as '1' if all the specified columns == 1, otherwise 0.
In the following codes, I am accommodating if users are inputting exactly two columns to be scanned over.
import numpy as np
class Tagger:
def __init__(self):
pass
def summing_all_tagger(self, df, tag_var_list, tag_value=1):
# This tagger creates a tag='1' if all variables in tag_var_list equals to tag_value; otherwise='0'
self.df = df
self.tag_var_list = tag_var_list
self.tag_value = tag_value
self.df['temp'] = np.where((self.df[self.tag_var_list[0]]==self.tag_value) &
(self.df[self.tag_var_list[1]]==self.tag_value), 1, 0)
return self.df_pin['temp']
Then I can call it in the main.py file:
import pandas as pd
import datetime
import feature_tagger.feature_tagger as ft
tagger_obj = ft.Tagger()
df_pin['PIN_RX&TIME_TAG'] = tagger_obj.summing_all_tagger(df_pin, tag_var_list=['PIN_RX_TAG', 'PIN_TIME_TAG'], tag_value=1)
How can I modify it so users can enter as many column names for tag_var_list as they want?
Such as
df_pin['PIN_RX&TIME_TAG'] = tagger_obj.summing_all_tagger(df_pin, tag_var_list=['PIN_RX_TAG', 'PIN_TIME_TAG', 'PIN_NAME_TAG'], tag_value=1)
# or
df_pin['PIN_RX&TIME_TAG'] = tagger_obj.summing_all_tagger(df_pin, tag_var_list=['PIN_RX_TAG'], tag_value=1)

The np.all() is your friend.
self.df['temp'] = np.where(np.all(self.df[self.tag_var_list] == self.tag_value, axis=1), 1, 0)

I think you can create list comprehension for list of boolean masks and then reduce of masks to one with casting to integer for 0/1 column:
L = [self.df[x]==self.tag_value for x in tag_var_list]
self.df['temp'] = np.logical_and.reduce(L).astype(int)
Or DataFrame.all with casting boolean mask to integers:
self.df['temp'] = (self.df[self.tag_var_list] == self.tag_value).all(axis=1).astype(int)

Python Pandas Merge Causing Memory Overflow

I'm new to Pandas and am trying to merge a few subsets of data. I'm giving a specific case where this happens, but the question is general: How/why is it happening and how can I work around it?
The data I load is around 85 Megs or so but I often watch my python session run up close to 10 gigs of memory usage then give a memory error.
I have no idea why this happens, but it's killing me as I can't even get started looking at the data the way I want to.
Here's what I've done:
Importing the Main data
import requests, zipfile, StringIO
import numpy as np
import pandas as pd
STAR2013url="http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_all_csv_v3.zip"
STAR2013fileName = 'ca2013_all_csv_v3.txt'
r = requests.get(STAR2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STAR2013=pd.read_csv(z.open(STAR2013fileName))
Importing some Cross Cross Referencing Tables
STARentityList2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013entities_csv.zip"
STARentityList2013fileName = "ca2013entities_csv.txt"
r = requests.get(STARentityList2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARentityList2013=pd.read_csv(z.open(STARentityList2013fileName))
STARlookUpTestID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/tests.zip"
STARlookUpTestID2013fileName = "Tests.txt"
r = requests.get(STARlookUpTestID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpTestID2013=pd.read_csv(z.open(STARlookUpTestID2013fileName))
STARlookUpSubgroupID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/subgroups.zip"
STARlookUpSubgroupID2013fileName = "Subgroups.txt"
r = requests.get(STARlookUpSubgroupID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpSubgroupID2013=pd.read_csv(z.open(STARlookUpSubgroupID2013fileName))
Renaming a Column ID to Allow for Merge
STARlookUpSubgroupID2013 = STARlookUpSubgroupID2013.rename(columns={'001':'Subgroup ID'})
STARlookUpSubgroupID2013
Successful Merge
merged = pd.merge(STAR2013,STARlookUpSubgroupID2013, on='Subgroup ID')
Try a second merge. This is where the Memory Overflow Happens
merged=pd.merge(merged, STARentityList2013, on='School Code')
I did all of this in ipython notebook, but don't think that changes anything.

Although this is an old question, I recently came across the same problem.
In my instance, duplicate keys are required in both dataframes, and I needed a method which could tell if a merge will fit into memory ahead of computation, and if not, change the computation method.
The method I came up with is as follows:
Calculate merge size:
def merge_size(left_frame, right_frame, group_by, how='inner'):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_diff = left_keys - intersection
right_diff = right_keys - intersection
left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_nan * right_nan]
left_size = [left_groups[group_name] for group_name in left_diff]
right_size = [right_groups[group_name] for group_name in right_diff]
if how == 'inner':
return sum(sizes)
elif how == 'left':
return sum(sizes + left_size)
elif how == 'right':
return sum(sizes + right_size)
return sum(sizes + left_size + right_size)
Note:
At present with this method, the key can only be a label, not a list. Using a list for group_by currently returns a sum of merge sizes for each label in the list. This will result in a merge size far larger than the actual merge size.
If you are using a list of labels for the group_by, the final row size is:
min([merge_size(df1, df2, label, how) for label in group_by])
Check if this fits in memory
The merge_size function defined here returns the number of rows which will be created by merging two dataframes together.
By multiplying this with the count of columns from both dataframes, then multiplying by the size of np.float[32/64], you can get a rough idea of how large the resulting dataframe will be in memory. This can then be compared against psutil.virtual_memory().available to see if your system can calculate the full merge.
def mem_fit(df1, df2, key, how='inner'):
rows = merge_size(df1, df2, key, how)
cols = len(df1.columns) + (len(df2.columns) - 1)
required_memory = (rows * cols) * np.dtype(np.float64).itemsize
return required_memory <= psutil.virtual_memory().available
The merge_size method has been proposed as an extension of pandas in this issue. https://github.com/pandas-dev/pandas/issues/15068.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting Pandas DataFrame to Orange Table - python

from Orange.data.pandas_compat import table_from_frame,table_to_frame df= table_to_frame(in_data) #here you go out_data = table_from_frame(df) based on answer of Creo

Related

Running Scipy Linregress Across Dataframe Where Each Element is a List

Create a Python Function in Excel

Equivalent of pd.Series.str.slice() and pd.Series.apply() in cuDF

How to make my modified pandas/numpy .where function adaptable to different sizes of a list parameter?

Python Pandas Merge Causing Memory Overflow

Categories

Resources