I have two dataframes: data and rules .
>>>data >>>rules
vendor rule
0 googel 0 google
1 google 1 dell
2 googly 2 macbook
I am trying to add two new columns into the data dataframe after computing the Levenshtein similarity between each vendor and rule. So my dataframe should ideally contain columns looking like this:
>>>data
vendor rule similarity
0 googel google 0.8
So far I am trying to perform an apply function that will return me this structure, but the dataframe apply is not accepting the axis argument.
>>> for index,r in rules.iterrows():
... data[['rule','similarity']]=data['vendor'].apply(lambda row:[r[0],ratio(row[0],r[0])],axis=1)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/mnnr/test/env/test-1.0/runtime/lib/python3.4/site-packages/pandas/core/series.py", line 2220, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/src/inference.pyx", line 1088, in pandas.lib.map_infer (pandas/lib.c:62658)
File "/home/mnnr/test/env/test-1.0/runtime/lib/python3.4/site-packages/pandas/core/series.py", line 2209, in <lambda>
f = lambda x: func(x, *args, **kwds)
TypeError: <lambda>() got an unexpected keyword argument 'axis'
Could someone please help me figure out what I am doing wrong? Any change I make is just creating new errors.Thank you
You're calling the Series version of apply for which it doesn't make sense to have an axis arg hence the error.
If you did:
data[['rule','similarity']]=data[['vendor']].apply(lambda row:[r[0],ratio(row[0],r[0])],axis=1)
then this makes a single column df for which this would work
Or just remove the axis arg:
data[['rule','similarity']]=data['vendor'].apply(lambda row:[r[0],ratio(row[0],r[0])])
update
Looking at what you're doing, you need to calculate the levenshtein ratio for each rule against every vendor.
You can do this by:
data['vendor'].apply(lambda row: rules['rule'].apply(lambda x: ratio(x, row))
this I think should calculate the ratio for each vendor against every rule.
Related
Is there a reason why pandas raises ValueError exception when setting DataFrame column using a list and doesn't do the same when using Series? Resulting in superfluous Series values being ignored (e.g. 7 in example below).
>>> import pandas as pd
>>> df = pd.DataFrame([[1],[2]])
>>> df
0
0 1
1 2
>>> df[0] = [5,6,7]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 3655, in __setitem__
self._set_item(key, value)
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 3832, in _set_item
value = self._sanitize_column(value)
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 4529, in _sanitize_column
com.require_length_match(value, self.index)
File "D:\Python310\lib\site-packages\pandas\core\common.py", line 557, in require_length_match
raise ValueError(
ValueError: Length of values (3) does not match length of index (2)
>>>
>>> df[0] = pd.Series([5,6,7])
>>> df
0
0 5
1 6
Tested using python 3.10.6 and pandas 1.5.3 on Windows 10.
You have right the behaviour is different between list and np.array but it's expected.
If you take a look in the source code in the frame.py module you will see that if the value is a list then it checks the length, in np.array doesn't check the length and as you observed is the np.array is larger, its truncated.
NOTE: The details of the np.array truncation is here
I am currently facing a problem that I don't seem to be able to solve with regards to handling and manipulating dataframes using Pandas.
To give you an idea of the dataframes I'm talking about and that you'll see in my code:
I’m trying to change the words found in column ‘exercise’ of the dataset ‘data’ with the words found in column ‘name’ of the dataset ‘exercise’.
For example, the acronym ‘Dl’ in the exercise column of the ‘data’ dataset should be changed into ‘Dead lifts’ found in the ‘name’ column of the ‘exercise’ dataset.
I have tried many methods but all have seemed to fail. I receive the same error every time.
Here is my code with the methods I tried:
### Method 1 ###
# Rename Name Column in 'exercise'
exercise = exercise.rename(columns={'label': 'exercise'})
# Merge Exercise Columns in 'exercise' and in 'data'
data = pd.merge(data, exercise, how = 'left', on='exercise')
### Method 2 ###
data.merge(exercise, left_on='exercise', right_on='label')
### Method 3 ###
data['exercise'] = data['exercise'].astype('category')
EXERCISELIST = exercise['name'].copy().to_list()
data['exercise'].cat.rename_categories(new_categories = EXERCISELIST, inplace = True)
### Same Error, New dataset ###
# Rename Name Column in 'area'
area = area.rename(columns={'description': 'area'})
# Merge Exercise Columns in 'exercise' and in 'data'
data = pd.merge(data, area, how = 'left', on = 'area')
This is the error I get:
Traceback (most recent call last):
File "---", line 232, in
data.to_frame().merge(exercise, left_on='exercise', right_on='label')
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 8192, in merge
return merge(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 74, in merge
op = _MergeOperation(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 668, in init
) = self._get_merge_keys()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 1046, in _get_merge_keys
left_keys.append(left._get_label_or_level_values(lk))
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/generic.py", line 1683, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'exercise'
Is someone able to help me with this? Thank you very much in advance.
merge, then drop and rename columns between data and area
merge, then drop and rename columns between step 1 and exercise
area = pd.DataFrame({"arealabel":["AGI","BAL"],
"description":["Agility","Balance"]})
exercise = pd.DataFrame({"description":["Jump rope","Dead lifts"],
"label":["Jr","Dl"]})
data = pd.DataFrame({"exercise":["Dl","Dl"],
"area":["AGI","BAL"],
"level":[0,3]})
(data.merge(area, left_on="area", right_on="arealabel")
.drop(columns=["arealabel","area"])
.rename(columns={"description":"area"})
.merge(exercise, left_on="exercise", right_on="label")
.drop(columns=["exercise","label"])
.rename(columns={"description":"exercise"})
)
level
area
exercise
0
0
Agility
Dead lifts
1
3
Balance
Dead lifts
I'm trying to use dask to process a dataset which does not fit into memory. It's time series data for various "IDs". After reading dask documentation, I chose to use the "parquet" file format and partitioning by "ID".
However, while reading from parquet, and setting the index I encountered a "TypeError: to union ordered Categoricals, all categories must be the same" which I did not manage to solve by myself.
This code replicates the issue I'm having:
import dask.dataframe as dd
import numpy as np
import pandas as pd
import traceback
# create ids
ids = ["AAA", "BBB", "CCC", "DDD"]
# create data
df = pd.DataFrame(index=np.random.choice(ids, 50), data=np.random.rand(50, 1), columns=["FOO"]).reset_index().rename(columns={"index": "ID"})
# serialize to parquet
f = r"C:/temp/foo.pq"
df.to_parquet(f, compression='gzip', engine='fastparquet', partition_cols=["ID"])
# read with dask
df = dd.read_parquet(f)
try:
df = df.set_index("ID")
except Exception as ee:
print(traceback.format_exc())
at this point I get the following error:
~\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\arrays\categorical.py in check_for_ordered(self, op)
1492 if not self.ordered:
1493 raise TypeError(
-> 1494 f"Categorical is not ordered for operation {op}\n"
1495 "you can use .as_ordered() to change the "
1496 "Categorical to an ordered one\n"
TypeError: Categorical is not ordered for operation max
you can use .as_ordered() to change the Categorical to an ordered one
I then did:
# we order the categorical
df.ID = df.ID.cat.as_ordered()
df = df.set_index("ID")
And, when I'm trying to use df.compute(scheduler="processes"), I get the TypeError I mentioned before:
try:
schd_str = 'processes'
aa = df.compute(scheduler=schd_str)
print(f"{schd_str}: OK")
except:
print(f"{schd_str}: KO")
print(traceback.format_exc())
gives:
Traceback (most recent call last):
File "<ipython-input-6-e15c4e86fee2>", line 3, in <module>
aa = df.compute(scheduler=schd_str)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 166, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in compute
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in <listcomp>
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 103, in finalize
return _concat(results)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 98, in _concat
else methods.concat(args2, uniform=True, ignore_index=ignore_index)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 431, in concat_pandas
ind = concat([df.index for df in dfs])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 400, in concat_pandas
return pd.CategoricalIndex(union_categoricals(dfs), name=dfs[0].name)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\dtypes\concat.py", line 352, in union_categoricals
raise TypeError("Categorical.ordered must be the same")
TypeError: Categorical.ordered must be the same
Surprisingly enough, using df.compute(scheduler="threads"), df.compute(scheduler="synchronous"), or not setting the index at all works properly.
However, it does not seem to be something I should do since I'm actually trying to merge several of these datasets, and thought that setting the index would result in a speed-up over not setting any. (I'm getting the same exact error when trying to merge two dataframes indexed this way)
I tried to inspect df._meta, and it turns out my categories are "known" as they should be? dask-categoricals
I also read this github post about something that looks similar but somehow did not find a solution.
Thanks for your help,
This very simple piece of code,
# imports...
from lifelines import CoxPHFitter
import pandas as pd
src_file = "Pred.csv"
df = pd.read_csv(src_file, header=0, delimiter=',')
df = df.drop(columns=['score'])
cph = CoxPHFitter()
cph.fit(df, duration_col='Length', event_col='Status', show_progress=True)
produces an error:
Traceback (most recent call last):
File
"C:/Users/.../predictor.py", line 11,
in
cph.fit(df, duration_col='Length', event_col='Status', show_progress=True)
File
"C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\lifelines\fitters\coxph_fitter.py",
line 298, in fit
self._check_values(df)
File "C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\lifelines\fitters\coxph_fitter.py",
line 323, in _check_values
cols = str(list(X.columns[low_var]))
File "C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\pandas\core\indexes\base.py",
line 1754, in _ _ getitem _ _
result = getitem(key)
IndexError: boolean index did not match indexed array along dimension 0; dimension is 88 but corresponding
boolean dimension is 76
However, when I print df itself, everything's all right. As you can see, everything is inside the library. And the library's examples work fine.
Without knowing what your data look like - I had the same error, which was resolved when I removed all but the duration, event and coefficient(s) from the pandas df I was using. That is, I had a lot of extra columns in the df that were confusing the cox PH fitter since you don't actually specify which coef you want to include as an argument to cph.fit().
I have a dataset imported via Pandas that has a column full of arrays with strings in them, i.e.:
'Entry'
0 ['test', 'test1', test2']
.
.
.
[n] ['test', 'test1n', 'test2n']
What I would like to do is apply a function to ensure that there are no similar elements in the array. My method is as follows:
def remove_duplicates ( test_id_list ):
new_test_ids = []
for tags in test_id_list:
if tags not in new_test_ids:
new_test_ids.append(tags)
return new_test_ids
I want to apply this to the 'Entry' column in my DataFrame via either apply() or maps() to clean up each column entry. I am doing this via
training_data['Entry'].apply(remove_duplicates(training_data['Entry']))
But I am getting the error:
Traceback (most recent call last):
File "/home/main.py", line 32, in <module>
training_data['Entry'].apply(remove_duplicates(training_data['Entry']))
File "/home/~~~~/.local/lib/python2.7/site-packages/pandas/core/series.py", line 2294, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer (pandas/lib.c:66124)
TypeError: 'list' object is not callable
If anybody can help point me in the right direction, that would be wonderful! I am a bit lost at this point/new to using Pandas for data manipulation.
If you decompose your expression a bit you can see what's wrong.
training_data['Entry'].apply(remove_duplicates(training_data['Entry']))
is functionally equivalent to
x = remove_duplicates(training_data['Entry'])
training_data['Entry'].apply(x)
x is a list because that's what your remove_duplicates function returns. The apply method wants a function as Rauch points out, so you'd want x to simply be remove_duplicates
Setup
df
Out[1190]:
Entry
0 [test, test, test2]
1 [test, test1n, test2n]
To make your code work, you can just do:
df.Entry.apply(func=remove_duplicates)
Out[1189]:
0 [test, test2]
1 [test, test1n, test2n]
Name: Entry, dtype: object
You can actually do this without a custom function in a one liner:
df.Entry.apply(lambda x: list(set(x)))
Out[1193]:
0 [test, test2]
1 [test, test2n, test1n]
Name: Entry, dtype: object