FutureWarning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method.
I am getting the above error whenever i ran this code!
difference = pd.Panel(dict(df1=df1,df2=df2))
Can anyone please tell me the alternative way for usage of Panel with the above line of code.
Edit-1:-
def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)
difference = pd.Panel(dict(df1=df1,df2=df2))
res = difference.apply(report_diff, axis=0)
Here df1 and df2 contains both categorical and numerical data.
Just comparing the two dataframes here to get the differences between the two.
As stated in the docs, the recommended replacements for a Pandas Panel are using a multindex, or the xarray library.
For your specific use case, this somewhat hacky code gets you the same result:
a = df1.values.reshape(df1.shape[0] * df1.shape[1])
b = df2.values.reshape(df2.shape[0] * df2.shape[1])
res = np.array([v if v == b[idx] else str(v) + '--->' + str(b[idx]) for idx, v in enumerate(a)]).reshape(
df1.shape[0], df1.shape[1])
res = pd.DataFrame(res, columns=df1.columns)
Related
I am developping a dashboard using dash.
The user can select different parameters and a dataframe is updated (6 parameters).
The idea was to do :
filtering = []
if len(filter1)>0:
filtering.append("df['col1'].isin(filter1)")
if len(filter2)>0:
filtering.append("df['col2'].isin(filter2)")
condition = ' & '.join(filtering)
df.loc[condition]
But I have a key error, what i understand, as condition is a string.
Any advice on how I can do it ? What is the best practise ?
NB : I have a solution with if condition but I would like to maximise this part, avoiding the copy of the dataframe (>10 millions of rows).
dff = df.copy()
if len(filter1)>0:
dff = dff.loc[dff.col1.isin(filter1)]
if len(filter2)>0:
dff = dff.loc[dff.col2.isin(filter2)]
you can use eval:
filtering = []
if len(filter1)>0:
filtering.append("df['col1'].isin(filter1)")
if len(filter2)>0:
filtering.append("df['col2'].isin(filter2)")
condition = ' & '.join(filtering)
df.loc[eval(condition)]
You can merge the masks using the & operator and only apply the merged mask once
from functools import reduce
filters = []
if len(filter1)>0:
filters.append(df.col1.isin(filter1))
if len(filter2)>0:
filters.append(df.col2.isin(filter2))
if len(filters) > 0:
final_filter = reduce(lambda a, b: a&b, filters)
df = df.loc[final_filter]
I am running a fairly complex filter on a dataframe in pandas (I am filtering for passing test results against 67 different thresholds via a dictionary). In order to do this I have the following:
query_string = ' | '.join([f'{k} > {v}' for k , v in dictionary.items()])
test_passes = df.query(query_string, engine='python')
Where k is the test name and v is the threshold value.
This is working nicely and I am able to export the rows with test passes to csv.
I am wondering though if there is a way to also attach a column which counts the number of test passes. So for example if the particular row recorded 1-67 test passes.
So I finally 'solved' with the following, starting after the pandas query originally posted. The original question was for test passes by my use case if actually for test failures which....
test_failures = data.query(query_string, engine='python').copy()
The copy is to prevent unintentional data manipulation and chaining error messages.
for k, row in test_failures.iterrows():
failure_count=0
test_count=0
for key, val in threshold_dict.items():
test_count +=1
if row[key] > val:
failure_count +=1
test_failures.at[k, 'Test Count'] = test_count
test_failures.at[k, 'Failure Count'] = failure_count
From what I have read iterrows() is not the fastest iterative method but it does provide the index (k) and a data dictionary (row) separately which I found more useful for these purposes than the tuple returned in itertuples().
sorted_test_failures = test_failures.sort_values('Failure Count', ascending=False)
sorted_test_failures.to_csv('failures.csv', encoding='utf8')
A little sorting and saving to finish.
I have tested on a dummy data set of (8000 x 66) - it doesn't provide groundbreaking speed but it does the job. Any improvements would be great!
This was answered here:
https://stackoverflow.com/a/24516612/6815750
But to give an example you could do the following:
new_df = df.apply(pd.Series.value_counts, axis = 1) #where df is your current dataframe holding the pass/fails
df[new_df.columns] = new_df
You can use the following approach instead:
dictionary = {'a':'b', 'b': 'c'}
data = pd.DataFrame({'a': [1,2,3], 'b': [ 2,1,2], 'c': [2,1,1] })
test_components = pd.DataFrame([df.loc[:, k] > df.loc[:, v] for k , v in dictionary.items()]).T
# now can inspect what conditions were met in `test_components` variable
condition = test_components.any(axis=1)
data_filtered = data.loc[common_condition, :]
I am using the following function with a DataFrame:
df['error_code'] = df.apply(lambda row: replace_semi_colon(row), axis=1)
The embedded function is:
def replace_semi_colon(row):
errrcd = str(row['error_code'])
semi_colon_pat = re.compile(r'.*;.*')
if pd.notnull(errrcd):
if semi_colon_pat.match(errrcd):
mod_error_code = str(errrcd.replace(';',':'))
return mod_error_code
return errrcd
But I am receiving the (in)famous
SettingWithCopyWarning
I have read many posts but still do not know how to prevent it.
The strange thing is that I use other apply functions the same way but they do not throw the same error.
Can someone explain why I am getting this warning?
Before the apply there was another statement:
df = df.query('error_code != "BM" and eror_code != "PM"')
I modified that to:
df.loc[:] = df.query('error_code != "BM" and eror_code != "PM"')
That solved it.
Like others before me (e.g., questions like this), I am attempting to use statsmodels OLS within a pandas groupby. However, in trying to send the results' residuals to a column in the extant dataframe, I run up against either indexing ValueErrors (if I use apply) or else KeyErrors (if I use transform).
My current code is:
def regression_residuals(df, **kwargs):
X = df[kwargs['x_column']]
y = df[kwargs['y_column']]
regr_ols = sm.OLS(y,X).fit()
resid = regr_ols.resid.reset_index(drop=True)
return resid
df['residuals'] = df.groupby(['year_and_month']).apply(
regression_residuals, x_column = 'x_var', y_column = 'y_var')
As is, the code yields a result of "ValueError: Wrong number of items passed 4, placement implies 1", while changing apply to transform results in "KeyError: ('x_var', 'occurred at index item_label')". From debug-output it appears that the creation of the residuals seems correct, but it's having a hard time placing the residuals series back into groupby with the correct indexing. However, it's not apparent what would correctly do that.
If I try to use the for-loop iteration through the DataFrameGroupBy's as in the question I had cited, the original frame remains unmodified. As a result, things like
grps = df.groupby(['year_and_month'])
for year_month, grp in grps:
grp['residuals'] = apply_reg_resid(grp, x_column = 'x_var', y_column = 'y_var')
are of no use here, as it does nothing to the original df.
What should I more properly be doing?
Thanks all for any help.
EDIT:
Hi all, I'm apparently unable to post an answer my own question, but I think I've found out the solution. Using:
def regression_residuals(df, **kwargs):
X = df[kwargs.pop('x_column')].values
y = df[kwargs.pop('y_column')].values
X = sm.add_constant(X, prepend=False)
regr_ols = sm.OLS(y,X).fit()
resid = regr_ols.resid
df_resid = pd.DataFrame(resid, index=df.index)
return resid
seems to solve the problem.
I am able to answer my question. It is:
def regression_residuals(df, **kwargs):
X = df[kwargs.pop('x_column')]
y = df[kwargs.pop('y_column')]
X = sm.add_constant(X, prepend=False)
regr_ols = sm.OLS(y,X).fit()
resid = regr_ols.resid
df_resid = pd.DataFrame(resid, index=df.index)
return resid
I'm new to Pandas and am trying to merge a few subsets of data. I'm giving a specific case where this happens, but the question is general: How/why is it happening and how can I work around it?
The data I load is around 85 Megs or so but I often watch my python session run up close to 10 gigs of memory usage then give a memory error.
I have no idea why this happens, but it's killing me as I can't even get started looking at the data the way I want to.
Here's what I've done:
Importing the Main data
import requests, zipfile, StringIO
import numpy as np
import pandas as pd
STAR2013url="http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_all_csv_v3.zip"
STAR2013fileName = 'ca2013_all_csv_v3.txt'
r = requests.get(STAR2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STAR2013=pd.read_csv(z.open(STAR2013fileName))
Importing some Cross Cross Referencing Tables
STARentityList2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013entities_csv.zip"
STARentityList2013fileName = "ca2013entities_csv.txt"
r = requests.get(STARentityList2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARentityList2013=pd.read_csv(z.open(STARentityList2013fileName))
STARlookUpTestID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/tests.zip"
STARlookUpTestID2013fileName = "Tests.txt"
r = requests.get(STARlookUpTestID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpTestID2013=pd.read_csv(z.open(STARlookUpTestID2013fileName))
STARlookUpSubgroupID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/subgroups.zip"
STARlookUpSubgroupID2013fileName = "Subgroups.txt"
r = requests.get(STARlookUpSubgroupID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpSubgroupID2013=pd.read_csv(z.open(STARlookUpSubgroupID2013fileName))
Renaming a Column ID to Allow for Merge
STARlookUpSubgroupID2013 = STARlookUpSubgroupID2013.rename(columns={'001':'Subgroup ID'})
STARlookUpSubgroupID2013
Successful Merge
merged = pd.merge(STAR2013,STARlookUpSubgroupID2013, on='Subgroup ID')
Try a second merge. This is where the Memory Overflow Happens
merged=pd.merge(merged, STARentityList2013, on='School Code')
I did all of this in ipython notebook, but don't think that changes anything.
Although this is an old question, I recently came across the same problem.
In my instance, duplicate keys are required in both dataframes, and I needed a method which could tell if a merge will fit into memory ahead of computation, and if not, change the computation method.
The method I came up with is as follows:
Calculate merge size:
def merge_size(left_frame, right_frame, group_by, how='inner'):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_diff = left_keys - intersection
right_diff = right_keys - intersection
left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_nan * right_nan]
left_size = [left_groups[group_name] for group_name in left_diff]
right_size = [right_groups[group_name] for group_name in right_diff]
if how == 'inner':
return sum(sizes)
elif how == 'left':
return sum(sizes + left_size)
elif how == 'right':
return sum(sizes + right_size)
return sum(sizes + left_size + right_size)
Note:
At present with this method, the key can only be a label, not a list. Using a list for group_by currently returns a sum of merge sizes for each label in the list. This will result in a merge size far larger than the actual merge size.
If you are using a list of labels for the group_by, the final row size is:
min([merge_size(df1, df2, label, how) for label in group_by])
Check if this fits in memory
The merge_size function defined here returns the number of rows which will be created by merging two dataframes together.
By multiplying this with the count of columns from both dataframes, then multiplying by the size of np.float[32/64], you can get a rough idea of how large the resulting dataframe will be in memory. This can then be compared against psutil.virtual_memory().available to see if your system can calculate the full merge.
def mem_fit(df1, df2, key, how='inner'):
rows = merge_size(df1, df2, key, how)
cols = len(df1.columns) + (len(df2.columns) - 1)
required_memory = (rows * cols) * np.dtype(np.float64).itemsize
return required_memory <= psutil.virtual_memory().available
The merge_size method has been proposed as an extension of pandas in this issue. https://github.com/pandas-dev/pandas/issues/15068.