Pandas replace the values of multiple columns - python

If the match value is equal to the sample_input,The value in the sample_input is replaced.
The merge method now used can match, But don't know how to replace it.
There are many duplicate values in the sample being replaced.
The sample_data I used upload to the github.
sample_data_input
import pandas as pd
#Read file
match = pd.read_excel('match.xlsx', sheet_name='Sheet1')
replace = pd.read_excel('replace.xlsx', sheet_name='Sheet1') #replace value
sample_input = pd.read_excel('sample_input.xlsx', sheet_name='Sheet1') #raw file
#column
match_col_n1 = ['e', 'i', 'j', 'k', 'l', 'n', 'label']
match_col_n2 = ['e', 'i', 'j', 'k', 'l', 'n']
replace_col_n = ['i', 'j', 'k', 'l', 'label'] #replace
sample_input_col_n = ['a', 'b', 'c', 'd', 'e', 'f',
'g', 'h', 'i', 'j', 'k', 'l',
'm', 'n']
#DataFrame
match_data = pd.DataFrame(match, columns=match_col_n1)
replace_data = pd.DataFrame(replace, columns=replace_col_n)
sample_input_data = pd.DataFrame(sample_input, columns=sample_input_col_n)
# tmp
tmp = sample_input_data.merge(match_data, how='left', on=None,
left_on=match_col_n2, right_on=match_col_n2,
left_index=False, right_index=False, sort=False,
suffixes=('_x', '_y'), copy=True,
indicator=False, validate=None)
sample_input_data['label'] = tmp['label']
#for num in match_data.index.values:
# label = match_data.loc[num, 'label']
# sample_input_data[sample_input_data['label'] == label][replace_col_n] = replace_data.iloc[num, :].values
sample_input_data = sample_input_data.to_excel('output.xlsx', index=False)

Here's a pretty straightforward way of comparing and contrasting two Excel files.
import pandas as pd
import numpy as np
# Next, read in both of our excel files into dataframes
df1 = pd.read_excel('C:\\your_path\\Book1.xlsx', 'Sheet1', na_values=['NA'])
df2 = pd.read_excel('C:\\your_path\\Book2.xlsx', 'Sheet1', na_values=['NA'])
# Order by account number and reindex so that it stays this way.
df1.sort_index(by=["H1"])
df1=df1.reindex()
df2.sort_index(by=["H1"])
df2=df2.reindex()
# Create a diff function to show what the changes are.
def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)
# Merge the two datasets together in a Panel . I will admit that I haven’t fully grokked the panel concept yet but the only way to learn is to keep pressing on!
diff_panel = pd.Panel(dict(df1=df1,df2=df2))
# Once the data is in a panel, we use the report_diff function to highlight all the changes. I think this is a very intuitive way (for this data set) to show changes. It is relatively simple to see what the old value is and the new one. For example, someone could easily check and see why that postal code changed for account number 880043.
diff_output = diff_panel.apply(report_diff, axis=0)
diff_output.tail()
# One of the things we want to do is flag rows that have changes so it is easier to see the changes. We will create a has_change function and use apply to run the function against each row.
def has_change(row):
if "--->" in row.to_string():
return "Y"
else:
return "N"
diff_output['has_change'] = diff_output.apply(has_change, axis=1)
diff_output.tail()
# It is simple to show all the columns with a change:
diff_output[(diff_output.has_change == 'Y')]
# Finally, let’s write it out to an Excel file:
diff_output[(diff_output.has_change == 'Y')].to_excel('C:\\your_path\\diff.xlsx')
https://pbpython.com/excel-diff-pandas.html

Related

how to automatically drop index levels that only have single value?

I have a dataframe that has A to M columns for example. Then I did:
groups = df.groupby(['E', 'D', 'B', 'G', 'I'])
stats = pd.concat(
[
groups['N'].mean().rename("N_mean"),
groups['H'].median().rename('H_median')
]
)
stats = stats[stats['N']>0]
now if I print stats, the index is ('E', 'D', 'B', 'G', 'I'). However, many of them only contain single value which means they are insignificant. I knew I can determine which level is insignificant then stats.index.droplevel(...). But I wonder is there already builtin method to automatically do this?

Fill in missing column names - Python

I'm trying to concatenate a bunch of dataframes together, which all have the same information. But some column names are missing and some dataframes have extra columns. However, for the columns they do have, they all follow the same order. I'd like a function to fill in the missing names. The following almost works:
def fill_missing_colnames(colnames):
valid_colnames = ['Z', 'K', 'C', 'T', 'A', 'E', 'F', 'G']
missing = list(set(valid_colnames) - set(colnames))
if len(missing) > 0:
for i, col in enumerate(colnames):
if col not in valid_colnames and len(missing) > 0:
colnames[i] = missing.pop(0)
return colnames
But the problem is that set() orders the elements alphabetically, whereas I'd like to preserve the order from the column names (or rather from the valid column names).
colnames = ['K', 'C', 'T', 'E', 'XY', 'F', 'G']
list(set(valid_colnames) - set(colnames))
Out[9]: ['A', 'Z']
The concat looks like this:
concat_errors = {}
all_data = pd.DataFrame(list_of_dataframes[0])
for i, data in enumerate(list_of_dataframes[1:]):
try:
all_data = pd.concat([all_data, pd.DataFrame(data)], axis = 0, sort = False)
except Exception as e:
concat_errors.update({i+1:e})
You can use a list comprehension instead of a set operation.
missing = [col for col in valid_colnames if col not in colnames]
That will simply filter out the values that are not in colnames and preserve order.

More than the specified no. of columns are renamed with Pandas

I joined multiple files using Pandas join() but now want to rename a few of the duplicate columns. But when I specify the indices to rename a few columns, more than the specified no. of columns are being renamed.
Input CSV files have the format
F1.csv
A,B,C,D,E,F
1,4,5,6,7,8
2,1,3,4,5,6
3,4,1,5,1,8
4,5,1,5,6,7
F2.csv
A,B,C,M,N
1,4,5,6,7
2,1,3,4,5
3,4,1,5,1
4,5,1,5,6
F3.csv
A,B,C,X,Y,Z
1,4,5,6,7,8
2,1,3,4,5,6
3,4,1,5,1,8
4,5,1,5,6,7
F4.csv
A,B,C,T,Q,R
1,4,5,6,7,8
2,1,3,4,5,6
3,4,1,5,1,8
4,5,1,5,6,7
And my code
data = None
for f in filelist:
if data is None:
data = pandas.read_csv(f, index_col='A')
else:
data = data.join(pandas.read_csv(f, index_col='A'), lsuffix='_left', rsuffix='_right', how=join_type)
print(list(data))
new_names =["HH","XX"]
old_names = data.columns[[0,1]]
data.rename(columns=dict(zip(old_names, new_names)), inplace=True)
print(list(data_union))
The first print gives the output
['B_left', 'C_left', 'D', 'E', 'F', 'B_right', 'C_right', 'M', 'N', 'B_left', 'C_left', 'X', 'Y', 'Z', 'B_right', 'C_right', 'T', 'Q', 'R']
And print after renaming gives
['HH', 'XX', 'D', 'E', 'F', 'B_right', 'C_right', 'M', 'N', 'HH', 'XX', 'X', 'Y', 'Z', 'B_right', 'C_right', 'T', 'Q', 'R']
My problem is instead of renaming columns at indices 0 and 1 alone, it is changing indices 10 and 11 too. Could anyone help me with this? I am new to Pandas and not able to figure this out. Thanks,

How to move a column in a pandas dataframe

I want to take a column indexed 'length' and make it my second column. It currently exists as the 5th column. I have tried:
colnames = big_df.columns.tolist()
# make index "length" the second column in the big_df
colnames = colnames[0] + colnames[4] + colnames[:-1]
big_df = big_df[colnames]
I see the following error:
TypeError: must be str, not list
I'm not sure how to interpret this error because it actually should be a list, right?
Also, is there a general method to move any column by label to a specified position? My columns only have one level, i.e. no MultiIndex involved.
Correcting your error
I'm not sure how to interpret this error because it actually should be
a list, right?
No: colnames[0] and colnames[4] are scalars, not lists. You can't concatenate a scalar with a list. To make them lists, use square brackets:
colnames = [colnames[0]] + [colnames[4]] + colnames[:-1]
You can either use df[[colnames]] or df.reindex(columns=colnames): both necessarily trigger a copy operation as this transformation cannot be processed in place.
Generic solution
But converting arrays to lists and then concatenating lists manually is not only expensive, but prone to error. A related answer has many list-based solutions, but a NumPy-based solution is worthwhile since pd.Index objects are stored as NumPy arrays.
The key here is to modify the NumPy array via slicing rather than concatenation. There are only 2 cases to handle: when the desired position exists after the current position, and vice versa.
import pandas as pd, numpy as np
from string import ascii_uppercase
df = pd.DataFrame(columns=list(ascii_uppercase))
def shifter(df, col_to_shift, pos_to_move):
arr = df.columns.values
idx = df.columns.get_loc(col_to_shift)
if idx == pos_to_move:
pass
elif idx > pos_to_move:
arr[pos_to_move+1: idx+1] = arr[pos_to_move: idx]
else:
arr[idx: pos_to_move] = arr[idx+1: pos_to_move+1]
arr[pos_to_move] = col_to_shift
df = df.reindex(columns=arr)
return df
df = df.pipe(shifter, 'J', 1)
print(df.columns)
Index(['A', 'J', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N',
'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
dtype='object')
Performance benchmarking
Using NumPy slicing is more efficient with a large number of columns versus a list-based method:
n = 10000
df = pd.DataFrame(columns=list(range(n)))
def shifter2(df, col_to_shift, pos_to_move):
cols = df.columns.tolist()
cols.insert(pos_to_move, cols.pop(df.columns.get_loc(col_to_shift)))
df = df.reindex(columns=cols)
return df
%timeit df.pipe(shifter, 590, 5) # 381 µs
%timeit df.pipe(shifter2, 590, 5) # 1.92 ms

A Faster Way of Removing Unused Categories in Pandas?

I'm running some models in Python, with data subset on categories.
For memory usage, and preprocessing, all the categorical variables are stored as category data type.
For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset.
I am currently doing this using .cat.remove_unused_categories(), which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels to drop).
Here is a simplified example:
import itertools
import pandas as pd
#generate some fake data
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 2)]
z = pd.DataFrame({'x':keywords})
#convert to category datatype
z.x = z.x.astype('category')
#groupby
z = z.groupby('x')
#loop over groups
for i in z.groups:
x = z.get_group(i)
x.x = x.x.cat.remove_unused_categories()
#run my fancy model here
On my laptop, this takes about 20 seconds. for this small example, we could convert to str, then back to category for a speed up, but my real data has at least 300 lines per group.
Is it possible to speed up this loop? I have tried using x.x = x.x.cat.set_categories(i) which takes a similar time, and x.x.cat.categories = i, which asks for the same number of categories as I started with.
Your problem is in that you are assigning z.get_group(i) to x. x is now a copy of a portion of z. Your code will work fine with this change
for i in z.groups:
x = z.get_group(i).copy() # will no longer be tied to z
x.x = x.x.cat.remove_unused_categories()

Categories

Resources