How to move a column in a pandas dataframe - python

I want to take a column indexed 'length' and make it my second column. It currently exists as the 5th column. I have tried:
colnames = big_df.columns.tolist()
# make index "length" the second column in the big_df
colnames = colnames[0] + colnames[4] + colnames[:-1]
big_df = big_df[colnames]
I see the following error:
TypeError: must be str, not list
I'm not sure how to interpret this error because it actually should be a list, right?
Also, is there a general method to move any column by label to a specified position? My columns only have one level, i.e. no MultiIndex involved.

Correcting your error
I'm not sure how to interpret this error because it actually should be
a list, right?
No: colnames[0] and colnames[4] are scalars, not lists. You can't concatenate a scalar with a list. To make them lists, use square brackets:
colnames = [colnames[0]] + [colnames[4]] + colnames[:-1]
You can either use df[[colnames]] or df.reindex(columns=colnames): both necessarily trigger a copy operation as this transformation cannot be processed in place.
Generic solution
But converting arrays to lists and then concatenating lists manually is not only expensive, but prone to error. A related answer has many list-based solutions, but a NumPy-based solution is worthwhile since pd.Index objects are stored as NumPy arrays.
The key here is to modify the NumPy array via slicing rather than concatenation. There are only 2 cases to handle: when the desired position exists after the current position, and vice versa.
import pandas as pd, numpy as np
from string import ascii_uppercase
df = pd.DataFrame(columns=list(ascii_uppercase))
def shifter(df, col_to_shift, pos_to_move):
arr = df.columns.values
idx = df.columns.get_loc(col_to_shift)
if idx == pos_to_move:
pass
elif idx > pos_to_move:
arr[pos_to_move+1: idx+1] = arr[pos_to_move: idx]
else:
arr[idx: pos_to_move] = arr[idx+1: pos_to_move+1]
arr[pos_to_move] = col_to_shift
df = df.reindex(columns=arr)
return df
df = df.pipe(shifter, 'J', 1)
print(df.columns)
Index(['A', 'J', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N',
'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
dtype='object')
Performance benchmarking
Using NumPy slicing is more efficient with a large number of columns versus a list-based method:
n = 10000
df = pd.DataFrame(columns=list(range(n)))
def shifter2(df, col_to_shift, pos_to_move):
cols = df.columns.tolist()
cols.insert(pos_to_move, cols.pop(df.columns.get_loc(col_to_shift)))
df = df.reindex(columns=cols)
return df
%timeit df.pipe(shifter, 590, 5) # 381 µs
%timeit df.pipe(shifter2, 590, 5) # 1.92 ms

Related

Fill in missing column names - Python

I'm trying to concatenate a bunch of dataframes together, which all have the same information. But some column names are missing and some dataframes have extra columns. However, for the columns they do have, they all follow the same order. I'd like a function to fill in the missing names. The following almost works:
def fill_missing_colnames(colnames):
valid_colnames = ['Z', 'K', 'C', 'T', 'A', 'E', 'F', 'G']
missing = list(set(valid_colnames) - set(colnames))
if len(missing) > 0:
for i, col in enumerate(colnames):
if col not in valid_colnames and len(missing) > 0:
colnames[i] = missing.pop(0)
return colnames
But the problem is that set() orders the elements alphabetically, whereas I'd like to preserve the order from the column names (or rather from the valid column names).
colnames = ['K', 'C', 'T', 'E', 'XY', 'F', 'G']
list(set(valid_colnames) - set(colnames))
Out[9]: ['A', 'Z']
The concat looks like this:
concat_errors = {}
all_data = pd.DataFrame(list_of_dataframes[0])
for i, data in enumerate(list_of_dataframes[1:]):
try:
all_data = pd.concat([all_data, pd.DataFrame(data)], axis = 0, sort = False)
except Exception as e:
concat_errors.update({i+1:e})
You can use a list comprehension instead of a set operation.
missing = [col for col in valid_colnames if col not in colnames]
That will simply filter out the values that are not in colnames and preserve order.

Pandas replace the values of multiple columns

If the match value is equal to the sample_input,The value in the sample_input is replaced.
The merge method now used can match, But don't know how to replace it.
There are many duplicate values in the sample being replaced.
The sample_data I used upload to the github.
sample_data_input
import pandas as pd
#Read file
match = pd.read_excel('match.xlsx', sheet_name='Sheet1')
replace = pd.read_excel('replace.xlsx', sheet_name='Sheet1') #replace value
sample_input = pd.read_excel('sample_input.xlsx', sheet_name='Sheet1') #raw file
#column
match_col_n1 = ['e', 'i', 'j', 'k', 'l', 'n', 'label']
match_col_n2 = ['e', 'i', 'j', 'k', 'l', 'n']
replace_col_n = ['i', 'j', 'k', 'l', 'label'] #replace
sample_input_col_n = ['a', 'b', 'c', 'd', 'e', 'f',
'g', 'h', 'i', 'j', 'k', 'l',
'm', 'n']
#DataFrame
match_data = pd.DataFrame(match, columns=match_col_n1)
replace_data = pd.DataFrame(replace, columns=replace_col_n)
sample_input_data = pd.DataFrame(sample_input, columns=sample_input_col_n)
# tmp
tmp = sample_input_data.merge(match_data, how='left', on=None,
left_on=match_col_n2, right_on=match_col_n2,
left_index=False, right_index=False, sort=False,
suffixes=('_x', '_y'), copy=True,
indicator=False, validate=None)
sample_input_data['label'] = tmp['label']
#for num in match_data.index.values:
# label = match_data.loc[num, 'label']
# sample_input_data[sample_input_data['label'] == label][replace_col_n] = replace_data.iloc[num, :].values
sample_input_data = sample_input_data.to_excel('output.xlsx', index=False)
Here's a pretty straightforward way of comparing and contrasting two Excel files.
import pandas as pd
import numpy as np
# Next, read in both of our excel files into dataframes
df1 = pd.read_excel('C:\\your_path\\Book1.xlsx', 'Sheet1', na_values=['NA'])
df2 = pd.read_excel('C:\\your_path\\Book2.xlsx', 'Sheet1', na_values=['NA'])
# Order by account number and reindex so that it stays this way.
df1.sort_index(by=["H1"])
df1=df1.reindex()
df2.sort_index(by=["H1"])
df2=df2.reindex()
# Create a diff function to show what the changes are.
def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)
# Merge the two datasets together in a Panel . I will admit that I haven’t fully grokked the panel concept yet but the only way to learn is to keep pressing on!
diff_panel = pd.Panel(dict(df1=df1,df2=df2))
# Once the data is in a panel, we use the report_diff function to highlight all the changes. I think this is a very intuitive way (for this data set) to show changes. It is relatively simple to see what the old value is and the new one. For example, someone could easily check and see why that postal code changed for account number 880043.
diff_output = diff_panel.apply(report_diff, axis=0)
diff_output.tail()
# One of the things we want to do is flag rows that have changes so it is easier to see the changes. We will create a has_change function and use apply to run the function against each row.
def has_change(row):
if "--->" in row.to_string():
return "Y"
else:
return "N"
diff_output['has_change'] = diff_output.apply(has_change, axis=1)
diff_output.tail()
# It is simple to show all the columns with a change:
diff_output[(diff_output.has_change == 'Y')]
# Finally, let’s write it out to an Excel file:
diff_output[(diff_output.has_change == 'Y')].to_excel('C:\\your_path\\diff.xlsx')
https://pbpython.com/excel-diff-pandas.html

Indexing failure/odd behaviour with array

I have some code that is intended to convert a 3-dimensional list to an array. Technically it works in that I get a 3-dimensional array, but indexing only works when I don't iterate accross one of the dimensions, and doesn't work if I do.
Indexing works here:
listTempAllDays = []
for j in listGPSDays:
listTempDay = []
for i in listGPSDays[0]:
arrayDay = np.array(i)
listTempDay.append(arrayDay)
arrayTemp = np.array(listTempDay)
listTempAllDays.append(arrayTemp)
arrayGPSDays = np.array(listTempAllDays)
print(arrayGPSDays[0,0,0])
It doesn't work here:
listTempAllDays = []
for j in listGPSDays:
listTempDay = []
for i in j:
arrayDay = np.array(i)
listTempDay.append(arrayDay)
arrayTemp = np.array(listTempDay)
listTempAllDays.append(arrayTemp)
arrayGPSDays = np.array(listTempAllDays)
print(arrayGPSDays[0,0,0])
The difference between the two pieces of code is in the inner for loop. The first piece of code also works for all elements in listGPSDays (e.g. for i in listGPSDays[1]: etc...).
Removing the final print call allows the code to run in the second case, or changing the final line to print(arrayGPSDays[0][0,0]) does also run.
In both cases checking the type at all levels returns <class 'numpy.ndarray'>.
I would like this array indexing to work, if possible - what am I missing?
The following is provided as example data:
Anonymised results from print(arrayGPSDays[0:2,0:2,0:2]), generated using the first piece of code (so that the indexing works! - but also resulting in arrayGPSDays[0] being the same as arrayGPSDays[1]):
[[['1' '2']
['3' '4']]
[['1' '2']
['3' '4']]]
numpy's array constructor can handle arbitrarily dimensioned iterables. They only stipulation is that they can't be jagged (i.e. each "row" in each dimension must have the same length).
Here's an example:
In [1]: list_3d = [[['a', 'b', 'c'], ['d', 'e', 'f']], [['g', 'h', 'i'], ['j', 'k', 'l']]]
In [2]: import numpy as np
In [3]: np.array(list_3d)
Out[3]:
array([[['a', 'b', 'c'],
['d', 'e', 'f']],
[['g', 'h', 'i'],
['j', 'k', 'l']]], dtype='<U1')
In [4]: array_3d = np.array(list_3d)
In [5]: array_3d[0,0,0]
Out[5]: 'a'
In [6]: array_3d.shape
Out[6]: (2, 2, 3)
If the array is jagged, numpy will "squash" down to the dimension where the jagged-ness happens. Since that explanation is clear as mud, an example might help:
In [20]: jagged_3d = [ [['a', 'b'], ['c', 'd']], [['e', 'f'], ['g', 'h'], ['i', 'j']] ]
In [21]: jagged_arr = np.array(jagged_3d)
In [22]: jagged_arr.shape
Out[22]: (2,)
In [23]: jagged_arr
Out[23]:
array([list([['a', 'b'], ['c', 'd']]),
list([['e', 'f'], ['g', 'h'], ['i', 'j']])], dtype=object)
The reason the constructor isn't working out of the box is because you have a jagged array. numpy simply does not support jagged arrays due to the fact that each numpy array has a well-defined shape representing the length of each dimension. So if the items in a given dimension are different lengths, this abstraction falls apart, and numpy simply doesn't allow it.
HTH.
So Isaac, it seems your code have some syntax misinterpretations,
In your for statement, j represents an ITEM inside the list listGPSDays (I assume it is a list), not the ITEM INDEX inside the list, and you don't need to "get" the range of the list, python can do it for yourself, try:
for j in listGPSdays:
instead of
for j in range(len(listGPSDays)):
Also, try changing this line of code from:
for i in listGPSDays[j]:
to:
for i in listGPSDays.index(j):
I think it will solve your problem, hope it works!

Reading rows from Excel sheet to list of lists using openpyxl

I'm new to Python and working on a Excel sheet that i want to read using python. I want to read the rows into lists of lists. I've tried this using Openpyxl
rows_iter = ws.iter_rows(min_col = 1, min_row = 2, max_col = 11, max_row = ws.max_row)
val1 = [[cell.value for cell in row] for row in rows_iter]
But this gives me a single list of all the rows as lists inside.
I want to make make different lists consisting of 15 or 12 or 10 rows in them (depending on a condition).
Could you please help me.
Here are the sample Excel file, obtained output and expected outputs.Expected output and obtained op. I'm not able to attach more than 2 attachments!
Thanks in advance!
rows_iter = ws.iter_rows(min_col = 1, min_row = 2, max_col = 11, max_row = ws.max_row)
val_intermediate = [[cell.value for cell in list(row)] for row in rows_iter]
# This would be set to whatever cell value contains N
N = ws['A1'].value
# This splits your shallow list of lists into a list of N lists of lists (!)
val1 = [val_intermediate[j:j+N] for j in range(1,len(val_intermediate),N)]
Explanation: The first list comprehension will return a shallow list of lists. The second list comprehension converts your shallow list of lists into a group of lists of lists that are N entries long. To see how this works, consider this sample data:
In [1]: a = [['a','b','c'],['d','e','f'],['g','h','i'],['j','k','l'] ['m','n','o'],['p','q','r']]
In [2]: N=2; [a[j:j+N] for j in range(1,len(a),N)]
Out[2]:
[[['d', 'e', 'f'], ['g', 'h', 'i']],
[['j', 'k', 'l'], ['m', 'n', 'o']],
[['p', 'q', 'r']]]
In [3]: N=3; [a[j:j+N] for j in range(1,len(a),N)]
Out[3]:
[[['d', 'e', 'f'], ['g', 'h', 'i'], ['j', 'k', 'l']],
[['m', 'n', 'o'], ['p', 'q', 'r']]]
The range(1,len(a),N) will create a sequence of integers from 1 to len(a), skipping N numbers, so that's how you can pick the correct locations at which to split your shallow list of lists. It grabs N-1 items because of the [j:j+N] indexing.

A Faster Way of Removing Unused Categories in Pandas?

I'm running some models in Python, with data subset on categories.
For memory usage, and preprocessing, all the categorical variables are stored as category data type.
For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset.
I am currently doing this using .cat.remove_unused_categories(), which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels to drop).
Here is a simplified example:
import itertools
import pandas as pd
#generate some fake data
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 2)]
z = pd.DataFrame({'x':keywords})
#convert to category datatype
z.x = z.x.astype('category')
#groupby
z = z.groupby('x')
#loop over groups
for i in z.groups:
x = z.get_group(i)
x.x = x.x.cat.remove_unused_categories()
#run my fancy model here
On my laptop, this takes about 20 seconds. for this small example, we could convert to str, then back to category for a speed up, but my real data has at least 300 lines per group.
Is it possible to speed up this loop? I have tried using x.x = x.x.cat.set_categories(i) which takes a similar time, and x.x.cat.categories = i, which asks for the same number of categories as I started with.
Your problem is in that you are assigning z.get_group(i) to x. x is now a copy of a portion of z. Your code will work fine with this change
for i in z.groups:
x = z.get_group(i).copy() # will no longer be tied to z
x.x = x.x.cat.remove_unused_categories()

Categories

Resources