How to reindex columns MultiIndex of a Pandas Dataframe? - python

I have a Pandas DataFrame with MultiIndex on columns (lets say 3 levels):
MultiIndex(levels=[['BA-10.0', 'BA-2.5', ..., 'p'], ['41B004', '41B005', ..., 'T1M003', 'T1M011'], [25, 26, ..., 276, 277]],
labels=[[0, 0, 0, ..., 18, 19, 19], [4, 5, 6,..., 14, 12, 13], [24, 33, 47, ..., 114, 107, 113]],
names=['measurandkey', 'sitekey', 'channelid'])
When I iter through the first level and yield subset of DataFrame:
def cluster(df):
for key in df.columns.levels[0]:
yield df[key]
for subdf in cluster(df):
print(subdf.columns)
Columns index does have lost its first level, but the MultiIndex still contains reference to all other keys in sub-levels even if they are missing in the subset.
MultiIndex(levels=[['41B004', '41B005', '41B006', '41B008', '41B011', '41MEU1', '41N043', '41R001', '41R002', '41R012', '41WOL1', '41WOL2', 'T1M001', 'T1M003', 'T1M011'], [25, 26, 27, 28, 30, 31, 32, 3, ....
labels=[[4, 5, 6, 7, 9, 10], [24, 33, 47, 61, 83, 98]],
names=['sitekey', 'channelid'])
How can I force subdf to have its columns MultiIndex updated with only keys that are present?

def cluster(df):
for key in df.columns.levels[0]:
d = df[key]
d.columns = pd.MultiIndex.from_tuples(d.columns.to_series())
yield d

Related

Selecting randomly from dataframe index into a list of sublists; a value can be repeated between sublists but can't within a sublist python

I have to create a list of lists of sublists that each of them contain some items that can be repeated accross lists (first level of the main list), but not within the sublists. They are taken from a dataframe with the df.sample() method. For example, I got the result:
[[[20, 12, 23, 5, 31], [], [2, 29, 14, 22, 21]],
[[32, 17, 24, 23, 9], [], [20, 41, 2, 27, 14], [2, 9, 7]], #9 repeated
[[10, 44, 14, 27, 29], [], [14, 7, 22, 26, 41]], #wrong because of 14 repeated
[[10, 29, 9, 41, 7], [], [41, 29, 7, 15, 43]], #this sublist is wrong as the 41,9 and 29 are repeated within its sublists
[[44, 29, 23, 28, 19], [], [21, 44, 42, 41, 43]]] #also wrong for same reason as above
The second empty sublists are ok as this is a desired output based in a previous code.
The complete code is the following:
matrix=df.copy()
visited=[]
iteration=0
smpls=[]
booters=[5,0,5,3]
shiftsdict=dict(zip(shifts,booters))
chunks=[3,4,3,3,3]
dicts=[[[]for e in range(0,3)] for w in range(0,5)]
dicts[1].append([]) #to aggregate a sublist in the 2nd list of the main list
for i in (turns['1st week'].keys()):
for j in turns['1st week'][i]:
n=shiftsdict[j]
smpls.append(n)
iterate=iter(smpls)
cuantosamples=[list(islice(iterate,0,e)) for e in chunks]
#cuantosamples is a list of sublists that indicates how many numbers the sample method will take for each period/list
for i,j,k,p in zip(turns['1st week'].keys(),cuantosamples,dicts,turncounts['1st week'].keys()):
for l,m,o in zip(turns['1st week'][i],j,k):
if iteration<20:
sample=matrix.sample(m,weights='weights') #take the samples from the df
vls=list(sample.index.values)
o.extend(vls) #add samples to the dicts list of lists
turncounts['1st week'][p]+=len(vls) #a dict counter
visited.append(list(vls))
visited2=[i for m in visited for i in m]
visitedunique = list(set(visited2))
iteration+=1
dicts #this is the list of lists that I want to fill with the conditions that I mentioned above
The dicts code print the result I showed at the top of the question, and a more desired output would be something like this:
[[[20, 12, 23, 5, 31], [], [2, 29, 14, 22, 21]],
[[32, 17, 24, 23, 9], [], [20, 41, 2, 27, 14], [2, 8, 7]], #9 is no longer repeated
[[10, 44, 14, 27, 29], [], [6, 7, 22, 26, 41]], #6 is no longer repeated
[[10, 29, 9, 41, 7], [], [43, 30, 7, 15, 43]], #29 and 41 no longer repeated
[[44, 29, 23, 28, 19], [], [21, 45, 42, 41, 43]]] #44 no longer repeated
Again, it doesn't matter if the values are repeated ACROSS lists, but I don't want them to be repeated within sublists of a list.
I'm making a code that solves this with a defined function but this would imply to evaluate each of the sublists and seek for duplicates, then replace those duplicates with another random sample from the df (obviously after excluding the duplicated ones in the df); but I don't see this process that I'm doing very optimal, and I would like to know how to do all in this loop, but I can't imagine how because when I think of it, it would involve to do a looping across all the sublists and at the same time, in the list of the list and I don't know how this would be solved.
Thanks

How to change or create new ndarray from list

I have value X of type ndarray with shape: (40000, 2)
The second column of X contains list of 50 numbers
Example:
[17, [1, 2, 3, ...]],
[39, [44, 45, 45, ...]], ...
I want to convert it to ndarray of shape (40000, 51):
the first column will be the same
the every element of the list will be in it's own column.
for my example:
[17, 1, 2, 3, ....],
[39, 44, 45, 45, ...]
How can I do it ?
np.hstack((arr[:,0].reshape(-1,1), np.array(arr[:,1].tolist())))
Example:
>>> arr
array([[75, list([90, 39, 63])],
[20, list([82, 92, 22])],
[80, list([12, 6, 89])],
[79, list([11, 96, 74])],
[96, list([26, 37, 65])]], dtype=object)
>>> np.hstack((arr[:,0].reshape(-1,1),np.array(arr[:,1].tolist()))).astype(int)
array([[75, 90, 39, 63],
[20, 82, 92, 22],
[80, 12, 6, 89],
[79, 11, 96, 74],
[96, 26, 37, 65]])
You can do this for each line of your ndarray , here is an example :
# X = [39, [44, 45, 45, ...]]
newX = numpy.ndarray(shape=(1,51))
new[0] = X[0] # adding the first element
# now the rest of elements
i = 0
for e in X[1] :
newX[i] = e
i = i + 1
You can make this process as a function and apply it in this way :
newArray = numpy.ndarray(shape=(40000,51))
i = 0
for x in oldArray :
Process(newArray[i],x)
i=i+1
I defined the source array (with shorter lists in column 1) as:
X = np.array([[17, [1, 2, 3, 4]], [39, [44, 45, 45, 46]]])
To do your task, define the following function:
def myExplode(row):
tbl = [row[0]]
tbl.extend(row[1])
return tbl
Then apply it to each row:
np.apply_along_axis(myExplode, axis=1, arr=X)
The result is:
array([[17, 1, 2, 3, 4],
[39, 44, 45, 45, 46]])

Order a list of lists based on a list - python

I have a structure like this :
data = [[2,5,6,9,12,45,32] , [43,23,12,76,845,1] ,[65,23,1,54,22,123] ,
[323,23,412,656,2,3] , [8,5,3,9,12,45,32] , [60,23,12,76,845,1] ,
[5,23,1,54,22,123] , [35,2,12,56,22,34] ]
and I want order this lists based on another list with the positions
order = [5,4,1,3,0,6,7, 2]
the result would be :
data_ordered = [[60,23,12,76,845,1],[8,5,3,9,12,45,32], [43,23,12,76,845,1],
[323,23,412,656,2,3] , [2,5,6,9,12,45,32] , [5,23,1,54,22,123] ,
[35,2,12,56,22,34] ,[65,23,1,54,22,123] ]
Any idea?
data_ordered = [ data[i] for i in order]
Pretty basic list comprehension.
import numpy as np
data_ordered = np.array(data)[np.array(order)].tolist()
And this will be done. Full example given below:
import numpy as np
data = [[2,5,6,9,12,45,32] , [43,23,12,76,845,1] ,[65,23,1,54,22,123] ,
[323,23,412,656,2,3] , [8,5,3,9,12,45,32] , [60,23,12,76,845,1] ,
[5,23,1,54,22,123] , [35,2,12,56,22,34] ]
order = [5,4,1,3,0,6,7, 2]
data_ordered= np.array(data)[np.array(order)].tolist()
print(data_ordered)
Output is
[[60, 23, 12, 76, 845, 1], [8, 5, 3, 9, 12, 45, 32], [43, 23, 12, 76, 845, 1], [323, 23, 412, 656, 2, 3], [2, 5, 6, 9, 12, 45, 32], [5, 23, 1, 54, 22, 123], [35, 2, 12, 56, 22, 34], [65, 23, 1, 54, 22, 123]]
Use numpy to solve it.

How can I convert all columns from my Excel file using pandas

I want to convert all columns (59 columns) from my excel file to a dataframe, specifying the types.
Some columns are a string, others dates, other int and more.
I know I can use a converter in a read_excel method.
but I have a lot of columns and I don't want write converter={'column1': type1, 'column2': type2, ..., 'column59': type59}
my code is:
import numpy as np
import pandas as pd
import recordlinkage
import xrld
fileName = 'C:/Users/Tito/Desktop/banco ZIKA4.xlsx'
strcols = [0, 5, 31, 36, 37, 38, 39, 40, 41, 45]
datecols = [3, 4, 29, 30, 32, 48, 50, 51, 52, 53, 54, 55]
intcols = [33, 43, 59]
booleancols = [6, ..., 28]
df = pd.read_excel(fileName, sheet_name=0, true_values=['s'], false_values=['n'], converters={strcols: str, intcols: np.int, booleancols: np.bool, datecols: pd.to_datetime})
print(df.iat[1, 31], df.iat[1, 32], df.iat[1, 33])
Iiuc your code doesn't work because the converters kwarg doesn't allow lists of several columns as keys to functions.
What you can do is to create dicts instead of lists and provide the concatenated dicts to converters:
strcols = {c: str for c in [0, 5, 31, 36, 37, 38, 39, 40, 41, 45]}
datecols = {c: pd.to_datetime for c in [3, 4, 29, 30, 32, 48, 50, 51, 52, 53, 54, 55]}
intcols = {c: np.int for c in [33, 43, 59]}
booleancols = {c: np.bool for c in range(6, 29)}
conv_fcts = {**strcols, **datecols, **intcols, **booleancols}
df = pd.read_excel(fileName, converters=conv_fcts, sheet_name=0, true_values=['s'], false_values=['n'])

Comparing lists with their indices and content in Python

I have a list of numbers as
N = [13, 14, 15, 25, 27, 31, 35, 36, 43]
After some calculations, for each element in N, I get the following list as the answers.
ndlist = [4, 30, 0, 42, 48, 4, 3, 42, 3]
That is, for the first index in N (which is 13), my answer is 4 in ndlist.
For some indices in N, I get the same answer in ndlist. For example, when N= 13 and 31, the answer is 4 in ndlist.
I need to find the numbers in N (13 and 31 in my example) such that they have the same answer in ndlist.
Can someone help me to that?
You can use a defaultdict and put those into a list keyed by the answer like:
Code:
N = [13, 14, 15, 25, 27, 31, 35, 36, 43]
ndlist = [4, 30, 0, 42, 48, 4, 3, 42, 3]
from collections import defaultdict
answers = defaultdict(list)
for n, answer in zip(N, ndlist):
answers[answer].append(n)
print(answers)
print([v for v in answers.values() if len(v) > 1])
Results:
defaultdict(<class 'list'>, {4: [13, 31], 30: [14],
0: [15], 42: [25, 36], 48: [27], 3: [35, 43]})
[[13, 31], [25, 36], [35, 43]]
Here is a way using only a nested list comprehension:
[N[idx] for idx, nd in enumerate(ndlist) if nd in [i for i in ndlist if ndlist.count(i)>1]]
#[13, 25, 31, 35, 36, 43]
To explain: the inner list comprehension ([i for i in ndlist if ndlist.count(i)>1]) gets all duplicate values in ndlist, and the rest of the list comprehension extracts the corresponding values in N where those values are found in ndlist

Categories

Resources