get value by column index where row is a specific value - python

I have a dataframe sheet_overview:
Unnamed: 0 Headline Unnamed: 2 Unnamed: 3
0 nan 1. 1. username Erik
1 nan 1. 2. userage 23
2 nan 1. 3. favorite ice
I want to get the value 23, by looking for "1. 2." in the second column.
If I don't want to go onto the column names, I have to use the index. My question is, if my approach is too complicated.
It works but it seems to be too much and not very pythonic:
age = sheet_overview.iloc[
sheet_overview[
sheet_overview.iloc[:, 1] == '1. 2.']
.index[0], 3]

Add values for numpy array for filter with iloc and then use next for return first matched value - if not exist get missing:
a = sheet_overview.iloc[(sheet_overview.iloc[:, 1] == '1. 2.').values, 3]
a = next(iter(a), 'missing')
print (a)
23
If performance is important , use numba:
from numba import njit
#njit
def first_val(A, k):
a = A[:, 0]
b = A[:, 1]
for i in range(len(a)):
if a[i] == k:
return b[i]
return 'missing'
a = first_val(sheet_overview.iloc[:, [1,3]].values, '1. 2.')

Related

Cleaning outliers inside a column with interpolation

I'm trying to do the following.
I have some data with wrong values (x<=0 or x>=1100) inside a dataframe.
I am trying to change those values to values inside an acceptable range.
For the time being, this is what I do code-wise
def while_non_nan(A, k):
init = k
if k+1 >= len(A)-1:
return A.iloc[k-1]
while np.isnan(A[k+1]):
k += 1
#Calculate the value.
n = k-init+1
value = (n*A.iloc[init-1] + A.iloc[k])/(n+1)
return value
evoli.loc[evoli['T1'] >= 1100, 'T1'] = np.nan
evoli.loc[evoli['T1'] <= 0, 'T1'] = np.nan
inds = np.where(np.isnan(evoli))
#Place column means in the indices. Align the arrays using take
for k in inds[0] :
evoli['T1'].iloc[k] = while_non_nan(evoli['T1'], k)
I transform the outlier values into nan.
Afterwards, I get the position of those nan.
Finally, I modify the nan to the mean value between the previous value and the next one.
Since, several nan can be next to each other, the whie_non_nan search for the next non_nan value and get the ponderated mean.
Example of what I'm hoping to get:
Input :
[nan 0 1 2 nan 4 nan nan 7 nan ]
Output:
[0 0 1 2 3 4 5 6 7 7 ]
Hope it is clear enough. Thanks !
Pandas has a builtin interpolation you could use after setting your limits to NaN:
from numpy import NaN
import pandas as pd
df = pd.DataFrame({"T1": [1, 2, NaN, 3, 5, NaN, NaN, 4, NaN]})
df["T1"] = df["T1"].interpolate(method='linear', axis=0).ffill().bfill()
print(df)
Interpolate is a DataFrame method that fills NaN values with specified interpolation method (linear in this case). Calling .bfill() for backward fill and .ffill() for forward fill ensures the 1st and last item are also replaced if needed, with 2nd and 2nd to last item respectively. If you want some fancier strategy for 1st and last item you need to write it yourself.

How to replace values in a array?

I'm beggining to study python and saw this:
I have and array(km_media) that have nan values,
km_media = km / (2019 - year)
it happend because the variable year has some 2019.
So for the sake of learning, I would like to know how do to 2 things:
how can I use the replace() to substitute the nan values for 0 in the variable;
how can i print the variable that has the nan values with the replace.
What I have until now:
1.
km_media = km_media.replace('nan', 0)
print(f'{km_media.replace('nan',0)}')
Thanks
Not sure is this will do what you are looking for?
a = 2 / np.arange(5)
print(a)
array([ inf, 2. , 1. , 0.66666667, 0.5 ])
b = [i if i != np.inf or i != np.nan else 0 for i in a]
print(b)
Output:
[0, 2.0, 1.0, 0.6666666666666666, 0.5]
Or:
np.where(((a == np.inf) | (a == np.nan)), 0, a)
Or:
a[np.isinf(a)] = 0
Also, for part 2 of your question, I'm not sure what you mean. If you have just replaced the inf's with 0, then you will just be printing zeros. If you want the index position of the inf's you have replaced, you can grab them before replacement:
np.where(a == np.inf)[0][0]
Output:
0 # this is the index position of np.inf in array a

Best way to find Order of Time series from matrix of groups

Is there a standard way of doing this?
I basically have some users, who performed actions together and split off as a group. We don't know the order of the events, but can infer them:
A B C D E
WentToMall 1 1 1 0 0
DroveToTheMovies 1 0 0 0 0
AteLunchTogether 1 1 0 0 0
BoughtClothes 1 1 0 0 1
BoughtElectronics 1 1 0 0 0
The rule is they can't converge together after.
So the time series would look like:
Time 0 is always all of them together, then the largest 'grouping' is splitting off into
'WentToMall' where we get A,B,C and D,E split off.
From there, it looks like AB split off from C, and AB proceed to 'AteLunchTogether, BoughtClothes, BoughtElectronics'. Sometime during 'boughtclothes', it looks like E split off from D.
Finally, A and B split off at the end as A 'Drove to the movies'.
If possible, I'd like to also show this visually, maybe with nodes showing the number of events separating the split (which would look like):
ABCDE ---> ABC --> AB ->A
| | |->B
| |
| |--> C
|
|
|---> DE --> D
|-->E
A problem that comes up is sometimes you get time points which are 'difficult to asses' or appear contradictory, and don't fit in based on the minimal amount of columns. I'm not sure what to do about those either. I am given 'weights' for the actions, so I could decide based on that, or I guess generate all versions of the graph.
I was thinking maybe of recursion to do a search through, or similar?
edit: the latest file is here
The process is through recursion. Pandas is useful in your scenario, though there might be a more efficient ways to do this.
We search from the further nodes. In your case, these would be A and E nodes. How we know these are the furthest nodes? Just count 0 and 1 values of all rows. Then get sum of 0 and 1 values. Also, sort values by 0. For first case, it should be like that:
0 1
DroveToTheMovies 4.0 1.0
AteLunchTogether 3.0 2.0
BoughtElectronics 3.0 2.0
WentToMall 2.0 3.0
BoughtClothes 2.0 3.0
FirstCase 0.0 5.0
This means there is 1 person drove to the movies. You see the pattern. There are people joining this person later on. In first case, there are 5 people which we began with. But there is a problem. How we know whether the previous person was in the group? Let's say X drove to the movies. Now we check for ate lunch. Say Y and Z joined the group, but not X. For this case, we will check if the latest group is in the new group. So until we reach first case, we add all the event to an array. So now we have a branch.
Assume there were people not in group case. In this case, we stored this odd behavior as well. Then, we go from there now. In first case our beginning node was A; not it is B using the same technique. So the process will be repeated again.
My final results were like that:
0 1
0 DroveToTheMovies Index(['A'], dtype='object')
1 AteLunchTogether Index(['A', 'B'], dtype='object')
2 BoughtElectronics Index(['A', 'B'], dtype='object')
3 WentToMall Index(['A', 'B', 'C'], dtype='object')
4 FirstCase Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
5 BoughtClothes Index(['E'], dtype='object')
6 FirstCase Index(['D', 'E'], dtype='object')
There are two FirstCase. But you need to process these two FirstCase values and know that, this D-E group is from the first FirstCase group, then E went for bought clothes. D is unknown, therefore could be assigned as something else. And there you have it.
First branch:
ABCDE ---> ABC --> AB ->A
| |->B
|
|--> C
Second branch:
(first case)---> DE --> D
|-->E
All you have to do is now find who left branches. For first branch it is B, C, and D-E. These are easy to calculate from now on. Hope it'll help you. The code is here, and I suggest to to debug the code in order to get the whole idea more clear:
import pandas as pd
df = pd.DataFrame(
[[1, 1, 1, 0, 0],
[1, 0, 0, 0, 0],
[1, 1, 0, 0, 0],
[1, 1, 0, 0, 1],
[1, 1, 0, 0, 0]], columns=list("ABCDE"))
df.index = ['WentToMall', 'DroveToTheMovies', 'AteLunchTogether', 'BoughtClothes', 'BoughtElectronics']
first_case = pd.DataFrame(
[[1, 1, 1, 1, 1]], columns=list("ABCDE"), index=['FirstCase'])
all_case = pd.concat([first_case, df])
def case_finder(all_case):
df_case = all_case.apply(lambda x: x.value_counts(), axis=1).fillna(0)
df_case = df_case.loc[df_case[1] != 0]
return df_case.sort_values(by=1)
def check_together(x):
x = df.iloc[x]
activity = all_case.loc[x.name]
does_activity = activity.loc[activity == 1]
return activity.name, does_activity.index
def check_in(pre, now):
return pre.isin(now).all()
def check_odd(i):
act = check_together(i)[0]
who = check_together(i)[1][~check_together(i)[1].isin(check_together(i-1)[1])]
return act, who
df = case_finder(all_case)
total = all_case.shape[0]
all_acts = []
last_stable = []
while True:
for i in range(total):
act, ind = check_together(i)
if ind.size == 1:
print("Initiliazed!")
all_acts.append([act, ind])
pass
else:
p_act, p_ind = check_together(i-1)
if check_in(p_ind, ind) == True:
print("So a new person joins us!")
all_acts.append([act, ind])
else:
print("This is weird. We'll check later!")
# act, who = check_odd(i)
last_stable.append([i, p_ind])
continue
if act == 'FirstCase':
break
if len(last_stable) == 0:
print("Process done!")
break
else:
print("Update cases!")
ls_ind = last_stable[0]
all_case = all_case.drop(last_stable[0][1], axis=1)
total = all_case.shape[0]
df = case_finder(all_case)
last_stable = last_stable[1:]
print(all_acts)
x = pd.DataFrame(all_acts)

Comparing rows of two pandas dataframes?

This is a continuation of my question. Fastest way to compare rows of two pandas dataframes?
I have two dataframes A and B:
A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
For a condensed example:
A B C D E
0 0 0 0 1 0
1 1 1 1 1 0
2 1 0 0 1 1
3 0 1 1 1 0
B is 1024 rows x 10 columns, and is a full iteration from 0 to 1023 in binary form.
Example:
0 1 2
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I am trying to find which rows in A, at a particular 10 columns of A, correspond with each row of B.
Each row of A[My_Columns_List] is guaranteed to be somewhere in B, but not every row of B will match up with a row in A[My_Columns_List]
For example, I want to show that for columns [B,D,E] of A,
rows [1,3] of A match up with row [6] of B,
row [0] of A matches up with row [2] of B,
row [2] of A matches up with row [3] of B.
I have tried using:
pd.merge(B.reset_index(), A.reset_index(),
left_on = B.columns.tolist(),
right_on =A.columns[My_Columns_List].tolist(),
suffixes = ('_B','_A')))
This works, but I was hoping that this method would be faster:
S = 2**np.arange(10)
A_ID = np.dot(A[My_Columns_List],S)
B_ID = np.dot(B,S)
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
But when I do this, out_row_idx returns an array containing all the indices of A, which doesn't tell me anything.
I think this method will be faster, but I don't know why it returns an array from 0 to 999.
Any input would be appreciated!
Also, credit goes to #jezrael and #Divakar for these methods.
I'll stick by my initial answer but maybe explain better.
You are asking to compare 2 pandas dataframes. Because of that, I'm going to build dataframes. I may use numpy, but my inputs and outputs will be dataframes.
Setup
You said we have a a 1000 x 500 array of ones and zeros. Let's build that.
A_init = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A_init.columns = pd.MultiIndex.from_product([range(A_init.shape[1]/10), range(10)])
A = A_init
In addition, I gave A a MultiIndex to easily group by columns of 10.
Solution
This is very similar to #Divakar's answer with one minor difference that I'll point out.
For one group of 10 ones and zeros, we can treat it as a bit array of length 8. We can then calculate what it's integer value is by taking the dot product with an array of powers of 2.
twos = 2 ** np.arange(10)
I can execute this for every group of 10 ones and zeros in one go like this
AtB = A.stack(0).dot(twos).unstack()
I stack to get a row of 50 groups of 10 into columns in order to do the dot product more elegantly. I then brought it back with the unstack.
I now have a 1000 x 50 dataframe of numbers that range from 0-1023.
Assume B is a dataframe with each row one of 1024 unique combinations of ones and zeros. B should be sorted like B = B.sort_values().reset_index(drop=True).
This is the part I think I failed at explaining last time. Look at
AtB.loc[:2, :2]
That value in the (0, 0) position, 951 means that the first group of 10 ones and zeros in the first row of A matches the row in B with the index 951. That's what you want!!! Funny thing is, I never looked at B. You know why, B is irrelevant!!! It's just a goofy way of representing the numbers from 0 to 1023. This is the difference with my answer, I'm ignoring B. Ignoring this useless step should save time.
These are all functions that take two dataframes A and B and returns a dataframe of indices where A matches B. Spoiler alert, I'll ignore B completely.
def FindAinB(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
twos = 2 ** np.arange(10)
return A.stack(0).dot(twos).unstack()
def FindAinB2(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
# use clever bit shifting instead of dot product with powers
# questionable improvement
return (A.stack(0) << np.arange(10)).sum(1).unstack()
I'm channelling my inner #Divakar (read, this is stuff I've learned from Divakar)
def FindAinB3(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
a = A.values.reshape(-1, 10)
a = np.einsum('ij->i', a << np.arange(10))
return pd.DataFrame(a.reshape(A.shape[0], -1), A.index)
Minimalist One Liner
f = lambda A: pd.DataFrame(np.einsum('ij->i', A.values.reshape(-1, 10) << np.arange(10)).reshape(A.shape[0], -1), A.index)
Use it like
f(A)
Timing
FindAinB3 is an order of magnitude faster

Pandas DataFrame: Writing values to column depending on a value check of existing column

I want to add a column to a pd.DataFrame in which I write values based on a check in an existing column.
I want to check for values in a dictionary. Let's say I have the following dictionary:
{"<=4":[0,4], "(4,10]":[4,10], ">10":[10,inf]}
Now I want to check in a column in my DataFrame, if the values in this column belong to any of the intervals in the dictionary. If so, I want to write the matching dictionary key to a second column in the same data frame.
So a DataFrame like:
col_1
a 3
b 15
c 8
will become:
col_1 col_2
a 3 "<=4"
b 15 ">10"
c 8 "(4,10]"
the pd.cut() function is used to convert continuous variable to categorical variable , in this case we have [0 , 4 , 10 , np.inf] , this means we have 3 categories [0 , 4] , [4 , 10] , [10 , inf] , so any value between 0 and 4 will be assigned to category [ 0 , 4] , and any value between 4 and 10 will be assigned to category [ 4 , 10 ] and so on .
then you assign a name for each category in the same order , you can do this by using the label parameter , in this case we have 3 categories [0 , 4] , [4 , 10] , [10 , inf] , simply we will assign ['<=4' , '(4,10]' , '>10'] to the label parameter , this means that [0 , 4] category will be named <=4 , and [4 , 10] category will be named (4,10] and so on .
In [83]:
df['col_2'] = pd.cut(df.col_1 , [0 , 4 , 10 , np.inf] , labels = ['<=4' , '(4,10]' , '>10'] )
df
Out[83]:
col_1 col_2
0 3 <=4
1 15 >10
2 8 (4,10]
This solution creates a function named extract_str which is applied to col_1. It uses a conditional list comprehension to iterate through the keys and values in the dictionary, checking if the value is greater than or equal to the lower value and less than the upper value. A check is made to ensure this resulting list does not contain more than one result. If there is a value in the list, it is
returned. Otherwise None is returned by default.
from numpy import inf
d = {"<=4": [0, 4], "(4,10]": [4, 10], ">10": [10, inf]}
def extract_str(val):
results = [key for key, value_range in d.iteritems()
if value_range[0] <= val < value_range[1]]
if len(results) > 1:
raise ValueError('Multiple ranges satisfied.')
if results:
return results[0]
df['col_2'] = df.col_1.apply(extract_str)
>>> df
col_1 col_2
a 3 <=4
b 15 >10
c 8 (4,10]
On this small dataframe, this solution is much faster than the solution provided by #ColonelBeauvel.
%timeit df['col_2'] = df.col_1.apply(extract_str)
1000 loops, best of 3: 220 µs per loop
%timeit df['col_2'] = df['col_1'].map(foo)
1000 loops, best of 3: 1.46 ms per loop
You can use this approach:
dico = pd.DataFrame({"<=4":[0,4], "(4,10]":[4,10], ">10":[10,float('inf')]}).transpose()
foo = lambda x: dico.index[(dico[1]>x) & (dico[0]<=x)][0]
df['col_1'].map(foo)
#0 <=4
#1 >10
#2 (4,10]
#Name: col1, dtype: object
You can use a function to map. like the example.
I hope it may help you.
import pandas as pd
d = {'col_1':[3,15,8]}
from numpy import inf
test = pd.DataFrame(d,index=['a','b','c'])
newdict = {"<=4":[0,4], "(4,10]":[4,10], ">10":[10,inf]}
def mapDict(num):
print(num)
for key,value in newdict.items():
tmp0 = value[0]
tmp1 = value[1]
if num == 0:
return "<=4"
elif (num> tmp0) & (num<=tmp1):
return key
test['col_2']=test.col_1.map(mapDict)
then test will become:
col_1 col_2
a 3 <=4
b 15 >10
c 8 (4,10]
ps. I wanna know how to code fast in stack overflow, are there some one can tell me the tricks?

Categories

Resources