How to compare values to next or previous items in loop?
I need to summarize consecutive repetitinos of occurences in columns.
After that I need to create "frequency table" so the dfoutput schould looks like on the bottom picture.
This code doesn't work because I can't compare to another item.
Maybe there is another, simple way to do this without looping?
sumrep=0
df = pd.DataFrame(data = {'1' : [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],'2' : [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.index= [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] # It will be easier to assign repetitions in output df - index will be equal to number of repetitions
dfoutput = pd.DataFrame(0,index=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],columns=['1','2'])
#example for column 1
for val1 in df.columns[1]:
if val1 == 1 and val1 ==0: #can't find the way to check NEXT val1 (one row below) in column 1 :/
if sumrep==0:
dfoutput.loc[1,1]=dfoutput.loc[1,1]+1 #count only SINGLE occurences of values and assign it to proper row number 1 in dfoutput
if sumrep>0:
dfoutput.loc[sumrep,1]=dfoutput.loc[sumrep,1]+1 #count repeated occurences greater then 1 and assign them to proper row in dfoutput
sumrep=0
elif val1 == 1 and df[val1+1]==1 :
sumrep=sumrep+1
Desired output table for column 1 - dfoutput:
I don't undestand why there is no any simple method to move around dataframe like offset function in VBA in Excel:/
You can use the function defined here to perform fast run-length-encoding:
import numpy as np
def rlencode(x, dropna=False):
"""
Run length encoding.
Based on http://stackoverflow.com/a/32681075, which is based on the rle
function from R.
Parameters
----------
x : 1D array_like
Input array to encode
dropna: bool, optional
Drop all runs of NaNs.
Returns
-------
start positions, run lengths, run values
"""
where = np.flatnonzero
x = np.asarray(x)
n = len(x)
if n == 0:
return (np.array([], dtype=int),
np.array([], dtype=int),
np.array([], dtype=x.dtype))
starts = np.r_[0, where(~np.isclose(x[1:], x[:-1], equal_nan=True)) + 1]
lengths = np.diff(np.r_[starts, n])
values = x[starts]
if dropna:
mask = ~np.isnan(values)
starts, lengths, values = starts[mask], lengths[mask], values[mask]
return starts, lengths, values
With this function your task becomes a lot easier:
import pandas as pd
from collections import Counter
from functools import partial
def get_frequency_of_runs(col, value=1, index=None):
_, lengths, values = rlencode(col)
return pd.Series(Counter(lengths[np.where(values == value)]), index=index)
df = pd.DataFrame(data={'1': [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],
'2': [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.apply(partial(get_frequency_of_runs, index=df.index)).fillna(0)
# 1 2
# 0 0.0 0.0
# 1 1.0 2.0
# 2 2.0 1.0
# 3 0.0 0.0
# 4 1.0 1.0
# 5 0.0 0.0
# 6 0.0 0.0
# 7 0.0 0.0
# 8 0.0 0.0
# 9 0.0 0.0
# 10 0.0 0.0
# 11 0.0 0.0
# 12 0.0 0.0
# 13 0.0 0.0
# 14 0.0 0.0
Related
I'm trying to compare 2 columns (strings) of 2 different pandas' dataframe (A and B) and if they match a piece of the string, I would like to assign the value of one column in dataframe A to dataframe B.
This is my code:
import numpy as np
import pandas as pd
A = ['DF-PI-05', 'DF-PI-09', 'DF-PI-10', 'DF-PI-15', 'DF-PI-16',
'DF-PI-19', 'DF-PI-89', 'DF-PI-92', 'DF-PI-93', 'DF-PI-94',
'DF-PI-95', 'DF-PI-96', 'DF-PI-25', 'DF-PI-29', 'DF-PI-30',
'DF-PI-34', 'DF-PI-84']
B = ['PI-05', 'PI-10', 'PI-89', 'PI-90', 'PI-93', 'PI-94', 'PI-95',
'PI-96', 'PI-09', 'PI-15', 'PI-16', 'PI-19', 'PI-91A', 'PI-91b',
'PI-92', 'PI-25-CU', 'PI-29', 'PI-30', 'PI-34', 'PI-84-CU-S1',
'PI-84-CU-S2']
import random
sample_size = len(A)
Group = [random.randint(0,1) for _ in range(sample_size)]
A = pd.DataFrame(list(zip(A,Group)),columns=['ID','Group'])
B = pd.DataFrame(B,columns=['Name'])
clus_tx = np.array([])
for date, row in B.iterrows():
for date2, row2 in A.iterrows():
if row2['ID'] in row['Name']:
clus = row['Group']
else:
clus = 999
clus_tx = np.append(clus_tx,clus)
B['Group'] = clus_tx
What I would like to have is a np.array clus_tx with the length of B, where if there is an element with the string that matches in A ('PI-xx'), I would take the value of the column 'Group' from A and assign to B, if there is no string matching, I would assign the value of 999 to the column 'Group' in B.
I think I'm doing the loop wrong because the size of clus_tx is not what I expected...My real dataset is huge, so I can't do this manually.
First, the reason why the size of clus_tx is not what you want is that you put clus_tx = np.append(clus_tx,clus) in the innermost loop, which has no break. So the length of clus_tx will always be len(A) x len(B).
Second, the logic of if statement block is not what you want.
I've changed the code a bit, hope it helps:
import numpy as np
import pandas as pd
A = ['DF-PI-05', 'DF-PI-09', 'DF-PI-10', 'DF-PI-15', 'DF-PI-16',
'DF-PI-19', 'DF-PI-89', 'DF-PI-92', 'DF-PI-93', 'DF-PI-94',
'DF-PI-95', 'DF-PI-96', 'DF-PI-25', 'DF-PI-29', 'DF-PI-30',
'DF-PI-34', 'DF-PI-84']
B = ['PI-05', 'PI-10', 'PI-89', 'PI-90', 'PI-93', 'PI-94', 'PI-95',
'PI-96', 'PI-09', 'PI-15', 'PI-16', 'PI-19', 'PI-91A', 'PI-91b',
'PI-92', 'PI-25-CU', 'PI-29', 'PI-30', 'PI-34', 'PI-84-CU-S1',
'PI-84-CU-S2']
import random
sample_size = len(A)
Group = [random.randint(0,1) for _ in range(sample_size)]
A = pd.DataFrame(list(zip(A,Group)),columns=['ID','Group'])
B = pd.DataFrame(B,columns=['Name'])
clus_tx = np.array([])
for date, row_B in B.iterrows():
clus = 999
for date2, row_A in A.iterrows():
if row_B['Name'] in row_A['ID']:
clus = row_A['Group']
break
clus_tx = np.append(clus_tx,clus)
B['Group'] = clus_tx
print(B)
The print output of B looks like:
Name Group
0 PI-05 0.0
1 PI-10 0.0
2 PI-89 1.0
3 PI-90 999.0
4 PI-93 0.0
5 PI-94 1.0
6 PI-95 1.0
7 PI-96 0.0
8 PI-09 1.0
9 PI-15 0.0
10 PI-16 1.0
11 PI-19 1.0
12 PI-91A 999.0
13 PI-91b 999.0
14 PI-92 1.0
15 PI-25-CU 999.0
16 PI-29 0.0
17 PI-30 1.0
18 PI-34 0.0
19 PI-84-CU-S1 999.0
20 PI-84-CU-S2 999.0
I have a simulation that uses pandas Dataframes to describe objects in a hierarchy. To achieve this, I have used a MultiIndex to show the route to a child object.
Parent df
par_val
a b
0 0.0 0.366660
1.0 0.613888
1 2.0 0.506531
3.0 0.327356
2 4.0 0.684335
0.0 0.013800
3 1.0 0.590058
2.0 0.179399
4 3.0 0.790628
4.0 0.310662
Child df
child_val
a b c
0 0.0 0 0.528217
1.0 0 0.515479
1 2.0 0 0.719221
3.0 0 0.785008
2 4.0 0 0.249344
0.0 0 0.455133
3 1.0 0 0.009394
2.0 0 0.775960
4 3.0 0 0.639091
4.0 0 0.150854
0 0.0 1 0.319277
1.0 1 0.571580
1 2.0 1 0.029063
3.0 1 0.498197
2 4.0 1 0.424188
0.0 1 0.572045
3 1.0 1 0.246166
2.0 1 0.888984
4 3.0 1 0.818633
4.0 1 0.366697
This implies that object (0,0,0) and (0,0,1) in the child Dataframes are both characterised by values at (0,0) in the parent Dataframe.
When a function is performed on the child dataframe for a certain subject of 'a', it may therefore need to grab a value from 'b'. My current solution locates the value from the parent Dataframe by index within the solution function:
import pandas as pd
import numpy as np
import time
from matplotlib import pyplot as plt
r = range(10, 1000, 10)
dt = []
for i in r:
start = time.time()
df_par = pd.DataFrame(
{'a': np.repeat(np.arange(5), i/5),
'b': np.append(np.arange(i/2), np.arange(i/2)),
'par_val': np.random.rand(i)
}).set_index(['a','b'])
df_child = pd.concat([df_par[[]]] * 2, keys = [0, 1], names = ['c'])\
.reorder_levels(['a', 'b', 'c'])
df_child['child_val'] = np.random.rand(i * 2)
df_child['solution'] = np.nan
def solution(row, df_par, var):
data_level = len(df_par.index.names)
index_filt = tuple([row.name[i] for i in range(data_level)])
sol = df_par.loc[index_filt, 'par_val'] / row.child_val
return sol
a_mask = df_child.index.get_level_values('a') == 0
df_child.loc[a_mask, 'solution'] = df_child.loc[a_mask].apply(solution,
df_par = df_par,
var = 10,
axis = 1)
stop = time.time()
dt.append(stop - start)
plt.plot(r, dt)
plt.show()
The solution function is becoming very costly for large amounts of iterations in the simulation:
(iterations (x) vs time in seconds (y))
Is there a more efficient method of calculating this? I have considered including the 'par_val' in the child df, but I was trying to avoid this as the very large amount of repetitions reduces the amount of simulations I can fit in RAM.
par_val is a float64 which takes 8 bytes for each value. If the child data frame has 1 million rows, that's 8MB of memory (before the OS's Memory Compression feature kicks in). If it has 1 billions rows, then yes, I would worry about the memory impact.
The bigger performance bottleneck though, is in your df_child.loc[a_mask].apply(..., axis=1) line. This makes pandas uses the slow Python loop instead of the much faster vectorized code. In SQL, we call the loop approach row-by-agonizing-row and it's an anti-pattern. You generally want to avoid .apply(..., axis=1) for this reason.
Here's one way to improve the performance without changing df_par or df_child:
a_mask = df_child.index.get_level_values('a') == 0
child_val = df_child.loc[a_mask, 'child_val'].droplevel(-1)
solution = df_par.loc[child_val.index, 'par_val'] / child_val
df_child.loc[a_mask, 'solution'] = solution.to_numpy()
Before:
After:
Could you please help to solve a specific task. I need to process pandas DataFrame column line-by-line. The main point is that "None" values must be turned into "0" or "1" so as to proceed "0" or "1" values which are already in the column. I've done it by using a "for" loop, and it works correct:
for i in np.arange(1, len(pd['signal'])):
if df.isnull(df['signal'].iloc[i]) and df['signal'].iloc[i-1] == 0:
df['signal'].iloc[i] = 0
if df.isnull(df['signal'].iloc[i]) and df['signal'].iloc[i-1] == 1:
df['signal'].iloc[i] = 1
But, there is the fact that it's not a good approach to iterate the DataFrame.
I tried to use "loc" method, but it brings incorrect results because in this way each step does not consider previously performed results, therefore some "None" values remain unchanged.
df.loc[(df.isnull(df['signal'])) & (df['signal'].shift(1) == 0), 'signal'] = 0
df.loc[(df.isnull(df['signal'])) & (df['signal'].shift(1) == 1), 'signal'] = 1
Does anyone have any idea how to implement this task without a "for" loop?
there are vectorized functions for just this purpose that will be much faster:
df = pd.DataFrame(dict(a=[1,1,np.nan, np.nan], b=[0,1,0,np.nan]))
df.ffill()
# df
a b
0 1.0 0.0
1 1.0 1.0
2 NaN 0.0
3 NaN NaN
# output
a b
0 1.0 0.0
1 1.0 1.0
2 1.0 0.0
3 1.0 0.0
You can use numpy where:
import numpy as np
df['signal'] = np.where(pd.isnull(df['signal']), df['signal'].shift(1), df['signal'])
I currently have a dictionary that stores a user's ID as a key and the events he's performed as a list of tuples. Each tuple contains the date the event was performed and the event itself.
Here is an excerpt from the dictionary:
{
'56d306892fcf7d8a0563b488bbe72b0df1c42f8b62edf18f68a180eab2ca7dc5':
[('2018-10-24T08:30:12.761Z', 'booking_initialized')],
'ac3406118670ef98ee2e3e76ab0f21edccba7b41fa6e4960eea10d2a4d234845':
[('2018-10-20T14:12:35.088Z', 'visited_hotel'), ('2018-10-20T14:17:38.521Z',
'visited_hotel'), ('2018-10-20T14:16:41.968Z', 'visited_hotel'), ('2018-10-
20T13:39:36.064Z', 'search_hotel'), ('2018-10-20T13:47:03.086Z',
'visited_hotel')],
'19813b0b79ec87975e42e02ff34724dd960c7b05efec71477ec66fb04b6bed9c': [('2018-
10-10T18:10:10.242Z', 'referal_code_shared')]
}
I also have a dataframe with the corresponding columns:
Columns: [visited_hotel, search_hotel, booking_initialized, creferal_code_shared]
What I wanted to do was iterate over each dictionary entry and then appropriately append it as row to my dataframe. Each row is a number indicating the number of times the user has performed that event.
So, for example after reading through my dictionary excerpt, my dataframe would read like:
visited_hotel search_hotel booking_initialized referal_code_shared
0 0 0 1 0
1 4 1 0 0
2 0 0 0 1
Thanks in advance :)
from collections import Counter
import pandas as pd
# d is your dictionary of values
result = {user: Counter(x[1] for x in records)
for user, records in d.items()}
df = pd.DataFrame(result).fillna(0).T.reset_index(drop=True)
A slightly cleaner approach
result = {i: Counter(x[1] for x in records)
for i, records in enumerate(d.values()) }
df = pd.DataFrame(result).fillna(0).T
If you want your columns in the specific order, then
cols = ['visited_hotel', 'search_hotel', 'booking_initialized', 'referal_code_shared']
df = df.loc[:, cols]
d = {
'56d306892fcf7d8a0563b488bbe72b0df1c42f8b62edf18f68a180eab2ca7dc5': [('2018-10-24T08:30:12.761Z', 'booking_initialized')],
'ac3406118670ef98ee2e3e76ab0f21edccba7b41fa6e4960eea10d2a4d234845': [('2018-10-20T14:12:35.088Z', 'visited_hotel'), ('2018-10-20T14:17:38.521Z', 'visited_hotel'), ('2018-10-20T14:16:41.968Z', 'visited_hotel'), ('2018-10-20T13:39:36.064Z', 'search_hotel'), ('2018-10-20T13:47:03.086Z', 'visited_hotel')],
'19813b0b79ec87975e42e02ff34724dd960c7b05efec71477ec66fb04b6bed9c': [('2018-10-10T18:10:10.242Z', 'referal_code_shared')]
}
def user_actions(user, actions):
# Convert the actions to dataframe
df = pd.DataFrame(actions).rename(columns={0: 'timestamp', 1: 'action'})
# Count each action
counted = df.groupby(['action'])['timestamp'].agg('count').reset_index().rename(columns={'timestamp': 'counter'})
# Pivot the result so each action is a column
pivoted = counted.pivot(columns='action', values='counter')
return pivoted
# Process each user's actions and concatenate all
all_actions_df = pd.concat([user_actions(user, user_actions_list) for user, user_actions_list in d.items()]).replace(np.nan, 0)
Output
booking_initialized referal_code_shared search_hotel visited_hotel
0 1.0 0.0 0.0 0.0
0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 4.0
0 0.0 1.0 0.0 0.0
I am trying to get rid of NaN values in a dataframe.
Instead of filling NaN with averages or doing ffill I wanted to fill missing values according to the destribution of values inside a column.
In other words, if a column has 120 rows, 20 are NaN, 80 contain 1.0 and 20 contain 0,0, I want to fill 80% of NaN values with 1. Note that the column contains floats.
I made a function to do so:
def fill_cr_hist(x):
if x is pd.np.nan:
r = random.random()
if r > 0.80:
return 0.0
else:
return 1.0
else:
return x
However when I call the function it does not change NaN values.
df['Credit_History'] = df['Credit_History'].apply(fill_cr_hist)
I thied filling NaN values with pd.np.nan, but it didn't change anything.
df['Credit_History'].fillna(value=pd.np.nan, inplace=True)
df['Credit_History'] = df['Credit_History'].apply(fill_cr_hist)
The other function I wrote that is almost identical and works fine. In that case the column contains strings.
def fill_self_emp(x):
if x is pd.np.nan:
r = random.random()
if r > 0.892442:
return 'Yes'
else:
return 'No'
else:
return x
ser = pd.Series([
1, 1, np.nan, 0, 0, 1, np.nan, 1, 1, np.nan, 0, 0, np.nan])
Use value_counts with normalize=True to get a list of probabilities corresponding to your values. Then generate values randomly according to the given probability distribution and use fillna to fill NaNs.
p = ser.value_counts(normalize=True).sort_index().tolist()
u = np.sort(ser.dropna().unique())
ser = ser.fillna(pd.Series(np.random.choice(u, len(ser), p=p)))
This solution should work for any number of numeric/categorical values, not just 0s and 1s. If data is a string type, use pd.factorize and convert to numeric.
Details
First, compute the probability distribution:
ser.value_counts(normalize=True).sort_index()
0.0 0.444444
1.0 0.555556
dtype: float64
Get a list of unique values, sorted in the same way:
np.sort(ser.dropna().unique())
array([0., 1.])
Finally, generate random values with specified probability distribution.
pd.Series(np.random.choice(u, len(ser), p=p))
0 0.0
1 0.0
2 1.0
3 0.0
4 0.0
5 0.0
6 1.0
7 1.0
8 0.0
9 0.0
10 1.0
11 0.0
12 1.0
dtype: float64