I have a sample of my dataframe as follows:
data = {'retailer': [2, 2, 2, 2, 2, 5, 5, 5, 5, 5],
'store': [1, 1, 1, 1, 1, 7, 7, 7, 7, 7],
'week':[2021110701, 2021101301, 2021100601, 2021092901, 2021092201, 2021110701, 2021101301, 2021100601, 2021092901, 2021092201],
'dollars': [353136.2, 379263.8, 507892.1, 491528.2, 503602.8, 435025.2, 406698.5, 338383.5, 360845.1, 372385.2]
}
data = pd.DataFrame(data)
I have sorted my columns by doing
data = data.sort_values(['retailer', 'store', 'week'], ascending=(True, True, False))
I would like to find the percent different in dollars between each row WITHIN each group...so essentially group by retailer, then store and then find the percent difference between the rows for 'dollars' between the week and the week below it, and then save the value in a column next to the dollars.
Basically have the output look like:
data = {'retailer': [2, 2, 2, 2, 2, 5, 5, 5, 5, 5],
'store': [1, 1, 1, 1, 1, 7, 7, 7, 7, 7],
'week':[2021110701, 2021101301, 2021100601, 2021092901, 2021092201, 2021110701, 2021101301, 2021100601, 2021092901, 2021092201],
'dollars': [353136.2, 379263.8, 507892.1, 491528.2, 503602.8, 435025.2, 406698.5, 338383.5, 360845.1, 372385.2],
'pc_diff': [-0.06889030801252315, -0.2532591075939161, 0.03329188437204613, -0.02397643539710259, 'NaN', 0.06965036753270545, 0.20188632128930636, -0.062247208012523876, -0.030989684874694362, 'NaN']
}
data = pd.DataFrame(data)
So for retailer 2, store 1 trying to find the percent difference between week 2021110701 and 2021101301 which would be (353136.2 - 379263.8)/379263.8.
The NAs exist because there is no row below that one so there is nothing to find the percent change between (if that makes sense). Is there a way I can do this/is there a pandas equivalent of using a lag function?
You can use groupby+pct_change:
data.groupby(['retailer', 'store'])['dollars'].pct_change(-1)
output:
retailer store week dollars pc_diff
0 2 1 2021110701 353136.2 -0.068890
1 2 1 2021101301 379263.8 -0.253259
2 2 1 2021100601 507892.1 0.033292
3 2 1 2021092901 491528.2 -0.023976
4 2 1 2021092201 503602.8 NaN
5 5 7 2021110701 435025.2 0.069650
6 5 7 2021101301 406698.5 0.201886
7 5 7 2021100601 338383.5 -0.062247
8 5 7 2021092901 360845.1 -0.030990
9 5 7 2021092201 372385.2 NaN
Related
I would like to know if there is a way to create a subset from a dataframe in python, based on the last exam status of a patient with a corresponding id (one id per patient)
For example, if a certain id has 5 exams (and exam_status can be 1 or 0) I would like to create a new dataframe based only on the last exam status (let's say is 1).The original df has more columns (72 to be exact)
ex: patient id13 has 2 exam status and i want a dataframe with the ids and corresponding only to the last status (either 0 or 1)
How can I do that?
solution (from whe answers!!):
df.groupby("id").last()[list(df.groupby("id")["exam_status"].last() == 1)]
ALSO, how can I create a subset with the patients that changed status (from 0 to 1). I know it's probably a small change in the given solution, but I'm new at learning python! Thank you!!
This should work:
df.groupby("id").last()[list(df.groupby("id")["exam_status"].last() == 1)]
Assuming your dataframe is df, you filter on id, take the last value, then filter by a list of booleans of: last exam_status == 1.
Mock data:
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 14, 15],
'exam_status': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1],
'col3': [2, 3, 1, 2, 3, 2, 1, 23, 4, 3 ,1, 24, 5, 6, 3, 6, 1]})
Output for mock data:
#Out:
# exam_status col3
#id
#1 1 2
#2 1 3
#3 1 1
#4 1 2
#5 1 3
#6 1 2
#7 1 1
#8 1 23
#9 1 4
#10 1 3
#12 1 24
#13 1 6
#14 1 6
#15 1 1
I have a data frame with a column 'score'. It contains scores from 1 to 10. I want to create a new column "color" which gives the column color according to the score.
For e.g. if the score is 1, the value of color should be "#75968f", if the score is 2, the value of color should be "#a5bab7". i.e. we need colors ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2","#f1d4Af", "#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"] for scores [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] respectively.
Is it possible to do this without using a loop?
Let me know in case you have a problem understanding the question.
Use Series.map with dictionary generated by zipping both lists or if need range by length of list colors is possible use enumerate:
df = pd.DataFrame({'score':[2,4,6,3,8]})
colors = ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2","#f1d4Af",
"#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"]
scores = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
df['new'] = df['score'].map(dict(zip(scores, colors)))
df['new1'] = df['score'].map(dict(enumerate(colors, 1)))
print (df)
score new new1
0 2 #a5bab7 #a5bab7
1 4 #e2e2e2 #e2e2e2
2 6 #dfccce #dfccce
3 3 #c9d9d3 #c9d9d3
4 8 #cc7878 #cc7878
I have a df like this:
Customer# Facility Transp
1 RS 4
2 RS 7
3 RS 9
1 CM 2
2 CM 8
3 CM 5
I want to convert to a dictionary that looks like this:
transp = {'RS' : {1 : 4, 2 : 7, 3 : 9,
'CM' : {1 : 2, 2 : 8, 3 : 5}}
I'm unfamiliar with this conversion. I tried various options. The data has to be exactly in this dictionary format. I can't have nesting with []. Essentially Facility is the primary level then Customer / Transp. I feel like this should be easy.... Thanks,
You can do it in one go.
df = pd.DataFrame({"Customer#": [1, 2, 3, 1, 2, 3],
"Facility": ['RS', 'RS', 'RS', 'CM', 'CM', 'CM'],
"Transp": [4, 7, 9, 2, 8, 5]})
transp = df.groupby('Facility')[['Customer#','Transp']].apply(lambda g: dict(g.values.tolist())).to_dict()
print(transp)
So I created this post regarding my problem 2 days ago and got an answer thankfully.
I have a data made of 20 rows and 2500 columns. Each column is a unique product and rows are time series, results of measurements. Therefore each product is measured 20 times and there are 2500 products.
This time I want to know for how many consecutive rows my measurement result can stay above a specific threshold.
AKA: I want to count the number of consecutive values that is above a value, let's say 5.
A = [1, 2, 6, 8, 7, 3, 2, 3, 6, 10, 2, 1, 0, 2]
We have these values in bold and according to what I defined above, I should get NumofConsFeature = 3 as the result. (Getting the max if there are more than 1 series that meets the condition)
I thought of filtering using .gt, then getting the indexes and using a loop afterwards in order to detect the consecutive index numbers but couldn't make it work.
In 2nd phase, I'd like to know the index of the first value of my consecutive series. For the above example, that would be 3.
But I have no idea of how for this one.
Thanks in advance.
Here's another answer using only Pandas functions:
A = [1, 2, 6, 8, 7, 3, 2, 3, 6, 10, 2, 1, 0, 2]
a = pd.DataFrame(A, columns = ['foo'])
a['is_large'] = (a.foo > 5)
a['crossing'] = (a.is_large != a.is_large.shift()).cumsum()
a['count'] = a.groupby(['is_large', 'crossing']).cumcount(ascending=False) + 1
a.loc[a.is_large == False, 'count'] = 0
which gives
foo is_large crossing count
0 1 False 1 0
1 2 False 1 0
2 6 True 2 3
3 8 True 2 2
4 7 True 2 1
5 3 False 3 0
6 2 False 3 0
7 3 False 3 0
8 6 True 4 2
9 10 True 4 1
10 2 False 5 0
11 1 False 5 0
12 0 False 5 0
13 2 False 5 0
From there on you can easily find the maximum and its index.
There is simple way to do that.
Lets say your list is like: A = [1, 2, 6, 8, 7, 6, 8, 3, 2, 3, 6, 10,6,7,8, 2, 1, 0, 2]
And you want to find how many consecutive series that has values bigger than 6 and length of 5. For instance, here your answer is 2. There is two series that has values bigger than 6 and length of the series are 5. In python and pandas we do that like below:
condition = (df.wanted_row > 6) & \
(df.wanted_row.shift(-1) > 6) & \
(df.wanted_row.shift(-2) > 6) & \
(df.wanted_row.shift(-3) > 6) & \
(df.wanted_row.shift(-4) > 6)
consecutive_count = df[condition].count().head(1)[0]
Here's one with maxisland_start_len_mask -
# https://stackoverflow.com/a/52718782/ #Divakar
def maxisland_start_len_mask(a, fillna_index = -1, fillna_len = 0):
# a is a boolean array
pad = np.zeros(a.shape[1],dtype=bool)
mask = np.vstack((pad, a, pad))
mask_step = mask[1:] != mask[:-1]
idx = np.flatnonzero(mask_step.T)
island_starts = idx[::2]
island_lens = idx[1::2] - idx[::2]
n_islands_percol = mask_step.sum(0)//2
bins = np.repeat(np.arange(a.shape[1]),n_islands_percol)
scale = island_lens.max()+1
scaled_idx = np.argsort(scale*bins + island_lens)
grp_shift_idx = np.r_[0,n_islands_percol.cumsum()]
max_island_starts = island_starts[scaled_idx[grp_shift_idx[1:]-1]]
max_island_percol_start = max_island_starts%(a.shape[0]+1)
valid = n_islands_percol!=0
cut_idx = grp_shift_idx[:-1][valid]
max_island_percol_len = np.maximum.reduceat(island_lens, cut_idx)
out_len = np.full(a.shape[1], fillna_len, dtype=int)
out_len[valid] = max_island_percol_len
out_index = np.where(valid,max_island_percol_start,fillna_index)
return out_index, out_len
def maxisland_start_len(a, trigger_val, comp_func=np.greater):
# a is 2D array as the data
mask = comp_func(a,trigger_val)
return maxisland_start_len_mask(mask, fillna_index = -1, fillna_len = 0)
Sample run -
In [169]: a
Out[169]:
array([[ 1, 0, 3],
[ 2, 7, 3],
[ 6, 8, 4],
[ 8, 6, 8],
[ 7, 1, 6],
[ 3, 7, 8],
[ 2, 5, 8],
[ 3, 3, 0],
[ 6, 5, 0],
[10, 3, 8],
[ 2, 3, 3],
[ 1, 7, 0],
[ 0, 0, 4],
[ 2, 3, 2]])
# Per column results
In [170]: row_index, length = maxisland_start_len(a, 5)
In [172]: row_index
Out[172]: array([2, 1, 3])
In [173]: length
Out[173]: array([3, 3, 4])
You can apply diff() on your Series, and then just count the number of consecutive entries where the difference is 1 and the actual value is above your cutoff. The largest count is the maximum number of consecutive values.
First compute diff():
df = pd.DataFrame({"a":[1, 2, 6, 7, 8, 3, 2, 3, 6, 10, 2, 1, 0, 2]})
df['b'] = df.a.diff()
df
a b
0 1 NaN
1 2 1.0
2 6 4.0
3 7 1.0
4 8 1.0
5 3 -5.0
6 2 -1.0
7 3 1.0
8 6 3.0
9 10 4.0
10 2 -8.0
11 1 -1.0
12 0 -1.0
13 2 2.0
Now count consecutive sequences:
above = 5
n_consec = 1
max_n_consec = 1
for a, b in df.values[1:]:
if (a > above) & (b == 1):
n_consec += 1
else: # check for new max, then start again from 1
max_n_consec = max(n_consec, max_n_consec)
n_consec = 1
max_n_consec
3
Here's how I did it using numpy:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[1, 2, 6, 7, 8, 3, 2, 3, 6, 10, 2, 1, 0, 2]})
consecutive_steps = 2
marginal_price = 5
assertions = [(df.loc[:, "a"].shift(-i) < marginal_price) for i in range(consecutive_steps)]
condition = np.all(assertions, axis=0)
consecutive_count = df.loc[condition, :].count()
print(consecutive_count)
which yields 6.
I have json records in the file json_data. I used pd.DataFrame(json_data) to make a new table, pd_json_data, using these records.
pandas table pd_json_data
I want to manipulate pd_json_data to return a new table with primary key (url,hour), and then a column updated that contains a boolean value.
hour is based on the number of checks. For example, if number of checks contains 378 at row 0, the new table should have the numbers 1 through 378 in hour, with True in updated if the number in hour is a number in positive checks.
Any ideas for how I should approach this?
Updated Answer
Make fake data
df = pd.DataFrame({'number of checks': [5, 10, 300, 8],
'positive checks':[[1,3,10], [10,11], [9,200], [1,8,7]],
'url': ['a', 'b', 'c', 'd']})
Output
number of checks positive checks url
0 5 [1, 3, 10] a
1 10 [10, 11] b
2 300 [9, 200] c
3 8 [1, 8, 7] d
Iterate and create new dataframes, then concatenate
dfs = []
for i, row in df.iterrows():
hour = np.arange(1, row['number of checks'] + 1)
df_cur = pd.DataFrame({'hour' : hour,
'url': row['url'],
'updated': np.in1d(hour, row['positive checks'])})
dfs.append(df_cur)
df_final = pd.concat(dfs)
hour updated url
0 1 True a
1 2 False a
2 3 True a
3 4 False a
4 5 False a
0 1 False b
1 2 False b
2 3 False b
3 4 False b
4 5 False b
5 6 False b
6 7 False b
7 8 False b
8 9 False b
9 10 True b
0 1 False c
1 2 False c
Old answer
Now build new dataframe
df1 = df[['url']].copy()
df1['hour'] = df['number of checks'].map(lambda x: list(range(1, x + 1)))
df1['updated'] = df.apply(lambda x: x['number of checks'] in x['positive checks'], axis=1)
Output
url hour updated
0 a [1, 2, 3, 4, 5] False
1 b [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] True
2 c [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... False
3 d [1, 2, 3, 4, 5, 6, 7, 8] True