How to iterate a vectorized if/else statement over additional columns? - python

import pandas as pd, numpy as np
ltlist = [1, 2]
org = {'ID': [1, 3, 4, 5, 6, 7], 'ID2': [3, 4, 5, 6, 7, 2]}
ltlist_set = set(ltlist)
org['LT'] = np.where(org['ID'].isin(ltlist_set), org['ID'], 0)
I'll need to check the ID2 column and write the ID in, unless it already has an ID.
output
ID ID2 LT
1 3 1
3 4 0
4 5 0
5 6 0
6 7 0
7 2 2
Thanks!

Option 1
You can nest numpy.where statements:
org['LT'] = np.where(org['ID'].isin(ltlist_set), 1,
np.where(org['ID2'].isin(ltlist_set), 2, 0))
Option 2
Alternatively, you can use pd.DataFrame.loc sequentially:
org['LT'] = 0 # default value
org.loc[org['ID2'].isin(ltlist_set), 'LT'] = 2
org.loc[org['ID'].isin(ltlist_set), 'LT'] = 1
Option 3
A third option is to use numpy.select:
conditions = [org['ID'].isin(ltlist_set), org['ID2'].isin(ltlist_set)]
values = [1, 2]
org['LT'] = np.select(conditions, values, 0) # 0 is default value

Related

In pandas, filter for duplicate values appearing in 1 of 2 different columns, for list of certain values only

zed = pd.DataFrame(data = { 'date': ['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04', '2022-03-05'], 'a': [1, 5, 7, 3, 4], 'b': [3, 4, 9, 12, 5] })
How can the following dataframe be filtered to keep the earliest row (earliest == lowest date) for each of the 3 values 1, 5, 4 appearing in either column a or column b? In this example, the rows with dates '2022-03-01', '2022-03-02' would be kept as they are the lowest dates where each of the 3 values appears?
We have tried zed[zed.isin({'a': [1, 5, 4], 'b': [1, 5, 4]}).any(1)].sort_values(by=['date']) but this returns the incorrect result as it returns 3 rows.
Without reshape your dataframe, you can use:
idx = max([zed[['a', 'b']].eq(i).sum(axis=1).idxmax() for i in [1, 5, 4]])
out = zed.loc[:idx]
Output:
>>> out
date a b
0 2022-03-01 1 3
1 2022-03-02 5 4
You can reshape by DataFrame.stack, so possible filterin gby list with remove duplicates:
s = zed.set_index('date')[['a','b']].stack()
idx = s[s.isin([1, 5, 4])].drop_duplicates().index.remove_unused_levels().levels[0]
print (idx)
Index(['2022-03-01', '2022-03-02'], dtype='object', name='date')
out = zed[zed['date'].isin(idx)]
print (out)
date a b
0 2022-03-01 1 3
1 2022-03-02 5 4
Or filter first index value matching conditions, get unique values and select rows by DataFrame.loc:
L = [1, 5, 4]
idx = pd.unique([y for x in L for y in zed[zed[['a', 'b']].eq(x).any(axis=1)].index[:1]])
df = zed.loc[idx]
print (df)
date a b
0 2022-03-01 1 3
1 2022-03-02 5 4

groupby count of values not equal to other col value pandas

I'm aiming to pass a groupby count of values but only considering rows where Item and Item 2 are different. The following achieves this but drops rows if no values are different. If there are one or more values that are present but are identical between Item and Item 2 then I'm hoping to return 0.
import pandas as pd
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,4,4,4],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A','B','B','B'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A','B','A','A'],
'Value' : [5, 6, 6, 5, 5, 6, 5, 6, 3, 1, 4, 6, 7, 4, 5, 1, 2, 3],
})
df1 = df[df['Item'] != df['Item2']].groupby(['Time']).size().reset_index(name='count')
Intended Output:
Time count
0 1 4
1 2 3
2 3 0
3 4 2
Edit 2:
df = pd.DataFrame({
'Time' : ['1','1','1','1','1','1','1','2','2','2','2','2','2','2','3','4','4','4'],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A','B','B','B'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A','B','A','A'],
'Value' : [2, 6, 6, 5, 3, 3, 4, 6, 5, 1, 4, 6, 7, 4, 5, 1, 2, 3],
})
df1 = (df.assign(new = df['Item'] != df['Item2'])
.groupby('Time')['new']
.mean()
.reset_index(name='avg')
)
Intended Output:
Time avg
0 1 3.0
1 2 5.0
2 3 0.0
3 4 2.5
Idea is not filter, bur count values of Trues per groups by sum, here is passed Series df['Time'] to groupby:
df1 = (df['Item'] != df['Item2']).groupby(df['Time']).sum().reset_index(name='count')
print (df1)
Time count
0 1 4
1 2 3
2 3 0
3 4 2
Another similar solution is create new helper column and aggregate it:
df1 = (df.assign(new = df['Item'] != df['Item2'])
.groupby('Time')['new']
.sum()
.reset_index(name='count'))
EDIT: You can replace non matched values to misisng values by Series.where and then replace misisng values by fillna
df1 = (df.assign(new = df['Value'].where(df['Item'] != df['Item2']))
.groupby('Time')['new']
.mean()
.fillna(0)
.reset_index(name='avg')
)
print (df1)
Time avg
0 1 3.0
1 2 5.0
2 3 0.0
3 4 2.5
Alternative is use Series.reindex by uniqu values of original Time column:
df1 = (df[df['Item'] != df['Item2']]
.groupby(['Time'])['Value']
.mean()
.reindex(df['Time'].unique(), fill_value=0)
.reset_index(name='avg'))
Have a look at the pivot tables for pandas
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A'],
'Value' : [5, 6, 6, 5, 5, 6, 5, 6, 3, 1, 4, 6, 7, 4, 5],
})
# this gives you just the ones were there is a differance
df2 = df[df['Item'] != df['Item2']]
# then sum up the numbers for each item
pd.pivot_table(df2,index='Time',aggfunc='count')
This gives you the table
Item Item2 Value
Time
1 4 4 4
2 3 3 3

Pandas - add a row at the end of a for loop iteration

So I have a for loop that gets a series of values and makes some tests:
list = [1, 2, 3, 4, 5, 6]
df = pd.DataFrame(columns=['columnX','columnY', 'columnZ'])
for value in list:
if value > 3:
df['columnX']="A"
else:
df['columnX']="B"
df['columnZ']="Another value only to be filled in this condition"
df['columnY']=value-1
How can I do this and keep all the values in a single row for each loop iteration no matter what's the if outcome? Can I keep some columns empty?
I mean something like the following process:
[create empty row] -> [process] -> [fill column X] -> [process] -> [fill column Y if true] ...
Like:
[index columnX columnY columnZ]
[0 A 0 NULL ]
[1 A 1 NULL ]
[2 B 2 "..." ]
[3 B 3 "..." ]
[4 B 4 "..." ]
I am not sure to understand exactly but I think this may be a solution:
list = [1, 2, 3, 4, 5, 6]
d = {'columnX':[],'columnY':[]}
for value in list:
if value > 3:
d['columnX'].append("A")
else:
d['columnX'].append("B")
d['columnY'].append(value-1)
df = pd.DataFrame(d)
for the second question just add another condition
list = [1, 2, 3, 4, 5, 6]
d = {'columnX':[],'columnY':[], 'columnZ':[]}
for value in list:
if value > 3:
d['columnX'].append("A")
else:
d['columnX'].append("B")
if condition:
d['columnZ'].append(xxx)
else:
d['columnZ'].append(None)
df = pd.DataFrame(d)
According to the example you have given I have changed your code a bit to achieve the result you shared:
list = [1, 2, 3, 4, 5, 6]
df = pd.DataFrame(columns=['columnX','columnY', 'columnZ'])
for index, value in enumerate(list):
temp = []
if value > 3:
#df['columnX']="A"
temp.append("A")
temp.append(None)
else:
#df['columnX']="B"
temp.append("B")
temp.append("Another value") # or you can add any conditions
#df['columnY']=value-1
temp.append(value-1)
df.loc[index] = temp
print(df)
this produce the result:
columnX columnY columnZ
0 B Another value 0.0
1 B Another value 1.0
2 B Another value 2.0
3 A None 3.0
4 A None 4.0
5 A None 5.0
df.index is printed as : Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
You may just prepare/initialize your Dataframe with an index depending on input list size, then getting power from np.where routine:
In [111]: lst = [1, 2, 3, 4, 5, 6]
...: df = pd.DataFrame(columns=['columnX','columnY', 'columnZ'], index=range(len(lst)))
In [112]: int_arr = np.array(lst)
In [113]: df['columnX'] = np.where(int_arr > 3, 'A', 'B')
In [114]: df['columnZ'] = np.where(int_arr > 3, df['columnZ'], '...')
In [115]: df['columnY'] = int_arr - 1
In [116]: df
Out[116]:
columnX columnY columnZ
0 B 0 ...
1 B 1 ...
2 B 2 ...
3 A 3 NaN
4 A 4 NaN
5 A 5 NaN

Counting the number of consecutive values that meets a condition (Pandas Dataframe)

So I created this post regarding my problem 2 days ago and got an answer thankfully.
I have a data made of 20 rows and 2500 columns. Each column is a unique product and rows are time series, results of measurements. Therefore each product is measured 20 times and there are 2500 products.
This time I want to know for how many consecutive rows my measurement result can stay above a specific threshold.
AKA: I want to count the number of consecutive values that is above a value, let's say 5.
A = [1, 2, 6, 8, 7, 3, 2, 3, 6, 10, 2, 1, 0, 2]
We have these values in bold and according to what I defined above, I should get NumofConsFeature = 3 as the result. (Getting the max if there are more than 1 series that meets the condition)
I thought of filtering using .gt, then getting the indexes and using a loop afterwards in order to detect the consecutive index numbers but couldn't make it work.
In 2nd phase, I'd like to know the index of the first value of my consecutive series. For the above example, that would be 3.
But I have no idea of how for this one.
Thanks in advance.
Here's another answer using only Pandas functions:
A = [1, 2, 6, 8, 7, 3, 2, 3, 6, 10, 2, 1, 0, 2]
a = pd.DataFrame(A, columns = ['foo'])
a['is_large'] = (a.foo > 5)
a['crossing'] = (a.is_large != a.is_large.shift()).cumsum()
a['count'] = a.groupby(['is_large', 'crossing']).cumcount(ascending=False) + 1
a.loc[a.is_large == False, 'count'] = 0
which gives
foo is_large crossing count
0 1 False 1 0
1 2 False 1 0
2 6 True 2 3
3 8 True 2 2
4 7 True 2 1
5 3 False 3 0
6 2 False 3 0
7 3 False 3 0
8 6 True 4 2
9 10 True 4 1
10 2 False 5 0
11 1 False 5 0
12 0 False 5 0
13 2 False 5 0
From there on you can easily find the maximum and its index.
There is simple way to do that.
Lets say your list is like: A = [1, 2, 6, 8, 7, 6, 8, 3, 2, 3, 6, 10,6,7,8, 2, 1, 0, 2]
And you want to find how many consecutive series that has values bigger than 6 and length of 5. For instance, here your answer is 2. There is two series that has values bigger than 6 and length of the series are 5. In python and pandas we do that like below:
condition = (df.wanted_row > 6) & \
(df.wanted_row.shift(-1) > 6) & \
(df.wanted_row.shift(-2) > 6) & \
(df.wanted_row.shift(-3) > 6) & \
(df.wanted_row.shift(-4) > 6)
consecutive_count = df[condition].count().head(1)[0]
Here's one with maxisland_start_len_mask -
# https://stackoverflow.com/a/52718782/ #Divakar
def maxisland_start_len_mask(a, fillna_index = -1, fillna_len = 0):
# a is a boolean array
pad = np.zeros(a.shape[1],dtype=bool)
mask = np.vstack((pad, a, pad))
mask_step = mask[1:] != mask[:-1]
idx = np.flatnonzero(mask_step.T)
island_starts = idx[::2]
island_lens = idx[1::2] - idx[::2]
n_islands_percol = mask_step.sum(0)//2
bins = np.repeat(np.arange(a.shape[1]),n_islands_percol)
scale = island_lens.max()+1
scaled_idx = np.argsort(scale*bins + island_lens)
grp_shift_idx = np.r_[0,n_islands_percol.cumsum()]
max_island_starts = island_starts[scaled_idx[grp_shift_idx[1:]-1]]
max_island_percol_start = max_island_starts%(a.shape[0]+1)
valid = n_islands_percol!=0
cut_idx = grp_shift_idx[:-1][valid]
max_island_percol_len = np.maximum.reduceat(island_lens, cut_idx)
out_len = np.full(a.shape[1], fillna_len, dtype=int)
out_len[valid] = max_island_percol_len
out_index = np.where(valid,max_island_percol_start,fillna_index)
return out_index, out_len
def maxisland_start_len(a, trigger_val, comp_func=np.greater):
# a is 2D array as the data
mask = comp_func(a,trigger_val)
return maxisland_start_len_mask(mask, fillna_index = -1, fillna_len = 0)
Sample run -
In [169]: a
Out[169]:
array([[ 1, 0, 3],
[ 2, 7, 3],
[ 6, 8, 4],
[ 8, 6, 8],
[ 7, 1, 6],
[ 3, 7, 8],
[ 2, 5, 8],
[ 3, 3, 0],
[ 6, 5, 0],
[10, 3, 8],
[ 2, 3, 3],
[ 1, 7, 0],
[ 0, 0, 4],
[ 2, 3, 2]])
# Per column results
In [170]: row_index, length = maxisland_start_len(a, 5)
In [172]: row_index
Out[172]: array([2, 1, 3])
In [173]: length
Out[173]: array([3, 3, 4])
You can apply diff() on your Series, and then just count the number of consecutive entries where the difference is 1 and the actual value is above your cutoff. The largest count is the maximum number of consecutive values.
First compute diff():
df = pd.DataFrame({"a":[1, 2, 6, 7, 8, 3, 2, 3, 6, 10, 2, 1, 0, 2]})
df['b'] = df.a.diff()
df
a b
0 1 NaN
1 2 1.0
2 6 4.0
3 7 1.0
4 8 1.0
5 3 -5.0
6 2 -1.0
7 3 1.0
8 6 3.0
9 10 4.0
10 2 -8.0
11 1 -1.0
12 0 -1.0
13 2 2.0
Now count consecutive sequences:
above = 5
n_consec = 1
max_n_consec = 1
for a, b in df.values[1:]:
if (a > above) & (b == 1):
n_consec += 1
else: # check for new max, then start again from 1
max_n_consec = max(n_consec, max_n_consec)
n_consec = 1
max_n_consec
3
Here's how I did it using numpy:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[1, 2, 6, 7, 8, 3, 2, 3, 6, 10, 2, 1, 0, 2]})
consecutive_steps = 2
marginal_price = 5
assertions = [(df.loc[:, "a"].shift(-i) < marginal_price) for i in range(consecutive_steps)]
condition = np.all(assertions, axis=0)
consecutive_count = df.loc[condition, :].count()
print(consecutive_count)
which yields 6.

Trouble creating/manipulating Pandas DataFrame from given list of JSON records

I have json records in the file json_data. I used pd.DataFrame(json_data) to make a new table, pd_json_data, using these records.
pandas table pd_json_data
I want to manipulate pd_json_data to return a new table with primary key (url,hour), and then a column updated that contains a boolean value.
hour is based on the number of checks. For example, if number of checks contains 378 at row 0, the new table should have the numbers 1 through 378 in hour, with True in updated if the number in hour is a number in positive checks.
Any ideas for how I should approach this?
Updated Answer
Make fake data
df = pd.DataFrame({'number of checks': [5, 10, 300, 8],
'positive checks':[[1,3,10], [10,11], [9,200], [1,8,7]],
'url': ['a', 'b', 'c', 'd']})
Output
number of checks positive checks url
0 5 [1, 3, 10] a
1 10 [10, 11] b
2 300 [9, 200] c
3 8 [1, 8, 7] d
Iterate and create new dataframes, then concatenate
dfs = []
for i, row in df.iterrows():
hour = np.arange(1, row['number of checks'] + 1)
df_cur = pd.DataFrame({'hour' : hour,
'url': row['url'],
'updated': np.in1d(hour, row['positive checks'])})
dfs.append(df_cur)
df_final = pd.concat(dfs)
hour updated url
0 1 True a
1 2 False a
2 3 True a
3 4 False a
4 5 False a
0 1 False b
1 2 False b
2 3 False b
3 4 False b
4 5 False b
5 6 False b
6 7 False b
7 8 False b
8 9 False b
9 10 True b
0 1 False c
1 2 False c
Old answer
Now build new dataframe
df1 = df[['url']].copy()
df1['hour'] = df['number of checks'].map(lambda x: list(range(1, x + 1)))
df1['updated'] = df.apply(lambda x: x['number of checks'] in x['positive checks'], axis=1)
Output
url hour updated
0 a [1, 2, 3, 4, 5] False
1 b [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] True
2 c [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... False
3 d [1, 2, 3, 4, 5, 6, 7, 8] True

Categories

Resources