Suppose I have the following data frame
from pandas import DataFrame
Cars = { 'value': [10, 31, 661, 1, 51, 61, 551],
'action1': [1, 1, 1, 1, 1, 1, 1],
'price1': [ 12,0, 15,3, 0, 12,0],
'action2': [2, 2, 2, 2, 2, 2, 2],
'price2': [ 0, 16, 19, 0, 1, 10,0],
'action3': [3, 3, 3, 3, 3, 3, 3],
'price3': [ 14, 36, 9, 0, 0, 0,0]
}
df = DataFrame(Cars,columns= ['value', 'action1', 'price1', 'action2', 'price2', 'action3', 'price3'])
print (df)
How can I select randomly value (action and price) among 3 columns? As a result I want to have a dataframe that will look something like this one?
RandCars = {'value': [10, 31, 661, 1, 51, 61, 551],
'action': [1, 3, 1, 3, 1, 2, 2],
'price': [ 12, 36, 15, 0, 3, 10, 0]
}
df2 = DataFrame(RandCars, columns = ['value','action', 'price'])
print(df2)
Use:
#get columns names not starting by action or price
cols = df.columns[~df.columns.str.startswith(('action','price'))]
print (cols)
Index(['value'], dtype='object')
#convert filtered columns to 2 numpy arrays
arr1 = df.filter(regex='^action').values
arr2 = df.filter(regex='^price').values
#pandas 0.24+
#arr1 = df.filter(regex='^action').to_numpy()
#arr2 = df.filter(regex='^price').to_numpy()
i, c = arr1.shape
#create random positions of both DataFrames to new df
idx = np.random.choice(np.arange(c), i)
df3 = pd.DataFrame({'action': arr1[np.arange(len(df)), idx],
'price': arr2[np.arange(len(df)), idx]},
index=df.index)
print (df3)
action price
0 2 0
1 3 36
2 3 9
3 1 3
4 3 0
5 1 12
6 1 0
#add all another columns by join
df4 = df[cols].join(df3)
print (df4)
value action price
0 10 2 0
1 31 3 36
2 661 3 9
3 1 1 3
4 51 3 0
5 61 1 12
6 551 1 0
Related
I'm aiming to pass a groupby count of values but only considering rows where Item and Item 2 are different. The following achieves this but drops rows if no values are different. If there are one or more values that are present but are identical between Item and Item 2 then I'm hoping to return 0.
import pandas as pd
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,4,4,4],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A','B','B','B'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A','B','A','A'],
'Value' : [5, 6, 6, 5, 5, 6, 5, 6, 3, 1, 4, 6, 7, 4, 5, 1, 2, 3],
})
df1 = df[df['Item'] != df['Item2']].groupby(['Time']).size().reset_index(name='count')
Intended Output:
Time count
0 1 4
1 2 3
2 3 0
3 4 2
Edit 2:
df = pd.DataFrame({
'Time' : ['1','1','1','1','1','1','1','2','2','2','2','2','2','2','3','4','4','4'],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A','B','B','B'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A','B','A','A'],
'Value' : [2, 6, 6, 5, 3, 3, 4, 6, 5, 1, 4, 6, 7, 4, 5, 1, 2, 3],
})
df1 = (df.assign(new = df['Item'] != df['Item2'])
.groupby('Time')['new']
.mean()
.reset_index(name='avg')
)
Intended Output:
Time avg
0 1 3.0
1 2 5.0
2 3 0.0
3 4 2.5
Idea is not filter, bur count values of Trues per groups by sum, here is passed Series df['Time'] to groupby:
df1 = (df['Item'] != df['Item2']).groupby(df['Time']).sum().reset_index(name='count')
print (df1)
Time count
0 1 4
1 2 3
2 3 0
3 4 2
Another similar solution is create new helper column and aggregate it:
df1 = (df.assign(new = df['Item'] != df['Item2'])
.groupby('Time')['new']
.sum()
.reset_index(name='count'))
EDIT: You can replace non matched values to misisng values by Series.where and then replace misisng values by fillna
df1 = (df.assign(new = df['Value'].where(df['Item'] != df['Item2']))
.groupby('Time')['new']
.mean()
.fillna(0)
.reset_index(name='avg')
)
print (df1)
Time avg
0 1 3.0
1 2 5.0
2 3 0.0
3 4 2.5
Alternative is use Series.reindex by uniqu values of original Time column:
df1 = (df[df['Item'] != df['Item2']]
.groupby(['Time'])['Value']
.mean()
.reindex(df['Time'].unique(), fill_value=0)
.reset_index(name='avg'))
Have a look at the pivot tables for pandas
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A'],
'Value' : [5, 6, 6, 5, 5, 6, 5, 6, 3, 1, 4, 6, 7, 4, 5],
})
# this gives you just the ones were there is a differance
df2 = df[df['Item'] != df['Item2']]
# then sum up the numbers for each item
pd.pivot_table(df2,index='Time',aggfunc='count')
This gives you the table
Item Item2 Value
Time
1 4 4 4
2 3 3 3
I have a dataset where i groupby the monthly data with the same id:
temp1 = listvar[2].groupby(["id", "month"])["value"].mean()
This results in this:
id month
SN10380 1 -9.670370
2 -8.303571
3 -4.932143
4 0.475862
5 5.732000
...
SN99950 8 6.326786
9 4.623529
10 1.290566
11 -0.867273
12 -2.485455
I then want to have each month and the corresponding value as a own column on the same ID, like this:
id month_1 month_2 month_3 month_4 .... month_12
SN10380 -9.670370 -8.303571 .....
SN99950
I have tried different solutions using apply(), transform() and agg(), but aren't able to produce the wanted output.
You could use unstack. Here's the sample code:
import pandas as pd
df = pd.DataFrame({
"id": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"month": [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
"value": [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
})
temp1 = df.groupby(["id", "month"])["value"].mean()
temp1.unstack()
I hope it helps!
I'm working with a data frame containing 582,260 rows and 24 columns. Each row corresponds to a 24 hours vector length time series, and 20 rows (days) correspond to id_1, 20 to id_2... and so on up to id_N. I would like to concatenate into a single row all the 20 rows of id_1 so that my concatenated time series become a 480 (20 days * 24 hrs/day) vector length, and repeat this operation from id_1 to id_N.
A very reduced and reproducible version of my data frame is shown (ID column should be an index but for iteration purposes I reseted it):
df = pd.DataFrame([['id1', 1, 1, 3, 4, 1], ['id1', 0, 1, 5, 2, 1], ['id1', 3, 4, 5, 0, 0],
['id2', 1, 1, 8, 0, 6], ['id2', 5, 3, 1, 1, 2], ['id2', 5, 4, 5, 2, 7]],
columns = ['ID', 'h0', 'h1', 'h2', 'h3', 'h4'] )
I've tried with the next function to iterate over the rows in the data frame but it doesn't give me the expected output.
def concatenation(df):
for i, row in df.iterrows():
if df.ix[i]['ID'] == df.ix[i+1]['ID']:
pd.concat([df], axis = 1)
return(df)
concatenation(df)
The expected output should look like this:
df = pd.DataFrame([['id1', 1, 1, 3, 4, 1, 0, 1, 5, 2, 1, 3, 4, 5, 0, 0],
['id2', 1, 1, 8, 0, 6, 5, 3, 1, 1, 2, 5, 4, 5, 2, 7]],
columns = ['ID', 'h0', 'h1', 'h2', 'h3', 'h4',
'h0', 'h1', 'h2', 'h3', 'h4',
'h0', 'h1', 'h2', 'h3', 'h4'])
Is there a compact and elegant way of programming this task with pandas tools?
Thank you in advance for your help.
First add a column day, then create a hierarchical index of ID and day that then gets unstacked:
df['day'] = df.groupby('ID').cumcount()
df = df.set_index(['ID','day'])
res = df.unstack()
Intermediate result:
h0 h1 h2 h3 h4
day 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
ID
id1 1 0 3 1 1 4 3 5 5 4 2 0 1 1 0
id2 1 5 5 1 3 4 8 1 5 0 1 2 6 2 7
Now we flatten the index and re-order the columns as requested:
res.set_axis([f"{y}{x}" for x, y in res.columns], axis=1, inplace=True)
res = res.reindex(sorted(res.columns), axis=1)
Final result:
0h0 0h1 0h2 0h3 0h4 1h0 1h1 1h2 1h3 1h4 2h0 2h1 2h2 2h3 2h4
ID
id1 1 1 3 4 1 0 1 5 2 1 3 4 5 0 0
id2 1 1 8 0 6 5 3 1 1 2 5 4 5 2 7
You can use defaultdict(list) and .extend() method to store all the values in exact order and to create the same output as you defined.
But this would require you to do a crude loop which is not recommended for large dataframes.
Let's assume that I have the following data-frame:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2], "nominal": [1, np.nan, 1, 1, np.nan], "numeric1": [3, np.nan, np.nan, 7, np.nan], "numeric2": [2, 3, np.nan, 2, np.nan], "numeric3": [np.nan, 2, np.nan, np.nan, 3], "date":[pd.Timestamp(2005, 6, 22), pd.Timestamp(2006, 2, 11), pd.Timestamp(2008, 9, 13), pd.Timestamp(2009, 5, 12), pd.Timestamp(2010, 5, 9)]})
As output, I want to get a data-frame, that will indicate the number of days that have passed since a non-nan value was seen for that column, for that id. If a column has a value for the corresponding date, or if a column doesn't have a value at the start for an new id, the value should be a 0. In addition, this is supposed to be computed only for the numeric columns. With that said, the output data-frame should be:
output_df = pd.DataFrame({"numeric1_delta": [0, 234, 1179, 0, 362], "numeric2_delta": [0, 0, 945, 0, 362], "numeric3_delta": [0, 0, 945, 0, 0]})
Looking forward to your answers!
You can groupby the cumsum of the non null and then subtract the first date:
In [11]: df.numeric1.notnull().cumsum()
Out[11]:
0 1
1 1
2 1
3 2
4 2
Name: numeric1, dtype: int64
In [12]: df.groupby(df.numeric1.notnull().cumsum()).date.transform(lambda x: x.iloc[0])
Out[12]:
0 2005-06-22
1 2005-06-22
2 2005-06-22
3 2009-05-12
4 2009-05-12
Name: date, dtype: datetime64[ns]
In [13]: df.date - df.groupby(df.numeric1.notnull().cumsum()).date.transform(lambda x: x.iloc[0])
Out[13]:
0 0 days
1 234 days
2 1179 days
3 0 days
4 362 days
Name: date, dtype: timedelta64[ns]
For multiple columns:
ncols = [col for col in df.columns if col.startswith("numeric")]
for c in ncols:
df[c + "_delta"] = df.date - df.groupby(df[c].notnull().cumsum()).date.transform('first')
So I created this post regarding my problem 2 days ago and got an answer thankfully.
I have a data made of 20 rows and 2500 columns. Each column is a unique product and rows are time series, results of measurements. Therefore each product is measured 20 times and there are 2500 products.
This time I want to know for how many consecutive rows my measurement result can stay above a specific threshold.
AKA: I want to count the number of consecutive values that is above a value, let's say 5.
A = [1, 2, 6, 8, 7, 3, 2, 3, 6, 10, 2, 1, 0, 2]
We have these values in bold and according to what I defined above, I should get NumofConsFeature = 3 as the result. (Getting the max if there are more than 1 series that meets the condition)
I thought of filtering using .gt, then getting the indexes and using a loop afterwards in order to detect the consecutive index numbers but couldn't make it work.
In 2nd phase, I'd like to know the index of the first value of my consecutive series. For the above example, that would be 3.
But I have no idea of how for this one.
Thanks in advance.
Here's another answer using only Pandas functions:
A = [1, 2, 6, 8, 7, 3, 2, 3, 6, 10, 2, 1, 0, 2]
a = pd.DataFrame(A, columns = ['foo'])
a['is_large'] = (a.foo > 5)
a['crossing'] = (a.is_large != a.is_large.shift()).cumsum()
a['count'] = a.groupby(['is_large', 'crossing']).cumcount(ascending=False) + 1
a.loc[a.is_large == False, 'count'] = 0
which gives
foo is_large crossing count
0 1 False 1 0
1 2 False 1 0
2 6 True 2 3
3 8 True 2 2
4 7 True 2 1
5 3 False 3 0
6 2 False 3 0
7 3 False 3 0
8 6 True 4 2
9 10 True 4 1
10 2 False 5 0
11 1 False 5 0
12 0 False 5 0
13 2 False 5 0
From there on you can easily find the maximum and its index.
There is simple way to do that.
Lets say your list is like: A = [1, 2, 6, 8, 7, 6, 8, 3, 2, 3, 6, 10,6,7,8, 2, 1, 0, 2]
And you want to find how many consecutive series that has values bigger than 6 and length of 5. For instance, here your answer is 2. There is two series that has values bigger than 6 and length of the series are 5. In python and pandas we do that like below:
condition = (df.wanted_row > 6) & \
(df.wanted_row.shift(-1) > 6) & \
(df.wanted_row.shift(-2) > 6) & \
(df.wanted_row.shift(-3) > 6) & \
(df.wanted_row.shift(-4) > 6)
consecutive_count = df[condition].count().head(1)[0]
Here's one with maxisland_start_len_mask -
# https://stackoverflow.com/a/52718782/ #Divakar
def maxisland_start_len_mask(a, fillna_index = -1, fillna_len = 0):
# a is a boolean array
pad = np.zeros(a.shape[1],dtype=bool)
mask = np.vstack((pad, a, pad))
mask_step = mask[1:] != mask[:-1]
idx = np.flatnonzero(mask_step.T)
island_starts = idx[::2]
island_lens = idx[1::2] - idx[::2]
n_islands_percol = mask_step.sum(0)//2
bins = np.repeat(np.arange(a.shape[1]),n_islands_percol)
scale = island_lens.max()+1
scaled_idx = np.argsort(scale*bins + island_lens)
grp_shift_idx = np.r_[0,n_islands_percol.cumsum()]
max_island_starts = island_starts[scaled_idx[grp_shift_idx[1:]-1]]
max_island_percol_start = max_island_starts%(a.shape[0]+1)
valid = n_islands_percol!=0
cut_idx = grp_shift_idx[:-1][valid]
max_island_percol_len = np.maximum.reduceat(island_lens, cut_idx)
out_len = np.full(a.shape[1], fillna_len, dtype=int)
out_len[valid] = max_island_percol_len
out_index = np.where(valid,max_island_percol_start,fillna_index)
return out_index, out_len
def maxisland_start_len(a, trigger_val, comp_func=np.greater):
# a is 2D array as the data
mask = comp_func(a,trigger_val)
return maxisland_start_len_mask(mask, fillna_index = -1, fillna_len = 0)
Sample run -
In [169]: a
Out[169]:
array([[ 1, 0, 3],
[ 2, 7, 3],
[ 6, 8, 4],
[ 8, 6, 8],
[ 7, 1, 6],
[ 3, 7, 8],
[ 2, 5, 8],
[ 3, 3, 0],
[ 6, 5, 0],
[10, 3, 8],
[ 2, 3, 3],
[ 1, 7, 0],
[ 0, 0, 4],
[ 2, 3, 2]])
# Per column results
In [170]: row_index, length = maxisland_start_len(a, 5)
In [172]: row_index
Out[172]: array([2, 1, 3])
In [173]: length
Out[173]: array([3, 3, 4])
You can apply diff() on your Series, and then just count the number of consecutive entries where the difference is 1 and the actual value is above your cutoff. The largest count is the maximum number of consecutive values.
First compute diff():
df = pd.DataFrame({"a":[1, 2, 6, 7, 8, 3, 2, 3, 6, 10, 2, 1, 0, 2]})
df['b'] = df.a.diff()
df
a b
0 1 NaN
1 2 1.0
2 6 4.0
3 7 1.0
4 8 1.0
5 3 -5.0
6 2 -1.0
7 3 1.0
8 6 3.0
9 10 4.0
10 2 -8.0
11 1 -1.0
12 0 -1.0
13 2 2.0
Now count consecutive sequences:
above = 5
n_consec = 1
max_n_consec = 1
for a, b in df.values[1:]:
if (a > above) & (b == 1):
n_consec += 1
else: # check for new max, then start again from 1
max_n_consec = max(n_consec, max_n_consec)
n_consec = 1
max_n_consec
3
Here's how I did it using numpy:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[1, 2, 6, 7, 8, 3, 2, 3, 6, 10, 2, 1, 0, 2]})
consecutive_steps = 2
marginal_price = 5
assertions = [(df.loc[:, "a"].shift(-i) < marginal_price) for i in range(consecutive_steps)]
condition = np.all(assertions, axis=0)
consecutive_count = df.loc[condition, :].count()
print(consecutive_count)
which yields 6.