how to add a DataFrame to some columns of another DataFrame

how to add a DataFrame to some columns of another DataFrame - python

I want to add a DataFrame a (containing a loadprofile) to some of the columns of another DataFrame b (also containing one load profile per column). So some columns (load profiles) of b should be overlaid withe the load profile of a.
So lets say my DataFrames look like:
a:
P[kW]
0 0
1 0
2 0
3 8
4 8
5 0
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 4 4
4 2 2 2
5 2 2 2
Now I want to overlay some colums of b:
b.iloc[:, [1]] += a.iloc[:, 0]
I would expect this:
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 12 4
4 2 10 2
5 2 2 2
but what I actually get:
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 nan 2
1 3 nan 3
2 3 nan 3
3 4 nan 4
4 2 nan 2
5 2 nan 2
That's not exactly what my code and data look like, but the principle is the same as in this abstract example.
Any guesses, what could be the problem?
Many thanks for any help in advance!
EDIT:
I actually have to overlay more than one column.Another example:
load = [0,0,0,0,0,0,0]
data = pd.DataFrame(load)
for i in range(1, 10):
data[i] = data[0]
data
overlay = pd.DataFrame([0,0,0,0,6,6,0])
overlay
data.iloc[:, [1,2,4,5,7,8]] += overlay.iloc[:, 0]
data
WHAT??! The result is completely crazy. Columns 1 and 2 aren't changed at all. Columns 4 and 5 are changed, but in every row. Columns 7 and 8 are nans. What am I missing?
That is what I would expect the result to look like:

Please do not pass the column index '1' of dataframe 'b' as a list but as an element.
Code
b.iloc[:, 1] += a.iloc[:, 0]
b
Output
P1[kW] P2[kW] Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 12 4
4 2 10 2
5 2 2 2
Edit
Seems like this what we are looking for i.e to sum certain columns of data df with overlay df
Two Options
Option 1
cols=[1,2,4,5,7,8]
data[cols] = data[cols] + overlay.values
data
Option 2, if we want to use iloc
cols=[1,2,4,5,7,8]
data[cols] = data.iloc[:,cols] + overlay.iloc[:].values
data
Output
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 6 6 0 6 6 0 6 6 0
5 0 6 6 0 6 6 0 6 6 0
6 0 0 0 0 0 0 0 0 0 0

Related

Increment the value in a new column based on a condition using an existing column

I have a pandas dataframe with two columns:
temp_1 flag
1 0
1 0
1 0
2 0
3 0
4 0
4 1
4 0
5 0
6 0
6 1
6 0
and I wanted to create a new column named "final" based on :
if "flag" has a value = 1 , then it increments "temp_1" by 1 and following values as well. If we find value = 1 again in flag column then the previous value in "final" with get incremented by 1 , please refer to expected output
I have tired using .cumsum() with filters but not getting the desired result.
Expected output
temp_1 flag final
1 0 1
1 0 1
1 0 1
2 0 2
3 0 3
4 0 4
4 1 5
4 0 5
5 0 6
6 0 7
6 1 8
6 0 8

Just do cumsum for flag:
>>> df['final'] = df['temp_1'] + df['flag'].cumsum()
>>> df
temp_1 flag final
0 1 0 1
1 1 0 1
2 1 0 1
3 2 0 2
4 3 0 3
5 4 0 4
6 4 1 5
7 4 0 5
8 5 0 6
9 6 0 7
10 6 1 8
11 6 0 8
>>>

Difference of one multi index level

For a MultiIndex with a repeating level, how can I calculate the differences with another level of the index, effectively ignoring it?
Let me explain in code.
>>> ix = pd.MultiIndex.from_product([(0, 1, 2), (0, 1, 2, 3)])
>>> df = pd.DataFrame([5]*4 + [4]*4 + [3, 2, 1, 0], index=ix)
>>> df
0
0 0 5
1 5
2 5
3 5
1 0 4
1 4
2 4
3 4
2 0 3
1 2
2 1
3 0
Now by some operation I'd like to subtract the last set of values (2, 0:4) from the whole data frame. I.e. df - df.loc[2] to produce this:
0
0 0 2
1 3
2 4
3 5
1 0 1
1 2
2 3
3 4
2 0 0
1 0
2 0
3 0
But the statement produces an error. df - df.loc[2:3] does not, but in addition to the trailing zeros only NaNs are produced - naturally of course because the indices don't match.
How could this be achieved?
I realised that the index level is precisely the problem. So I got a bit closer.
>>> df.droplevel(0) - df.loc[2]
0
0 2
0 1
0 0
1 3
1 2
1 0
2 4
2 3
2 0
3 5
3 4
3 0
Still not quite what I want. But I don't know if there's a convenient way of achieving what I'm after.

This with stack and unstack:
new_df = df.unstack()
new_df.sub(new_df.loc[2]).stack()
Output:
0
0 0 2
1 3
2 4
3 5
1 0 1
1 2
2 3
3 4
2 0 0
1 0
2 0
3 0

Try creating a dataframe with identical index and mapping the last set of data with the first level and populate across the dataframe , then substract:
df - pd.DataFrame(index=df.index,data=df.index.get_level_values(1).map(df.loc[2].squeeze()))
0
0 0 2
1 3
2 4
3 5
1 0 1
1 2
2 3
3 4
2 0 0
1 0
2 0
3 0

Padding and reshaping pandas dataframe

I have a dataframe with the following form:
data = pd.DataFrame({'ID':[1,1,1,2,2,2,2,3,3],'Time':[0,1,2,0,1,2,3,0,1],
'sig':[2,3,1,4,2,0,2,3,5],'sig2':[9,2,8,0,4,5,1,1,0],
'group':['A','A','A','B','B','B','B','A','A']})
print(data)
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 2 0 4 0 B
4 2 1 2 4 B
5 2 2 0 5 B
6 2 3 2 1 B
7 3 0 3 1 A
8 3 1 5 0 A
I want to reshape and pad such that each 'ID' has the same number of Time values, the sig1,sig2 are padded with zeros (or mean value within ID) and the group carries the same letter value. The output after repadding would be :
data_pad = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3],'Time':[0,1,2,3,0,1,2,3,0,1,2,3],
'sig1':[2,3,1,0,4,2,0,2,3,5,0,0],'sig2':[9,2,8,0,0,4,5,1,1,0,0,0],
'group':['A','A','A','A','B','B','B','B','A','A','A','A']})
print(data_pad)
ID Time sig1 sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
My end goal is to ultimately reshape this into something with shape (number of ID, number of time points, number of sequences {2 here}).
It seems that if I pivot data, it fills in with nan values, which is fine for the signal values, but not the groups. I am also hoping to avoid looping through data.groupby('ID'), since my actual data has a large number of groups and the looping would likely be very slow.

Here's one approach creating the new index with pd.MultiIndex.from_product and using it to reindex on the Time column:
df = data.set_index(['ID', 'Time'])
# define a the new index
ix = pd.MultiIndex.from_product([df.index.levels[0],
df.index.levels[1]],
names=['ID', 'Time'])
# reindex using the above multiindex
df = df.reindex(ix, fill_value=0)
# forward fill the missing values in group
df['group'] = df.group.mask(df.group.eq(0)).ffill()
print(df.reset_index())
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A

IIUC:
(data.pivot_table(columns='Time', index=['ID','group'], fill_value=0)
.stack('Time')
.sort_index(level=['ID','Time'])
.reset_index()
)
Output:
ID group Time sig sig2
0 1 A 0 2 9
1 1 A 1 3 2
2 1 A 2 1 8
3 1 A 3 0 0
4 2 B 0 4 0
5 2 B 1 2 4
6 2 B 2 0 5
7 2 B 3 2 1
8 3 A 0 3 1
9 3 A 1 5 0
10 3 A 2 0 0
11 3 A 3 0 0

Creating a column that assigns max value of set of rows by condition to all rows in that group

I have a dataframe that looks like this:
data metadata
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
A 0
A 1
B 0
A 0
A 1
B 0
df.data contains two different categories, A and B. df.metadata stores a running count the number of times a category appears consecutively before the category changes. I want to create a column consecutive_count that assigns the max value of metadata per consecutive group to every row in that group. It should look like this:
data metadata consecutive_count
A 0 4
A 1 4
A 2 4
A 3 4
A 4 4
B 0 2
B 1 2
B 2 2
A 0 1
A 1 1
B 0 0
A 0 1
A 1 1
B 0 0
Please advise. Thank you.

Method 1:
You may try transform max on groupby of each group of data
s = df.data.ne(df.data.shift()).cumsum()
df['consecutive_count'] = df.groupby(s).metadata.transform('max')
Out[96]:
data metadata consecutive_count
0 A 0 4
1 A 1 4
2 A 2 4
3 A 3 4
4 A 4 4
5 B 0 2
6 B 1 2
7 B 2 2
8 A 0 1
9 A 1 1
10 B 0 0
11 A 0 1
12 A 1 1
13 B 0 0
Method 2:
Since metadata is sorted per group, you may reverse dataframe and do groupby cummax
s = df.data.ne(df.data.shift()).cumsum()
df['consecutive_count'] = df[::-1].groupby(s).metadata.cummax()
Out[101]:
data metadata consecutive_count
0 A 0 4
1 A 1 4
2 A 2 4
3 A 3 4
4 A 4 4
5 B 0 2
6 B 1 2
7 B 2 2
8 A 0 1
9 A 1 1
10 B 0 0
11 A 0 1
12 A 1 1
13 B 0 0

Finding efficiently pandas (part of) rows with unique values

Given a pandas dataframe with a row per individual/record. A row includes a property value and its evolution across time (0 to N).
A schedule includes the estimated values of a variable 'property' for a number of entities from day 1 to day 10 in the following example.
I want to filter entities with unique values for a given period and get those values
csv=',property,1,2,3,4,5,6,7,8,9,10\n0,100011,0,0,0,0,3,3,3,3,3,0\n1,100012,0,0,0,0,2,2,2,8,8,0\n2, \
100012,0,0,0,0,2,2,2,2,2,0\n3,100012,0,0,0,0,0,0,0,0,0,0\n4,100011,0,0,0,0,2,2,2,2,2,0\n5, \
180011,0,0,0,0,2,2,2,2,2,0\n6,110012,0,0,0,0,0,0,0,0,0,0\n7,110011,0,0,0,0,3,3,3,3,3,0\n8, \
110012,0,0,0,0,3,3,3,3,3,0\n9,110013,0,0,0,0,0,0,0,0,0,0\n10,100011,0,0,0,0,3,3,3,3,4,0'
from StringIO import StringIO
import numpy as np
schedule = pd.read_csv(StringIO(csv), index_col=0)
print schedule
property 1 2 3 4 5 6 7 8 9 10
0 100011 0 0 0 0 3 3 3 3 3 0
1 100012 0 0 0 0 2 2 2 8 8 0
2 100012 0 0 0 0 2 2 2 2 2 0
3 100012 0 0 0 0 0 0 0 0 0 0
4 100011 0 0 0 0 2 2 2 2 2 0
5 180011 0 0 0 0 2 2 2 2 2 0
6 110012 0 0 0 0 0 0 0 0 0 0
7 110011 0 0 0 0 3 3 3 3 3 0
8 110012 0 0 0 0 3 3 3 3 3 0
9 110013 0 0 0 0 0 0 0 0 0 0
10 100011 0 0 0 0 3 3 3 3 4 0
I want to find records/individuals for who property has not changed during a given period and the corresponding unique values
Here is what i came with : I want to locate individuals with property in [100011, 100012, 1100012] between days 7 and 10
props = [100011, 100012, 1100012]
begin = 7
end = 10
res = schedule['property'].isin(props)
df = schedule.ix[res, begin:end]
print "df \n%s " %df
We have :
df
7 8 9
0 3 3 3
1 2 8 8
2 2 2 2
3 0 0 0
4 2 2 2
10 3 3 4
res = df.apply(lambda x: np.unique(x).size == 1, axis=1)
print "res : %s\n" %res
df_f = df.ix[res,]
print "df filtered %s \n" % df_f
res = pd.Series(df_f.values.ravel()).unique().tolist()
print "unique values : %s " %res
Giving :
res :
0 True
1 False
2 True
3 True
4 True
10 False
dtype: bool
df filtered
7 8 9
0 3 3 3
2 2 2 2
3 0 0 0
4 2 2 2
unique values : [3, 2, 0]
As those operations need to be run many times (in millions) on a million rows dataframe, i need to be able to run it as quickly as possible.
(#MaxU) : schedule can be seen as a database/repository updated many times. The repository is then requested as well many times for unique values
Would you have some ideas for improvements/ alternate ways ?

Given your df
7 8 9
0 3 3 3
1 2 8 8
2 2 2 2
3 0 0 0
4 2 2 2
10 3 3 4
You can simplify your code to:
df_f = df[df.apply(pd.Series.nunique, axis=1) == 1]
print(df_f)
7 8 9
0 3 3 3
2 2 2 2
3 0 0 0
4 2 2 2
And the final step to:
res = df_f.iloc[:,0].unique().tolist()
print(res)
[3, 2, 0]
It's not fully vectorised, but maybe this clarifies things a bit towards that?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to add a DataFrame to some columns of another DataFrame - python

Related

Increment the value in a new column based on a condition using an existing column

Difference of one multi index level

Padding and reshaping pandas dataframe

Creating a column that assigns max value of set of rows by condition to all rows in that group

Finding efficiently pandas (part of) rows with unique values

Categories

Resources