I want to fill df1 using df2 values, I could achieve it using nested loop but is very much time taking.
Is there any smart way to do this ?
P.S. The size of df is around 8000 rows , 8000 columns.
df1 initially is like this
A B C D
A 0 0 0 0
B 0 0 0 0
C 0 0 0 0
D 0 0 0 0
df2 is like this
P Q R S T
P 1 5 7 5 3
Q 5 6 2 8 5
R 3 5 4 9 3
S 9 4 5 0 8
T 2 9 4 2 1
Now there is correspondence list between indices of df1 and df2
df1 df2
A P
B Q
C R
D S
B T
df1 should be filled like this
A B C D
A 1 8 7 5
B 7 21 6 10
C 3 8 4 9
D 9 12 5 0
Here as 'B' is occurring twice in the list, so it will add values of 'Q' and 'T' together.
Thank you in advance.
You could try changing the row and col names in df1 (based on the correspondence with df2) and for the cases of multiple correspondence (like B) you could first name them B1, B2, etc... and then sum them together:
> di
{'Q': 'B1', 'P': 'A', 'S': 'D', 'R': 'C', 'T': 'B2'}
> df1 = df2.copy()
> df1.columns = [di[c] for c in df2.columns]
> df1.index = [di[c] for c in df2.index]
> ## sum B1,B2 column wise
> df1['B'] = df1.B1 + df1.B2
> ## sum B1,B2 row wise
> df1.at["B", :] = df1.loc["B1"] + df1.loc["B2"]
> ## subset with original index and column names
> df1[["A", "B", "C", "D"]].loc[["A", "B", "C", "D"]]
##output
A B C D
A 1.0 8.0 7.0 5.0
B 7.0 21.0 6.0 10.0
C 3.0 8.0 4.0 9.0
D 9.0 12.0 5.0 0.0
you can also stack df2 to a series, as the columns become an inner index(level_1) of a Series.
Then replace the indices with {'P': 'A', 'Q': 'B', 'R': 'C', 'S': 'D', 'T': 'B'}.
use groupby with sum to add values with the same indices, then unstack to turn the inner level index to column.
amap = {'P': 'A', 'Q': 'B', 'R': 'C', 'S': 'D', 'T': 'B'}
obj2 = df2.stack().reset_index()
for col in ['level_0','level_1']:
obj2[col] = obj2[col].map(amap)
df1 = obj2.groupby(['level_0', 'level_1'])[0].sum().unstack()
Related
I have dataframe, where 'A' 1 - client, B - admin
I need to merge messages in row with 1 sequentially and merge lines 2 - admin response sequentially across the dataframe.
df1 = pd.DataFrame({'A' : ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'j', 'de', 'be'],
'B' : [1, 1, 2, 1, 1, 1, 2, 2, 1, 2]})
df1
A B
A B
0 a 1
1 b 1
2 c 2
3 d 1
4 e 1
5 f 1
6 h 2
7 j 2
8 de 1
9 be 2
I need to get in the end this dataframe:
df2 = pd.DataFrame({'A' : ['a, b', 'd, e, f', 'de'],
'B' : ['c', 'h, j', 'be' ]})
Out:
A B
0 a,b c
1 d,e,f h,j
2 de be
I do not know how to do this
Create groups by consecutive values in B - trick compare shifted values with cumulative sum and aggregate first and join. Create helper column for posible pivoting in next step by DataFrame.pivot:
Solution working if exist pairs 1,2 in sequentially order with duplicates.
df = (df1.groupby(df1['B'].ne(df1['B'].shift()).cumsum())
.agg(B = ('B','first'), A= ('A', ','.join))
.assign(C = lambda x: x['B'].eq(1).cumsum()))
print (df)
B A C
B
1 1 a,b 1
2 2 c 1
3 1 d,e,f 2
4 2 h,j 2
5 1 de 3
6 2 be 3
df = (df.pivot('C','B','A')
.rename(columns={1:'A',2:'B'})
.reset_index(drop=True).rename_axis(None, axis=1))
print (df)
A B
0 a,b c
1 d,e,f h,j
2 de be
I have two data set as follows
df1 = pd.DataFrame(np.array([[10, 20, 30, 40],
[11, 21, 31, 41]]), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.array([0, 1, 0, 1]).reshape(1, -1), columns = ['A', 'B', 'C', 'D'])
What I want is; If any item of df2 is greater than 0.5, the same Items of df1 will be 0 after running the code the df1 will be
print(df)
A B C D
10 0 30 0
11 0 31 0
I tried using
df1[df2>= 0.5] = 0
I think you should use pandas.DataFrame.where(), after you brought df2 to the same shape as df1. Please understand that df.where() replaces all values if the condition does not match, so this is the reson why >= is changed to <.
df1 = df1.where(df2<0.5, 0)
>>> df1
A B C D
0 10 0 30 0
1 11 0 31 0
If you have problems to extend df2, you can use this:
df2 = pd.DataFrame([[0, 1, 0, 1]], columns = ['A', 'B', 'C', 'D'])
>>>df2
A B C D
0 0 1 0 1
n = 1 # df1.shape[0] - 1
df2 = df2.append([df2.loc[0,:]]*n,ignore_index=True)
>>> df2
A B C D
0 0 1 0 1
1 0 1 0 1
Since both of the data frames have the same number of columns, where() method in pandas data frame can get the job done.
i.e
>>> df1.where(df2 < 0.5)
A B C D
0 10.0 NaN 30.0 NaN
1 NaN NaN NaN NaN
By default, if the condition evaluated to False in the where() method, the position will be replaced with NaN but not inplace. We can change that by changing the other argument from it's default value to 0 and to make the changes in-place we set inplace=True.
>>> df1.where(df2 < 0.5, other=0, inplace=True)
>>> df1
A B C D
0 10 0 30 0
1 0 0 0 0
I know this is in S0 somewhere but I can't seem to find it. I want to subset a df on a specific value and include the following unique rows. Using below, I can return values equal to A, but I'm hoping to return the next unique values, which is B.
Note: The subsequent unique value may not be B or may have varying rows, so I need a function that finds the returns all subsequent unique values.
import pandas as pd
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,2,2,2,2,2,2],
'ID' : ['A','A','B','B','C','C','A','A','B','B','C','C'],
'Val' : [2.0,5.0,2.5,2.0,2.0,1.0,1.0,6.0,4.0,2.0,5.0,1.0],
})
df = df[df['ID'] == 'A']
intended output:
Time ID Val
0 1 A 2.0
1 1 A 5.0
2 1 B 2.5
3 1 B 2.0
4 2 A 1.0
5 2 A 6.0
6 2 B 4.0
7 2 B 2.0
Ok OP let me do this again, you want to find all the rows which are "A" (base condition) and all the rows which are following a "A" row at some point, right ?
Then,
is_A = df["ID"] == "A"
not_A_follows_from_A = (df["ID"] != "A") &( df["ID"].shift() == "A")
candidates = df["ID"].loc[is_A | not_A_follows_from_A].unique()
df.loc[df["ID"].isin(candidates)]
Should work as intented.
Edit : example
df = pd.DataFrame({
'Time': [1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1],
'ID': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'A', 'E', 'E', 'E', 'A', 'F'],
'Val': [7, 2, 7, 5, 1, 6, 7, 3, 2, 4, 7, 8, 2]})
is_A = df["ID"] == "A"
not_A_follows_from_A = (df["ID"] != "A") &( df["ID"].shift() == "A")
candidates = df["ID"].loc[is_A | not_A_follows_from_A].unique()
df.loc[df["ID"].isin(candidates)]
outputs this :
Time ID Val
0 1 A 7
1 1 A 2
2 1 B 7
3 0 B 5
7 1 A 3
8 0 E 2
9 0 E 4
10 1 E 7
11 1 A 8
12 1 F 2
Let us try drop_duplicates, then groupby select the number of unique ID we would like to keep by head, and merge
out = df.merge(df[['Time','ID']].drop_duplicates().groupby('Time').head(2))
Time ID Val
0 1 A 2.0
1 1 A 5.0
2 1 B 2.5
3 1 B 2.0
4 2 A 1.0
5 2 A 6.0
6 2 B 4.0
7 2 B 2.0
How to apply a function to each column of dataframe "groupwisely" ?
I.e. group by values of one column and calculate e.g. means for each group+ other columns. The expected output is dataframe with index - names of different groups, and values - means for each group+column
E.g. consider:
df = pd.DataFrame(np.arange(16).reshape(4,4), columns=['A', 'B', 'C', 'D'])
df['group'] = ['a', 'a', 'b','b']
A B C D group
0 0 1 2 3 a
1 4 5 6 7 a
2 8 9 10 11 b
3 12 13 14 15 b
I want to calculate e.g. np.mean for each column, but "groupwisely",
in that particular example it can be done by:
t = df.groupby('group').agg({'A': np.mean, 'B': np.mean, 'C': np.mean, 'D': np.mean })
A B C D
group
a 2 3 4 5
b 10 11 12 13
However, it requires explicit use of column names 'A': np.mean, 'B': np.mean, 'C': np.mean, 'D': np.mean
which is unacceptable for my task, since they can be changed.
As MaxU commented simplier is groupby + GroupBy.mean:
df1 = df.groupby('group').mean()
print (df1)
A B C D
group
a 2 3 4 5
b 10 11 12 13
If need column from index:
df1 = df.groupby('group', as_index=False).mean()
print (df1)
group A B C D
0 a 2 3 4 5
1 b 10 11 12 13
You don't need to explicitly name the columns.
df.groupby('group').agg('mean')
Will produce the mean for each group for each column as requested:
A B C D
group
a 2 3 4 5
b 10 11 12 13
The below does the job:
df.groupby('group').apply(np.mean, axis=0)
giving back
A B C D
group
a 2.0 3.0 4.0 5.0
b 10.0 11.0 12.0 13.0
apply takes axis = {0,1} as additional argument, which in turn specifies whether you want to apply the function row-wise or column-wise.
I have two dataframe: df1 and df2.
df1 is following:
name exist
a 1
b 1
c 1
d 1
e 1
df2 (just have one column:name)is following:
name
e
f
g
a
h
I want to merge these two dataframe, and didn't merge repeat names, I mean, if the name in df2 exist in df1, just show one time, else if the name is df2 not exist in df1, set the exist value is 0 or Nan. for example as df1(there is a and e), and df2(there is a and e, just showed a, e one time), I want to be the following df:
a 1
b 1
c 1
d 1
e 1
f 0
g 0
h 0
I used the concat function to do it, my code is following:
import pandas as pd
df1 = pd.DataFrame({'name': ['a', 'b', 'c', 'd', 'e'],
'exist': ['1', '1', '1', '1', '1']})
df2 = pd.DataFrame({'name': ['e', 'f', 'g', 'h', 'a']})
df = pd.concat([df1, df2])
print(df)
but the result is wrong(name a and e is repeated to be showed):
exist name
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
0 NaN e
1 NaN f
2 NaN g
3 NaN h
4 NaN a
please give your hands, thanks in advance!
As indicated by your title, you can use merge instead of concat and specify how parameter as outer since you want to keep all records from df1 and df2 which defines an outer join:
import pandas as pd
pd.merge(df1, df2, on = 'name', how = 'outer').fillna(0)
# exist name
# 0 1 a
# 1 1 b
# 2 1 c
# 3 1 d
# 4 1 e
# 5 0 f
# 6 0 g
# 7 0 h