Filling rows based on column partitions & conditions in Python - python

My goal is to iterate through a list of possible B values, such that each ID (col A) will have new rows added with C = 0 where the possible B value did not previously exist in the DF.
I have a dataframe with:
A B C
0 id1 2 10
1 id1 3 20
2 id2 1 30
possible_B_values = [1 2 3]
Resulting in:
A B C
0 id1 1 0
1 id1 2 10
2 id1 3 20
3 id2 1 30
4 id2 2 0
5 id2 3 0
Thanks in advance!

Using some index trickery:
import pandas as pd
df = pd.read_clipboard() # Your df here
possible_B_values = [1, 2, 3]
extrapolate_columns = ["A", "B"]
index = pd.MultiIndex.from_product(
[df["A"].unique(), possible_B_values],
names=extrapolate_columns
)
out = df.set_index(extrapolate_columns).reindex(index, fill_value=0).reset_index()
out:
A B C
0 id1 1 0
1 id1 2 10
2 id1 3 20
3 id2 1 30
4 id2 2 0
5 id2 3 0

Maybe you can create a dataframe with list of tuples with possible B values and merge it with original one
import pandas as pd
# Create a list of tuples with the possible B values and a C value of 0
possible_b_values = [1, 2, 3]
possible_b_rows = [(id, b, 0) for id in df['A'].unique() for b in possible_b_values]
# Create a new DataFrame from the list of tuples
possible_b_df = pd.DataFrame(possible_b_rows, columns=['A', 'B', 'C'])
# Merge the new DataFrame with the original one, using the 'A' and 'B' columns as the keys
df = df.merge(possible_b_df, on=['A', 'B'], how='outer')
# Fill any null values in the 'C' column with 0
df['C'] = df['C'].fillna(0)
print(df)

Here is a one-liner pure pandas way of solving this -
You set the index as B (this will help in re-indexing later)
Gropuby column A and then for column C apply the following apply function to reindex B
The lambda function x.reindex(range(1,4), fill_value=0) basically takes each group of dataframe x for each id, and then reindexes it from range(1,4) = 1,2,3 and fills the nan values with 0.
Finally you reset_index to bring A and B back into the dataframe.
out = df.set_index('B') \ # Set index as B
.groupby(['A'])['C'] \ # Groupby A and use apply on column C
.apply(lambda x: x.reindex(range(1,4), fill_value=0))\ # Reindex B to range(1,4) for each group and fill 0
.reset_index() # Reset index
print(out)
A B C
0 id1 1 0
1 id1 2 10
2 id1 3 20
3 id2 1 30
4 id2 2 0
5 id2 3 0

Related

How do you generate a rolling count the number of rows that are duplicated in Pandas? [duplicate]

I come from a sql background and I use the following data processing step frequently:
Partition the table of data by one or more fields
For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending
EX:
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'data1' : [1,2,2,3,3],
'data2' : [1,10,2,3,30]})
df
data1 data2 key1
0 1 1 a
1 2 10 a
2 2 2 a
3 3 3 b
4 3 30 a
I'm looking for how to do the PANDAS equivalent to this sql window function:
RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
I've tried the following which I've gotten to work where there are no 'partitions':
def row_number(frame,orderby_columns, orderby_direction,name):
frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
frame[name] = list(xrange(len(frame.index)))
I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:
df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()
def nf(x):
x['rn'] = list(xrange(len(x.index)))
df1['rn1'] = df1.groupby('key1').apply(nf)
But I just got a lot of NaNs when I do this.
Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?
you can also use sort_values(), groupby() and finally cumcount() + 1:
df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
.groupby(['key1']) \
.cumcount() + 1
print(df)
yields:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
PS tested with pandas 0.18
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2
You can do this by using groupby twice along with the rank method:
In [11]: g = df.groupby('key1')
Use the min method argument to give values which share the same data1 the same RN:
In [12]: g['data1'].rank(method='min')
Out[12]:
0 1
1 2
2 2
3 1
4 4
dtype: float64
In [13]: df['RN'] = g['data1'].rank(method='min')
And then groupby these results and add the rank with respect to data2:
In [14]: g1 = df.groupby(['key1', 'RN'])
In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0 0
1 0
2 1
3 0
4 0
dtype: float64
In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1
In [17]: df
Out[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
It feels like there ought to be a native way to do this (there may well be!...).
You can use transform and Rank together Here is an example
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
df
Have a look at Pandas Rank method for more information
pandas.lib.fast_zip() can create a tuple array from a list of array. You can use this function to create a tuple series, and then rank it:
values = {'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,3,3],
'data2' : [1,10,2,3,30,20]}
df = pd.DataFrame(values, index=list("abcdef"))
def rank_multi_columns(df, cols, **kw):
data = []
for col in cols:
if col.startswith("-"):
flag = -1
col = col[1:]
else:
flag = 1
data.append(flag*df[col])
values = pd.lib.fast_zip(data)
s = pd.Series(values, index=df.index)
return s.rank(**kw)
rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))
print rank
the result:
a 1
b 2
c 3
d 2
e 4
f 1
dtype: float64

Insert/replace/merge values from one dataframe to another

I have two dataframes like this:
df1 = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['0','10','80','0','0','0']})
df2 = pd.DataFrame({'ID1':['A','D','E','F'],
'ID2':['50','30','90','50'],
'aa':['1','2','3','4']})
I want to insert ID2 in df2 into ID2 in df1, and at the same time insert aa into df1 according to ID1 to obtain a new dataframe like this:
df_result = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['50','10','80','30','90','50'],
'aa':['1','NaN','NaN','2','3','4']})
I've tried to use merge, but it didn't work.
You can use combine_first on the DataFrame after setting the index to ID1:
(df2.set_index('ID1') # values of df2 have priority in case of overlap
.combine_first(df1.set_index('ID1')) # add missing values from df1
.reset_index() # reset ID1 as column
)
output:
ID1 ID2 aa
0 A 50 1
1 B 10 NaN
2 C 80 NaN
3 D 30 2
4 E 90 3
5 F 50 4
Try this:
new_df = df1.assign(ID2=df1['ID2'].replace('0', np.nan)).merge(df2, on='ID1', how='left').pipe(lambda g: g.assign(ID2=g.filter(like='ID2').bfill(axis=1).iloc[:, 0]).drop(['ID2_x', 'ID2_y'], axis=1))
Output:
>>> new_df
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
Use df.merge with Series.combine_first:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [571]: x['ID2'] = x.ID2_y.combine_first(x.ID2_x)
In [574]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)
In [575]: x
Out[575]:
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
OR use df.filter with df.ffill:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [597]: x['ID2'] = x.filter(like='ID2').ffill(axis=1)['ID2_y']
In [599]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)

Merge duplicated cells instead of dropping them [duplicate]

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3

Match columns pandas Dataframe

I want to match two pandas Dataframes by the name of their columns.
import pandas as pd
df1 = pd.DataFrame([[0,2,1],[1,3,0],[0,4,0]], columns=['A', 'B', 'C'])
A B C
0 0 2 1
1 1 3 0
2 0 4 0
df2 = pd.DataFrame([[0,0,1],[1,5,0],[0,7,0]], columns=['A', 'B', 'D'])
A B D
0 0 0 1
1 1 5 0
2 0 7 0
If the names match, do nothing. (Keep the column of df2)
If a column is in Dataframe 1 but not in Dataframe 2, add the column in Dataframe 2 as a vector of zeros.
If a column is in Dataframe 2 but not in Dataframe 1, drop it.
The output should look like this:
A B C
0 0 0 0
1 1 5 0
2 0 7 0
I know if I do:
df2 = df2[df1.columns]
I get:
KeyError: "['C'] not in index"
I could also add the vectors of zeros manually, but of course this is a toy example of a much longer dataset. Is there any smarter/pythonic way of doing this?
It appears that df2 columns should be the same as df1 columns after this operation, as columns that are in df1 and not df2 should be added, while columns only in df2 should be removed. We can simply reindex df2 to match df1 columns with a fill_value=0 (this is the safe equivalent to df2 = df2[df1.columns] when adding new columns with a fill value):
df2 = df2.reindex(columns=df1.columns, fill_value=0)
df2:
A B C
0 0 0 0
1 1 5 0
2 0 7 0

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?
import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.

Categories

Resources