Mapping column from one pandas DataFrame to another - python

I have a dataframe like this:
df_1 = pd.DataFrame({'players.name': ['John', 'Will' ,'John', 'Jim', 'Tim', 'John', 'Will', 'Tim'],
'players.diff': [0, 0, 0, 0, 0, 0, 0, 0],
'count': [3, 2, 3, 1, 2, 3, 2, 2]})
'count' values are constant.
And I have a different shape dataframe with players ordered differently, like so:
df_2 = pd.DataFrame({'players.name': ['Will', 'John' ,'Jim'],
'players.diff': [0, 0, 0]})
How do I map from df_1 values and populate a 'count' value on df_2, ending up with:
players.name players.diff counts
0 Will 0 2
1 John 0 3
2 Jim 0 1

Since you're just trying to create a column of counts, it'd be more meaningful to map your player names to counts:
df_2['counts'] = df_2['players.name'].map(
df_1.groupby('players.name')['count'].first())
df_2
players.name players.diff counts
0 Will 0 2
1 John 0 3
2 Jim 0 1

This could work:
pd.merge(df_1, df_2, on=["players.name", "players.diff"]).drop_duplicates()

Your sample df_1 has duplicated players.name with same count, so you need left-merge and drop_duplicates
new_df_2 = df_2.merge(df_1[['players.name','count']], on='players.name', how='left').drop_duplicates()
Out[89]:
players.name players.diff count
0 Will 0 2
2 John 0 3
5 Jim 0 1

Related

How to create multiple columns in Pandas Dataframe?

I have data as you can see in the terminal. I need it to be converted to the Excel sheet format as you can see in the Excel sheet file by creating multi-levels in columns.
I researched this and reached many different things but cannot achieve my goal then, I reached "transpose", and it gave me the shape that I need but unfortunately that it did reshape from a column to a row instead where I got the wrong data ordering.
Current result:
Desired result:
What can I try next?
You can use pivot() function and reorder multi-column levels.
Before that, index/group data for repeated iterations/rounds:
data=[
(2,0,0,1),
(10,2,5,3),
(2,0,0,0),
(10,1,1,1),
(2,0,0,0),
(10,1,2,1),
]
columns = ["player_number", "cel1", "cel2", "cel3"]
df = pd.DataFrame(data=data, columns=columns)
df_nbr_plr = df[["player_number"]].groupby("player_number").agg(cnt=("player_number","count"))
df["round"] = list(itertools.chain.from_iterable(itertools.repeat(x, df_nbr_plr.shape[0]) for x in range(df_nbr_plr.iloc[0,0])))
[Out]:
player_number cel1 cel2 cel3 round
0 2 0 0 1 0
1 10 2 5 3 0
2 2 0 0 0 1
3 10 1 1 1 1
4 2 0 0 0 2
5 10 1 2 1 2
Now, pivot and reorder the colums levels:
df = df.pivot(index="round", columns="player_number").reorder_levels([1,0], axis=1).sort_index(axis=1)
[Out]:
player_number 2 10
cel1 cel2 cel3 cel1 cel2 cel3
round
0 0 0 1 2 5 3
1 0 0 0 1 1 1
2 0 0 0 1 2 1
This can be done with unstack after setting player__number as index. You have to reorder the Multiindex columns and fill missing values/delete duplicates though:
import pandas as pd
data = {"player__number": [2, 10 , 2, 10, 2, 10],
"cel1": [0, 2, 0, 1, 0, 1],
"cel2": [0, 5, 0, 1, 0, 2],
"cel3": [1, 3, 0, 1, 0, 1],
}
df = pd.DataFrame(data).set_index('player__number', append=True)
df = df.unstack('player__number').reorder_levels([1, 0], axis=1).sort_index(axis=1) # unstacking, reordering and sorting columns
df = df.ffill().iloc[1::2].reset_index(drop=True) # filling values and keeping only every two rows
df.to_excel('output.xlsx')
Output:

How to convert column values into new columns showing frequency

I created a new dataframe by splitting a column and expanding it.
I now want to convert the dataframe to create new columns for every value and only display the frequency of the value.
I wrote an example below.
Example dataframe:
import pandas as pd
import numpy as np
df= pd.DataFrame({0:['cake','fries', 'ketchup', 'potato', 'snack'],
1:['fries', 'cake', 'potato', np.nan, 'snack'],
2:['ketchup', 'cake', 'potatos', 'snack', np.nan],
3:['potato', np.nan,'cake', 'ketchup',np.nan],
'index':['james','samantha','ashley','tim', 'mo']})
df.set_index('index')
Expected output:
output = pd.DataFrame({'cake': [1, 2, 1, 0, 0],
'fries': [1, 1, 0, 0, 0],
'ketchup': [1, 0, 1, 1, 0],
'potatoes': [1, 0, 2, 1, 0],
'snack': [0, 0, 0, 1, 2],
'index': ['james', 'samantha', 'asheley', 'tim', 'mo']})
output.set_index('index')
Based on the description of what you want, you would need a crosstab on the reshaped data:
df2 = df.reset_index().melt('index')
out = pd.crosstab(df2['index'], df2['value'].str.lower())
This, however, doesn't match the provided output.
Output:
value apple berries cake chocolate drink fries fruits ketchup potato potatoes snack
index
Ashley 0 0 0 0 0 0 0 1 1 0 1
James 0 1 1 0 0 1 1 0 0 0 0
Mo 0 0 0 1 0 0 1 1 0 1 0
samantha 1 0 0 1 0 1 0 0 0 0 0
tim 0 0 0 0 1 0 0 0 0 0 1

How do I merge (insert) rows from one dataframe into another one in Pandas?

Let's suppose I have a following dataframe:
df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'val': [0, 0, 0, 0, 0]})
I want to modify the column val with values from another dataframes like these:
df1 = pd.DataFrame({'id': [2, 3], 'val': [1, 1]})
df2 = pd.DataFrame({'id': [1, 5], 'val': [2, 2]})
I need a function merge_values_into_df that would work in the way to provide the following result:
df = merge_values_into_df(df1, on='id', field='val')
df = merge_values_into_df(df2, on='id', field='val')
print(df)
id val
0 1 2
1 2 1
2 3 1
3 4 0
4 5 2
I need an efficient (by CPU and memory) solution because I want to apply the approach to huge dataframes.
Use DataFrame.update with convert id to index in all DataFrames:
df = df.set_index('id')
df1 = df1.set_index('id')
df2 = df2.set_index('id')
df.update(df1)
df.update(df2)
df = df.reset_index()
print (df)
id val
0 1 2.0
1 2 1.0
2 3 1.0
3 4 0.0
4 5 2.0
You can concat all dataframes and drop_duplicated same id by keeping the last occurrence of id.
out = pd.concat([df, df1, df2]).drop_duplicates('id', keep='last')
print(out.sort_values('id', ignore_index=True))
# Output
id val
0 1 2
1 2 1
2 3 1
3 4 0
4 5 2

How to create a new DataFrame repeating rows using indexes from original DF

I have a DataFrame of generated random agents. However, I want to expand them to match the population I am looking for, so I need to repeat rows, according to my sampled indexes.
Here is a loop code that is taking forever:
df = pd.DataFrame({'a': [0, 1, 2]})
sampled_indexes = [0, 0, 1, 1, 2, 2, 2]
new_df = pd.DataFrame(columns=['a'])
for i, idx in enumerate(sampled_indexes):
new_df.loc[i] = df.loc[idx]
Then, the original DataFrame:
df
a
0 0
1 1
2 2
gives me the result of an enlarged new dataframe
new_df
a
0 0
1 0
2 1
3 1
4 2
5 2
6 2
So, this loop is too slow with a DataFrame that has 34,000 or more rows (takes forever).
How can I do this simpler and faster?
Reindex the dataframe with sampled_indexes, then reset the index.
df.reindex(sampled_indexes).reset_index(drop=True)
You can do DataFrame.merge:
df = pd.DataFrame({'a': [0, 1, 2]})
sampled_indexes = [0, 0, 1, 1, 2, 2, 2]
print( df.merge(pd.DataFrame({'a': sampled_indexes})) )
Prints:
a
0 0
1 0
2 1
3 1
4 2
5 2
6 2

Pandas - Does row fall below a row with a column value and same id

I am new to Pandas. I have a Pandas data frame like so:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0]})
I want to add a column val2, that indicates whether an row falls below another row having the same id as itself where val1 == 1.
The result would be a data frame like:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0], 'val2': [0, 0, 1, 0, 0, 1, 1]})
My first thought was to use an apply statement, but these only go by row. And from my experience for loops are never the answer. Any help would be greatly appreciated!
Let's try shift + cumsum inside a groupby.
df['val2'] = df.groupby('id').val1.apply(
lambda x: x.shift().cumsum()
).ge(1).astype(int)
Or, in an attempt to avoid the lambda,
df['val2'] = (
df.groupby('id')
.val1.shift()
.groupby(df.id)
.cumsum()
.ge(1)
.astype(int)
)
df
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1
Using groupby + transform. Similar to coldspeed's but using bool conversion for non-zero cumsum values.
df['val2'] = df.groupby('id')['val1'].transform(lambda x: x.cumsum().shift())\
.fillna(0).astype(bool).astype(int)
print(df)
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1

Categories

Resources