I created a new dataframe by splitting a column and expanding it.
I now want to convert the dataframe to create new columns for every value and only display the frequency of the value.
I wrote an example below.
Example dataframe:
import pandas as pd
import numpy as np
df= pd.DataFrame({0:['cake','fries', 'ketchup', 'potato', 'snack'],
1:['fries', 'cake', 'potato', np.nan, 'snack'],
2:['ketchup', 'cake', 'potatos', 'snack', np.nan],
3:['potato', np.nan,'cake', 'ketchup',np.nan],
'index':['james','samantha','ashley','tim', 'mo']})
df.set_index('index')
Expected output:
output = pd.DataFrame({'cake': [1, 2, 1, 0, 0],
'fries': [1, 1, 0, 0, 0],
'ketchup': [1, 0, 1, 1, 0],
'potatoes': [1, 0, 2, 1, 0],
'snack': [0, 0, 0, 1, 2],
'index': ['james', 'samantha', 'asheley', 'tim', 'mo']})
output.set_index('index')
Based on the description of what you want, you would need a crosstab on the reshaped data:
df2 = df.reset_index().melt('index')
out = pd.crosstab(df2['index'], df2['value'].str.lower())
This, however, doesn't match the provided output.
Output:
value apple berries cake chocolate drink fries fruits ketchup potato potatoes snack
index
Ashley 0 0 0 0 0 0 0 1 1 0 1
James 0 1 1 0 0 1 1 0 0 0 0
Mo 0 0 0 1 0 0 1 1 0 1 0
samantha 1 0 0 1 0 1 0 0 0 0 0
tim 0 0 0 0 1 0 0 0 0 0 1
Related
My problem is that I can not convert this:
import pandas as pd
example = {
"ID": [1, 1, 2, 2, 2, 3],
"place":["Maryland","Maryland", "Washington", "Washington", "Washington", "Los Angeles"],
"sex":["male","male","female", "female", "female", "other"],
"depression": [0, 0, 0, 0, 0, 1],
"stressed": [1 ,0, 0, 0, 0, 0],
"sleep": [1, 1, 1, 0, 1, 1],
"ate":[0,1, 0, 1, 0, 1],
}
#load into df:
example = pd.DataFrame(example)
print(example)
to this:
import pandas as pd
result = {
"ID": [1, 2, 3],
"place":["Maryland","Washington","Los Angeles"],
"sex":["male", "female", "other"],
"depression": [0, 0, 1],
"stressed": [1,0,0],
"sleep": [1, 1, 1],
"ate":[1, 1 , 1],
}
#load into df:
result = pd.DataFrame(result)
print(result)
I was trying to pivot it like this:
table = example.pivot_table(index='place',columns='ID')
print (table)
However, it looks totally different and I am confused how to set values for it. Could you please let me know what I am doing wrong.
Huge thanks in advance!
I think you just want groupby with max (which acts as a logical OR on 1/0 values) as an aggregation function:
example.groupby(['ID', 'place','sex']).max().reset_index()
Output:
ID place sex depression stressed sleep ate
0 1 Maryland male 0 1 1 1
1 2 Washington female 0 0 1 1
2 3 Los Angeles other 1 0 1 1
You can use groupby and any to get there:
example.groupby(['ID','place','sex']).any().astype(int).reset_index()
ID place sex depression stressed sleep ate
0 1 Maryland male 0 1 1 1
1 2 Washington female 0 0 1 1
2 3 Los Angeles other 1 0 1 1
The default aggregation function is mean, to keep it binary, use aggfunc='max':
table = example.pivot_table(index='place', columns='ID', aggfunc=np.max, fill_value=0)
Output:
stressed
ID 1 2 3
place
Los Angeles 0 0 0
Maryland 1 0 0
Washington 0 0 0
Although in your case you might want a Groupby.max:
example.groupby(['ID', 'place', 'sex'], as_index=False).max()
Output:
ID place sex depression stressed sleep ate
0 1 Maryland male 0 1 1 1
1 2 Washington female 0 0 1 1
2 3 Los Angeles other 1 0 1 1
i have the following dataframe called df. I want for each sector column (sector_) to basically do a group by and get the unique ids for each sector. The sector is denoted as 1 for each row if the id is apart of that sector. How can i do this group by if the columns are one hot encoded?
id winner sector_food sector_learning sector_parenting sector_consumer
1 1 1 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
2 0 1 0 0 0
3 1 0 0 0 1
expected output
sector unique_id
sector_food 2
sector_learning 0
sector_parenting 0
sector_consumer 1
You can do something like this:
out = df.drop(["id", "winner"], 1).multiply(df.id, 0).nunique().subtract(1)
#sector_food 2
#sector_learning 0
#sector_parenting 0
#sector_consumer 1
#dtype: int64
To get your exact expected output you can add:
out = out.rename_axis("sector").to_frame("unique_id")
# unique_id
#sector
#sector_food 2
#sector_learning 0
#sector_parenting 0
#sector_consumer 1
Try this:
df.drop('winner',axis=1).groupby(level=0).sum().gt(0).sum().to_frame('unique_id')
Given
import pandas as pd
ids = [1, 1, 2, 2, 3]
winner = [1, 0, 1, 0, 1]
sector_food = [1, 1, 0, 1, 0]
sector_learning = [0, 0, 0, 0, 0]
sector_parenting = [0, 0, 0, 0, 0]
sector_consumer = [0, 0, 0, 0, 1]
df = pd.DataFrame({
'id': ids,
'winner': winner,
'sector_food': sector_food,
'sector_learning': sector_learning,
'sector_parenting': sector_parenting,
'sector_consumer': sector_consumer
})
print(df)
output
id winner sector_food sector_learning sector_parenting sector_consumer
0 1 1 1 0 0 0
1 1 0 1 0 0 0
2 2 1 0 0 0 0
3 2 0 1 0 0 0
4 3 1 0 0 0 1
You can do
_df = (df
# drop unused cols
.drop('winner', axis=1)
# melt with 'id' as index
.melt(id_vars='id')
# drop all duplicates
.drop_duplicates(['id', 'variable', 'value'])
# sum unique values
.groupby('variable').value.sum()
)
print(_df)
output
variable
sector_consumer 1
sector_food 2
sector_learning 0
sector_parenting 0
Name: value, dtype: int64
I have a dataframe like this:
df_1 = pd.DataFrame({'players.name': ['John', 'Will' ,'John', 'Jim', 'Tim', 'John', 'Will', 'Tim'],
'players.diff': [0, 0, 0, 0, 0, 0, 0, 0],
'count': [3, 2, 3, 1, 2, 3, 2, 2]})
'count' values are constant.
And I have a different shape dataframe with players ordered differently, like so:
df_2 = pd.DataFrame({'players.name': ['Will', 'John' ,'Jim'],
'players.diff': [0, 0, 0]})
How do I map from df_1 values and populate a 'count' value on df_2, ending up with:
players.name players.diff counts
0 Will 0 2
1 John 0 3
2 Jim 0 1
Since you're just trying to create a column of counts, it'd be more meaningful to map your player names to counts:
df_2['counts'] = df_2['players.name'].map(
df_1.groupby('players.name')['count'].first())
df_2
players.name players.diff counts
0 Will 0 2
1 John 0 3
2 Jim 0 1
This could work:
pd.merge(df_1, df_2, on=["players.name", "players.diff"]).drop_duplicates()
Your sample df_1 has duplicated players.name with same count, so you need left-merge and drop_duplicates
new_df_2 = df_2.merge(df_1[['players.name','count']], on='players.name', how='left').drop_duplicates()
Out[89]:
players.name players.diff count
0 Will 0 2
2 John 0 3
5 Jim 0 1
How can I delete rows and columns in a Pandas dataframe that contain all the zeros. For example, I have a df:
1 0 1 0 1
0 0 0 0 0
1 1 1 0 1
0 1 1 0 1
1 1 0 0 0
0 0 0 0 0
0 0 1 0 1
I want to delete 2nd and 6th row (line) and also the 4th column. The output should look like:
1 0 1 1
1 1 1 1
0 1 1 1
1 1 0 0
0 0 1 1
here we go:
ad = np.array([[1, 0, 1, 0, 1],
[0, 0, 0, 0, 0],
[1, 1, 1, 0, 1],
[0, 1, 1, 0, 1],
[1, 1, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 1]])
df = pd.DataFrame(ad)
df.drop(df.loc[df.sum(axis=1)==0].index, inplace=True)
df.drop(columns=df.columns[df.sum()==0], inplace=True)
The code above will drop the row/column, when the sum of the row/column is zero. This is achived by calucating the sum along the axis 1 for rows and 0 for columns and then dropting the row/column with a sum of 0 (df.drop(...))
found similar question already answered
Pandas DataFrame, How do I remove all columns and rows that sum to 0
df.loc[(df!=0).any(1), (df!=0).any(0)]
But this does not work inplace. How can i change that?
df = df.loc[(df!=0).any(1), (df!=0).any(0)]
In addition to the question when someone wants to drop a column/row, where every value is 0, the solution above will do this. First you getting a table with False/True's for the dataframe for the condition df != 0 and with (df != 0).any()) you will get whether any element is True, potentially over an axis (from the documentation).
I am new to Pandas. I have a Pandas data frame like so:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0]})
I want to add a column val2, that indicates whether an row falls below another row having the same id as itself where val1 == 1.
The result would be a data frame like:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0], 'val2': [0, 0, 1, 0, 0, 1, 1]})
My first thought was to use an apply statement, but these only go by row. And from my experience for loops are never the answer. Any help would be greatly appreciated!
Let's try shift + cumsum inside a groupby.
df['val2'] = df.groupby('id').val1.apply(
lambda x: x.shift().cumsum()
).ge(1).astype(int)
Or, in an attempt to avoid the lambda,
df['val2'] = (
df.groupby('id')
.val1.shift()
.groupby(df.id)
.cumsum()
.ge(1)
.astype(int)
)
df
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1
Using groupby + transform. Similar to coldspeed's but using bool conversion for non-zero cumsum values.
df['val2'] = df.groupby('id')['val1'].transform(lambda x: x.cumsum().shift())\
.fillna(0).astype(bool).astype(int)
print(df)
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1