I have two dataframes as follows:
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
I want to get vectors of the same length out of the column category (i.e. indexed by category) for each person and group. In other words, I want to transform this long dataframes into wide format where the name of the new columns are the values of the column category.
What is the best way to do this? This is an example of what I need:
id type A B C D E F
0 100 group 10 0 0 8 0 8
1 200 group 0 6 7 0 0 0
2 300 group 8 0 0 0 0 5
3 1 person 2 3 1 0 0 0
4 2 person 0 2 0 1 0 0
5 3 person 0 0 0 0 4 2
6 4 person 0 0 0 3 0 1
My current script appends both dataframes and then it gets a pivot table. My concern is that in this case the types of the id columns are different.
I do this because sometimes not all the categories are in each dataframe (e.g. 'E' is not in df2).
This is what I have:
import pandas as pd
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df1['type'] = 'person'
df2['type'] = 'group'
df1.rename(columns={'person': 'id'}, inplace = True)
df2.rename(columns={'group': 'id'}, inplace = True)
rawpivot = pd.DataFrame([])
rawpivot = rawpivot.append(df1)
rawpivot = rawpivot.append(df2)
pivot = rawpivot.pivot_table(index=['id','type'], columns='category', values='value', aggfunc='sum', fill_value=0)
pivot.reset_index(inplace = True)
import pandas as pd
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
cols = ['idx', 'type', 'A', 'B', 'C', 'D', 'E', 'F']
df1 = pd.DataFrame(columns=cols)
def add_data(type_, data):
global df1
for id_, category, value in zip(data[type_], data['category'], data['value']):
if id_ not in df1.idx.values:
row = pd.DataFrame({'idx': id_, 'type': type_}, columns = cols, index=[0])
df1 = df1.append(row, ignore_index = True)
df1.loc[df1['idx']==id_, category] = value
add_data('group', d2)
add_data('person', d1)
df1 = df1.fillna(0)
df1 now holds the following values
idx type A B C D E F
0 100 group 10 0 0 8 0 8
1 200 group 0 6 7 0 0 0
2 300 group 8 0 0 0 0 5
3 1 person 2 3 1 0 0 0
4 2 person 0 2 0 1 0 0
5 3 person 0 0 0 0 4 2
6 4 person 0 0 0 3 0 1
Related
This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 12 months ago.
I hava a dataframe like this:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6], 'D': ['e', 'f', 'g'], 'E': [7, 8, 9], id: [25, 15, 30]})
I would like to use the values of df1 (and their respective columns) as a basis for filling in df2.
Expected:
expected = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E'], 'value': [1, 'a', 4, 'e', 7, 2, 'b', 5, 'f', 8], 'id': [25, 15]})
I tried using iterrows, but as I need to use it for a large amount of data, the performance results were not positive. Can you help me?
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6], 'D': ['e', 'f', 'g'], 'E': [7, 8, 9], 'id': [25, 15, 30]})
pd.melt(df1, id_vars=['id'], var_name = 'column')
id column value
0 25 A 1
1 15 A 2
2 30 A 3
3 25 B a
4 15 B b
5 30 B c
6 25 C 4
7 15 C 5
8 30 C 6
9 25 D e
10 15 D f
11 30 D g
12 25 E 7
13 15 E 8
14 30 E 9
Have you tried Dataframe.melt? I guess something like this could do the trick:
df1.melt(ignore_index=False).merge(
df1, left_index=True, right_index=True
)[['variable', 'value', 'id']].reset_index()
There are some rows to be ignored, but that should be easy. I don't now about performance regarding large data frames, though.
I have a little problem with the transformation from wide to long on a dataset. I tried with melt but I didn't get a good result. I hope that someone could help me. The dataset is as follow:
pd.DataFrame({'id': [0, 1, 2, 3, 4, 5],
'type': ['a', 'b', 'c', 'd', 'e', 'f'],
'rank': ['alpha', 'beta', 'gamma', 'epsilon', 'phi', 'ro'],
'type.1': ['d', 'g', 'z', 'a', 'nan', 'nan'],
'rank.1': ['phi', 'sigma', 'gamma', 'lambda', 'nan', 'nan'],
'type.2': ['nan', 'nan', 'j', 'r', 'nan', 'nan'],
'rank.2': ['nan', 'nan', 'eta', 'theta', 'nan', 'nan']})
And I need the dataset in this way:
pd.DataFrame({'id': [0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 5],
'type': ['a', 'd', 'b', 'g', 'c', 'z', 'j', 'd', 'a', 'r', 'e', 'f'],
'rank': ['alpha', 'phi', 'beta', 'sigma', 'gamma', 'gamma', 'eta', 'epsilon', 'lambda', 'theta', 'phi', 'ro']})
Can anyone help me with that? Thanks a lot
Use wide_to_long:
# normalize the `type` and `rank` columns so they have the same format as others
df = df.rename(columns={'type': 'type.0', 'rank': 'rank.0'})
(pd.wide_to_long(df, stubnames=['type', 'rank'], i='id', j='var', sep='.')
[lambda x: (x['type'] != 'nan') | (x['rank'] != 'nan')].reset_index())
id var type rank
0 0 0 a alpha
1 1 0 b beta
2 2 0 c gamma
3 3 0 d epsilon
4 4 0 e phi
5 5 0 f ro
6 0 1 d phi
7 1 1 g sigma
8 2 1 z gamma
9 3 1 a lambda
10 2 2 j eta
11 3 2 r theta
You can drop the var column if not needed.
One option is pivot_longer from pyjanitor, which abstracts the reshaping process:
# pip install janitor
import janitor
(df
.pivot_longer(
index = 'id',
names_to = ['type', 'rank'],
names_pattern = ['type', 'rank'],
sort_by_appearance = True)
.loc[lambda df: ~df.eq('nan').any(1)]
)
id type rank
0 0 a alpha
1 0 d phi
3 1 b beta
4 1 g sigma
6 2 c gamma
7 2 z gamma
8 2 j eta
9 3 d epsilon
10 3 a lambda
11 3 r theta
12 4 e phi
15 5 f ro
The idea for this particular reshape is that each regex in names_pattern is used to pair the matching column with the paired name in names_to.
I have a following problem. Suppose I have this dataframe:
import pandas as pd
d = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3]}
df = pd.DataFrame(data=d)
I need to add a new column that add a number to each project per name. Desired output is:
import pandas as pd
dnew = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3], 'New_column': ['1', '1','1','2', '2','2','2','3','3']}
NEWdf = pd.DataFrame(data=dnew)
In other words: 'aa','ab','bc' in Project occurs in the first rows, so I add 1 to the new column. 'aa', 'ab' is the second Project from the beginning. It occurs for Name 'a' and 'b', so I add 2 to the both new column. 'ca', 'cb' is the third project and it occurs only for name 'd', so I add 3 only to the name 'd'.
I tried to combine groupby with a for loop, but it did not worked to me. Thanks a lot for a help!
Looks like networkx since Name and Project are related , you can use:
import networkx as nx
G=nx.from_pandas_edgelist(df, 'Name', 'Project')
l = list(nx.connected_components(G))
s = pd.Series(map(list,l)).explode()
df['new'] = df['Project'].map({v:k for k,v in s.items()}).add(1)
print(df)
Name Project col2 new
0 a aa 3 1
1 a ab 4 1
2 b bb 6 2
3 b bc 6 2
4 c aa 6 1
5 c ab 6 1
I have a pandas dataframe. The dataframe has 4 columns. The last one is just some random data. The first two columns are columns I will group by and sum the value column. Of each grouping, I would only like to keep the first row (i.e. the group with the largest sum).
My data:
import pandas as pd
df = pd.DataFrame(data=[['0', 'A', 3, 'a'],
['0', 'A', 2, 'b'],
['0', 'A', 1, 'c'],
['0', 'B', 3, 'd'],
['0', 'B', 4, 'e'],
['0', 'B', 4, 'f'],
['1', 'C', 3, 'g'],
['1', 'C', 2, 'h'],
['1', 'C', 1, 'i'],
['1', 'D', 3, 'j'],
['1', 'D', 4, 'k'],
['1', 'D', 4, 'l']
], columns=['group col 1', 'group col 2', 'value', 'random data']
)
Desired output:
group col 1 group col 2 value random data
3 0 B 3 d
4 0 B 4 e
5 0 B 4 f
9 1 D 3 j
10 1 D 4 k
11 1 D 4 l
I have an inefficient way of getting there, but looking for a simpler solution.
My solution:
df1 = df.groupby(['group col 1','group col 2']).agg('sum').reset_index()
biggest_groups= df1.sort_values(by=['group col 1', 'value'], ascending=[True, False])
biggest_groups = biggest_groups.groupby('group col 1').head(1)
pairs = biggest_groups[['group col 1', 'group col 2']].values.tolist()
pairs = [tuple(i) for i in pairs]
df = df[df[['group col 1', 'group col 2']].apply(tuple, axis = 1).isin(pairs)]
IIUC you will need two groupby here, one is to get the sum , then we base on the group select the max again
s=df.groupby(['group col 1', 'group col 2']).value.transform('sum')
s=df[s.groupby(df['group col 1']).transform('max')==s]
group col 1 group col 2 value random data
3 0 B 3 d
4 0 B 4 e
5 0 B 4 f
9 1 D 3 j
10 1 D 4 k
11 1 D 4 l
I have pandas dataframe that looks like this:
df = pd.DataFrame({'name': [0, 1, 2, 3], 'cards': [['A', 'B', 'C', 'D'],
['B', 'C', 'D', 'E'],
['E', 'F', 'G', 'H'],
['A', 'A', 'E', 'F']]})
name cards
0 ['A', 'B', 'C', 'D']
1 ['B', 'C', 'D', 'E']
2 ['E', 'F', 'G', 'H']
3 ['A', 'A', 'E', 'F']
And I'd like to create a matrix that looks like this:
name 0 1 2 3
name
0 4 3 0 1
1 3 4 1 1
2 0 1 4 2
3 1 1 2 4
Where the values are the number of items in common.
Any ideas?
Using .apply method and lambda we can directly get a dataframe
def func(df, j):
return pd.Series([len(set(i)&set(j)) for i in df.cards])
newdf = df.cards.apply(lambda x: func(df, x))
newdf
0 1 2 3
0 4 3 0 1
1 3 4 1 1
2 0 1 4 2
3 1 1 2 3
By list comprehension and iterate through all pairs we can make the result:
import pandas as pd
df = pd.DataFrame({'name': [0, 1, 2, 3], 'cards': [['A', 'B', 'C', 'D'],
['B', 'C', 'D', 'E'],
['E', 'F', 'G', 'H'],
['A', 'A', 'E', 'F']]})
result=[[len(list(set(x) & set(y))) for x in df['cards']] for y in df['cards']]
print(result)
output :
[[4, 3, 0, 1], [3, 4, 1, 1], [0, 1, 4, 2], [1, 1, 2, 3]]
'&' is used to calculate intersection of two sets
This is exactly what you want:
import pandas as pd
df = pd.DataFrame({'name': [0, 1, 2, 3], 'cards': [['A', 'B', 'C', 'D'],
['B', 'C', 'D', 'E'],
['E', 'F', 'G', 'H'],
['A', 'A', 'E', 'F']]})
result=[[len(x)-max(len(set(y) - set(x)),len(set(x) - set(y))) for x in df['cards']] for y in df['cards']]
print(result)
output:
[[4, 3, 0, 1], [3, 4, 1, 1], [0, 1, 4, 2], [1, 1, 2, 4]]
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 'B', 'C', 'D'],
['B', 'C', 'D', 'E'],
['E', 'F', 'G', 'H'],
['A', 'A', 'E', 'F']])
nrows = df.shape[0]
# Initialization
matrix = np.zeros((nrows,nrows),dtype= np.int64)
for i in range(0,nrows):
for j in range(0,nrows):
matrix[i,j] = sum(df.iloc[:,i] == df.iloc[:,j])
output
print(matrix)
[[4 1 0 0]
[1 4 0 0]
[0 0 4 0]
[0 0 0 4]]