Create dataframe from values/columns from another dataframe [duplicate] - python

This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 12 months ago.
I hava a dataframe like this:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6], 'D': ['e', 'f', 'g'], 'E': [7, 8, 9], id: [25, 15, 30]})
I would like to use the values of df1 (and their respective columns) as a basis for filling in df2.
Expected:
expected = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E'], 'value': [1, 'a', 4, 'e', 7, 2, 'b', 5, 'f', 8], 'id': [25, 15]})
I tried using iterrows, but as I need to use it for a large amount of data, the performance results were not positive. Can you help me?

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6], 'D': ['e', 'f', 'g'], 'E': [7, 8, 9], 'id': [25, 15, 30]})
pd.melt(df1, id_vars=['id'], var_name = 'column')
id column value
0 25 A 1
1 15 A 2
2 30 A 3
3 25 B a
4 15 B b
5 30 B c
6 25 C 4
7 15 C 5
8 30 C 6
9 25 D e
10 15 D f
11 30 D g
12 25 E 7
13 15 E 8
14 30 E 9

Have you tried Dataframe.melt? I guess something like this could do the trick:
df1.melt(ignore_index=False).merge(
df1, left_index=True, right_index=True
)[['variable', 'value', 'id']].reset_index()
There are some rows to be ignored, but that should be easy. I don't now about performance regarding large data frames, though.

Related

Looking to drop first 5 rows of dataframe after ever new value occurs

I am looking to drop the first 5 rows each time a new value occurs in a dataframe
data = {
'col1': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
}
df = pd.DataFrame(data)
I am looking to drop the first 5 rows after each new value. Ex: 'A' value is new... delete first 5 rows. Now encounter 'B' value... delete its first 5 rows...
You need to do the following:
mask = df.groupby('col1').cumcount() >= 5
df = df.loc[mask]
You can use a negative tail:
df.groupby('col1').tail(-5)
To group by consecutive values:
group = df['col1'].ne(df['col1'].shift()).cumsum()
df.groupby(group).tail(-5)
Output:
col1 col2
5 A 6
6 A 7
12 B 13
13 B 14
19 C 20
20 C 21
NB. As pointed out by #Mark, there is an issue for older pandas versions (<1.4), in which case the cumcount approach can be used.

Count sequence within a column in pandas

I have a following problem. Suppose I have this dataframe:
import pandas as pd
d = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3]}
df = pd.DataFrame(data=d)
I need to add a new column that add a number to each project per name. Desired output is:
import pandas as pd
dnew = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3], 'New_column': ['1', '1','1','2', '2','2','2','3','3']}
NEWdf = pd.DataFrame(data=dnew)
In other words: 'aa','ab','bc' in Project occurs in the first rows, so I add 1 to the new column. 'aa', 'ab' is the second Project from the beginning. It occurs for Name 'a' and 'b', so I add 2 to the both new column. 'ca', 'cb' is the third project and it occurs only for name 'd', so I add 3 only to the name 'd'.
I tried to combine groupby with a for loop, but it did not worked to me. Thanks a lot for a help!
Looks like networkx since Name and Project are related , you can use:
import networkx as nx
G=nx.from_pandas_edgelist(df, 'Name', 'Project')
l = list(nx.connected_components(G))
s = pd.Series(map(list,l)).explode()
df['new'] = df['Project'].map({v:k for k,v in s.items()}).add(1)
print(df)
Name Project col2 new
0 a aa 3 1
1 a ab 4 1
2 b bb 6 2
3 b bc 6 2
4 c aa 6 1
5 c ab 6 1

Pairwise similarity

I have pandas dataframe that looks like this:
df = pd.DataFrame({'name': [0, 1, 2, 3], 'cards': [['A', 'B', 'C', 'D'],
['B', 'C', 'D', 'E'],
['E', 'F', 'G', 'H'],
['A', 'A', 'E', 'F']]})
name cards
0 ['A', 'B', 'C', 'D']
1 ['B', 'C', 'D', 'E']
2 ['E', 'F', 'G', 'H']
3 ['A', 'A', 'E', 'F']
And I'd like to create a matrix that looks like this:
name 0 1 2 3
name
0 4 3 0 1
1 3 4 1 1
2 0 1 4 2
3 1 1 2 4
Where the values are the number of items in common.
Any ideas?
Using .apply method and lambda we can directly get a dataframe
def func(df, j):
return pd.Series([len(set(i)&set(j)) for i in df.cards])
newdf = df.cards.apply(lambda x: func(df, x))
newdf
0 1 2 3
0 4 3 0 1
1 3 4 1 1
2 0 1 4 2
3 1 1 2 3
By list comprehension and iterate through all pairs we can make the result:
import pandas as pd
df = pd.DataFrame({'name': [0, 1, 2, 3], 'cards': [['A', 'B', 'C', 'D'],
['B', 'C', 'D', 'E'],
['E', 'F', 'G', 'H'],
['A', 'A', 'E', 'F']]})
result=[[len(list(set(x) & set(y))) for x in df['cards']] for y in df['cards']]
print(result)
output :
[[4, 3, 0, 1], [3, 4, 1, 1], [0, 1, 4, 2], [1, 1, 2, 3]]
'&' is used to calculate intersection of two sets
This is exactly what you want:
import pandas as pd
df = pd.DataFrame({'name': [0, 1, 2, 3], 'cards': [['A', 'B', 'C', 'D'],
['B', 'C', 'D', 'E'],
['E', 'F', 'G', 'H'],
['A', 'A', 'E', 'F']]})
result=[[len(x)-max(len(set(y) - set(x)),len(set(x) - set(y))) for x in df['cards']] for y in df['cards']]
print(result)
output:
[[4, 3, 0, 1], [3, 4, 1, 1], [0, 1, 4, 2], [1, 1, 2, 4]]
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 'B', 'C', 'D'],
['B', 'C', 'D', 'E'],
['E', 'F', 'G', 'H'],
['A', 'A', 'E', 'F']])
nrows = df.shape[0]
# Initialization
matrix = np.zeros((nrows,nrows),dtype= np.int64)
for i in range(0,nrows):
for j in range(0,nrows):
matrix[i,j] = sum(df.iloc[:,i] == df.iloc[:,j])
output
print(matrix)
[[4 1 0 0]
[1 4 0 0]
[0 0 4 0]
[0 0 0 4]]

What is the 'name' in pandas.DataFrame.columns?

When I execute a pivot on a pandas dataframe,
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df.pivot(index='foo', columns='bar', values='baz')
>>> bar A B C
foo
one 1 2 3
two 4 5 6
Which has these columns,
df.pivot(index='foo', columns='bar', values='baz').columns
>>> Index(['A', 'B', 'C'], dtype='object', name='bar')
My question is, what does name=bar part mean?
From the docs
name : object
Name to be stored in the index
In your example, it's the name of the pandas.Index that is used as the column name.
The name attribute becomes useful in some cases, for instance if you have a multiindex, you can refer to the level of the index by it's name:
>>> df
idx1 1 2 3 # <- column header 1
idx2 a b c # <- column header 2
vals 5 4 6
>>> df.columns
MultiIndex(levels=[[1, 2, 3], ['a', 'b', 'c']],
labels=[[0, 1, 2], [0, 1, 2]],
names=['idx1', 'idx2'])
>>> df.columns.get_level_values('idx1')
Int64Index([1, 2, 3], dtype='int64', name='idx1')
>>> df.columns.get_level_values('idx2')
Index(['a', 'b', 'c'], dtype='object', name='idx2')

Python: Create vectors of same length using two DataFrames

I have two dataframes as follows:
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
I want to get vectors of the same length out of the column category (i.e. indexed by category) for each person and group. In other words, I want to transform this long dataframes into wide format where the name of the new columns are the values of the column category.
What is the best way to do this? This is an example of what I need:
id type A B C D E F
0 100 group 10 0 0 8 0 8
1 200 group 0 6 7 0 0 0
2 300 group 8 0 0 0 0 5
3 1 person 2 3 1 0 0 0
4 2 person 0 2 0 1 0 0
5 3 person 0 0 0 0 4 2
6 4 person 0 0 0 3 0 1
My current script appends both dataframes and then it gets a pivot table. My concern is that in this case the types of the id columns are different.
I do this because sometimes not all the categories are in each dataframe (e.g. 'E' is not in df2).
This is what I have:
import pandas as pd
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df1['type'] = 'person'
df2['type'] = 'group'
df1.rename(columns={'person': 'id'}, inplace = True)
df2.rename(columns={'group': 'id'}, inplace = True)
rawpivot = pd.DataFrame([])
rawpivot = rawpivot.append(df1)
rawpivot = rawpivot.append(df2)
pivot = rawpivot.pivot_table(index=['id','type'], columns='category', values='value', aggfunc='sum', fill_value=0)
pivot.reset_index(inplace = True)
import pandas as pd
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
cols = ['idx', 'type', 'A', 'B', 'C', 'D', 'E', 'F']
df1 = pd.DataFrame(columns=cols)
def add_data(type_, data):
global df1
for id_, category, value in zip(data[type_], data['category'], data['value']):
if id_ not in df1.idx.values:
row = pd.DataFrame({'idx': id_, 'type': type_}, columns = cols, index=[0])
df1 = df1.append(row, ignore_index = True)
df1.loc[df1['idx']==id_, category] = value
add_data('group', d2)
add_data('person', d1)
df1 = df1.fillna(0)
df1 now holds the following values
idx type A B C D E F
0 100 group 10 0 0 8 0 8
1 200 group 0 6 7 0 0 0
2 300 group 8 0 0 0 0 5
3 1 person 2 3 1 0 0 0
4 2 person 0 2 0 1 0 0
5 3 person 0 0 0 0 4 2
6 4 person 0 0 0 3 0 1

Categories

Resources