What is the 'name' in pandas.DataFrame.columns? - python

When I execute a pivot on a pandas dataframe,
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df.pivot(index='foo', columns='bar', values='baz')
>>> bar A B C
foo
one 1 2 3
two 4 5 6
Which has these columns,
df.pivot(index='foo', columns='bar', values='baz').columns
>>> Index(['A', 'B', 'C'], dtype='object', name='bar')
My question is, what does name=bar part mean?

From the docs
name : object
Name to be stored in the index
In your example, it's the name of the pandas.Index that is used as the column name.
The name attribute becomes useful in some cases, for instance if you have a multiindex, you can refer to the level of the index by it's name:
>>> df
idx1 1 2 3 # <- column header 1
idx2 a b c # <- column header 2
vals 5 4 6
>>> df.columns
MultiIndex(levels=[[1, 2, 3], ['a', 'b', 'c']],
labels=[[0, 1, 2], [0, 1, 2]],
names=['idx1', 'idx2'])
>>> df.columns.get_level_values('idx1')
Int64Index([1, 2, 3], dtype='int64', name='idx1')
>>> df.columns.get_level_values('idx2')
Index(['a', 'b', 'c'], dtype='object', name='idx2')

Related

Create dataframe from values/columns from another dataframe [duplicate]

This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 12 months ago.
I hava a dataframe like this:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6], 'D': ['e', 'f', 'g'], 'E': [7, 8, 9], id: [25, 15, 30]})
I would like to use the values of df1 (and their respective columns) as a basis for filling in df2.
Expected:
expected = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E'], 'value': [1, 'a', 4, 'e', 7, 2, 'b', 5, 'f', 8], 'id': [25, 15]})
I tried using iterrows, but as I need to use it for a large amount of data, the performance results were not positive. Can you help me?
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6], 'D': ['e', 'f', 'g'], 'E': [7, 8, 9], 'id': [25, 15, 30]})
pd.melt(df1, id_vars=['id'], var_name = 'column')
id column value
0 25 A 1
1 15 A 2
2 30 A 3
3 25 B a
4 15 B b
5 30 B c
6 25 C 4
7 15 C 5
8 30 C 6
9 25 D e
10 15 D f
11 30 D g
12 25 E 7
13 15 E 8
14 30 E 9
Have you tried Dataframe.melt? I guess something like this could do the trick:
df1.melt(ignore_index=False).merge(
df1, left_index=True, right_index=True
)[['variable', 'value', 'id']].reset_index()
There are some rows to be ignored, but that should be easy. I don't now about performance regarding large data frames, though.

Count sequence within a column in pandas

I have a following problem. Suppose I have this dataframe:
import pandas as pd
d = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3]}
df = pd.DataFrame(data=d)
I need to add a new column that add a number to each project per name. Desired output is:
import pandas as pd
dnew = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3], 'New_column': ['1', '1','1','2', '2','2','2','3','3']}
NEWdf = pd.DataFrame(data=dnew)
In other words: 'aa','ab','bc' in Project occurs in the first rows, so I add 1 to the new column. 'aa', 'ab' is the second Project from the beginning. It occurs for Name 'a' and 'b', so I add 2 to the both new column. 'ca', 'cb' is the third project and it occurs only for name 'd', so I add 3 only to the name 'd'.
I tried to combine groupby with a for loop, but it did not worked to me. Thanks a lot for a help!
Looks like networkx since Name and Project are related , you can use:
import networkx as nx
G=nx.from_pandas_edgelist(df, 'Name', 'Project')
l = list(nx.connected_components(G))
s = pd.Series(map(list,l)).explode()
df['new'] = df['Project'].map({v:k for k,v in s.items()}).add(1)
print(df)
Name Project col2 new
0 a aa 3 1
1 a ab 4 1
2 b bb 6 2
3 b bc 6 2
4 c aa 6 1
5 c ab 6 1

Pairwise similarity

I have pandas dataframe that looks like this:
df = pd.DataFrame({'name': [0, 1, 2, 3], 'cards': [['A', 'B', 'C', 'D'],
['B', 'C', 'D', 'E'],
['E', 'F', 'G', 'H'],
['A', 'A', 'E', 'F']]})
name cards
0 ['A', 'B', 'C', 'D']
1 ['B', 'C', 'D', 'E']
2 ['E', 'F', 'G', 'H']
3 ['A', 'A', 'E', 'F']
And I'd like to create a matrix that looks like this:
name 0 1 2 3
name
0 4 3 0 1
1 3 4 1 1
2 0 1 4 2
3 1 1 2 4
Where the values are the number of items in common.
Any ideas?
Using .apply method and lambda we can directly get a dataframe
def func(df, j):
return pd.Series([len(set(i)&set(j)) for i in df.cards])
newdf = df.cards.apply(lambda x: func(df, x))
newdf
0 1 2 3
0 4 3 0 1
1 3 4 1 1
2 0 1 4 2
3 1 1 2 3
By list comprehension and iterate through all pairs we can make the result:
import pandas as pd
df = pd.DataFrame({'name': [0, 1, 2, 3], 'cards': [['A', 'B', 'C', 'D'],
['B', 'C', 'D', 'E'],
['E', 'F', 'G', 'H'],
['A', 'A', 'E', 'F']]})
result=[[len(list(set(x) & set(y))) for x in df['cards']] for y in df['cards']]
print(result)
output :
[[4, 3, 0, 1], [3, 4, 1, 1], [0, 1, 4, 2], [1, 1, 2, 3]]
'&' is used to calculate intersection of two sets
This is exactly what you want:
import pandas as pd
df = pd.DataFrame({'name': [0, 1, 2, 3], 'cards': [['A', 'B', 'C', 'D'],
['B', 'C', 'D', 'E'],
['E', 'F', 'G', 'H'],
['A', 'A', 'E', 'F']]})
result=[[len(x)-max(len(set(y) - set(x)),len(set(x) - set(y))) for x in df['cards']] for y in df['cards']]
print(result)
output:
[[4, 3, 0, 1], [3, 4, 1, 1], [0, 1, 4, 2], [1, 1, 2, 4]]
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 'B', 'C', 'D'],
['B', 'C', 'D', 'E'],
['E', 'F', 'G', 'H'],
['A', 'A', 'E', 'F']])
nrows = df.shape[0]
# Initialization
matrix = np.zeros((nrows,nrows),dtype= np.int64)
for i in range(0,nrows):
for j in range(0,nrows):
matrix[i,j] = sum(df.iloc[:,i] == df.iloc[:,j])
output
print(matrix)
[[4 1 0 0]
[1 4 0 0]
[0 0 4 0]
[0 0 0 4]]

Why does Pandas Series.isin work for strings but not numbers?

Simple example:
>>> df = pd.DataFrame(
columns=['x', 'y', 'z'],
data=np.array([
['a', 1, 'foo'],
['b', 2, 'bar'],
['c', 3, 'biz'],
['d', 99, 'baz'] ]))
>>> df
x y z
0 a 1 foo
1 b 2 bar
2 c 3 biz
3 d 99 baz
>>> df[df.z.isin(['foo', 'biz'])]
x y z
0 a 1 foo
2 c 3 biz
That works as expected!
However, now I try to use y:
>>> df[df.y.isin([1,3])]
Empty DataFrame
Columns: [x, y, z]
Index: []
What just happened?
I would have expected the same two rows to be output as in the above .z.isin(...) example.
Let's look at the source of the problem. It's actually the call to np.array.
np.array([['a', 1, 'foo'],
['b', 2, 'bar'],
['c', 3, 'biz'],
['d', 99, 'baz']])
This actually coerces the integers to strings:
array([['a', '1', 'foo'],
['b', '2', 'bar'],
['c', '3', 'biz'],
['d', '99', 'baz']], dtype='<U3')
Notice the second column is all strings, because of type coercion. OTOH, if you initialise the array with an explicit dtype=object, the individual types are preserved:
data = np.array([['a', 1, 'foo'],
['b', 2, 'bar'],
['c', 3, 'biz'],
['d', 99, 'baz']], dtype=object)
df = pd.DataFrame(columns=['x', 'y', 'z'], data=data)
df.y.isin([1,3])
0 True
1 False
2 True
3 False
Name: y, dtype: bool
Or, better still, pass a heterogenous list of lists (without conversion to array).
df = pd.DataFrame(data=[['a', 1, 'foo'],
['b', 2, 'bar'],
['c', 3, 'biz'],
['d', 99, 'baz']],
columns=list('xyz'))
df.y.isin([1,3])
0 True
1 False
2 True
3 False
Name: y, dtype: bool
If you look at df.y it is of type object, if you convert it to an int you will get the behavior you expect
In [8]: df.y
Out[8]:
0 1
1 2
2 3
3 99
Name: y, dtype: object

Pandas String Series to int normalisation for Tensor

I have a Pandas::Series object with repeated String values that I need to normalise into int values to feed into a TensorFlow.
I have looked at converting this into a Category as per this but it creates a code per item rather than identifying duplicates.
e.g. I wish for the following conversion
['a', 'b', 'c', 'd', 'a', 'a', 'c'] -> [1, 2, 3, 4, 1, 1, 3]
You need a bit change factorize:
print ((pd.factorize(['a', 'b', 'c', 'd', 'a', 'a', 'c'])[0] + 1).tolist())
[1, 2, 3, 4, 1, 1, 3]
You need add cat.codes after convert to category
pd.Series(['a', 'b', 'c', 'd', 'a', 'a', 'c']).astype('category').cat.codes+1
Out[1407]:
0 1
1 2
2 3
3 4
4 1
5 1
6 3
dtype: int8

Categories

Resources