Pandas get data from a list of column names - python

I have a pandas dataframe: df and list of column names: columns like so:
df = pd.DataFrame({
'A': ['b','b','c','d'],
'C': ['b1','b2','c1','d2'],
'B': list(range(4))})
columns = ['A','B']
Now I want to get all the data from these columns of the dataframe in one single series like so:
b
0
b
1
c
2
d
4
This is what I tried:
srs = pd.Series()
srs.append(df[column].values for column in columns)
But it is throwing this error:
TypeError: cannot concatenate object of type '<class 'generator'>';
only Series and DataFrame objs are valid
How can I fix this issue?

I think you can use numpy.ravel:
srs = pd.Series(np.ravel(df[columns]))
print (srs)
0 b
1 0
2 b
3 1
4 c
5 2
6 d
7 3
dtype: object
Or DataFrame.stack with Series.reset_index and drop=True:
srs = df[columns].stack().reset_index(drop=True)
If order should be changed is possible use DataFrame.melt:
srs = df[columns].melt()['value']
print (srs)
0 b
1 b
2 c
3 d
4 0
5 1
6 2
7 3
Name: value, dtype: object

You could do:
from itertools import chain
import pandas as pd
df = pd.DataFrame({
'A': ['b','b','c','d'],
'C': ['b1','b2','c1','d2'],
'B': list(range(4))})
columns = ['A','B']
res = pd.Series(chain.from_iterable(df[columns].to_numpy()))
print(res)
Output
0 b
1 0
2 b
3 1
4 c
5 2
6 d
7 3
dtype: object

Related

how to transform a dict of lists to a dataframe in python?

I have a dict in python like this:
d = {"a": [1,2,3], "b": [4,5,6]}
I want to transform in a dataframe like this:
letter
number
a
1
a
2
a
3
b
4
b
5
b
6
i have tried this code:
df = pd.DataFrame.from_dict(vulnerabilidade, orient = 'index').T
but this gave me:
a
1
2
3
b
4
5
6
You can always read your data in as you already have and then .melt it:
When passed no id_vars or value_vars, melt turns each of your columns into their own rows.
import pandas as pd
d = {"a": [1,2,3], "b": [4,5,6]}
out = pd.DataFrame(d).melt(var_name='letter', value_name='value')
print(out)
letter value
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
To use 'letter' and 'number' as column labels you could use:
a2 = [[key, val] for key, x in d.items() for val in x]
dict2 = pd.DataFrame(a2, columns = ['letter', 'number'])
which gives
letter number
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
Yet another possible solution:
(pd.Series(d, index=d.keys(), name='numbers')
.rename_axis('letters').reset_index()
.explode('numbers', ignore_index=True))
Output:
letters numbers
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
This will yield what you want (there might be a simpler way though):
import pandas as pd
my_dict = {"a": [1,2,3], "b": [4,5,6]}
my_list = [[key, val] for key in my_dict for val in my_dict[key] ]
df = pd.DataFrame(my_list, columns=['letter','number'])
df
# Out[106]:
# letter number
# 0 a 1
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b 5
# 5 b 6

Pandas DataFrame from a dict of ndarray

Consider a dictionary like the following:
>>> dict_temp = {'a': np.array([[0,1,2], [3,4,5]]),
'b': np.array([[3,4,5], [2,5,1], [5,3,7]])}
How can I build a pandas DataFrame out of this, using a multi-index with level 0 and 1 as follows:
level_0 = ['a', 'b']
level_1 = [[0,1], [0,1,2]]
I expect the code to build the multi-index levels itself... I don't care about the column names for now.
Appreciate comments...
Try concat:
pd.concat({k:pd.DataFrame(d) for k, d in dict_temp.items()})
Output:
0 1 2
a 0 0 1 2
1 3 4 5
b 0 3 4 5
1 2 5 1
2 5 3 7

Creating new DataFrame from the cartesian product of 2 lists

What I want to achieve is the following in Pandas:
a = [1,2,3,4]
b = ['a', 'b']
Can I create a DataFrame like:
column1 column2
'a' 1
'a' 2
'a' 3
'a' 4
'b' 1
'b' 2
'b' 3
'b' 4
Use itertools.product with DataFrame constructor:
a = [1, 2, 3, 4]
b = ['a', 'b']
from itertools import product
# pandas 0.24.0+
df = pd.DataFrame(product(b, a), columns=['column1', 'column2'])
# pandas below
# df = pd.DataFrame(list(product(b, a)), columns=['column1', 'column2'])
print (df)
column1 column2
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
I will put here another method, just in case someone prefers it.
full mockup below:
import pandas as pd
a = [1,2,3,4]
b = ['a', 'b']
df=pd.DataFrame([(y, x) for x in a for y in b], columns=['column1','column2'])
df
result below:
column1 column2
0 a 1
1 b 1
2 a 2
3 b 2
4 a 3
5 b 3
6 a 4
7 b 4

Compare values from two pandas data frames, order-independent

I am new to data science. I want to check which elements from one data frame exist in another data frame, e.g.
df1 = [1,2,8,6]
df2 = [5,2,6,9]
# for 1 output should be False
# for 2 output should be True
# for 6 output should be True
etc.
Note: I have matrix not vector.
I have tried using the following code:
import pandas as pd
import numpy as np
priority_dataframe = pd.read_excel(prioritylist_file_path, sheet_name='Sheet1', index=None)
priority_dict = {column: np.array(priority_dataframe[column].dropna(axis=0, how='all').str.lower()) for column in
priority_dataframe.columns}
keys_found_per_sheet = []
if file_path.lower().endswith(('.csv')):
file_dataframe = pd.read_csv(file_path)
else:
file_dataframe = pd.read_excel(file_path, sheet_name=sheet, index=None)
file_cell_array = list()
for column in file_dataframe.columns:
for file_cell in np.array(file_dataframe[column].dropna(axis=0, how='all')):
if isinstance(file_cell, str) == 'str':
file_cell_array.append(file_cell)
else:
file_cell_array.append(str(file_cell))
converted_file_cell_array = np.array(file_cell_array)
for key, values in priority_dict.items():
for priority_cell in values:
if priority_cell in converted_file_cell_array[:]:
keys_found_per_sheet.append(key)
break
I am doing something wrong in if priority_cell in converted_file_cell_array[:] ?
Is there any other efficient way to do that?
You can take the .values from each dataframe, convert them to a set(), and take the set intersection.
set1 = set(df1.values.reshape(-1).tolist())
set2 = set(dr2.values.reshape(-1).tolist())
different = set1 & set2
You can flatten all values of DataFrames by numpy.ravel and then use set.intersection():
df1 = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df1)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df2 = pd.DataFrame({'A':[2,3,13,4], 'Z':list('abfr')})
print (df2)
A Z
0 2 a
1 3 b
2 13 f
3 4 r
L = list(set(df1.values.ravel()).intersection(df2.values.ravel()))
print (L)
['f', 2, 3, 4, 'a', 'b']

How to simply add a column level to a pandas dataframe

let say I have a dataframe that looks like this:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df
Out[92]:
A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Asumming that this dataframe already exist, how can I simply add a level 'C' to the column index so I get this:
df
Out[92]:
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I saw SO anwser like this python/pandas: how to combine two dataframes into one with hierarchical column index? but this concat different dataframe instead of adding a column level to an already existing dataframe.
-
As suggested by #StevenG himself, a better answer:
df.columns = pd.MultiIndex.from_product([df.columns, ['C']])
print(df)
# A B
# C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
option 1
set_index and T
df.T.set_index(np.repeat('C', df.shape[1]), append=True).T
option 2
pd.concat, keys, and swaplevel
pd.concat([df], axis=1, keys=['C']).swaplevel(0, 1, 1)
A solution which adds a name to the new level and is easier on the eyes than other answers already presented:
df['newlevel'] = 'C'
df = df.set_index('newlevel', append=True).unstack('newlevel')
print(df)
# A B
# newlevel C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
You could just assign the columns like:
>>> df.columns = [df.columns, ['C', 'C']]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Or for unknown length of columns:
>>> df.columns = [df.columns.get_level_values(0), np.repeat('C', df.shape[1])]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Another way for MultiIndex (appanding 'E'):
df.columns = pd.MultiIndex.from_tuples(map(lambda x: (x[0], 'E', x[1]), df.columns))
A B
E E
C D
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I like it explicit (using MultiIndex) and chain-friendly (.set_axis):
df.set_axis(pd.MultiIndex.from_product([df.columns, ['C']]), axis=1)
This is particularly convenient when merging DataFrames with different column level numbers, where Pandas (1.4.2) raises a FutureWarning (FutureWarning: merging between different levels is deprecated and will be removed ... ):
import pandas as pd
df1 = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df2 = pd.DataFrame(index=list('abcde'), data=range(10, 15), columns=pd.MultiIndex.from_tuples([("C", "x")]))
# df1:
A B
a 0 0
b 1 1
# df2:
C
x
a 10
b 11
# merge while giving df1 another column level:
pd.merge(df1.set_axis(pd.MultiIndex.from_product([df1.columns, ['']]), axis=1),
df2,
left_index=True, right_index=True)
# result:
A B C
x
a 0 0 10
b 1 1 11
Another method, but using a list comprehension of tuples as the arg to pandas.MultiIndex.from_tuples():
df.columns = pd.MultiIndex.from_tuples([(col, 'C') for col in df.columns])
df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

Categories

Resources