Pandas DataFrame from a dict of ndarray - python

Consider a dictionary like the following:
>>> dict_temp = {'a': np.array([[0,1,2], [3,4,5]]),
'b': np.array([[3,4,5], [2,5,1], [5,3,7]])}
How can I build a pandas DataFrame out of this, using a multi-index with level 0 and 1 as follows:
level_0 = ['a', 'b']
level_1 = [[0,1], [0,1,2]]
I expect the code to build the multi-index levels itself... I don't care about the column names for now.
Appreciate comments...

Try concat:
pd.concat({k:pd.DataFrame(d) for k, d in dict_temp.items()})
Output:
0 1 2
a 0 0 1 2
1 3 4 5
b 0 3 4 5
1 2 5 1
2 5 3 7

Related

Create pandas dataframe columns using column names and elements [duplicate]

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2

Pandas get data from a list of column names

I have a pandas dataframe: df and list of column names: columns like so:
df = pd.DataFrame({
'A': ['b','b','c','d'],
'C': ['b1','b2','c1','d2'],
'B': list(range(4))})
columns = ['A','B']
Now I want to get all the data from these columns of the dataframe in one single series like so:
b
0
b
1
c
2
d
4
This is what I tried:
srs = pd.Series()
srs.append(df[column].values for column in columns)
But it is throwing this error:
TypeError: cannot concatenate object of type '<class 'generator'>';
only Series and DataFrame objs are valid
How can I fix this issue?
I think you can use numpy.ravel:
srs = pd.Series(np.ravel(df[columns]))
print (srs)
0 b
1 0
2 b
3 1
4 c
5 2
6 d
7 3
dtype: object
Or DataFrame.stack with Series.reset_index and drop=True:
srs = df[columns].stack().reset_index(drop=True)
If order should be changed is possible use DataFrame.melt:
srs = df[columns].melt()['value']
print (srs)
0 b
1 b
2 c
3 d
4 0
5 1
6 2
7 3
Name: value, dtype: object
You could do:
from itertools import chain
import pandas as pd
df = pd.DataFrame({
'A': ['b','b','c','d'],
'C': ['b1','b2','c1','d2'],
'B': list(range(4))})
columns = ['A','B']
res = pd.Series(chain.from_iterable(df[columns].to_numpy()))
print(res)
Output
0 b
1 0
2 b
3 1
4 c
5 2
6 d
7 3
dtype: object

Pandas: Split columns into multiple columns by two delimiters

I have data like this
ID INFO
1 A=2;B=2;C=5
2 A=3;B=4;C=1
3 A=1;B=3;C=2
I want to split the Info columns into
ID A B C
1 2 2 5
2 3 4 1
3 1 3 2
I can split columns with one delimiter by using
df['A'], df['B'], df['C'] = df['INFO'].str.split(';').str
then split again by = but this seems to not so efficient in case I have many rows and especially when there are so many field that cannot be hard-coded beforehand.
Any suggestion would be greatly welcome.
You could use named groups together with Series.str.extract. In the end concat back the 'ID'. This assumes you always have A=;B=;and C= in a line.
pd.concat([df['ID'],
df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)
# ID A B C
#0 1 2 2 5
#1 2 3 4 1
#2 3 1 3 2
If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2' then we can split on ';' and partition on '='. pivot in the end to get to your desired output.
### Starting Data
#ID INFO
#1 A=2;B=2;C=5
#2 A=3;B=4;C=1
#3 A=1;B=3;C=2
#4 A=1;C=2
(df.set_index('ID')['INFO']
.str.split(';', expand=True)
.stack()
.str.partition('=')
.reset_index(-1, drop=True)
.pivot(columns=0, values=2)
)
# A B C
#ID
#1 2 2 5
#2 3 4 1
#3 1 3 2
#4 1 NaN 2
Browsing a Series is much faster that iterating across the rows of a dataframe.
So I would do:
pd.DataFrame([dict([x.split('=') for x in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()
It gives as expected:
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
It should be faster than splitting twice dataframe columns.
values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)
This will give you the desired output:
ID INFO a b c
1 a=1;b=2;c=3 1 2 3
2 a=4;b=5;c=6 4 5 6
3 a=7;b=8;c=9 7 8 9
Explanation:
The first line converts every value to a dictionary.
e.g.
x = 'a=1;b=2;c=3'
dict(item.split("=") for item in x.split(";"))
results in :
{'a': '1', 'b': '2', 'c': '3'}
DataFrame can take a list of dicts as an input and turn it into a dataframe.
Then you only need to assign the dataframe to the columns you want:
df[['a', 'b', 'c']] = pd.DataFrame(values)
Another solution is Series.str.findAll to extract values and then apply(pd.Series):
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)
Details:
df = pd.DataFrame([[1, "A=2;B=2;C=5"],
[2, "A=3;B=4;C=1"],
[3, "A=1;B=3;C=2"]],
columns=["ID", "INFO"])
print(df.INFO.str.findall(r'=(\d+)'))
# 0 [2, 2, 5]
# 1 [3, 4, 1]
# 2 [1, 3, 2]
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
# ID INFO A B C
# 0 1 A=2;B=2;C=5 2 2 5
# 1 2 A=3;B=4;C=1 3 4 1
# 2 3 A=1;B=3;C=2 1 3 2
# Remove INFO column
df = df.drop("INFO", 1)
print(df)
# ID A B C
# 0 1 2 2 5
# 1 2 3 4 1
# 2 3 1 3 2
Another solution :
#split on ';'
#explode
#then split on '='
#and pivot
df_INFO = (df.INFO
.str.split(';')
.explode()
.str.split('=',expand=True)
.pivot(columns=0,values=1)
)
pd.concat([df.ID,df_INFO],axis=1)
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2

in Pandas, how to create a variable that is n for the nth observation within a group?

consider this
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df
Out[128]:
B C
0 a 1
1 a 2
2 b 6
3 b 2
I want to create a variable that simply corresponds to the ordering of observations after sorting by 'C' within each groupby('B') group.
df.sort_values(['B','C'])
Out[129]:
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
How can I do that? I am thinking about creating a column that is one, and using cumsum but that seems too clunky...
I think you can use range with len(df):
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'a', 'b'],
'C': [5, 3, 2]})
print df
A B C
0 1 a 5
1 2 a 3
2 3 b 2
df.sort_values(by='C', inplace=True)
#or without inplace
#df = df.sort_values(by='C')
print df
A B C
2 3 b 2
1 2 a 3
0 1 a 5
df['order'] = range(1,len(df)+1)
print df
A B C order
2 3 b 2 1
1 2 a 3 2
0 1 a 5 3
EDIT by comment:
I think you can use groupby with cumcount:
import pandas as pd
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df.sort_values(['B','C'], inplace=True)
#or without inplace
#df = df.sort_values(['B','C'])
print df
B C
0 a 1
1 a 2
3 b 2
2 b 6
df['order'] = df.groupby('B', sort=False).cumcount() + 1
print df
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
Nothing wrong with Jezrael's answer but there's a simpler (though less general) method in this particular example. Just add groupby to JohnGalt's suggestion of using rank.
>>> df['order'] = df.groupby('B')['C'].rank()
B C order
0 a 1 1.0
1 a 2 2.0
2 b 6 2.0
3 b 2 1.0
In this case, you don't really need the ['C'] but it makes the ranking a little more explicit and if you had other unrelated columns in the dataframe then you would need it.
But if you are ranking by more than 1 column, you should use Jezrael's method.

Assign to selection in pandas

I have a pandas dataframe and I want to create a new column, that is computed differently for different groups of rows. Here is a quick example:
import pandas as pd
data = {'foo': list('aaade'), 'bar': range(5)}
df = pd.DataFrame(data)
The dataframe looks like this:
bar foo
0 0 a
1 1 a
2 2 a
3 3 d
4 4 e
Now I am adding a new column and try to assign some values to selected rows:
df['xyz'] = 0
df.loc[(df['foo'] == 'a'), 'xyz'] = df.loc[(df['foo'] == 'a')].apply(lambda x: x['bar'] * 2, axis=1)
The dataframe has not changed. What I would expect is the dataframe to look like this:
bar foo xyz
0 0 a 0
1 1 a 2
2 2 a 4
3 3 d 0
4 4 e 0
In my real-world problem, the 'xyz' column is also computated for the other rows, but using a different function. In fact, I am also using different columns for the computation. So my questions:
Why does the assignment in the above example not work?
Is it neccessary to do df.loc[(df['foo'] == 'a') twice (as I am doing it now)?
You're changing a copy of df (a boolean mask of the DataFrame is a copy, see docs).
Another way to achieve the desired result is as follows:
In [11]: df.apply(lambda row: (row['bar']*2 if row['foo'] == 'a' else row['xyz']), axis=1)
Out[11]:
0 0
1 2
2 4
3 0
4 0
dtype: int64
In [12]: df['xyz'] = df.apply(lambda row: (row['bar']*2 if row['foo'] == 'a' else row['xyz']), axis=1)
In [13]: df
Out[13]:
bar foo xyz
0 0 a 0
1 1 a 2
2 2 a 4
3 3 d 0
4 4 e 0
Perhaps a neater way is just to:
In [21]: 2 * (df1.bar) * (df1.foo == 'a')
Out[21]:
0 0
1 2
2 4
3 0
4 0
dtype: int64

Categories

Resources