Decompose cell with multiple values in a DataFrame - python

I have pandas.DataFrame(...) in the following format(working example):
df = pd.DataFrame({'foo1':[1,2,3], 'foo2': ["a:1, b:2", "d:4", "a:6, d:5"]})
df
foo1 foo2
0 1 a:1, b:2
1 2 d:4
2 3 a:6, d:5
I would like to decompose the foo2 cell values into columns(O/P df):
foo1 foo2_a foo2_b foo2_d
0 1 1 2 0
1 2 0 0 4
2 3 6 0 5
I could iterate all over the dataframe through index, store value for each row - BUT IT DOESN'T seem elegent.
Is there some pandas trick/ elegent/ pythonic solution to this problem ?
Thanks!

If you use
df.foo2.str.split(', ').apply(lambda l: pd.Series({e.split(':')[0]: int(e.split(':')[1]) for e in l})).fillna(0)
You get
a b d
0 1.0 2.0 0.0
1 0.0 0.0 4.0
2 6.0 0.0 5.0
Note that once you get each row into a dictionary, you can transform it into a pandas Series, and this will be the result.
From this point, it is just a question of renaming the columns, and concatenating the result.

Use split + apply with list comprehension for dicts. Then converting column to numpy array by values + tolist, add_prefix and last join column foo1:
s = df['foo2'].str.split(', ').apply(lambda x: dict([y.split(':') for y in x]))
df1 = pd.DataFrame(s.values.tolist()).fillna(0).add_prefix('foo2_').astype(int)
df = df[['foo1']].join(df1)
print (df)
foo1 foo2_a foo2_b foo2_d
0 1 1 2 0
1 2 0 0 4
2 3 6 0 5

#find all the keys ('a','b','d',...)
d = {k:0 for k in df.foo2.str.extractall('([a-z]+)(?=:)').iloc[:,0].unique()}
#split foo2 and build a new DF then merge it into the existing DF.
pd.concat([df['foo1'].to_frame(), df.foo2.str.split(', ')\
.apply(lambda x: pd.Series(dict(d,**dict([e.split(':') for e in x])))).add_prefix('foo2_')], axis=1)
Out[149]:
foo1 foo2_a foo2_b foo2_d
0 1 1 2 0
1 2 0 0 4
2 3 6 0 5

Related

How can I groupby a DataFrame at the same time I count the values and put in different columns?

I have a DataFrame that looks like the one below
Index Category Class
0 1 A
1 1 A
2 1 B
3 2 A
4 3 B
5 3 B
And I would like to get an output data frame that groups by category and have one column for each of the classes with the counting of the occurrences of that class in each category, such as the one below
Index Category A B
0 1 2 1
1 2 1 0
2 3 0 2
So far I've tried various combinations of the groupby and agg methods, but I still can't get what I want. I've also tried df.pivot_table(index='Category', columns='Class', aggfunc='count'), but that return a DataFrame without columns. Any ideas of what could work in this case?
You can use aggfunc="size" to achieve your desired result:
>>> df.pivot_table(index='Category', columns='Class', aggfunc='size', fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Alternatively, you can use .groupby(...).size() to get the counts, and then unstack to reshape your data as well:
>>> df.groupby(["Category", "Class"]).size().unstack(fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Assign a dummy value to count:
out = df.assign(val=1).pivot_table('val', 'Category', 'Class',
aggfunc='count', fill_value=0).reset_index()
print(out)
# Output
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2
import pandas as pd
df = pd.DataFrame({'Index':[0,1,2,3,4,5],
'Category': [1,1,1,2,3,3],
'Class':['A','A','B','A','B','B'],
})
df = df.groupby(['Category', 'Class']).count()
df = df.pivot_table(index='Category', columns='Class')
print(df)
output:
Index
Class A B
Category
1 2.0 1.0
2 1.0 NaN
3 NaN 2.0
Use crosstab:
pd.crosstab(df['Category'], df['Class']).reset_index()
output:
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2

Python Pandas Change Column to Headings

I have data in the following format: Table 1
This data is loaded into a pandas dataframe. The date column is the index for this dataframe. How would I have it so the names become the column headings (must be unique) and the values correspond to the right dates.
So it would look something like this:
Table 2
Consider the following toy DataFrame:
>>> df = pd.DataFrame({'x': [1,2,3,4], 'y':['0 a','2 a','3 b','0 b']})
>>> df
x y
0 1 0 a
1 2 2 a
2 3 3 b
3 4 0 b
Start by processing each row into a Series:
>>> new_columns = df['y'].apply(lambda x: pd.Series(dict([reversed(x.split())])))
>>> new_columns
a b
0 0 NaN
1 2 NaN
2 NaN 3
3 NaN 0
Alternatively, new columns can be generated using pivot (the effect is the same):
>>> new_columns = df['y'].str.split(n=1, expand=True).pivot(columns=1, values=0)
Finally, concatenate the original and the new DataFrame objects:
>>> df = pd.concat([df, new_columns], axis=1)
>>> df
x y a b
0 1 0 a 0 NaN
1 2 2 a 2 NaN
2 3 3 b NaN 3
3 4 0 b NaN 0
Drop any columns that you don't require:
>>> df.drop(['y'], axis=1)
x a b
0 1 0 NaN
1 2 2 NaN
2 3 NaN 3
3 4 NaN 0
You will need to split out the column’s values, then rename your dataframe’s columns, and then you can pivot() the dataframe. I have added the steps below:
df[0].str.split(' ' , expand = True) # assumes you only have the one column
df.columns = ['col_name','values'] # use whatever naming convention you like
df.pivot(columns = 'col_name',values = 'values')
Please let me know if this helps.

How can I operate with the output of a DataFrame?

I have a DataFrame object and I'm grouping by some keys and counting the results. The problem is that I want to replace one of the index of the DataFrame columns for a relation between the counts.
df.groupby(['A','B', 'C'])['C'].count().apply(f).reset_index()
I'm looking for an f that replaces the column C by the value of #timesC==1 / #timesC==0 for each value of A and B.
Is this what you want?
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'A':[1,2,3,1,2,3],
'B':[2,0,1,2,0,1],
'C':[1,1,0,1,1,1]
})
print(df)
def f(x):
if np.count_nonzero(x==0)==0:
return np.nan
else:
return np.count_nonzero(x==1)/np.count_nonzero(x==0)
result = df.groupby(['A','B'])['C'].apply(f).reset_index()
print(result)
Result:
#df
A B C
0 1 2 1
1 2 0 1
2 3 1 0
3 1 2 1
4 2 0 1
5 3 1 1
#result
A B C
0 1 2 NaN
1 2 0 NaN
2 3 1 1.0

Pandas comparing multiindex dataframes without looping

I want to compare two multiindex dataframes and add another column to show the difference in values (if all index value match between the first dataframe and second dataframe) without using loops
index_a = [1,2,2,3,3,3]
index_b = [0,0,1,0,1,2]
index_c = [1,2,2,4,4,4]
index = pd.MultiIndex.from_arrays([index_a,index_b], names=('a','b'))
index_1 = pd.MultiIndex.from_arrays([index_c,index_b], names=('a','b'))
df1 = pd.DataFrame(np.random.rand(6,), index=index, columns=['p'])
df2 = pd.DataFrame(np.random.rand(6,), index=index_1, columns=['q'])
df1
p
a b
1 0 .4655
2 0 .8600
1 .9010
3 0 .0652
1 .5686
2 .8965
df2
q
a b
1 0 .6591
2 0 .5684
1 .5689
4 0 .9898
1 .3656
2 .6989
The resultant matrix (df1-df2) should look like
p diff
a b
1 0 .4655 -0.1936
2 0 .8600 .2916
1 .9010 .3321
3 0 .0652 No Match
1 .5686 No Match
2 .8965 No Match
Use reindex_like or reindex for intersection of indices:
df1['new'] = (df1['p'] - df2['q'].reindex_like(df1)).fillna('No Match')
#alternative
#df1['new'] = (df1['p'] - df2['q'].reindex(df1.index)).fillna('No Match')
print (df1)
p new
a b
1 0 0.955587 0.924466
2 0 0.312497 -0.310224
1 0.306256 0.231646
3 0 0.575613 No Match
1 0.674605 No Match
2 0.462807 No Match
Another idea with Index.intersection and DataFrame.loc:
df1['new'] = (df1['p'] - df2.loc[df2.index.intersection(df1.index), 'q']).fillna('No Match')
Or with merge with left join:
df = pd.merge(df1, df2, how='left', left_index=True, right_index=True)
df['new'] = (df['p'] - df['q']).fillna('No Match')
print (df)
p q new
a b
1 0 0.789693 0.665148 0.124544
2 0 0.082677 0.814190 -0.731513
1 0.762339 0.235435 0.526905
3 0 0.727695 NaN No Match
1 0.903596 NaN No Match
2 0.315999 NaN No Match
Use following to get the difference of matached indexes. Unmatch indices will be NaN
diff = df1['p'] - df2['q']
#Output
a b
1 0 -0.666542
2 0 -0.389033
1 0.064986
3 0 NaN
1 NaN
2 NaN
4 0 NaN
1 NaN
2 NaN
dtype: float64

Start counting at zero by group

Consider the following dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame({'group': list('aaabbabc')})
>>> df
group
0 a
1 a
2 a
3 b
4 b
5 a
6 b
7 c
I want to count the cumulative number of times each group has occurred. My desired output looks like this:
>>> df
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
My initial approach was to do something like this:
df['n'] = df.groupby('group').apply(lambda x: list(range(x.shape[0])))
Basically assigning a length n array, zero-indexed, to each group. But that has proven difficult to transpose and join.
You can use groupby + cumcount, and horizontally concat the new column:
>>> pd.concat([df, df.group.groupby(df.group).cumcount()], axis=1).rename(columns={0: 'n'})
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Simply use groupby on column name, in this case group and then apply cumcount and finally add a column in dataframe with the result.
df['n']=df.groupby('group').cumcount()
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
You can use apply method by passing a lambda expression as parameter.
The idea is that you need to find out the count for a group as number of appearances for that group from the previous rows.
df['n'] = df.apply(lambda x: list(df['group'])[:int(x.name)].count(x['group']), axis=1)
Output
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Note: cumcount method is build with the help of the apply function.
You can read this in pandas documentation.

Categories

Resources