pandas: dataframe from dict with comma separated values - python

I'm trying to create a DataFrame from a nested dictionary, where the values are in comma separated strings.
Each value is nested in a dict, such as:
dict = {"1":{
"event":"A, B, C"},
"2":{
"event":"D, B, A, C"},
"3":{
"event":"D, B, C"}
}
My desired output is:
A B C D
0 A B C NaN
1 A B C D
2 NaN B C D
All I have so far is converting the dict to dataframe and splitting the items in each list. But I'm not sure this is getting me any closer to my objective.
df = pd.DataFrame(dict)
Out[439]:
1 2 3
event A, B, C D, B, A, C D, B, C
In [441]: df.loc['event'].str.split(',').apply(pd.Series)
Out[441]:
0 1 2 3
1 A B C NaN
2 D B A C
3 D B C NaN
Any help is appreciated. Thanks

You can use a couple of comprehensions to massage the nested dict into a better format for creation of a DataFrame that flags if an entry for the column exists or not:
the_dict = {"1":{
"event":"A, B, C"},
"2":{
"event":"D, B, A, C"},
"3":{
"event":"D, B, C"}
}
df = pd.DataFrame([[{z:1 for z in y.split(', ')} for y in x.values()][0] for x in the_dict.values()])
>>> df
A B C D
0 1.0 1 1 NaN
1 1.0 1 1 1.0
2 NaN 1 1 1.0
Once you've made the DataFrame you can simply loop through the columns and convert the values that flagged the existence of the letter into a letter using the where method(below this does where NaN leave as NaN, otherwise it inserts the letter for the column):
for col in df.columns:
df_mask = df[col].isnull()
df[col]=df[col].where(df_mask,col)
>>> df
A B C D
0 A B C NaN
1 A B C D
2 NaN B C D
Based on #merlin's suggestion you can go straight to the answer within the comprehension:
df = pd.DataFrame([[{z:z for z in y.split(', ')} for y in x.values()][0] for x in the_dict.values()])
>>> df
A B C D
0 A B C NaN
1 A B C D
2 NaN B C D

From what you have(modified the split a little to strip the extra spaces) df1, you can probably just stack the result and use pd.crosstab() on the index and value column:
df1 = df.loc['event'].str.split('\s*,\s*').apply(pd.Series)
df2 = df1.stack().rename('value').reset_index()
pd.crosstab(df2.level_0, df2.value)
# value A B C D
# level_0
# 1 1 1 1 0
# 2 1 1 1 1
# 3 0 1 1 1
This is not exactly as you asked for, but I imagine you may prefer this to your desired output.
To get exactly what you are looking for, you can add an extra column which is equal to the value column above and then unstack the index that contains the values:
df2 = df1.stack().rename('value').reset_index()
df2['value2'] = df2.value
df2.set_index(['level_0', 'value']).drop('level_1', axis = 1).unstack(level = 1)
# value2
# value A B C D
# level_0
# 1 A B C None
# 2 A B C D
# 3 None B C D

Related

Expand pandas dataframe by replacing cell value with a list

I have a pandas dataframe like this below:
A B C
a b c
d e f
where A B and C are column names. Now i have a list:
mylist = [1,2,3]
I want to replace the c in column C with list such as dataframe expands for all value of list, like below:
A B C
a b 1
a b 2
a b 3
d e f
Any help would be appreciated!
I tried this,
mylist = [1,2,3]
x=pd.DataFrame({'mylist':mylist})
x['C']='c'
res= pd.merge(df,x,on=['C'],how='left')
res['mylist']=res['mylist'].fillna(res['C'])
For further,
del res['C']
res.rename(columns={"mylist":"C"},inplace=True)
print res
Output:
A B C
0 a b 1
1 a b 2
2 a b 3
3 d e f
You can use:
print (df)
A B C
0 a b c
1 d e f
2 a b c
3 t e w
mylist = [1,2,3]
idx1 = df.index[df.C == 'c']
df = df.loc[idx1.repeat(len(mylist))].assign(C=mylist * len(idx1)).append(df[df.C != 'c'])
print (df)
A B C
0 a b 1
0 a b 2
0 a b 3
2 a b 1
2 a b 2
2 a b 3
1 d e f
3 t e w

selecting not None value from a dataframe column

I would like to use the fillna function to fill None value of a column with its own first most frequent value that is not None or nan.
Input DF:
Col_A
a
None
None
c
c
d
d
The output Dataframe could be either:
Col_A
a
c
c
c
c
d
d
Any suggestion would be very appreciated.
Many Thanks, Best Regards,
Carlo
Prelude: If your None is actually a string, you can simplify any headaches by getting rid of them first-up. Use replace:
df = df.replace('None', np.nan)
I believe you could use fillna + value_counts:
df
Col_A
0 a
1 NaN
2 NaN
3 c
4 c
5 d
6 d
df.fillna(df.Col_A.value_counts(sort=False).index[0])
Col_A
0 a
1 c
2 c
3 c
4 c
5 d
6 d
Or, with Vaishali's suggestion, use idxmax to pick c:
df.fillna(df.Col_A.value_counts(sort=False).idxmax())
Col_A
0 a
1 c
2 c
3 c
4 c
5 d
6 d
The fill-values could either be c or d, depending on whether you include sort=False or not.
Details
df.Col_A.value_counts(sort=False)
c 2
a 1
d 2
Name: Col_A, dtype: int64
fillna + mode
df.Col_A.fillna(df.Col_A.mode()[0])
Out[963]:
0 a
1 c
2 c
3 c
4 c
5 d
6 d
Name: Col_A, dtype: object
To address 'None', you need to use replace then fillna much like #COLDSPEED suggests:
dr = df.Col_A.replace('None',np.nan)
dr.fillna(dr.dropna().value_counts().index[0])
Output:
0 a
1 d
2 d
3 c
4 c
5 d
6 d
Name: Col_A, dtype: object

creating a column in dataframe pandas based on a condition

I have two lists
A = ['a','b','c','d','e']
B = ['c','e']
A dataframe with column
A
0 a
1 b
2 c
3 d
4 e
I wish to create an additional column for rows where elements in B matches A.
A M
0 a
1 b
2 c match
3 d
4 e match
You can use loc or numpy.where and condition with isin:
df.loc[df.A.isin(B), 'M'] = 'match'
print (df)
A M
0 a NaN
1 b NaN
2 c match
3 d NaN
4 e match
Or:
df['M'] = np.where(df.A.isin(B),'match','')
print (df)
A M
0 a
1 b
2 c match
3 d
4 e match

Adding rows to a Pandas dataframe based on a list in a column (and vice versa)

I have this dataframe:
dfx = pd.DataFrame([[1,2],['A','B'],[['C','D'],'E']],columns=list('AB'))
A B
0 1 2
1 A B
2 [C, D] E
... that I want to transform in ...
A B
0 1 2
1 A B
2 C E
3 D E
... adding a row for each value contained in column A if it's a list.
Which is the most pythonic way?
And vice versa, if I want to group by a column (let's say B) and have in column A a list of the grouped values? (so the opposite that the example above)
Thanks in advance,
Gianluca
You have mixed dataframe - int with str and list values (very problematic because many functions raise errors), so first convert all numeric to str by where and mask is by to_numeric with parameter errors='coerce' which convert non numeric to NaN:
dfx.A = dfx.A.where(pd.to_numeric(dfx.A, errors='coerce').isnull(), dfx.A.astype(str))
print (dfx)
A B
0 1 2
1 A B
2 [C, D] E
and then create new DataFrame by np.repeat and flat values of lists by chain.from_iterable:
df = pd.DataFrame({
"B": np.repeat(dfx.B.values, dfx.A.str.len()),
"A": list(chain.from_iterable(dfx.A))})
print (df)
A B
0 1 2
1 A B
2 C E
3 D E
Pure pandas solution convert column A to list and then create new DataFrame.from_records. Then drop original column A and join stacked df:
df = pd.DataFrame.from_records(dfx.A.values.tolist(), index = dfx.index)
df = dfx.drop('A', axis=1).join(df.stack().rename('A')
.reset_index(level=1, drop=True))[['A','B']]
print (df)
A B
0 1 2
1 A B
2 C E
2 D E
If need lists use groupby and apply tolist:
print (df.groupby('B')['A'].apply(lambda x: x.tolist()).reset_index())
B A
0 2 [1]
1 B [A]
2 E [C, D]
but if need list only if length of values is more as 1 is necessary if..else:
print (df.groupby('B')['A'].apply(lambda x: x.tolist() if len(x) > 1 else x.values[0])
.reset_index())
B A
0 2 1
1 B A
2 E [C, D]

Append values of row with same values

I have following data frame:
1 A a
1 A b
2 B c
1 A d
How do I append all the values of a row with same values to data frame:
1 A a,c,d
2 B c
You can use groupby and apply function join :
df.columns = ['a','b','c']
print (df)
a b c
0 1 A a
1 1 A b
2 2 B c
3 1 A d
print (df.groupby(['a', 'b'])['c'].apply(', '.join).reset_index())
a b c
0 1 A a, b, d
1 2 B c
Or if first column is index:
df.columns = ['a','b']
print (df)
a b
1 A a
1 A b
2 B c
1 A d
df1 = df.b.groupby([df.index, df.a]).apply(', '.join).reset_index(name='c')
df1.columns = ['a','b','c']
print (df1)
a b c
0 1 A a, b, d
1 2 B c

Categories

Resources