Pandas: Splitting JSON list value into new columns - python

I used Pandas to load a CSV to the following DataFrame:
value values
0 56.0 [-0.5554548,10.0748005,4.232949]
1 72.0 [-0.1953888,0.15093994,-0.058532715]
...
Now I would like to replace "values" column with 3 new columns like so:
value values_a values_b values_c
0 56.0 -0.5554548 10.0748005 4.232949
1 72.0 -0.1953888 0.15093994 -0.058532715
...
How can I split the list to 3 columns?

You can use split with removing [] by strip:
df1 = df.pop('values').str.strip('[]').str.split(',',expand=True).astype(float)
df[['values_a', 'values_b', 'values_c']] = df1
Solution if There is no NaNs:
L = [x.split(',') for x in df.pop('values').str.strip('[]').values.tolist()]
df[['values_a', 'values_b', 'values_c']] = pd.DataFrame(L).astype(float)
solutions with converting columns first to list and then is used DataFrame constructor:
import ast
s = df.pop('values').apply(ast.literal_eval)
df[['values_a', 'values_b', 'values_c']] = pd.DataFrame(s.values.tolist()).astype(float)
Similar:
df = pd.read_csv(file converters={'values':ast.literal_eval})
print (df)
value values
0 56.0 [-0.5554548, 10.0748005, 4.232949]
1 72.0 [-0.1953888, 0.15093994, -0.058532715]
df1 = pd.DataFrame(df.pop('values').tolist()).astype(float)
df[['values_a', 'values_b', 'values_c']] = df1
Final:
print (df)
value values_a values_b values_c
0 56.0 -0.555455 10.074801 4.232949
1 72.0 -0.195389 0.150940 -0.058533
EDIT:
If is possible in some column is more as 3 value then is not possible assign to 3 new columns. Solution is use join:
df = df.join(df1.add_prefix('val'))
print (df)
value val0 val1 val2
0 56.0 -0.555455 10.074801 4.232949
1 72.0 -0.195389 0.150940 -0.058533

Related

Merging df in python

Say I have two DataFrames
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
I want to merge so that any values in df1 are overwritten in there is a value in df2 at that location and any new values in df2 are added including the new rows and columns.
The result should be:
A B C
0 1 3 nan
1 2 8 10
2 nan 9 11
I've tried combine_first but that causes only nan values to be overwritten
updated has the issue where new rows are created rather than overwritten
merge has many issues.
I've tried writing my own function
def take_right(df1, df2, j, i):
print (df1)
print (df2)
try:
s1 = df1[j][i]
except:
s1 = np.NaN
try:
s2 = df2[j][i]
except:
s2 = np.NaN
if math.isnan(s2):
#print(s1)
return s1
else:
# print(s2)
return s2
def combine_df(df1, df2):
rows = (set(df1.index.values.tolist()) | set(df2.index.values.tolist()))
#print(rows)
columns = (set(df1.columns.values.tolist()) | set(df2.columns.values.tolist()))
#print(columns)
df = pd.DataFrame()
#df.columns = columns
for i in rows:
#df[:][i]=[]
for j in columns:
df = df.insert(int(i), j, take_right(df1,df2,j,i), allow_duplicates=False)
# print(df)
return df
This won't add new columns or rows to an empty DataFrame.
Thank you!!
One approach is to create an empty output dataframe with the union of columns and indices from df1 and df2 and then use the df.update method to assign their values into the out_df
import pandas as pd
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
out_df = pd.DataFrame(
columns = df1.columns.union(df2.columns),
index = df1.index.union(df2.index),
)
out_df.update(df1)
out_df.update(df2)
out_df
Why does combine_first not work?
df = df2.combine_first(df1)
print(df)
Output:
A B C
0 1.0 3 NaN
1 2.0 8 10.0
2 NaN 9 11.0

Assigning values in DataFrame when columns names and values are in single row

I have two dataframes like below,
import numpy as np
import pandas as pd
df1 = pd.DataFrame({1: np.zeros(5), 2: np.zeros(5)}, index=['a','b','c','d','e'])
and
df2 = pd.DataFrame({'category': [1,1,2,2], 'value':[85,46, 39, 22]}, index=[0, 1, 3, 4])
The value from second dataframe is required to be assigned in first dataframe such that the index and column relationship is maintained. The second dataframe index is iloc based and has column category which is actually containing column names of first dataframe. The value is value to be assigned.
Following is the my solution with expected output,
for _category in df2['category'].unique():
df1.loc[df1.iloc[df2[df2['category'] == _category].index.tolist()].index, _category] = df2[df2['category'] == _category]['value'].values
Is there a pythonic way of doing so without the for loop?
One option is to pivot and update:
df3 = df1.reset_index()
df3.update(df2.pivot(columns='category', values='value'))
df3 = df3.set_index('index').rename_axis(None)
Alternative, reindex df2 (in two steps, numerical and by label), and combine_first with df1:
df3 = (df2
.pivot(columns='category', values='value')
.reindex(range(max(df2.index)+1))
.set_axis(df1.index)
.combine_first(df1)
)
output:
1 2
a 85.0 0.0
b 46.0 0.0
c 0.0 0.0
d 0.0 39.0
e 0.0 22.0
Here's one way by replacing the 0s in df1 with NaN; pivotting df2 and filling in the NaNs in df1 with df2:
out = (df1.replace(0, pd.NA).reset_index()
.fillna(df2.pivot(columns='category', values='value'))
.set_index('index').rename_axis(None).fillna(0))
Output:
1 2
a 85.0 0.0
b 46.0 0.0
c 0.0 0.0
d 0.0 39.0
e 0.0 22.0

insert missing rows in df with dictionary values

Hello I have the following dataframe
df = pd.DataFrame(data={'grade_1':['A','B','C'],
'grade_1_count': [19,28,32],
'grade_2': ['pass','fail',np.nan],
'grade_2_count': [39,18, np.nan]})
whereby some grades as missing, and need to be inserted in to the grade_n column according to the values in this dictionary
grade_dict = {'grade_1':['A','B','C','D','E','F'],
'grade_2' : ['pass','fail','not present', 'borderline']}
and the corresponding row value in the _count column should be filled with np.nan
so the expected output is like this
expected_df = pd.DataFrame(data={'grade_1':['A','B','C','D','E','F'],
'grade_1_count': [19,28,32,0,0,0],
'grade_2': ['pass','fail','not preset','borderline', np.nan, np.nan],
'grade_2_count': [39,18,0,0,np.nan,np.nan]})
so far I have this rather inelegant code that creates a column that includes all the correct categories for the grades, but i cannot reinsert it in to the dataframe, or fill the count columns with zeros (where the np.nans just reflect empty cells due to coercing columns with different lengths of rows) I hope that makes sense. any advice would be great. thanks
x=[]
for k, v in grade_dict.items():
out = df[k].reindex(grade_dict[k], axis=0, fill_value=0)
x = pd.concat([out], axis=1)
x[k] = x.index
x = x.reset_index(drop=True)
df[k] = x.fillna(np.nan)
Here is a solution using two consecutive merges:
# set up combinations
from itertools import zip_longest
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
# merge
(df2.merge(df.filter(like='grade_1'),
on='grade_1', how='left')
.merge(df.filter(like='grade_2'),
on='grade_2', how='left')
.sort_index(axis=1)
)
output:
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D NaN borderline NaN
4 E NaN None NaN
5 F NaN None NaN
multiple merges:
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
for col in grade_dict:
df2 = df2.merge(df.filter(like=col),
on=col, how='left')
df2
If you only need to merge on grade_1 without updating the non-NaNs of grade_2, you can cast grade_dict into a df and then use combine_first:
print (df.set_index("grade_1").combine_first(pd.DataFrame(grade_dict.values(),
index=grade_dict.keys()).T.set_index("grade_1"))
.fillna({"grade_1_count": 0}).reset_index())
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D 0.0 borderline NaN
4 E 0.0 None NaN
5 F 0.0 None NaN

Can you simultaneously select and assign a column in a pandas DataFrame?

Using data.table in R, you can simultaneously select and assign columns. Assume one has a data.table with 3 columns--col1, col2, and col3. One could do the following using data.table:
dt2 <- dt[, .(col1, col2, newcol = 3, anothercol = col3)]
I want to do something similar in pandas but it looks like it would take 3 lines.
df2 = df.copy()
df2['newcol'] = 3
df2.rename(columns = {"col3" : "anothercol"})
Is there a more concise way to do what I did above?
This might work:
import pandas as pd
ddict = {
'col1':['A','A','B','X'],
'col2':['A','A','B','X'],
'col3':['A','A','B','X'],
}
df = pd.DataFrame(ddict)
df.loc[:, ['col1', 'col2', 'col3']].rename(columns={"col3":"anothercol"}).assign(newcol=3)
result:
col1 col2 anothercol newcol
0 A A A 3
1 A A A 3
2 B B B 3
3 X X X 3
I don't know R, but what I'm seeing is that you are adding a new column called newcol that has a value of 3 on all the rows.
also you are renaming a column from col3 to anothercol.
you don't really need do the copy step.
df2 = df.rename(columns = {'col3': 'anothercol'})
df2['newcol'] = 3
You can use df.assign for that :
Example :
>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]},
index=['Portland', 'Berkeley'])
>>> df
temp_c
Portland 17.0
Berkeley 25.0
>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
temp_c temp_f
Portland 17.0 62.6
Berkeley 25.0 77.0
>>> df.assign(newcol=3).rename(columns={"temp_c":"anothercol"}
anothercol newcol
Portland 17.0 3
Berkeley 25.0 3
And then you can assign it to df2.
First examples taken from pandas Docs

Python: Fill 'na' in pandas column with random elements from a list

I am trying to fill 'NA' in a pandas column by randomly selecting elements from a list.
For example:
import pandas as pd
df = pandas.DataFrame()
df['A'] = [1, 2, None, 5, 53, None]
fill_list = [22, 56, 84]
Is it possible to write a function which takes the pandas DF with column name as input and replaces all NA by randomly selecting elements from the list 'fill_list'?
fun(df['column_name'], fill_list])
Create new Series with numpy.random.choice and then replace NaNs by fillna or combine_first:
df['A'] = df['A'].fillna(pd.Series(np.random.choice(fill_list, size=len(df.index))))
#alternative
#df['A'] = df['A'].combine_first(pd.Series(np.random.choice(fill_list, size=len(df.index))))
print (df)
A
0 1.0
1 2.0
2 84.0
3 5.0
4 53.0
5 56.0
Or:
#get mask of NaNs
m = df['A'].isnull()
#count rows with NaNs
l = m.sum()
#create array with size l
s = np.random.choice(fill_list, size=l)
#set NaNs values
df.loc[m, 'A'] = s
print (df)
A
0 1.0
1 2.0
2 56.0
3 5.0
4 53.0
5 56.0
data_rnr['CO BORROWER NAME'].fillna("NO",inplace=True)
data_rnr['ET REASON'].fillna("ET REASON NOT AVAILABLE",inplace=True)
data_rnr['INSURANCE COMPANY NM'].fillna("INSURANCE COMPANY-NOT
AVAILABLE",inplace=True)
data_rnr['GENDER'].fillna("GENDER DATA- NOT AVAILABLE",inplace=True)

Categories

Resources