Given the following dataframe and list of dictionaries:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict([
{'id': '912SAFD', 'key': 3, 'list_index': [0]},
{'id': '812SAFD', 'key': 4, 'list_index': [0, 1]},
{'id': '712SAFD', 'key': 5, 'list_index': [2]}])
designs = [{'designs': [{'color_id': 609090, 'value': 'b', 'lang': ''}]},
{'designs': [{'color_id': 609091, 'value': 'c', 'lang': ''}]},
{'designs': [{'color_id': 609092, 'value': 'd', 'lang': 'fr'}]}]
Dataframe output:
id key list_index
0 912SAFD 3 [0]
1 812SAFD 4 [0, 1]
2 712SAFD 5 [2]
Without using explicit loops (if possible), is it feasible to iterate through the lists in 'list_index' for each row, extract the values and use them to access the list of dictionaries by index and then create new columns based on the values in the dictionaries?
Here is an example of the expected result:
id key list_index 609090 609091 609092 609092_lang
0 912SAFD 3 [0] b NaN NaN NaN
1 812SAFD 4 [0, 1] b c NaN NaN
2 712SAFD 5 [2] NaN NaN d fr
If 'lang' is not empty, it should be added as a column to the dataframe by using the color_id value combined with an underscore and its own name as the column name. For example: 609092_lang.
Any help would be much appreciated.
# this is to get the inner dictionary and make a tidy dataframe from it
designs = [info for design in designs for info in design['designs']]
df_designs = pd.DataFrame(designs)
df_designs['lang_code'] = 'lang_' + df_designs['color_id'].astype(str)
df_designs['lang'] = df_designs.lang.replace('', np.NaN)
df = df.explode('list_index').merge(df_designs, left_on='list_index', right_index=True)
df_color = df.pivot(index=['id', 'key'], columns=['color_id'], values='value')
df_lang = df.pivot(index=['id', 'key'], columns=['lang_code'], values='lang')
df = df_color.join(df_lang).reset_index().dropna(how='all' , axis=1)
print(df)
output :
>>>
id key 609090 609091 609092 lang_609092
0 712SAFD 5 NaN NaN d fr
1 812SAFD 4 b c NaN NaN
2 912SAFD 3 b NaN NaN NaN
alternatively, if you could work with multiIndex df , instead of naming them, that would be simpler :
# this is to get the inner dictionary and make a tidy dataframe from it
designs = [info for design in designs for info in design['designs']]
df_designs = pd.DataFrame(designs)
df_designs['lang'] = df_designs.lang.replace('',np.NaN)
df = df.explode('list_index').merge(df_designs, left_on='list_index', right_index=True)
df = df.pivot(index=['id', 'key'], columns=['color_id'], values=['value','lang']).dropna(how='all' , axis=1).reset_index()
print(df)
output:
>>>
id key value lang
color_id 609090 609091 609092 609092
0 712SAFD 5 NaN NaN d fr
1 812SAFD 4 b c NaN NaN
2 912SAFD 3 b NaN NaN NaN
I would like to know if it's possible to combine some rows if we have in specific columns NaN value ? But the order can be change. I thought combine the rows if Name is duplicated.
import pandas as pd
import numpy as np
d = {'Name': ['Jacque','Paul', 'Jacque'], 'City': [np.nan, '4', '10'], 'Birthday' : ['1','2',np.nan]}
df = pd.DataFrame(data=d)
df
And I would like to have this output :
Check with sorted
out = df.apply(lambda x : sorted(x,key=pd.isnull)).dropna()
Name City Birthday
0 Jacque 4.0 1.0
1 Paul 10.0 2.0
I have a pandas DataFrame as shown below. I want to identify the index values of the columns in df that match a given string (more specifically, a string that matches the column names after 'sim-' or 'act-').
# Sample df
import pandas as pd
df = pd.DataFrame({
'sim-prod1': [1, 1.4],
'sim-prod2': [2, 2.1],
'act-prod1': [1.1, 1],
'act-prod2': [2.5, 2]
})
# Get unique prod values from df.columns
prods = pd.Series(df.columns[1:]).str[4:].unique()
prods
array(['prod2', 'prod1'], dtype=object)
I now want to loop through prods and identify the columns where prod1 and prod2 occur, and then use those columns to create new dataframes. How can I do this? In R I could use the which function to do this easily. Example dataframes I want to obtain are below.
df_prod1
sim_prod1 act_prod1
0 1.0 1.1
1 1.4 1.0
df_prod2
sim_prod2 act_prod2
0 2.0 2.5
1 2.1 2.0
Try groupby with axis=1:
for prod, d in df.groupby(df.columns.str[-4:], axis=1):
print(f'this is {prod}')
print(d)
print('='*20)
Output:
this is rod1
sim-prod1 act-prod1
0 1.0 1.1
1 1.4 1.0
====================
this is rod2
sim-prod2 act-prod2
0 2.0 2.5
1 2.1 2.0
====================
Now, to have them as variables:
dfs = {prod:d for prod, d in df.groupby(df.columns.str[-4:], axis=1)}
Try this, storing the parts of the dataframe as a dictionary:
df_dict = dict(tuple(df.groupby(df.columns.str[4:], axis=1)))
print(df_dict['prod1'])
print('\n')
print(df_dict['prod2'])
Output:
sim-prod1 act-prod1
0 1.0 1.1
1 1.4 1.0
sim-prod2 act-prod2
0 2.0 2.5
1 2.1 2.0
You can also do this without using groupby() and for loop by:-
df_prod2=df[df.columns[df.columns.str.contains(prods[0])]]
df_prod1=df[df.columns[df.columns.str.contains(prods[1])]]
I have the following sample DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame({'Tom': [2, np.nan, np.nan],
'Ron': [np.nan, 5, np.nan],
'Jim': [np.nan, np.nan, 6],
'Mat': [7, np.nan, np.nan],},
index=['Min', 'Max', 'Avg'])
that looks like this where each row have only one non-null value
Tom Ron Jim Mat
Min 2.0 NaN NaN 7.0
Max NaN 5.0 NaN NaN
Avg NaN NaN 6.0 NaN
Desired Outcome
For each column, I want to have the non-null value and then append the index of the corresponding non-null value to the name of the column. So the final result should look like this
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
My attempt
Using list comprehensions: Find the non-null value, and append the corresponding index to the column name and then create a new DataFrame
values = [df[col][~pd.isna(df[col])].values[0] for col in df.columns]
# [2.0, 5.0, 6.0, 7.0]
new_cols = [col + '_{}'.format(df[col][~pd.isna(df[col])].index[0]) for col in df.columns]
# ['Tom_Min', 'Ron_Max', 'Jim_Avg', 'Mat_Min']
df_new = pd.DataFrame([values], columns=new_cols)
My question
Is there some in-built functionality in pandas which can do this without using for loops and list comprehensions?
If there is only one non missing value is possible use DataFrame.stack with convert Series to DataFrame and then flatten MultiIndex, for correct order is used DataFrame.swaplevel with DataFrame.reindex:
df = df.stack().to_frame().T.swaplevel(1,0, axis=1).reindex(df.columns, level=0, axis=1)
df.columns = df.columns.map('_'.join)
print (df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
Use:
s = df.T.stack()
s.index = s.index.map('_'.join)
df = s.to_frame().T
Result:
# print(df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
I'm collecting data over the course of many days and rather than filling it in for every day, I can elect to say that the data from one day should really be a repeat of another day. I'd like to repeat some of the rows from my existing data frame into the days specified as repeats. I have a column that indicates which day the current day is to repeat from but I am getting stuck with errors.
I have found ways to repeat rows n times based a column value but I am trying to use a column as an index to repeat data from previous rows.
I'd like to copy parts of my "Data" column for Day 1 into the "Data" column for Day 3 , using my "Repeat" Column as the index. I would like to do this for many more different days.
data = [['1', 5,np.NaN], ['1',5,np.NaN],['1',5,np.NaN], ['2', 6,np.NaN],['2', 6,np.NaN],['2', 6,np.NaN], ['3',np.NaN,1], ['3',np.NaN,np.NaN],['3', np.NaN,np.NaN]]
df = pd.DataFrame(data, columns = ['Day', 'Data','repeat_tag'])
I slightly extended your test data:
data = [['1', 51, np.nan], ['1', 52, np.nan], ['1', 53, np.nan],
['2', 61, np.nan], ['2', 62, np.nan], ['2', 63, np.nan],
['3', np.nan, 1], ['3', np.nan, np.nan], ['3', np.nan, np.nan],
['4', np.nan, 2], ['4', np.nan, np.nan], ['4', np.nan, np.nan]]
df = pd.DataFrame(data, columns = ['Day', 'Data', 'repeat_tag'])
Details:
There are 4 days with observations.
Each observation has different value (Data).
To avoid "single day copy", values for day '3' are to be copied from
day '1' and for day '4' from day '2'.
I assume that non-null value of repeat_tag can be placed in only one
observation for the "target" day.
I also added obsNo column to identify observations within particular day:
df['obsNo'] = df.groupby('Day').cumcount().add(1);
(it will be necessary later).
The first step of actual processing is to generate replDays table, where Day
column is the target day and repeat_tag is the source day:
replDays = df.query('repeat_tag.notnull()')[['Day', 'repeat_tag']]
replDays.repeat_tag = replDays.repeat_tag.astype(int).apply(str)
A bit of type manipulation was with repeat_tag column.
As this column contains NaN values and non-null values are int, this column is
coerced to float64. Hence, to get string type (comparable with Day) it
must be converted:
First to int, to drop the decimal part.
Then to str.
The result is:
Day repeat_tag
6 3 1
9 4 2
(fill data for day 3 with data from day 1 and data for day 4 with data from day 2).
The next step is to generate replData table:
replData = pd.merge(replDays, df, left_on='repeat_tag', right_on='Day',
suffixes=('_src', ''))[['Day_src', 'Day', 'Data', 'obsNo']]\
.set_index(['Day_src', 'obsNo']).drop(columns='Day')
The result is:
Data
Day_src obsNo
3 1 51.0
2 52.0
3 53.0
4 1 61.0
2 62.0
3 63.0
As you can see:
There is only one column of replacement data - Data (from day 1 and 2).
MutliIndex contains both the day and observation number (both will be
needed for proper update).
And the final part includes the following steps:
Copy df to res (result), setting index to Day and obsNo
(required for update).
Update this table with data from replData.
Move Day and obsNo from index back to "regular" columns.
The code is:
res = df.copy().set_index(['Day', 'obsNo'])
res.update(replData)
res.reset_index(inplace=True)
If you want, you can alse drop obsNo column.
And a remark concerning the solution by Peter:
If source data contains for any day different values, his code fails
with InvalidIndexError, probably due to lack of identification of
individual observations within particular day.
This confirms that my idea to add obsNo column is valid.
Setup
# Start with Valdi_Bo's expanded example data
data = [['1', 51, np.nan], ['1', 52, np.nan], ['1', 53, np.nan],
['2', 61, np.nan], ['2', 62, np.nan], ['2', 63, np.nan],
['3', np.nan, 1], ['3', np.nan, np.nan], ['3', np.nan, np.nan],
['4', np.nan, 2], ['4', np.nan, np.nan], ['4', np.nan, np.nan]]
df = pd.DataFrame(data, columns = ['Day', 'Data', 'repeat_tag'])
# Convert Day to integer data type
df['Day'] = df['Day'].astype(int)
# Spread repeat_tag values into all rows of tagged day
df['repeat_tag'] = df.groupby('Day')['repeat_tag'].ffill()
Solution
# Within each day, assign a number to each row
df['obs'] = df.groupby('Day').cumcount()
# Self-join
filler = (pd.merge(df, df,
left_on=['repeat_tag', 'obs'],
right_on=['Day', 'obs'])
.set_index(['Day_x', 'obs'])['Data_y'])
# Fill missing data
df = df.set_index(['Day', 'obs'])
df.loc[df['Data'].isnull(), 'Data'] = filler
df = df.reset_index()
Result
df
Day obs Data repeat_tag
0 1 0 51.0 NaN
1 1 1 52.0 NaN
2 1 2 53.0 NaN
3 2 0 61.0 NaN
4 2 1 62.0 NaN
5 2 2 63.0 NaN
6 3 0 51.0 1.0
7 3 1 52.0 1.0
8 3 2 53.0 1.0
9 4 0 61.0 2.0
10 4 1 62.0 2.0
11 4 2 63.0 2.0