Fill a table based on multiple positions in another dataframe - python

I have two dataframes. One is:
Age,Name,Dog,Cat,House,Car,Food
12,'Bob',0,0,0,0,0
12,'Sam',0,0,0,0,0
18,'Sam',0,0,0,0,0
And I have a much longer table
Age,Name,Item,Amount
12,'Bob','Dog',1
12,'Bob','Cat',3
12,'Sam','Cat',1
18,'Sam','Cat',1
18,'Sam','House',3
Final product:
Age,Name,Dog,Cat,House,Car,Food
12,'Bob',1,0,0,0,0
12,'Sam',0,1,0,0,0
18,'Sam',0,1,3,0,0
Basically I have to fill the first table up with values in the second table.
I have to match the age and name from the first to the second table, then see which one of the 1st tables columns I'm given in the second table, and fill in the amount
I've hardcoded it using three & conditions, but I have millions of rows/columns so it will literally take days to run if I do it that way.

You do not need your first df just use pivot_table on df2:
import pandas as pd
from io import StringIO
# your sample data
s2 = """Age,Name,Item,Amount
12,'Bob','Dog',1
12,'Bob','Cat',3
12,'Sam','Cat',1
18,'Sam','Cat',1
18,'Sam','House',3"""
df2 = pd.read_csv(StringIO(s2), quotechar="'")
# use pivot_table to reshape your dataFrame and reset your index
df2.pivot_table('Amount', ['Age', 'Name'], 'Item', aggfunc=sum).reset_index()
Item Age Name Cat Dog House
0 12 Bob 3.0 1.0 NaN
1 12 Sam 1.0 NaN NaN
2 18 Sam 1.0 NaN 3.0
Or just use groupby and unstack:
df2.groupby(['Age', 'Name', 'Item'])['Amount'].sum().unstack().reset_index()
Item Age Name Cat Dog House
0 12 Bob 3.0 1.0 NaN
1 12 Sam 1.0 NaN NaN
2 18 Sam 1.0 NaN 3.0
For the first example, just change aggfunc to whatever function you want to use to handle multiple values and same for groupby change .sum() to whatever function you want.
An update to answer your real question: replace values in a df:
import pandas as pd
from io import StringIO
# your sample data
s = """Age,Name,Dog,Cat,House,Car,Food
12,'Bob',0,0,0,0,0
12,'Sam',0,0,0,0,0
18,'Sam',0,0,0,0,0"""
df1 = pd.read_csv(StringIO(s), quotechar="'")
s2 = """Age,Name,Item,Amount
12,'Bob','Dog',1
12,'Bob','Cat',3
12,'Sam','Cat',1
18,'Sam','Cat',1
18,'Sam','House',3"""
df2 = pd.read_csv(StringIO(s2), quotechar="'")
# use pivot_table to reshape your dataFrame
pivot = df2.pivot_table('Amount', ['Age', 'Name'], 'Item', aggfunc=sum)
# set the index to age and name and create a new df
df1 = df1.set_index(['Age', 'Name'])
# use update to replace values
df1.update(pivot)
print(df1.reset_index())
Age Name Dog Cat House Car Food
0 12 Bob 1.0 3.0 0.0 0 0
1 12 Sam 0.0 1.0 0.0 0 0
2 18 Sam 0.0 1.0 3.0 0 0

Related

Merge dataframes based on substrings

I want to merge/join two large dataframes while the 'id' column the dataframe on the right is assumed to be substrings of the left 'id' column.
For illustration purposes:
import pandas as pd
import numpy as np
df1=pd.DataFrame({'id':['abc','adcfek','acefeasdq'],'numbers':[1,2,np.nan],'add_info':[3123,np.nan,312441]})
df2=pd.DataFrame({'matching':['adc','fek','acefeasdq','abcef','acce','dcf'],'needed_info':[1,2,3,4,5,6],'other_info':[22,33,11,44,55,66]})
This is df1:
id numbers add_info
0 abc 1.0 3123.0
1 adcfek 2.0 NaN
2 acefeasdq NaN 312441.0
And this is df2:
matching needed_info other_info
0 adc 1 22
1 fek 2 33
2 acefeasdq 3 11
3 abcef 4 44
4 acce 5 55
5 dcf 6 66
And this is the desired output:
id numbers add_info needed_info other_info
0 abc 1.0 3123.0 NaN NaN
1 adcfek 2.0 NaN 2.0 33.0
2 adcfek 2.0 NaN 6.0 66.0
3 acefeasdq NaN 312441.0 3.0 11.0
So as described, I only want to merge the additional columns only when the 'matching' column is a substring of the 'id' column. If it is the other way around, e.g. 'abc' is a substring of 'adcef', nothing should happen.
In my data, a lot of the matches between df1 and df2 are actually exact, like the 'acefeasdq' row. But there are cases where 'id's contain multiple 'matching's. For the moment, it is okish to ignore these cases but I'd like to learn how I can tackle this issue. And additionally, is it possible to mark out the rows that are merged based on substrings and the rows that are merged exactly?
You can use pd.merge(how='cross') to create a dataframe containing all combinations of the rows. And then filter the dataframe using a boolean series:
df = pd.merge(df1, df2, how="cross")
include_row = df.apply(lambda row: row.matching in row.id, axis=1)
filtered = df.loc[include_row]
print(filtered)
Docs:
pd.merge
Indexing and selecting data
If your processing can handle CROSS JOIN (problematic with large datasets), then you could cross join and then delete/filter only those you want.
map= cross.apply(lambda x: str(x['matching']) in str(x['id']), axis=1) #create map of booleans
final = cross[map] #get only those where condition was met

Pandas DataFrame, group by column into single line items but extend columns by number of occurrences per group

I am trying to reformat a DataFrame into a single line item per categorical group, but my fixed format needs to retain all elements of data associated to the category as new columns.
for example I have a DataFrame:
dta = {'day':['A','A','A','A','B','C','C','C','C','C'],
'param1':[100,200,2,3,7,23,43,98,1,0],
'param2':[1,20,65,3,67,2,3,98,654,5]}
df = pd.DataFrame(dta)
I need to be able to transform/reformat the DataFrame where the data is grouped by the 'day' column (e.g. one row per day) but then has columns generated dynamically according to how many entries are within each category.
For example category C in the 'day' column has 5 entries, meaning for 'day' C you would have 5 param1 values and 5 param2 values.
The associated values for days A and B would be populated with NaN or empty where they do not have entries.
e.g.
dta2 = {'day':['A','B','C'],
'param1_1':[100,7,23],
'param1_2':[200,np.nan,43],
'param1_3':[2,np.nan,98],
'param1_4':[3,np.nan,1],
'param1_5':[np.nan,np.nan,0],
'param2_1':[1,67,2],
'param2_2':[20,np.nan,3],
'param2_3':[65,np.nan,98],
'param2_4':[3,np.nan,654],
'param2_5':[np.nan,np.nan,5]
}
df2 = pd.DataFrame(dta2)
Unfortunately this is a predefined format that I have to maintain.
I am aiming to use Pandas as efficiently as possible to minimise deconstructing and reassembling the DataFrame.
You first need to melt, then add a helper columns to cumcount the labels per group and pivot:
df2 = (
df.melt(id_vars='day')
.assign(group=lambda d: d.groupby(['day', 'variable']).cumcount().add(1).astype(str))
.pivot(index='day', columns=['variable', 'group'], values='value')
)
df2.columns = df2.columns.map('_'.join)
df2 = df2.reset_index()
output:
day param1_1 param1_2 param1_3 param1_4 param1_5 param2_1 param2_2 param2_3 param2_4 param2_5
0 A 100.0 200.0 2.0 3.0 NaN 1.0 20.0 65.0 3.0 NaN
1 B 7.0 NaN NaN NaN NaN 67.0 NaN NaN NaN NaN
2 C 23.0 43.0 98.0 1.0 0.0 2.0 3.0 98.0 654.0 5.0

pivot df with duplicates as new rows

Evening, I have a dataframe that I want to reshape. there are duplicate id vars for some columns, and i want the duplicate values to appear as new rows
my data looks like this, and i want to have the ids as a row, with the group as column, and the choices as the values. if there are multiple choices picked per id within a group, then the row should be replicated as shown below. when I use pivot I end up just getting the mean or sum of the combined values e.g. 11.5 for id i1, group1. all tips very welcome thank you
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['i1','i1','i1','i2','i2','i2','i2','i2','i3','i3'],
'group': ['group1','group1','group2','group3','group1','group2','group2','group3','group1','group2'],
'choice':[12,11,12,14,11,19,9,7,8,9]})
pd.DataFrame({'id': ['i1','i1','i2','i2','i3'],
'group1': [12,11,11,np.nan,8],
'group2': [12,np.nan,19,9,9],
'group3':[np.nan,np.nan,14,7,np.nan]})
Use GroupBy.cumcount with Series.unstack and DataFrame.droplevel:
g = df.groupby(['id','group']).cumcount().add(1)
df = (df.set_index(['id','group', g])['choice']
.unstack(level=1)
.droplevel(level=1)
.rename_axis(None,axis=1)
.reset_index())
print (df)
id group1 group2 group3
0 i1 12.0 12.0 NaN
1 i1 11.0 NaN NaN
2 i2 11.0 19.0 14.0
3 i2 NaN 9.0 7.0
4 i3 8.0 9.0 NaN

need to add a column together and put the average beneath the column in Pandas

I'm currently trying to add a column together that has two rows to it as such:
Now I just need to add row 1 and 2 together for each column, and I want to append the average underneath the given column for their respective header name. I currently have this:
for x in sub_houseKeeping:
if "phch" in x:
sub_houseKeeping['Average'] = sub_houseKeeping[x].sum()/2
However, this adds together the entire row and appends it to the end of the rows, not the bottom of the column as I wished. How can I fix it to add to the bottom of the column?
This?
data=''' id a b
0 1 34 10
1 2 27 40'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
df1 = df.append(df[['a', 'b']].mean(), ignore_index=True)
df1
id a b
0 1.0 34.0 10.0
1 2.0 27.0 40.0
2 NaN 30.5 25.0
Try this:
sub_houseKeeping = pd.DataFrame({'ID':['200650_s_at','1565446_at'], 'phchp003v1':[2174.84972,6.724141107], 'phchp003v2':[444.9008362,4.093883364]})
sub_houseKeeping = sub_houseKeeping.append(pd.DataFrame(sub_houseKeeping.mean(axis=0)).T, ignore_index=True)
Output:
print(sub_houseKeeping)
ID phchp003v1 phchp003v2
0 200650_s_at 2174.849720 444.900836
1 1565446_at 6.724141 4.093883
2 NaN 1090.786931 224.497360

fill na in df.col1 on the basis of df2.col2. both dataframes are of different size

apologies if this has been already asked and replied to, but having searched for one whole day but could not locate the right solution. plz point me towards it, if solution already exists.
I am trying to fill na/nan values in a column in my pandas dataframe(df1). the fill values are located in another dataframe(df2) which contain the unique id's and a corresponding value. How do i match the id of df1.Prod_id (where existing value in df.item_wt is nan) and then find the corresponding value in df2.mean_wt and fill the nan value in df1.item_wt. both the dataframes are of different sizes, df1 being 80k+ rows and df2 is only 1559. the column names are also different as coming from different sources. the fill has to be done in-place.
would appreciate any pandas way, to avoid iterative looping given size of actual dataframe.
i have tried to use combine_first and map with zero success as the dataframe sizes are different, so extra rows gets no replacement.
data1 = {'Prod_id':['PR1', 'PR2', 'PR3', 'PR4', 'PR2', 'PR3','PR1', 'PR4"],store=['store1','store2','store3','store6','store3','store8','store45','store23']'item_wt':[28,nan,29,42,nan,34,87,nan]}
df1 = pd.DataFrame(data1)
data2 = {'Item_name':['PR1', 'PR2', 'PR3', 'PR4'],'mean_wt':[18,12,22,9]}
df2 = pd.DataFrame(data2)
final df should be like:
data1 = {'Prod_id':['PR1', 'PR2', 'PR3', 'PR4', 'PR2', 'PR3','PR1', 'PR4"],store=['store1','store2','store3','store6','store3','store8','store45','store23']'Item_wt':[28,12,29,42,12,34,87,9]}
df1 = pd.DataFrame(data1)
You can use fillna and set numpy array created by values because different indices original and new Series:
df1['item_wt'] = (df1.set_index('Prod_id')['item_wt']
.fillna(df2.set_index('Item_name')['mean_wt']).values)
print (df1)
Prod_id store item_wt
0 PR1 store1 28.0
1 PR2 store2 12.0
2 PR3 store3 29.0
3 PR4 store6 42.0
4 PR2 store3 12.0
5 PR3 store8 34.0
6 PR1 store45 87.0
7 PR4 store23 9.0
Or use map first:
s = df2.set_index('Item_name')['mean_wt']
df1['item_wt'] = df1['item_wt'].fillna(df1['Prod_id'].map(s))
#alternative
#df1['item_wt'] = df1['item_wt'].combine_first(df1['Prod_id'].map(s))
print (df1)
Prod_id store item_wt
0 PR1 store1 28.0
1 PR2 store2 12.0
2 PR3 store3 29.0
3 PR4 store6 42.0
4 PR2 store3 12.0
5 PR3 store8 34.0
6 PR1 store45 87.0
7 PR4 store23 9.0

Categories

Resources