I've got a dataframe df in Pandas that looks like this:
stores product discount
Westminster 102141 T
Westminster 102142 F
City of London 102141 T
City of London 102142 F
City of London 102143 T
And I'd like to end up with a dataset that looks like this:
stores product_1 discount_1 product_2 discount_2 product_3 discount_3
Westminster 102141 T 102143 F
City of London 102141 T 102143 F 102143 T
How do I do this in pandas?
I think this is some kind of pivot on the stores column, but with multiple . Or perhaps it's an "unmelt" rather than a "pivot"?
I tried:
df.pivot("stores", ["product", "discount"], ["product", "discount"])
But I get TypeError: MultiIndex.name must be a hashable type.
Use DataFrame.unstack for reshape, only necessary create counter by GroupBy.cumcount, last change ordering of second level and flatten MultiIndex in columns by map:
df = (df.set_index(['stores', df.groupby('stores').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1))
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df = df.reset_index()
print (df)
stores discount_1 product_1 discount_2 product_2 discount_3 \
0 City of London T 102141.0 F 102142.0 T
1 Westminster T 102141.0 F 102142.0 NaN
product_3
0 102143.0
1 NaN
Related
I m trying to merge these dataframes in a way that the final data frame would have matched the country year gdp from first dataframe with its corresponding values from second data frame.
[]
[]
first data frame :
Country
Country code
year
rgdpe
country1
Code1
year1
rgdpe1
country1
Code1
yearn
rgdpen
country2
Code2
year1
rgdpe1'
second dataframe :
countries
value
year
country1
value1
year1
country1
valuen
yearn
country2
Code2
year1
combined dataframe:
| Country | Country code | year |rgdpe |value|
|:--------|:------------:|:----:|:-----:|:---:|
|country1 | Code1 | year1|rgdpe1 |value|
|country1 | Code1 | yearn|rgdpen |Value|
|country2 | Code2 | year1|rgdpe1'|Value|
combined=pd.merge(left=df_biofuel_prod, right=df_GDP[['rgdpe']], left_on='Value', right_on='country', how='right')
combined.to_csv('../../combined_test.csv')
the results of this code gives me just the rgdpe column while the other column are empty.
What would be the most efficient way to merge and match these dataframes ?
First, from the data screen cap, it looks like the "country" column in your first dataset "df_GDP" is set as index. Reset it using "reset_index()". Then merge on multiple columns like left_on=["countries","year"] and right_on=["country","year"]. And since you want to retain all records from your main dataframe "df_biofuel_prod", so it should be "left" join:
combined_df = df_biofuel_prod.merge(df_GDP.reset_index(), left_on=["countries","year"], right_on=["country","year"], how="left")
Full example with dummy data:
df_GDP = pd.DataFrame(data=[["USA",2001,400],["USA",2002,450],["CAN",2001,150],["CAN",2002,170]], columns=["country","year","rgdpe"]).set_index("country")
df_biofuel_prod = pd.DataFrame(data=[["USA",400,2001],["USA",450,2003],["CAN",150,2001],["CAN",170,2003]], columns=["countries","Value","year"])
combined_df = df_biofuel_prod.merge(df_GDP.reset_index(), left_on=["countries","year"], right_on=["country","year"], how="left")
[Out]:
countries Value year country rgdpe
0 USA 400 2001 USA 400.0
1 USA 450 2003 NaN NaN
2 CAN 150 2001 CAN 150.0
3 CAN 170 2003 NaN NaN
You see "NaN" where matching data is not available in "df_GDP".
The dataframe looks like:
name education education_2 education_3
name_1 NaN some college NaN
name_2 NaN NaN graduate degree
name_3 high school NaN NaN
I just want to keep one education column. I tried to use the conditional statement and compared to each other, I got nothing but error though. I also looked through the merge solution, but in vain. Does anyone know how to deal with it using Python or pandas? Thank you in advance.
name education
name_1 some college
name_2 graduate degree
name_3 high school
One day I hope they'll have better functions for String type rows, rather than the limited support for columns currently available:
df['education'] = (df.filter(like='education') # Filters to only Education columns.
.T # Transposes
.convert_dtypes() # Converts to pandas dtypes, still somewhat in beta.
.max() # Gets the max value from the column, which will be the not-null one.
)
df = df[['name', 'education']]
print(df)
Output:
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
Looping this wouldn't be too hard e.g.:
cols = ['education', 'age', 'income']
for col in cols:
df[col] = df.filter(like=col).bfill(axis=1)[col]
df = df[['name'] + cols]
You can use df.fillna to do so.
df['combine'] = df[['education','education2','education3']].fillna('').sum(axis=1)
df
name education education2 education3 combine
0 name1 NaN some college NaN some college
1 name2 NaN NaN graduate degree graduate degree
2 name3 high school NaN NaN high school
If you have a lot of columns to combine, you can try this.
df['combine'] = df[df.columns[1:]].fillna('').sum(axis=1)
use bfill to fill the empty (NaN) values
df.bfill(axis=1).drop(columns=['education 2','education 3'])
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
if there are other columns in between then choose the columns to apply bfill
In essence, if you have multiple columns for education that you need to consolidate under a single column then choose the columns to which you apply the bfill. subsequently, you can delete those columns from which you back filled.
df[['education','education 2','education 3']].bfill(axis=1).drop(columns=['education 2','education 3'])
I am trying to collapse all the rows of a dataframe into one single row across all columns.
My data frame looks like the following:
name
job
value
bob
business
100
NAN
dentist
Nan
jack
Nan
Nan
I am trying to get the following output:
name
job
value
bob jack
business dentist
100
I am trying to group across all columns, I do not care if the value column is converted to dtype object (string).
I'm just trying to collapse all the rows across all columns.
I've tried groupby(index=0) but did not get good results.
You could apply join:
out = df.apply(lambda x: ' '.join(x.dropna().astype(str))).to_frame().T
Output:
name job value
0 bob jack business dentist 100.0
Try this:
new_df = df.agg(lambda x: x.dropna().astype(str).tolist()).str.join(' ').to_frame().T
Output:
>>> new_df
name job value
0 bob jack business dentist 100.0
I have two df A & B, I want to iterate through df B's certain columns and check values of all its rows and see if values exist in one of the columns in A, and use fill null values with A's other columns' values.
df A:
country region product
USA NY apple
USA NY orange
UK LON banana
UK LON chocolate
CANADA TOR syrup
CANADA TOR fish
df B:
country ID product1 product2 product3 product4 region
USA 123 other stuff other stuff apple NA NA
USA 456 orange other stuff other stuff NA NA
UK 234 banana other stuff other stuff NA NA
UK 766 other stuff other stuff chocolate NA NA
CANADA 877 other stuff other stuff syrup NA NA
CANADA 109 NA fish NA other stuff NA
so I want to iterate through dfB and for example see if dfA.product (apple) is in columns of dfB.product1-product4 if true such as the first row of dfB indicates, then I want to add the region value from dfA.region into dfB's region which now is currently NA.
here is the code I have, I am not sure if it is right:
import pandas as pd
from tqdm import tqdm
def fill_null_value(dfA, dfB):
for i, row in tqdm(dfA.iterrows()):
for index, row in tqdm(dfB.iterrows()):
if dfB['product1'][index] == dfA['product'][i]:
dfB['region'] = dfA['region '][i]
elif dfB['product2'][index] == dfA['product'[i]:
dfB['region'] = dfA['region'][i]
elif dfB['product3'][index] == dfA['product'][i]:
dfB['region'] = dfA['region'][i]
elif dfB['product4'][index] == dfA['product'][i]:
dfB['region'] = dfA['region'][i]
else:
dfB['region '] = "not found"
print('outputing data')
return dfB.to_excel('test.xlsx')
If i where you I would create some join and then concat them and drop duplicates
df_1 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product1'], how='right')
df_2 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product2'], how='right')
df_3 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product3'], how='right')
df_4 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product4'], how='right')
df = pd.concat([df_1, df_2, df_3, df_4]).drop_duplicates()
The main issue here seems to be finding a single column for products in your second data set that you can do your join on. It's not clear how exactly you are deciding what values in the various product columns in df_b are meant to be used as keys to lookup vs. the ones that are ignored.
Assuming, though, that your df_a contains an exhaustive list of product values and each of those values only ever occurs in a row once you could do something like this (simplifying your example):
import pandas as pd
df_a = pd.DataFrame({'Region':['USA', 'Canada'], 'Product': ['apple', 'banana']})
df_b = pd.DataFrame({'product1': ['apple', 'xyz'], 'product2': ['xyz', 'banana']})
product_cols = ['product1', 'product2']
df_b['Product'] = df_b[product_cols].apply(lambda x: x[x.isin(df_a.Product)][0], axis=1)
df_b = df_b.merge(df_a, on='Product')
The big thing here is generating a column that you can join on for your lookup
I have a dataframe df:
Open Volume Adj Close Ticker
Date
2006-11-22 140.750000 45505300 114.480649 SPY
I want to change df to another dataframe Open price like below:
SPY AGG
Date
2006-11-22 140.750000 NA
It only use open's data and two tickers, so how to change one dataframe to another?
I think you can use DataFrame constructor with reindex by list of ticker L:
L = ['SPY','AGG']
df1 = pd.DataFrame({'SPY': [df.Open.iloc[0]]},
index=[df.index[0]])
df1 = df1.reindex(columns=L)
print (df1)
SPY AGG
2006-11-22 140.75 NaN
You can use read_html for find list of Tickers:
df2 = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies', header=0)[0]
#print (df2)
#filter only Ticker symbols starts with SP
df2 = df2[df2['Ticker symbol'].str.startswith('SP')]
print (df2)
Ticker symbol Security SEC filings \
407 SPG Simon Property Group Inc reports
415 SPGI S&P Global, Inc. reports
418 SPLS Staples Inc. reports
GICS Sector GICS Sub Industry \
407 Real Estate REITs
415 Financials Diversified Financial Services
418 Consumer Discretionary Specialty Stores
Address of Headquarters Date first added CIK
407 Indianapolis, Indiana NaN 1063761
415 New York, New York NaN 64040
418 Framingham, Massachusetts NaN 791519
#convert column to list, add SPY because missing
L = ['SPY'] + df2['Ticker symbol'].tolist()
print (L)
['SPY', 'SPG', 'SPGI', 'SPLS']
df1 = pd.DataFrame({'SPY': [df.Open.iloc[0]]},
index=[df.index[0]])
df1 = df1.reindex(columns=L)
print (df1)
SPY SPG SPGI SPLS
2006-11-22 140.75 NaN NaN NaN
Suppose you have a list of data frame df_list for different tickers and every item of of list have the same look of the df in your example
You can first concatenate them into one frame with
df1 = pd.concat(df_list)
Then with
df1[["Open", "Ticker"]].reset_index().set_index(["Date", "Ticker"]).unstack()
It should give you an output like
Open
Ticker AGG SPY
Date
2006-11-22 NAN 140.75