Merging dataframes with pandas with two keys - python

I have two datasets, one with individual reports and one with regional conditions. There are many more individual rows than regional, but I want to append the regional data onto each individual. The problem I am facing is that I must merge using two primary keys, e.g.
Individual - 5000 rows
Code | Time | Data1 | Data2 | Data3
Regional - 100 rows
Code | Time | RData1 | RData2
--I have attemped and failed using:
df = individual.merge(regional, how='left', on=['Code', 'Time'])
--Which leaves RData1,2 as null values in the new df, which does, to its credit look like
df - 5000 rows
Code | Time | Data1 | Data2 | Data3 | RData1 | RData2
but the null values don't help me...
Example Data
What I am seeing

Data
Generate random df
rng = pd.date_range('2015-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Time': rng, 'data1': np.random.randn(len(rng)),'code':[201, 897,345, 70,879] })
df.set_index(['Time','code'], inplace=True)
df
Generate random df1
df1 = pd.DataFrame({ 'Time': rng, 'data1': np.random.randn(len(rng)),'code':[201, 30,345, 70,879] })
df1.set_index(['Time','code'], inplace=True)
df1
merge on indexes can be done as follows
result =df1.merge(df, left_index=True, right_index=True, suffixes=('_Left','_Right'))
result
Or better
result =pd.merge(df, df1,left_index=True, right_index=True, suffixes=('_Left','_Right'))
result

Related

Get a single rating against particular id by finding nearest date from a series of dates

Dataframe 1 has two columns (customer_id, date and rating) and Dataframe 2 has (customer_id, start_date, instrument_id). The function needs to run such that the instrument_id in DF2 includes rating for date closest to start_date.
DF1:
customer_id date rating
84952608 31-Mar-20 4-
84952608 31-Dec-19 3-
84952608 30-Jun-19 4
84952608 31-Mar-19 5-
DF2:
Instrument_id customer_id start_date
000LCLN190240003 84952608 31-Mar-2019
Result DF:
Instrument_id customer_id rating
000LCLN190240003 84952608 5-
5- selected since start_date is closest to date
I got a working sample, however the compute time is significant in this case. For around 3k records it takes around 40-50 seconds
DF2 is exposure and DF1 is file
for w in range(len(exposure)):
max_preceeding_date = file.loc[(file['customer_id']==exposure.loc[w,'customer_id']) & (file['date']<=exposure.loc[w,'start_date']),['rating','date']].sort_values('date', ascending=False)
value = max_preceeding_date.iloc[0,0]
I also tried using df.merge to first merge both the dataframes, however unable to figure out how to use groupby to get the final output.
Appreciate your time and effort in helping on this one.
Merging dataframes and comparing datetime objects:
In [254]: res_df = df2.merge(df1, how='left', on='customer_id')
In [255]: res_df[['start_date', 'date']] = res_df[['start_date', 'date']].apply(lambda s: pd.to_datetime(s))
In [256]: res_df[res_df['date'] <= res_df['start_date']].sort_values(['start_date', 'date'], ascending=[False, False]).d
...: rop(['start_date', 'date'], axis=1)
Out[256]:
Instrument_id customer_id rating
3 000LCLN190240003 84952608 5-

DataFrame index is clearly a DateTimeIndex but .plot() can't graph it properly?

I made a very simple two-column excel file with date and random float to test some other stuff. I read_excel() with parse_dates=True on the first column, and verified type(df.index) = .
When I go to .plot(), the x-axis of the graph is a bunch of random years like 2016 then 1987 then 1678...and none of my values show up of course (y-axis is correct though).
What did I miss?
The xlsx:
The column A is formatted as Short-Date in excel; B is just General
_____| A | B |
0 | 2019-01-01 12.87
1 | 2019-01-02 15.20
..
90 | 2019-03-31 10.12
The code is:
fakedf = pd.read_excel('sampledata.xlsx', index_col=0, parse_dates=True, header=None)
fakedf.index.name = 'Date'
fakedf.columns = ['Price']
fakedf.head()
From there there is a decent plot but when I do this:
df = pd.DataFrame({'Date': [np.nan], 'Price': [np.nan]})
df.index = df.Date
del df['Date']
for x in range(len(fakedf)):
df = df.append(fakedf.iloc[x])
df.plot()
It messes up. I had the for loop there because I was trying to test something relating to time.sleep()

column dates to quarters with pandas

I have a pandas dataframe with these columns (important part is that i have every month from 1996-04 to 2016-08)
Index(['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank',
'1996-04', '1996-05', '1996-06', '1996-07',
...
'2015-11', '2015-12', '2016-01', '2016-02', '2016-03', '2016-04',
'2016-05', '2016-06', '2016-07', '2016-08'],
dtype='object', length=251)
I need to group columns by three to represent financial quarters, eg:
| 1998-01 | 1999-02 | 1999-03 |
| 2 | 4 | 7 |
Needs to become
| 1998q1 |
|avg(2,4,7)|
Any hint about the right approach?
First convert all non dates columns to index, convert them to quarter period and aggregate by columns with mean:
df = df.set_index(['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank'])
df.columns = pd.to_datetime(df.columns).to_period('Q').strftime('%Yq%q')
df = df.groupby(level=0, axis=1).mean().reset_index()

How to reshape dataframe with multi year data in Python

I believe my question can be solved with a loop but I haven't been able to create such. I have a data sample which looks like this
sample data
And I would like to have dataframe that would be organised by the year:
result data
I tried pivot-function by creating a year column with df['year'] = df.index.year and then reshaping with pivot but it will populate only the first year column because of the index.
I have managed to do this type of reshaping manually but with several years of data it is time consuming solution. Here is the example code for manual solution:
mydata = pd.DataFrame()
mydata2 = pd.DataFrame()
mydata3 = pd.DataFrame()
mydata1['1'] = df['data'].iloc[160:664]
mydata2['2'] = df['data'].iloc[2769:3273]
mydata3['3'] = df['data'].iloc[5583:6087]
mydata1.reset_index(drop=True, inplace=True)
mydata2.reset_index(drop=True, inplace=True)
mydata3.reset_index(drop=True, inplace=True)
mydata = pd.concat([mydata1, mydata2, mydata3],axis=1, ignore_index=True)
mydata.columns = ['78','88','00','05']
Welcome to StackOverflow! I think I understood what you were asking for from your question, but please correct me if I'm wrong. Basically, you want to reshape your current pandas.DataFrame using a pivot. I set up a sample dataset and solved the problem in the following way:
import pandas as pd
#test set
df = pd.DataFrame({'Index':['2.1.2000','3.1.2000','3.1.2001','4.1.2001','3.1.2002','4.1.2002'],
'Value':[100,101,110,111,105,104]})
#create a year column for yourself
#by splitting on '.' and selecting year element.
df['Year'] = df['Index'].str.split('.', expand=True)[2]
#pivot your table
pivot = pd.pivot_table(df, index=df.index, columns='Year', values='Value')
#now, in my pivoted test set there should be unwanted null values showing up so
#we can apply another function that drops null values in each column without losing values in other columns
pivot = pivot.apply(lambda x: pd.Series(x.dropna().values))
Result on my end
| Year | 2000 | 2001 | 2002 |
|------|------|------|------|
| 0 | 100 | 110 | 105 |
| 1 | 101 | 111 | 104 |
Hope this solves your problem!

Merging on non-unique column - pandas python

I have been trying to merge two DataFrames together (df and df_details) in a similar fashion to an Excel "vlookup" but am getting strange results. Below I show the structure of the two DataFrames without populating real data for simplicity
df_details:
Abstract_Title | Abstract_URL | Session_No_v2 | Session_URL | Session_ID
-------------------------------------------------------------------------
Abstract_Title1 Abstract_URL1 1 Session_URL1 12345
Abstract_Title2 Abstract_URL2 1 Session_URL1 12345
Abstract_Title3 Abstract_URL3 1 Session_URL1 12345
Abstract_Title4 Abstract_URL4 2 Session_URL2 22222
Abstract_Title5 Abstract_URL5 2 Session_URL2 22222
Abstract_Title6 Abstract_URL6 3 Session_URL3 98765
Abstract_Title7 Abstract_URL7 3 Session_URL3 98765
df:
Session_Title | Session_URL | Sponsors | Type | Session_ID
-------------------------------------------------------------------------------
Session_Title1 Session_URL1 x, y z Paper 12345
Session_Title2 Session_URL2 x, y Presentation 22222
Session_Title3 Session_URL3 a, b ,c Presentation 98765
Session_Title4 Session_URL4 c Talk 12121
Session_Title5 Session_URL5 a, x Paper 33333
I want to merge along Session_ID and I want the final DataFrame to look like:
I've tried the following script which yields a DataFrame that duplicates (several times) certain rows and does strange things. For example, df_details has 7,046 rows and df has 1,856 rows - when I run the following merge code, my final_df results in 21,148 rows:
final_df = pd.merge(df_details, df, how = 'outer', on = 'Session_ID')
Please help!
To generate your final output table I used the following code:
final_df = pd.merge(df_details, df[['Session_ID',
'Session_Title',
'Sponsors',
'Type']], left_on = ['Session_ID'], right_on = ['Session_ID'], how = 'outer')
Use 'left' instead of 'outer'.
final_df = pd.merge(df_details, df[['Session_ID','Session_Title','Sponsors','Type']], left_on = ['Session_ID'], right_on =['Session_ID'], how = 'left')

Categories

Resources