Plotting 2 data frames after merging - python

I have 2 different data frames that have the same column called date. Now, I want to plot these data frames where the values on X axis be the date column common to both the data frames and Y axis be the value. Also, I want to do this after concatenating both the data frames into a third frame. Currently here is what I did:
df1 = pd.DataFrame({'value': [1,2,3,4,5], 'date': [20,40,60,80,100]})
df2 = pd.DataFrame({'value': [11,21,31,41,51], 'date': [20,40,60,80,100]})
df3 = pd.concat([df1, df2], keys=['df1','df2'], axis=1)
df3.plot()
plt.show()
but the resultant plot is not what I wanted. It generates 4 plots as could be seen from the legend.
How could I just have 2 plots with a common X axis and the difference reflected in the Y axis? Please note that I want to do this after concatenating the data frames df1 and df2 and by calling plot on df3

You could use the "date" column as index before concatenating.
df1 = pd.DataFrame({'value': [1,2,3,4,5], 'date': [20,40,60,80,100]})
df2 = pd.DataFrame({'value': [11,21,31,41,51], 'date': [20,40,60,80,100]})
df3 = pd.concat([df1.set_index("date"), df2.set_index("date")], keys=['df1','df2'], axis=1)
df3.plot()
This creates a dataframe with only the two "value" columns and the date as index.
When plotting the index is used as x values and for each column a line is drawn.
You could also ignore the ignore the column index and later set new column names
df3 = pd.concat([df1.set_index("date"), df2.set_index("date")], axis=1, ignore_index =True)
df3.columns=['df1','df2']
Or you drop the level of the index that is common to both columns after concatenation.
df3 = pd.concat([df1.set_index("date"), df2.set_index("date")], keys=['df1','df2'], axis=1)
df3.columns = df3.columns.droplevel(level=1)

Try :
df3=pd.merge(df1,df2,on='date')
df3.plot.line(x="date")
plt.show()
First since the dates seem to be same, you can merge on the date column
df3=pd.merge(df1,df2,on='date')
value_x date value_y
0 1 20 11
1 2 40 21
2 3 60 31
3 4 80 41
4 5 100 51
Another way to do it using matplotlib :
Plot the date vs value_x and date vs value_y
plt.plot(df3["date"],df3["value_x"],label="df1")
plt.plot(df3["date"],df3["value_y"],label="df2")
plt.legend()
plt.show()

Related

How to remove duplication of columns names using Pandas Merge function

When we merge two dataframes using pandas merge function, is it possible to ensure the key(s) based on which the two dataframes are merged is not repeated twice in the result? For e.g., I tried to merge two DFs with a column named 'isin_code' in the left DF and a column named 'isin' in the right DF. Even though the column/header names are different, the values of both the columns are same. In, the eventual result though, I get to see both 'isin_code' column and 'isin' column, which I am trying to avoid.
Code used:
result = pd.merge(df1,df2[['isin','issue_date']],how='left',left_on='isin_code',right_on = 'isin')
Either rename the columns to match before merge to uniform the column names and specify only on:
result = pd.merge(
df1,
df2[['isin', 'issue_date']].rename(columns={'isin': 'isin_code'}),
on='isin_code',
how='left'
)
OR drop the duplicate column after merge:
result = pd.merge(
df1,
df2[['isin', 'issue_date']],
how='left',
left_on='isin_code',
right_on='isin'
).drop(columns='isin')
Sample DataFrames and output:
import pandas as pd
df1 = pd.DataFrame({'isin_code': [1, 2, 3], 'a': [4, 5, 6]})
df2 = pd.DataFrame({'isin': [1, 3], 'issue_date': ['2021-01-02', '2021-03-04']})
df1:
isin_code a
0 1 4
1 2 5
2 3 6
df2:
isin issue_date
0 1 2021-01-02
1 3 2021-03-04
result:
isin_code a issue_date
0 1 4 2021-01-02
1 2 5 NaN
2 3 6 2021-03-04

Concat two Pandas DataFrame column with different length of index

How do I add a merge columns of Pandas dataframe to another dataframe while the new columns of data has less rows? Specifically I need to new column of data to be filled with NaN at the first few rows in the merged DataFrame instead of the last few rows. Please refer to the picture. Thanks.
Use:
df1 = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
})
df2 = pd.DataFrame({
'SMA':list('rty')
})
df3 = df1.join(df2.set_index(df1.index[-len(df2):]))
Or:
df3 = pd.concat([df1, df2.set_index(df1.index[-len(df2):])], axis=1)
print (df3)
A B SMA
0 a 4 NaN
1 b 5 NaN
2 c 4 NaN
3 d 5 r
4 e 5 t
5 f 4 y
How it working:
First is selected index in df1 by length of df2 from back:
print (df1.index[-len(df2):])
RangeIndex(start=3, stop=6, step=1)
And then is overwrite existing values by DataFrame.set_index:
print (df2.set_index(df1.index[-len(df2):]))
SMA
3 r
4 t
5 y

How to concatenate two dataframes with different indices along column axis

I want to merge 2 dataframes and first is dm.shape = (21184, 34), second is po.shape = (21184, 6). I want to merge them then it will be 40 columns. I write as this
dm = dm.merge(po, left_index=True, right_index=True)
then it is dm.shape = (4554, 40) my rows decreased.
P.s po is the PolynomialFeatures of numerical data of dm.
Problem is different index values, so convert them to default RangeIndex in both DataFrames:
df = dm.reset_index(drop=True).merge(po.reset_index(drop=True),
left_index=True,
right_index=True)
Solution with concat - by default outer join, but if same index values in both working same:
df = pd.concat([dm.reset_index(drop=True), po.reset_index(drop=True)], axis=1)
Or use:
dm = pd.DataFrame([dm.values.flatten().tolist(), po.values.flatten().tolist()]).rename(index=dict(zip(range(2),[*po.columns.tolist(), *dm.columns.tolist()]))).T
You can use the method join and set the parameter on to the index of the joined dataframe:
df1 = pd.DataFrame({'col1': [1, 2]}, index=[1,2])
df2 = pd.DataFrame({'col2': [3, 4]}, index=[3,4])
df1.join(df2, on=df2.index)
Output:
col1 col2
1 1 3
2 2 4
The joined dataframe must not contain duplicated indices.

Merge two dataframe by day to make one graph

So i have this two dataframe
df1 and df2
df1 :
Data1 Created
1 22-01-01
4 22-01-01
3 22-01-01
df2 :
Data1 Created
1 22-01-01
6 23-01-01
each have the same columns names.
And i would like to use the same column "Created" which is a date to count occurence by day and plot them in the same graph.
I've tried this :
ax = df1.plot()
df2.plot(ax=ax,x_compat=True,figsize=(20,10))
but i have this :
Edit :
df2.resample('D').sum() give me :
Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
I've try this also :
ax = df1.set_index('Created').resample('1D', how='count').plot()
df2.set_index('Created').resample('1D', how='count').plot(ax=ax,x_compat=True,figsize=(20,10))
df1 = pd.DataFrame({'Data1': np.random.randint(0,30,size=10),'Created': pd.date_range("20180101", periods=10)})
df2 = pd.DataFrame({'Data1': np.random.randint(0,30,size=10),'Created': pd.date_range("20180103", periods=10)})
df = df1.merge(df2, on='Created', how='outer').fillna(0)
df['sum'] = df['Data1_x']+df['Data1_x']
df have all the data.
To plot the sum together
plt.plot(df['sum'], df['Created'])
Or two plots
plt.plot(df['Data1_x'], df['date'])
plt.plot(df['Data1_y'], df['date'])

Python sum with condition using a date and a condition

I have to dataframes and I am using pandas.
I want to do a cumulative sum from a variable date and by the value in a column
I want to add a second column to df2 that show the date to know the day when the sum of the AVG column is greater than 100 after date2 in df2.
For example with df1 and df2 being the dataframe I start with and df3 what I want and df3['date100'] is the day the sum of avg is greater than 100:
df1 = pd.DataFrame({'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'])})
*Something*
df3 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'], 'date100': ['3/1/2014', '2/1/2014'], 'sum': [123, 105]})
I found some answers but most them use groupby and df2 has no groups.
Since your example is very basic, if you have edge cases you want me to take care of, just ask. This solution implies that :
The solution :
# For this solution your DataFrame needs to be sorted by date.
limit = 100
df = pd.DataFrame({
'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014',
'2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],
'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C']})
result = []
for row in df2.to_dict('records'):
# For each date, I want to select the date that comes AFTER this one.
# Then, I take the .cumsum(), because it's the agg you wish to do.
# Filter by your limit and take the first occurrence.
# Converting this to a dict, appending it to a list, makes it easy
# to rebuild a DataFrame later.
ndf = df.loc[ (df['date1'] >= row['date2']) & (df['Place'] == row['Place']) ]\
.sort_values(by='date1')
ndf['avgsum'] = ndf['AVG'].cumsum()
final_df = ndf.loc[ ndf['avgsum'] >= limit ]
# Error handling, in case there is not avgsum above the threshold.
try:
final_df = final_df.iloc[0][['date1', 'avgsum']].rename({'date1' : 'date100'})
result.append( final_df.to_dict() )
except IndexError:
continue
df3 = pd.DataFrame(result)
final_df = pd.concat([df2, df3], axis=1, sort=False)
print(final_df)
# date2 Place avgsum date100
# 0 1/1/2014 A 123.0 3/1/2014
# 1 2/1/2014 C NaN NaN
Here is a direct solution, with following assumptions:
df1 is sorted by date
one solution exists for every date in df2
You can then do:
df2 = df2.join(pd.concat([
pd.DataFrame(pd.DataFrame(df1.loc[df1.date1 >= d].AVG.cumsum()).query('AVG>=100')
.iloc[0]).transpose()
for d in df2.date2]).rename_axis('ix').reset_index())\
.join(df1.drop(columns='AVG'), on='ix').rename(columns={'AVG': 'sum', 'date1': 'date100'})\
.drop(columns='ix')[['date2', 'date100', 'sum']]
This does the following:
for each date in df2 find the first date when the cumul on AVG will be at least 100
combine the results in one single dataframe indexed by the index of that line in df1
store that index in an ix column and reset the index to join that dataframe to df2
join that to df1 minus the AVG column using the ix column
rename the columns, remove the ix column, and re-order everything

Categories

Resources