Merge two dataframe by day to make one graph - python

So i have this two dataframe
df1 and df2
df1 :
Data1 Created
1 22-01-01
4 22-01-01
3 22-01-01
df2 :
Data1 Created
1 22-01-01
6 23-01-01
each have the same columns names.
And i would like to use the same column "Created" which is a date to count occurence by day and plot them in the same graph.
I've tried this :
ax = df1.plot()
df2.plot(ax=ax,x_compat=True,figsize=(20,10))
but i have this :
Edit :
df2.resample('D').sum() give me :
Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
I've try this also :
ax = df1.set_index('Created').resample('1D', how='count').plot()
df2.set_index('Created').resample('1D', how='count').plot(ax=ax,x_compat=True,figsize=(20,10))

df1 = pd.DataFrame({'Data1': np.random.randint(0,30,size=10),'Created': pd.date_range("20180101", periods=10)})
df2 = pd.DataFrame({'Data1': np.random.randint(0,30,size=10),'Created': pd.date_range("20180103", periods=10)})
df = df1.merge(df2, on='Created', how='outer').fillna(0)
df['sum'] = df['Data1_x']+df['Data1_x']
df have all the data.
To plot the sum together
plt.plot(df['sum'], df['Created'])
Or two plots
plt.plot(df['Data1_x'], df['date'])
plt.plot(df['Data1_y'], df['date'])

Related

Pandas Apply returns a Series instead of a dataframe

The goal of following code is to go through each row in df_label, extract app1 and app2 names, filter df_all using those two names, concatenate the result and return it as a dataframe. Here is the code:
def create_dataset(se):
# extracting the names of applications
app1 = se.app1
app2 = se.app2
# extracting each application from df_all
df1 = df_all[df_all.workload == app1]
df1.columns = df1.columns + '_0'
df2 = df_all[df_all.workload == app2]
df2.columns = df2.columns + '_1'
# combining workloads to create the pairs dataframe
df3 = pd.concat([df1, df2], axis=1)
display(df3)
return df3
df_pairs = pd.DataFrame()
df_label.apply(create_dataset, axis=1)
#df_pairs = df_pairs.append(df_label.apply(create_dataset, axis=1))
I would like to append all dataframes returned from apply. However, while display(df3) shows the correct dataframe, when returned from function, it's not a dataframe anymore and it's a series. A series with one element and that element seems to be the whole dataframe. Any ideas what I am doing wrong?
When you select a single column, you'll get a Series instead of a DataFrame so df1 and df2 will both be series.
However, concatenating them on axis=1 should produce a DataFrame (whereas combining them on axis=0 would produce a series). For example:
df = pd.DataFrame({'a':[1,2],'b':[3,4]})
df1 = df['a']
df2 = df['b']
>>> pd.concat([df1,df2],axis=1)
a b
0 1 3
1 2 4
>>> pd.concat([df1,df2],axis=0)
0 1
1 2
0 3
1 4
dtype: int64

Manipulate multiindex column in pivot_table

I see this question asked multiple times but solutions from other questions did not worked!
I have data frame like
df = pd.DataFrame({
"date": ["20180920"] * 3 + ["20180921"] * 3,
"id": ["A12","A123","A1234","A12345","A123456","A0"],
"mean": [1,2,3,4,5,6],
"std" :[7,8,9,10,11,12],
"test": ["a", "b", "c", "d", "e", "f"],
"result": [70, 90, 110, "(-)", "(+)", 0.3],})
using pivot_table
df_sum_table = (pd.pivot_table(df,index=['id'], columns = ['date'], values = ['mean','std']))
I got
df_sum_table.columns
MultiIndex([('mean', '20180920'),
('mean', '20180921'),
( 'std', '20180920'),
( 'std', '20180921')],
names=[None, 'date'])
So I wanted to shift date column one row below and remove id row. but keep id name there.
by following these past solutions
ValueError when trying to have multi-index in DataFrame.pivot
Removing index name from df created with pivot_table()
Resetting index to flat after pivot_table in pandas
pandas pivot_table keep index
df_sum_table = (pd.pivot_table(df,index=['id'], columns = ['date'], values = ['mean','std'])).reset_index().rename_axis(None, axis=1)
but getting error
TypeError: Must pass list-like as names.
How can I remove date but keep the id in the first column ?
The desired output
#jezrael
Try with rename_axis:
df = df.pivot_table(index=['id'], columns = ['date'], values = ['mean', 'std']).rename_axis(columns={'date': None}).fillna('').reset_index().T.reset_index(level=1).T.reset_index(drop=True).reset_index(drop=True)
df.index = df.pop('id').replace('', 'id').tolist()
print(df)
Output:
mean mean std std
id 20180920 20180921 20180920 20180921
A0 6 12
A12 1 7
A123 2 8
A1234 3 9
A12345 4 10
A123456 5 11
You could use rename_axis and rename the specific column axis name with dictionary mapping. I specify the columns argument for column axis name mapping.

Python sum with condition using a date and a condition

I have to dataframes and I am using pandas.
I want to do a cumulative sum from a variable date and by the value in a column
I want to add a second column to df2 that show the date to know the day when the sum of the AVG column is greater than 100 after date2 in df2.
For example with df1 and df2 being the dataframe I start with and df3 what I want and df3['date100'] is the day the sum of avg is greater than 100:
df1 = pd.DataFrame({'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'])})
*Something*
df3 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'], 'date100': ['3/1/2014', '2/1/2014'], 'sum': [123, 105]})
I found some answers but most them use groupby and df2 has no groups.
Since your example is very basic, if you have edge cases you want me to take care of, just ask. This solution implies that :
The solution :
# For this solution your DataFrame needs to be sorted by date.
limit = 100
df = pd.DataFrame({
'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014',
'2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],
'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C']})
result = []
for row in df2.to_dict('records'):
# For each date, I want to select the date that comes AFTER this one.
# Then, I take the .cumsum(), because it's the agg you wish to do.
# Filter by your limit and take the first occurrence.
# Converting this to a dict, appending it to a list, makes it easy
# to rebuild a DataFrame later.
ndf = df.loc[ (df['date1'] >= row['date2']) & (df['Place'] == row['Place']) ]\
.sort_values(by='date1')
ndf['avgsum'] = ndf['AVG'].cumsum()
final_df = ndf.loc[ ndf['avgsum'] >= limit ]
# Error handling, in case there is not avgsum above the threshold.
try:
final_df = final_df.iloc[0][['date1', 'avgsum']].rename({'date1' : 'date100'})
result.append( final_df.to_dict() )
except IndexError:
continue
df3 = pd.DataFrame(result)
final_df = pd.concat([df2, df3], axis=1, sort=False)
print(final_df)
# date2 Place avgsum date100
# 0 1/1/2014 A 123.0 3/1/2014
# 1 2/1/2014 C NaN NaN
Here is a direct solution, with following assumptions:
df1 is sorted by date
one solution exists for every date in df2
You can then do:
df2 = df2.join(pd.concat([
pd.DataFrame(pd.DataFrame(df1.loc[df1.date1 >= d].AVG.cumsum()).query('AVG>=100')
.iloc[0]).transpose()
for d in df2.date2]).rename_axis('ix').reset_index())\
.join(df1.drop(columns='AVG'), on='ix').rename(columns={'AVG': 'sum', 'date1': 'date100'})\
.drop(columns='ix')[['date2', 'date100', 'sum']]
This does the following:
for each date in df2 find the first date when the cumul on AVG will be at least 100
combine the results in one single dataframe indexed by the index of that line in df1
store that index in an ix column and reset the index to join that dataframe to df2
join that to df1 minus the AVG column using the ix column
rename the columns, remove the ix column, and re-order everything

Plotting 2 data frames after merging

I have 2 different data frames that have the same column called date. Now, I want to plot these data frames where the values on X axis be the date column common to both the data frames and Y axis be the value. Also, I want to do this after concatenating both the data frames into a third frame. Currently here is what I did:
df1 = pd.DataFrame({'value': [1,2,3,4,5], 'date': [20,40,60,80,100]})
df2 = pd.DataFrame({'value': [11,21,31,41,51], 'date': [20,40,60,80,100]})
df3 = pd.concat([df1, df2], keys=['df1','df2'], axis=1)
df3.plot()
plt.show()
but the resultant plot is not what I wanted. It generates 4 plots as could be seen from the legend.
How could I just have 2 plots with a common X axis and the difference reflected in the Y axis? Please note that I want to do this after concatenating the data frames df1 and df2 and by calling plot on df3
You could use the "date" column as index before concatenating.
df1 = pd.DataFrame({'value': [1,2,3,4,5], 'date': [20,40,60,80,100]})
df2 = pd.DataFrame({'value': [11,21,31,41,51], 'date': [20,40,60,80,100]})
df3 = pd.concat([df1.set_index("date"), df2.set_index("date")], keys=['df1','df2'], axis=1)
df3.plot()
This creates a dataframe with only the two "value" columns and the date as index.
When plotting the index is used as x values and for each column a line is drawn.
You could also ignore the ignore the column index and later set new column names
df3 = pd.concat([df1.set_index("date"), df2.set_index("date")], axis=1, ignore_index =True)
df3.columns=['df1','df2']
Or you drop the level of the index that is common to both columns after concatenation.
df3 = pd.concat([df1.set_index("date"), df2.set_index("date")], keys=['df1','df2'], axis=1)
df3.columns = df3.columns.droplevel(level=1)
Try :
df3=pd.merge(df1,df2,on='date')
df3.plot.line(x="date")
plt.show()
First since the dates seem to be same, you can merge on the date column
df3=pd.merge(df1,df2,on='date')
value_x date value_y
0 1 20 11
1 2 40 21
2 3 60 31
3 4 80 41
4 5 100 51
Another way to do it using matplotlib :
Plot the date vs value_x and date vs value_y
plt.plot(df3["date"],df3["value_x"],label="df1")
plt.plot(df3["date"],df3["value_y"],label="df2")
plt.legend()
plt.show()

Create label for two column in pandas

I have a pandas dataframe with two column of data. Now i want to make a label for two column, like the picture bellow:
Because two column donot have the same value so cant use groupby. I just only want add the label AAA like that. So, how to do it? Thank you
reassign to the columns attribute with an newly constructed pd.MultiIndex
df.columns = pd.MultiIndex.from_product([['AAA'], df.columns.tolist()])
Consider the dataframe df
df = pd.DataFrame(1, ['hostname', 'tmserver'], ['value', 'time'])
print(df)
value time
hostname 1 1
tmserver 1 1
Then
df.columns = pd.MultiIndex.from_product([['AAA'], df.columns.tolist()])
print(df)
AAA
value time
hostname 1 1
tmserver 1 1
If need create MultiIndex in columns, simpliest is:
df.columns = [['AAA'] * len(df.columns), df.columns]
It is similar as MultiIndex.from_arrays, also is possible add names parameter:
n = ['a','b']
df.columns = pd.MultiIndex.from_arrays([['AAA'] * len(df.columns), df.columns], names=n)

Categories

Resources