python dataframe concatenate based on a chosen date - python

Say I have the following variables and dataframe:
a = '2020-04-23 14:00:00+00:00','2020-04-23 13:00:00+00:00','2020-04-23 12:00:00+00:00','2020-04-23 11:00:00+00:00','2020-04-23 10:00:00+00:00','2020-04-23 09:00:00+00:00','2020-04-23 08:00:00+00:00','2020-04-23 07:00:00+00:00','2020-04-23 06:00:00+00:00','2020-04-23 04:00:00+00:00'
b = '2020-04-23 10:00:00+00:00','2020-04-23 09:00:00+00:00','2020-04-23 08:00:00+00:00','2020-04-23 07:00:00+00:00','2020-04-23 06:00:00+00:00','2020-04-23 05:00:00+00:00','2020-04-23 04:00:00+00:00','2020-04-23 03:00:00+00:00','2020-04-23 02:00:00+00:00','2020-04-23 01:00:00+00:00'
aa = 7105.50,6923.50,6692.50,6523.00,6302.5,6081.5,6262.0,6451.50,6369.50,6110.00
bb = 6386.00,6221.00,6505.00,6534.70,6705.00,6535.00,7156.50,7422.00,7608.50,8098.00
df1 = pd.DataFrame()
df1['timestamp'] = a
df1['price'] = aa
df2 = pd.DataFrame()
df2['timestamp'] = b
df2['price'] = bb
print(df1)
print(df2)
I am trying to concatenate the rows of following:
top row of df1 to '2020-04-23 08:00:00+00:00'
'2020-04-23 07:00:00+00:00' to the last row of df2
for illustration purposes the following is what the dataframe should look like:
c = '2020-04-23 14:00:00+00:00','2020-04-23 13:00:00+00:00','2020-04-23 12:00:00+00:00','2020-04-23 11:00:00+00:00','2020-04-23 10:00:00+00:00','2020-04-23 09:00:00+00:00','2020-04-23 08:00:00+00:00','2020-04-23 07:00:00+00:00','2020-04-23 06:00:00+00:00','2020-04-23 05:00:00+00:00','2020-04-23 04:00:00+00:00','2020-04-23 03:00:00+00:00','2020-04-23 02:00:00+00:00','2020-04-23 01:00:00+00:00'
cc = 7105.50,6923.50,6692.50,6523.00,6302.5,6081.5,6262.0,6534.70,6705.00,6535.00,7156.50,7422.00,7608.50,8098.00
df = pd.DataFrame()
df['timestamp'] = c
df['price'] = cc
print(df)
Any ideas?

You can convert the timestamp columns to pd.date_time objects, and then use boolean indexing and pd.concat to select and merge them:
df1.timestamp = pd.to_datetime(df1.timestamp)
df2.timestamp = pd.to_datetime(df2.timestamp)
dfs = [df1.loc[df1.timestamp >= pd.to_datetime("2020-04-23 08:00:00+00:00"),:],
df2.loc[df2.timestamp <= pd.to_datetime("2020-04-23 07:00:00+00:00"),:]
]
df_conc = pd.concat(dfs)

Related

Merging df in python

Say I have two DataFrames
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
I want to merge so that any values in df1 are overwritten in there is a value in df2 at that location and any new values in df2 are added including the new rows and columns.
The result should be:
A B C
0 1 3 nan
1 2 8 10
2 nan 9 11
I've tried combine_first but that causes only nan values to be overwritten
updated has the issue where new rows are created rather than overwritten
merge has many issues.
I've tried writing my own function
def take_right(df1, df2, j, i):
print (df1)
print (df2)
try:
s1 = df1[j][i]
except:
s1 = np.NaN
try:
s2 = df2[j][i]
except:
s2 = np.NaN
if math.isnan(s2):
#print(s1)
return s1
else:
# print(s2)
return s2
def combine_df(df1, df2):
rows = (set(df1.index.values.tolist()) | set(df2.index.values.tolist()))
#print(rows)
columns = (set(df1.columns.values.tolist()) | set(df2.columns.values.tolist()))
#print(columns)
df = pd.DataFrame()
#df.columns = columns
for i in rows:
#df[:][i]=[]
for j in columns:
df = df.insert(int(i), j, take_right(df1,df2,j,i), allow_duplicates=False)
# print(df)
return df
This won't add new columns or rows to an empty DataFrame.
Thank you!!
One approach is to create an empty output dataframe with the union of columns and indices from df1 and df2 and then use the df.update method to assign their values into the out_df
import pandas as pd
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
out_df = pd.DataFrame(
columns = df1.columns.union(df2.columns),
index = df1.index.union(df2.index),
)
out_df.update(df1)
out_df.update(df2)
out_df
Why does combine_first not work?
df = df2.combine_first(df1)
print(df)
Output:
A B C
0 1.0 3 NaN
1 2.0 8 10.0
2 NaN 9 11.0

how to get max 3 value from DataFrame at Axis =1

after processing some data I got df, now I need to get max 3 value from the data frame with column name
data=[[4.12,3,2],[1.0123123,-6.12312,5.123123],[-3.123123,-8.512323,6.12313]]
df = pd.DataFrame(data,columns =['a','b','c'],index=['aa','bb','cc'])
df
output:
a b c
aa 4.120000 3.000000 2.000000
bb 1.012312 -6.123120 5.123123
cc -3.123123 -8.512323 6.123130
Now I assigned each value with a columns name
df1 = df.astype(str).apply(lambda x:x+'='+x.name)
a b c
aa 4.12=a 3.0=b 2.0=c
bb 1.0123123=a -6.12312=b 5.123123=c
cc -3.123123=a -8.512323=b 6.12313=c
I need to get the max, I have tried to sort the data frame but not able to get the output
what I need is
final_df
max=1 max=2 max=3
aa 4.12=a 3.0=b 2.0=c
bb 5.123123=c 1.0123123=a -6.12312=b
cc 6.12313=c -3.123123=a -8.512323=b
I suggest you proceed as follows:
import pandas as pd
data=[[4.12,3,2],[1.0123123,-6.12312,5.123123],[-3.123123,-8.512323,6.12313]]
df = pd.DataFrame(data,columns =['a','b','c'],index=['aa','bb','cc'])
# first sort values in descending order
df.values[:, ::-1].sort(axis=1)
# then rename row values
df1 = df.astype(str).apply(lambda x: x + '=' + x.name)
# rename columns
df1.columns = [f"max={i}" for i in range(1, len(df.columns)+1)]
Result as desired:
max=1 max=2 max=3
aa 4.12=a 3.0=b 2.0=c
bb 5.123123=a 1.0123123=b -6.12312=c
cc 6.12313=a -3.123123=b -8.512323=c
As the solution proposed by #GuglielmoSanchini does not give the expected result, It works as follows:
# Imports
import pandas as pd
import numpy as np
# Data
data=[[4.12,3,2],[1.0123123,-6.12312,5.123123],[-3.123123,-8.512323,6.12313]]
df = pd.DataFrame(data,columns =['a','b','c'],index=['aa','bb','cc'])
df1 = df.astype(str).apply(lambda x:x+'='+x.name)
data = []
for index, row in df1.iterrows():
# the indices of the numbers sorted in descending order
indices_max = np.argsort([float(item[:-2]) for item in row])[::-1]
# We add the new values sorted
data.append(row.iloc[indices_max].values.tolist())
# We create the new dataframe with values sorted
df = pd.DataFrame(data, columns = [f"max={i}" for i in range(1, len(df1.columns)+1)])
df.index = df1.index
print(df)
Here is the result:
max=1 max=2 max=3
aa 4.12=a 3.0=b 2.0=c
bb 5.123123=c 1.0123123=a -6.12312=b
cc 6.12313=c -3.123123=a -8.512323=b

Fater update pandas DataFrame

I have a DataFrame named df has column GENDER, AGE and ID and others columns, and there is another DataFrame named df_2 which has only 3 columns GENDER, AGE and ID too. I want to update the value of GENDER and AGE in df with values from df_2.
So my ideas is
df_id = df.ID.tolist()
df_2_id = df_2.ID.tolist()
df = df.set_index('ID')
df_2 = df_2.set_index('ID')
# all the ids in df_2_id are in df_id
for id in tqdm.tqdm_notebook(df_2_id):
df.loc[id, 'GENDER'] = df_2.loc[id, 'GENDER']
df.loc[id, 'AGE'] = df_2.loc[id, 'AGE']
However, the for loop only has 17.2 iterations per seconds, and it around takes 2 hours to update the data. How can I make it faster?
I think you need first intersection of indices and then set values:
idx = df.index.intersection(df_2.index)
df.loc[idx, 'GENDER'] = df_2['GENDER']
df.loc[idx, 'AGE'] = df_2['AGE']
Or concat them together and remove duplicates, keep last value:
df = pd.concat([df, df_2])
df = df[~df.index.duplicated(keep='last')]
Similar solution:
df = pd.concat([df, df_2]).reset_index().drop_duplicates('ID', keep='last')
Sample:
df = pd.DataFrame({'ID':list('abcdef'),
'AGE':[5,3,6,9,2,4],
'GENDER':list('aaabbb')})
#print (df)
df_2 = pd.DataFrame({'ID':list('def'),
'AGE':[90,20,40],
'GENDER':list('eee')})
#print (df_2)
df = df.set_index('ID')
df_2 = df_2.set_index('ID')
idx = df.index.intersection(df_2.index)
df.loc[idx, 'GENDER'] = df_2['GENDER']
df.loc[idx, 'AGE'] = df_2['AGE']
print (df)
AGE GENDER
ID
a 5 a
b 3 a
c 6 a
d 90 e
e 20 e
f 40 e

Python Dataframe Rolling median with group by

I have a dataframe with three columns, viz., date,commodity and values. I want to add another column, median_20, the rolling median of last 20 days for each commodity in the df. Also, I want to add other columns which show value n days before, for example, lag_1 column shows value 1 day before for a given commodity, lag_2 shows value 2 days before, and so on. My df is quite big (>2 million rows) in size.
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort('date')
date commodity value
0 2017-01-01 GOLD -1.239422
0 2017-01-01 SILVER -0.209840
1 2017-01-02 SILVER 0.146293
1 2017-01-02 GOLD 1.422454
2 2017-01-03 GOLD 0.453222
...
Try:
import pandas as pd
import numpy as np
# create dataframe
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort_values(by='date').reset_index(drop=True)
# create columns
df['median_20_temp'] = df.groupby('market')['commodity'].rolling(20).median()
df['median_20'] = df.groupby('market')['median_20_temp'].shift(1)
df['lag_1'] = df.groupby('market')['commodity'].shift(1)
df['lag_2'] = df.groupby('market')['commodity'].shift(2)
df.drop(['median_20_temp'], axis=1, inplace=True)
Edit:
The following should work with version 0.16.2:
import numpy as np
import pandas as pd
np.random.seed(123)
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort('date').reset_index(drop=True)
# create columns
df['median_20_temp'] = df.groupby('market')['commodity'].apply(lambda s: pd.rolling_median(s, 20))
df['median_20'] = df.groupby('market')['median_20_temp'].shift(1)
df['lag_1'] = df.groupby('market')['commodity'].shift(1)
df['lag_2'] = df.groupby('market')['commodity'].shift(2)
df.drop(['median_20_temp'], axis=1, inplace=True)
I hope this helps.
I am sure there is a more efficient way, meanwhile try this solution:
for commo in df.market.unique():
df.loc[df.market==commo,'lag_1'] = df.loc[df.market==commo,'commodity'].shift(1)
df.loc[df.market==commo,'median_20'] = pd.rolling_median(df.loc[df.market==commo,'commodity'],20)

Appending to an empty DataFrame in Pandas?

Is it possible to append to an empty data frame that doesn't contain any indices or columns?
I have tried to do this, but keep getting an empty dataframe at the end.
e.g.
import pandas as pd
df = pd.DataFrame()
data = ['some kind of data here' --> I have checked the type already, and it is a dataframe]
df.append(data)
The result looks like this:
Empty DataFrame
Columns: []
Index: []
This should work:
>>> df = pd.DataFrame()
>>> data = pd.DataFrame({"A": range(3)})
>>> df = df.append(data)
>>> df
A
0 0
1 1
2 2
Since the append doesn't happen in-place, so you'll have to store the output if you want it:
>>> df = pd.DataFrame()
>>> data = pd.DataFrame({"A": range(3)})
>>> df.append(data) # without storing
>>> df
Empty DataFrame
Columns: []
Index: []
>>> df = df.append(data)
>>> df
A
0 0
1 1
2 2
And if you want to add a row, you can use a dictionary:
df = pd.DataFrame()
df = df.append({'name': 'Zed', 'age': 9, 'height': 2}, ignore_index=True)
which gives you:
age height name
0 9 2 Zed
You can concat the data in this way:
InfoDF = pd.DataFrame()
tempDF = pd.DataFrame(rows,columns=['id','min_date'])
InfoDF = pd.concat([InfoDF,tempDF])
The answers are very useful, but since pandas.DataFrame.append was deprecated (as already mentioned by various users), and the answers using pandas.concat are not "Runnable Code Snippets" I would like to add the following snippet:
import pandas as pd
df = pd.DataFrame(columns =['name','age'])
row_to_append = pd.DataFrame([{'name':"Alice", 'age':"25"},{'name':"Bob", 'age':"32"}])
df = pd.concat([df,row_to_append])
So df is now:
name age
0 Alice 25
1 Bob 32
pandas.DataFrame.append Deprecated since version 1.4.0: Use concat() instead.
Therefore:
df = pd.DataFrame() # empty dataframe
df2 = pd..DataFrame(...) # some dataframe with data
df = pd.concat([df, df2])

Categories

Resources