Fater update pandas DataFrame - python

I have a DataFrame named df has column GENDER, AGE and ID and others columns, and there is another DataFrame named df_2 which has only 3 columns GENDER, AGE and ID too. I want to update the value of GENDER and AGE in df with values from df_2.
So my ideas is
df_id = df.ID.tolist()
df_2_id = df_2.ID.tolist()
df = df.set_index('ID')
df_2 = df_2.set_index('ID')
# all the ids in df_2_id are in df_id
for id in tqdm.tqdm_notebook(df_2_id):
df.loc[id, 'GENDER'] = df_2.loc[id, 'GENDER']
df.loc[id, 'AGE'] = df_2.loc[id, 'AGE']
However, the for loop only has 17.2 iterations per seconds, and it around takes 2 hours to update the data. How can I make it faster?

I think you need first intersection of indices and then set values:
idx = df.index.intersection(df_2.index)
df.loc[idx, 'GENDER'] = df_2['GENDER']
df.loc[idx, 'AGE'] = df_2['AGE']
Or concat them together and remove duplicates, keep last value:
df = pd.concat([df, df_2])
df = df[~df.index.duplicated(keep='last')]
Similar solution:
df = pd.concat([df, df_2]).reset_index().drop_duplicates('ID', keep='last')
Sample:
df = pd.DataFrame({'ID':list('abcdef'),
'AGE':[5,3,6,9,2,4],
'GENDER':list('aaabbb')})
#print (df)
df_2 = pd.DataFrame({'ID':list('def'),
'AGE':[90,20,40],
'GENDER':list('eee')})
#print (df_2)
df = df.set_index('ID')
df_2 = df_2.set_index('ID')
idx = df.index.intersection(df_2.index)
df.loc[idx, 'GENDER'] = df_2['GENDER']
df.loc[idx, 'AGE'] = df_2['AGE']
print (df)
AGE GENDER
ID
a 5 a
b 3 a
c 6 a
d 90 e
e 20 e
f 40 e

Related

Merging df in python

Say I have two DataFrames
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
I want to merge so that any values in df1 are overwritten in there is a value in df2 at that location and any new values in df2 are added including the new rows and columns.
The result should be:
A B C
0 1 3 nan
1 2 8 10
2 nan 9 11
I've tried combine_first but that causes only nan values to be overwritten
updated has the issue where new rows are created rather than overwritten
merge has many issues.
I've tried writing my own function
def take_right(df1, df2, j, i):
print (df1)
print (df2)
try:
s1 = df1[j][i]
except:
s1 = np.NaN
try:
s2 = df2[j][i]
except:
s2 = np.NaN
if math.isnan(s2):
#print(s1)
return s1
else:
# print(s2)
return s2
def combine_df(df1, df2):
rows = (set(df1.index.values.tolist()) | set(df2.index.values.tolist()))
#print(rows)
columns = (set(df1.columns.values.tolist()) | set(df2.columns.values.tolist()))
#print(columns)
df = pd.DataFrame()
#df.columns = columns
for i in rows:
#df[:][i]=[]
for j in columns:
df = df.insert(int(i), j, take_right(df1,df2,j,i), allow_duplicates=False)
# print(df)
return df
This won't add new columns or rows to an empty DataFrame.
Thank you!!
One approach is to create an empty output dataframe with the union of columns and indices from df1 and df2 and then use the df.update method to assign their values into the out_df
import pandas as pd
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
out_df = pd.DataFrame(
columns = df1.columns.union(df2.columns),
index = df1.index.union(df2.index),
)
out_df.update(df1)
out_df.update(df2)
out_df
Why does combine_first not work?
df = df2.combine_first(df1)
print(df)
Output:
A B C
0 1.0 3 NaN
1 2.0 8 10.0
2 NaN 9 11.0

Filter dataframe based on corresponding rows in another one

would like to create df3 where the url would come from df1 and the traffic value from corresponding rows in df2.
Current code:
import pandas as pd
data = [['http://url1.com'], ['http://url3.com']]
data_2 = [[{'url':'http://url1.com', 'traffic':100}], [{'url':'http://url2.com', 'traffic':200}], [{'url':'http://url3.com', 'traffic':300}]]
df1 = pd.DataFrame(data=data, columns=['url'])
df2 = pd.DataFrame(data=data_2, columns=['url', 'traffic'])
df3 = pd.merge(left=df1, right=df2, on='url')
Expected output:
url traffic
0 http://url1.com 100
1 http://url3.com 300
Current output:
ValueError: 2 columns passed, passed data had 1 columns
Regarding https and http, you need to make sure you overwrite the dataframe:
import pandas as pd
data = [['https://url1.com'], ['https://url3.com']]
data_2 = [[{'url':'http://url1.com', 'traffic':100}], [{'url':'http://url2.com', 'traffic':200}], [{'url':'http://url3.com', 'traffic':300}]]
df1 = pd.DataFrame(data=data, columns=['url'])
df2 = pd.DataFrame([row[0] for row in data_2])
df1 = df1.replace(to_replace = 'https', value='http', regex=True)
df3 = pd.merge(left=df1, right=df2, on='url')
print(df3)
url traffic
0 http://url1.com 100
1 http://url3.com 300

python dataframe concatenate based on a chosen date

Say I have the following variables and dataframe:
a = '2020-04-23 14:00:00+00:00','2020-04-23 13:00:00+00:00','2020-04-23 12:00:00+00:00','2020-04-23 11:00:00+00:00','2020-04-23 10:00:00+00:00','2020-04-23 09:00:00+00:00','2020-04-23 08:00:00+00:00','2020-04-23 07:00:00+00:00','2020-04-23 06:00:00+00:00','2020-04-23 04:00:00+00:00'
b = '2020-04-23 10:00:00+00:00','2020-04-23 09:00:00+00:00','2020-04-23 08:00:00+00:00','2020-04-23 07:00:00+00:00','2020-04-23 06:00:00+00:00','2020-04-23 05:00:00+00:00','2020-04-23 04:00:00+00:00','2020-04-23 03:00:00+00:00','2020-04-23 02:00:00+00:00','2020-04-23 01:00:00+00:00'
aa = 7105.50,6923.50,6692.50,6523.00,6302.5,6081.5,6262.0,6451.50,6369.50,6110.00
bb = 6386.00,6221.00,6505.00,6534.70,6705.00,6535.00,7156.50,7422.00,7608.50,8098.00
df1 = pd.DataFrame()
df1['timestamp'] = a
df1['price'] = aa
df2 = pd.DataFrame()
df2['timestamp'] = b
df2['price'] = bb
print(df1)
print(df2)
I am trying to concatenate the rows of following:
top row of df1 to '2020-04-23 08:00:00+00:00'
'2020-04-23 07:00:00+00:00' to the last row of df2
for illustration purposes the following is what the dataframe should look like:
c = '2020-04-23 14:00:00+00:00','2020-04-23 13:00:00+00:00','2020-04-23 12:00:00+00:00','2020-04-23 11:00:00+00:00','2020-04-23 10:00:00+00:00','2020-04-23 09:00:00+00:00','2020-04-23 08:00:00+00:00','2020-04-23 07:00:00+00:00','2020-04-23 06:00:00+00:00','2020-04-23 05:00:00+00:00','2020-04-23 04:00:00+00:00','2020-04-23 03:00:00+00:00','2020-04-23 02:00:00+00:00','2020-04-23 01:00:00+00:00'
cc = 7105.50,6923.50,6692.50,6523.00,6302.5,6081.5,6262.0,6534.70,6705.00,6535.00,7156.50,7422.00,7608.50,8098.00
df = pd.DataFrame()
df['timestamp'] = c
df['price'] = cc
print(df)
Any ideas?
You can convert the timestamp columns to pd.date_time objects, and then use boolean indexing and pd.concat to select and merge them:
df1.timestamp = pd.to_datetime(df1.timestamp)
df2.timestamp = pd.to_datetime(df2.timestamp)
dfs = [df1.loc[df1.timestamp >= pd.to_datetime("2020-04-23 08:00:00+00:00"),:],
df2.loc[df2.timestamp <= pd.to_datetime("2020-04-23 07:00:00+00:00"),:]
]
df_conc = pd.concat(dfs)

Compare values based on common keys in pandas

Hi, I have two data frames. Both with two columns, identifier and weight.
What I would like is, for each "key" so A and B, if the second column have opposite signs accross the two dataframes (so one is positive and one is negative, then create a new column with the lowest absolute value).
import pandas as pd
A = {"ID":["A", "B"], "Weight":[500,300]}
B = {"ID":["A", "B"], "Weight":[-300,100]}
dfA = pd.DataFrame(data=A)
dfB = pd.DataFrame(data=B)
dfC = dfA.merge(dfB, how='outer', left_on=['ID'], right_on=['ID'])
So expected output would be a new column on dfC with the lowest absolute value between both weight columns if they have an opposite signs
Here is one way via .loc accessor:
import pandas as pd
dfA = dfA.set_index('ID')
dfB = dfB.set_index('ID')
dfC = dfA.copy()
dfC['Result'] = 0
mask = (dfA['Weight'] > 0) != (dfB['Weight'] > 0)
dfC.loc[mask, 'Result'] = np.minimum(dfA['Weight'].abs(), dfB['Weight'].abs())
dfC = dfC.reset_index()
# ID Weight Result
# 0 A 500 300
# 1 B 300 0
Here is another way to get the result you want, using df.apply and df.concat
Step 1 : Create dfC with ID, WeightA and WeightB
import numpy as np
A = dfA.set_index('ID')
B = dfB.set_index('ID')
dfC = pd.concat([A,B], 1).reset_index()
dfC.columns = ['ID', 'WeightA', 'WeightB']
Edit :
You can use your dfC too, just rename the columns as such and use the Step2 for your result.
dfC = dfA.merge(dfB, how='outer', left_on=['ID'], right_on=['ID'])
dfC.columns = ['ID', 'WeightA', 'WeightB']
Step2: Create column 'lowestAbsWeight' which is the lowest absolute of the two weights A and B
dfC['lowestAbsWeight'] = dfC.apply(lambda row: np.absolute(row['WeightA']) if np.absolute(row['WeightA'])<np.absolute(row['WeightB'] ) else np.absolute(row['WeightB']), axis=1 )
The output looks like:
ID WeightA WeightB lowestAbsWeight
0 A 500 -300 300
1 B 300 100 100
Hope this helps.

pandas convert grouped rows into columns

I have a dataframe such as:
label column1
a 1
a 2
b 6
b 4
I would like to make a dataframe with a new column, with the opposite value from column1 where the labels match. Such as:
label column1 column2
a 1 2
a 2 1
b 6 4
b 4 6
I know this is probably very simple to do with a groupby command but I've been searching and can't find anything.
The following uses groupby and apply and seems to work okay:
x = pd.DataFrame({ 'label': ['a','a','b','b'],
'column1': [1,2,6,4] })
y = x.groupby('label').apply(
lambda g: g.assign(column2 = np.asarray(g.column1[::-1])))
y = y.reset_index(drop=True) # optional: drop weird index
print(y)
you can try the code block below:
#create the Dataframe
df = pd.DataFrame({'label':['a','a','b','b'],
'column1':[1,2,6,4]})
#Group by label
a = df.groupby('label').first().reset_index()
b = df.groupby('label').last().reset_index()
#Concat those groups to create columns2
df2 = (pd.concat([b,a])
.sort_values(by='label')
.rename(columns={'column1':'column2'})
.reset_index()
.drop('index',axis=1))
#Merge with the original Dataframe
df = df.merge(df2,left_index=True,right_index=True,on='label')[['label','column1','column2']]
Hope this helps
Assuming their are only pairs of labels, you could use the following as well:
# Create dataframe
df = pd.DataFrame(data = {'label' :['a', 'a', 'b', 'b'],
'column1' :[1,2, 6,4]})
# iterate over dataframe, identify matching label and opposite value
for index, row in df.iterrows():
newvalue = int(df[(df.label == row.label) & (df.column1 != row.column1)].column1.values[0])
# set value to new column
df.set_value(index, 'column2', newvalue)
df.head()
You can use groupby with apply where create new Series with back order:
df['column2'] = df.groupby('label')["column1"] \
.apply(lambda x: pd.Series(x[::-1].values)).reset_index(drop=True)
print (df)
column1 label column2
0 1 a 2
1 2 a 1
2 6 b 4
3 4 b 6

Categories

Resources