I have to dataframes and I am using pandas.
I want to do a cumulative sum from a variable date and by the value in a column
I want to add a second column to df2 that show the date to know the day when the sum of the AVG column is greater than 100 after date2 in df2.
For example with df1 and df2 being the dataframe I start with and df3 what I want and df3['date100'] is the day the sum of avg is greater than 100:
df1 = pd.DataFrame({'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'])})
*Something*
df3 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'], 'date100': ['3/1/2014', '2/1/2014'], 'sum': [123, 105]})
I found some answers but most them use groupby and df2 has no groups.
Since your example is very basic, if you have edge cases you want me to take care of, just ask. This solution implies that :
The solution :
# For this solution your DataFrame needs to be sorted by date.
limit = 100
df = pd.DataFrame({
'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014',
'2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],
'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C']})
result = []
for row in df2.to_dict('records'):
# For each date, I want to select the date that comes AFTER this one.
# Then, I take the .cumsum(), because it's the agg you wish to do.
# Filter by your limit and take the first occurrence.
# Converting this to a dict, appending it to a list, makes it easy
# to rebuild a DataFrame later.
ndf = df.loc[ (df['date1'] >= row['date2']) & (df['Place'] == row['Place']) ]\
.sort_values(by='date1')
ndf['avgsum'] = ndf['AVG'].cumsum()
final_df = ndf.loc[ ndf['avgsum'] >= limit ]
# Error handling, in case there is not avgsum above the threshold.
try:
final_df = final_df.iloc[0][['date1', 'avgsum']].rename({'date1' : 'date100'})
result.append( final_df.to_dict() )
except IndexError:
continue
df3 = pd.DataFrame(result)
final_df = pd.concat([df2, df3], axis=1, sort=False)
print(final_df)
# date2 Place avgsum date100
# 0 1/1/2014 A 123.0 3/1/2014
# 1 2/1/2014 C NaN NaN
Here is a direct solution, with following assumptions:
df1 is sorted by date
one solution exists for every date in df2
You can then do:
df2 = df2.join(pd.concat([
pd.DataFrame(pd.DataFrame(df1.loc[df1.date1 >= d].AVG.cumsum()).query('AVG>=100')
.iloc[0]).transpose()
for d in df2.date2]).rename_axis('ix').reset_index())\
.join(df1.drop(columns='AVG'), on='ix').rename(columns={'AVG': 'sum', 'date1': 'date100'})\
.drop(columns='ix')[['date2', 'date100', 'sum']]
This does the following:
for each date in df2 find the first date when the cumul on AVG will be at least 100
combine the results in one single dataframe indexed by the index of that line in df1
store that index in an ix column and reset the index to join that dataframe to df2
join that to df1 minus the AVG column using the ix column
rename the columns, remove the ix column, and re-order everything
Related
I would like to assign agent_code to specific number of rows in df2.
df1
df2
Thank you.
df3 (Output)
First make sure in both DataFrames is default index by DataFrame.reset_index with drop=True, then repeat agent_code, convert to default index and last use concat:
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
s = df1['agent_code'].repeat(df1['number']).reset_index(drop=True)
df3 = pd.concat([df2, s], axis=1)
I’ve got two data frames :-
Df1
Time V1 V2
02:00 D3F3 0041
02:01 DD34 0040
Df2
FileName V1 V2
1111.txt D3F3 0041
2222.txt 0000 0040
Basically I want to compare the v1 v2 columns and if they match print the row time from df1 and the row from df2 filename. So far all i can find is the
isin()
, which simply gives you a boolean output.
So the output would be :
1111.txt 02:00
I started using dataframes because i though i could query the two df's on the V1 / V2 values but I can't see a way. Any pointers would be much appreciated
Use merge on the dataframe columns that you want to have the same values. You can then drop the rows with NaN values, as those will not have matching values. From there, you can print the merged dataframes values however you see fit.
df1 = pd.DataFrame({'Time': ['8a', '10p'], 'V1': [1, 2], 'V2': [3, 4]})
df2 = pd.DataFrame({'fn': ['8.txt', '10.txt'], 'V1': [3, 2], 'V2': [3, 4]})
df1.merge(df2, on=['V1', 'V2'], how='outer').dropna()
=== Output: ===
Time V1 V2 fn
1 10p 2 4 10.txt
The most intuitive solution is:
1) iterate the V1 column in DF1;
2) for each item in this column, check if this item exists in the V1 column of DF2;
3) if the item exists in DF2's V1, then find the index of that item in the DF2 and then you would be able to find the file name.
You can try using pd.concat.
On this case it would be like:
pd.concat([df1, df2.reindex(df1.index)], axis=1)
It will create a new dataframe with all the values, but in case there are some values that doesn't match in both dataframes, it'll return NaN. If you doesn't want this to happen you must use this:
pd.concat([df1, df4], axis=1, join='inner')
If you wanna learn a bit more, use pydata: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
You can use merge option with inner join
df2.merge(df1,how="inner",on=["V1","V2"])[["FileName","Time"]]
While I think Eric's solution is more pythonic, if your only aim is to print the rows on which df1 and df2 have v1 and v2 values the same, provided the two dataframes are of the same length, you can do the following:
for row in range(len(df1)):
if (df1.iloc[row,1:] == df2.iloc[row,1:]).all() == True:
print(df1.iloc[row], df2.iloc[row])
Try this:
client = boto3.client('s3')
obj = client.get_object(Bucket='', Key='')
data = obj['Body'].read()
df1 = pd.read_excel(io.BytesIO(data), sheet_name='0')
df2 = pd.read_excel(io.BytesIO(data), sheet_name='1')
head = df2.columns[0]
print(head)
data = df1.iloc[[8],[0]].values[0]
print(data)
print(df2)
df2.columns = df2.iloc[0]
df2 = df2.drop(labels=0, axis=0)
df2['Head'] = head
df2['ID'] = pd.Series([data,data])
print(df2)
df2.to_csv('test.csv',index=False)
Hi, I have two data frames. Both with two columns, identifier and weight.
What I would like is, for each "key" so A and B, if the second column have opposite signs accross the two dataframes (so one is positive and one is negative, then create a new column with the lowest absolute value).
import pandas as pd
A = {"ID":["A", "B"], "Weight":[500,300]}
B = {"ID":["A", "B"], "Weight":[-300,100]}
dfA = pd.DataFrame(data=A)
dfB = pd.DataFrame(data=B)
dfC = dfA.merge(dfB, how='outer', left_on=['ID'], right_on=['ID'])
So expected output would be a new column on dfC with the lowest absolute value between both weight columns if they have an opposite signs
Here is one way via .loc accessor:
import pandas as pd
dfA = dfA.set_index('ID')
dfB = dfB.set_index('ID')
dfC = dfA.copy()
dfC['Result'] = 0
mask = (dfA['Weight'] > 0) != (dfB['Weight'] > 0)
dfC.loc[mask, 'Result'] = np.minimum(dfA['Weight'].abs(), dfB['Weight'].abs())
dfC = dfC.reset_index()
# ID Weight Result
# 0 A 500 300
# 1 B 300 0
Here is another way to get the result you want, using df.apply and df.concat
Step 1 : Create dfC with ID, WeightA and WeightB
import numpy as np
A = dfA.set_index('ID')
B = dfB.set_index('ID')
dfC = pd.concat([A,B], 1).reset_index()
dfC.columns = ['ID', 'WeightA', 'WeightB']
Edit :
You can use your dfC too, just rename the columns as such and use the Step2 for your result.
dfC = dfA.merge(dfB, how='outer', left_on=['ID'], right_on=['ID'])
dfC.columns = ['ID', 'WeightA', 'WeightB']
Step2: Create column 'lowestAbsWeight' which is the lowest absolute of the two weights A and B
dfC['lowestAbsWeight'] = dfC.apply(lambda row: np.absolute(row['WeightA']) if np.absolute(row['WeightA'])<np.absolute(row['WeightB'] ) else np.absolute(row['WeightB']), axis=1 )
The output looks like:
ID WeightA WeightB lowestAbsWeight
0 A 500 -300 300
1 B 300 100 100
Hope this helps.
df.index = 10,100,1000
df2.index = 1,2,11,50,101,500,1001
Just sample
I need to match closest index from df2 compare with df by these conditions
df2.index have to > df.index
only one closest value
for example output
df | df2
10 | 11
100 | 101
1000 | 1001
Now I can do it with for-loop and it's extremely slow
And I used new_df2 to keep index instead of df2
new_df2 = pd.DataFrame(columns = ["value"])
for col in df.index:
for col2 in df2.index:
if(col2 > col):
new_df2.loc[col2] = df2.loc[col2]
break
else:
df2 = df2[1:] #delete first row for index speed
How to avoid for-loop in this case Thank.
Not sure how robust this is, but you can sort df2 so it's index is decreasing, and use asof to find the most recent index label matching each key in df's index:
df2.sort_index(ascending=False, inplace=True)
df['closest_df2'] = df.index.map(lambda x: df2.index.asof(x))
df
Out[19]:
a closest_df2
10 1 11
100 2 101
1000 3 1001
I have two dataframes. df1 and df2.
I would like to get whatever values are common from df1 and df2 and the dt value of df2 must be greater than df1's dt value
In this case, the expected value is fee
df1 = pd.DataFrame([['2015-01-01 06:00','foo'],
['2015-01-01 07:00','fee'], ['2015-01-01 08:00','fum']],
columns=['dt', 'value'])
df1.dt=pd.to_datetime(df1.dt)
df2=pd.DataFrame([['2015-01-01 06:10','zoo'],
['2015-01-01 07:10','fee'],['2015-01-01 08:10','feu'],
['2015-01-01 09:10','boo']], columns=['dt', 'value'])
df2.dt=pd.to_datetime(df2.dt)
One way would be to merge on 'value' column so this will produce only matching rows, you can then filter the merged df using the 'dt_x', 'dt_y' columns:
In [15]:
merged = df2.merge(df1, on='value')
merged[merged['dt_x'] > merged['dt_y']]
Out[15]:
dt_x value dt_y
0 2015-01-01 07:10:00 fee 2015-01-01 07:00:00
You can't do something like the following because the lengths don't match:
df2[ (df2['value'].isin(df1['value'])) & (df2['dt'] > df1['dt']) ]
raises:
ValueError: Series lengths must match to compare