I’ve got two data frames :-
Df1
Time V1 V2
02:00 D3F3 0041
02:01 DD34 0040
Df2
FileName V1 V2
1111.txt D3F3 0041
2222.txt 0000 0040
Basically I want to compare the v1 v2 columns and if they match print the row time from df1 and the row from df2 filename. So far all i can find is the
isin()
, which simply gives you a boolean output.
So the output would be :
1111.txt 02:00
I started using dataframes because i though i could query the two df's on the V1 / V2 values but I can't see a way. Any pointers would be much appreciated
Use merge on the dataframe columns that you want to have the same values. You can then drop the rows with NaN values, as those will not have matching values. From there, you can print the merged dataframes values however you see fit.
df1 = pd.DataFrame({'Time': ['8a', '10p'], 'V1': [1, 2], 'V2': [3, 4]})
df2 = pd.DataFrame({'fn': ['8.txt', '10.txt'], 'V1': [3, 2], 'V2': [3, 4]})
df1.merge(df2, on=['V1', 'V2'], how='outer').dropna()
=== Output: ===
Time V1 V2 fn
1 10p 2 4 10.txt
The most intuitive solution is:
1) iterate the V1 column in DF1;
2) for each item in this column, check if this item exists in the V1 column of DF2;
3) if the item exists in DF2's V1, then find the index of that item in the DF2 and then you would be able to find the file name.
You can try using pd.concat.
On this case it would be like:
pd.concat([df1, df2.reindex(df1.index)], axis=1)
It will create a new dataframe with all the values, but in case there are some values that doesn't match in both dataframes, it'll return NaN. If you doesn't want this to happen you must use this:
pd.concat([df1, df4], axis=1, join='inner')
If you wanna learn a bit more, use pydata: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
You can use merge option with inner join
df2.merge(df1,how="inner",on=["V1","V2"])[["FileName","Time"]]
While I think Eric's solution is more pythonic, if your only aim is to print the rows on which df1 and df2 have v1 and v2 values the same, provided the two dataframes are of the same length, you can do the following:
for row in range(len(df1)):
if (df1.iloc[row,1:] == df2.iloc[row,1:]).all() == True:
print(df1.iloc[row], df2.iloc[row])
Try this:
client = boto3.client('s3')
obj = client.get_object(Bucket='', Key='')
data = obj['Body'].read()
df1 = pd.read_excel(io.BytesIO(data), sheet_name='0')
df2 = pd.read_excel(io.BytesIO(data), sheet_name='1')
head = df2.columns[0]
print(head)
data = df1.iloc[[8],[0]].values[0]
print(data)
print(df2)
df2.columns = df2.iloc[0]
df2 = df2.drop(labels=0, axis=0)
df2['Head'] = head
df2['ID'] = pd.Series([data,data])
print(df2)
df2.to_csv('test.csv',index=False)
Related
One of the solutions that is similar is found in here where the asker only have a single dataframe and their requirements was to match a fixed string value:
result = df.loc[(df['Col1'] =='Team2') & (df['Col2']=='Medium'), 'Col3'].values[0]
However, the problem I encountered with the .loc method is that it requires the 2 dataframes to have the same size because it will only match values on the same row position of each dataframe. So if the orders of the rows are mixed in either of the dataframes, it will not work as expected.
Sample of this situation is shown below:
df1 - df1 = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
df2 - df2 = pd.DataFrame({'a': [1, 3, 2], 'b': [4, 6, 5]})
Using, df1.loc[(df1['a'] == df2['a']) & (df1['b'] == df2['b']), 'Status'] = 'Included' will yield:
But i'm looking for something like this:
I have looked into methods such as .lookup but is deprecated as of December of the year 2020 (which also requires similar sized dataframes).
Use DataFrame.merge with indicator parameter for new column with this information, if need change values e.g. use numpy.where:
df = df1.merge(df2, indicator='status', how='left')
df['status'] = np.where(df['status'].eq('both'), 'included', 'not included')
print (df)
a b status
0 1 4 included
1 2 5 included
2 3 6 included
Basically, I have 5 pd.dataframes, named= df0, df1, df2, df3, df4. What I would like to do is use a for loop to add data to these 5 dataframes. Something the likes of:
for i, dataset in enumerate([df0,df1,df2,df3,df4]):
dataset = pd.concat([dataset, NEW_DATA])
However, when you do it like this (or when you use a solely list instead of enumerate), 'dataset' returns the dataset, rather than the name (i.e. df0). How can I solve this. For example, the output for the second iteration should be:
for i, dataset in enumerate([df0,df1,df2,df3,df4]):
df1 = pd.concat([df1, NEW_DATA])
edit: I have also tried dictionaries, such as {'df0':df0... etc}, however, it again prints the dataset rather than the dataset 'variable name'.
You can re-assign the new df into your list:
# setup example
df0 = pd.DataFrame(np.random.randint(0, 10, (3, 2)))
df1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)))
df2 = pd.DataFrame(np.random.randint(0, 10, (3, 2)))
# then
lst = [df0, df1, df2]
for i, df in enumerate(lst):
newdata = pd.DataFrame([[0,0], [0,0]]) # (say)
lst[i] = df.append(newdata)
df0, df1, df2 = lst
>>> df0
0 1
0 8 7
1 9 1
2 5 6
0 0 0
1 0 0
But, BTW, it might be better to store your DataFrames collection in a dict instead of a list, if you want to refer to them by name instead of by index.
Edit: Rewriting the solution to provide some proper practice.
So the problem is you have a bunch of values that need to be updated through reassignment. There's a stylistic thing going on where if you have df1, df2, ..., maybe you'd much rather have them in a list.
Using a list in any case is also how I'd address the issue.
dfs = [df0, df1, df2, ...]
dfs = [pd.concat([df, NEW_DATA]) for df in dfs]
[df0, df1, df2, ...] = dfs
See how, if you'd just use dfs in general and refer to dfs[0] instead of df0, this solution could almost come for free?
I want to merge 2 dataframes and first is dm.shape = (21184, 34), second is po.shape = (21184, 6). I want to merge them then it will be 40 columns. I write as this
dm = dm.merge(po, left_index=True, right_index=True)
then it is dm.shape = (4554, 40) my rows decreased.
P.s po is the PolynomialFeatures of numerical data of dm.
Problem is different index values, so convert them to default RangeIndex in both DataFrames:
df = dm.reset_index(drop=True).merge(po.reset_index(drop=True),
left_index=True,
right_index=True)
Solution with concat - by default outer join, but if same index values in both working same:
df = pd.concat([dm.reset_index(drop=True), po.reset_index(drop=True)], axis=1)
Or use:
dm = pd.DataFrame([dm.values.flatten().tolist(), po.values.flatten().tolist()]).rename(index=dict(zip(range(2),[*po.columns.tolist(), *dm.columns.tolist()]))).T
You can use the method join and set the parameter on to the index of the joined dataframe:
df1 = pd.DataFrame({'col1': [1, 2]}, index=[1,2])
df2 = pd.DataFrame({'col2': [3, 4]}, index=[3,4])
df1.join(df2, on=df2.index)
Output:
col1 col2
1 1 3
2 2 4
The joined dataframe must not contain duplicated indices.
I have to dataframes and I am using pandas.
I want to do a cumulative sum from a variable date and by the value in a column
I want to add a second column to df2 that show the date to know the day when the sum of the AVG column is greater than 100 after date2 in df2.
For example with df1 and df2 being the dataframe I start with and df3 what I want and df3['date100'] is the day the sum of avg is greater than 100:
df1 = pd.DataFrame({'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'])})
*Something*
df3 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'], 'date100': ['3/1/2014', '2/1/2014'], 'sum': [123, 105]})
I found some answers but most them use groupby and df2 has no groups.
Since your example is very basic, if you have edge cases you want me to take care of, just ask. This solution implies that :
The solution :
# For this solution your DataFrame needs to be sorted by date.
limit = 100
df = pd.DataFrame({
'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014',
'2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],
'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C']})
result = []
for row in df2.to_dict('records'):
# For each date, I want to select the date that comes AFTER this one.
# Then, I take the .cumsum(), because it's the agg you wish to do.
# Filter by your limit and take the first occurrence.
# Converting this to a dict, appending it to a list, makes it easy
# to rebuild a DataFrame later.
ndf = df.loc[ (df['date1'] >= row['date2']) & (df['Place'] == row['Place']) ]\
.sort_values(by='date1')
ndf['avgsum'] = ndf['AVG'].cumsum()
final_df = ndf.loc[ ndf['avgsum'] >= limit ]
# Error handling, in case there is not avgsum above the threshold.
try:
final_df = final_df.iloc[0][['date1', 'avgsum']].rename({'date1' : 'date100'})
result.append( final_df.to_dict() )
except IndexError:
continue
df3 = pd.DataFrame(result)
final_df = pd.concat([df2, df3], axis=1, sort=False)
print(final_df)
# date2 Place avgsum date100
# 0 1/1/2014 A 123.0 3/1/2014
# 1 2/1/2014 C NaN NaN
Here is a direct solution, with following assumptions:
df1 is sorted by date
one solution exists for every date in df2
You can then do:
df2 = df2.join(pd.concat([
pd.DataFrame(pd.DataFrame(df1.loc[df1.date1 >= d].AVG.cumsum()).query('AVG>=100')
.iloc[0]).transpose()
for d in df2.date2]).rename_axis('ix').reset_index())\
.join(df1.drop(columns='AVG'), on='ix').rename(columns={'AVG': 'sum', 'date1': 'date100'})\
.drop(columns='ix')[['date2', 'date100', 'sum']]
This does the following:
for each date in df2 find the first date when the cumul on AVG will be at least 100
combine the results in one single dataframe indexed by the index of that line in df1
store that index in an ix column and reset the index to join that dataframe to df2
join that to df1 minus the AVG column using the ix column
rename the columns, remove the ix column, and re-order everything
I'm searching and haven't found an answer to this question, can you perform a merge of pandas dataframes using OR logic? Basically, the equivalent of a SQL merge using "where t1.A = t2.A OR t1.A = t2.B".
I have a situation where I am pulling information from one database into a dataframe (df1) and I need to merge it with information from another database, which I pulled into another dataframe (df2), merging based on a single column (col1). If these always used the same value when they matched, it would be very straightforward. The situation I have is that sometimes they match and sometimes they use a synonym. There is a third database that has a table that provides a lookup between synonyms for this data entity (col1 and col1_alias), which could be pulled into a third dataframe (df3). What I am looking to do is merge the columns I need from df1 and the columns I need from df2.
As stated above, in cases where df1.col1 and df2.col1 match, this would work...
df = df1.merge(df2, on='col1', how='left')
However, they don't always have the same value and sometimes have the synonyms. I thought about creating df3 based on when df3.col1 was in df1.col1 OR df3.col1_alias was in df1.col1. Then, creating a single list of values from df3.col1 and df3.col1_alias (list1) and selecting df2 based on df2.col1 in list1. This would give me the rows from df2 I need but, that still wouldn't put me in position to merge df1 and df2 matching the appropriate rows. I think if there an OR merge option, I can step through this and make it work, but all of the following threw a syntax error:
df = df1.merge((df3, left_on='col1', right_on='col1', how='left')|(df3, left_on='col1', right_on='col1_alias', how='left'))
and
df = df1.merge(df3, (left_on='col1', right_on='col1')|(left_on='col1', right_on='col1_alias'), how='left')
and
df = df1.merge(df3, left_on='col1', right_on='col1'|right_on='col1_alias', how='left')
and several other variations. Any guidance on how to perform an OR merge or suggestions on a completely different approach to merging df1 and df2 using the synonyms in two columns in df3?
I think I would do this as two merges:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=["A", "B"])
In [12]: df2 = pd.DataFrame([[1, 7], [2, 8], [4, 9]], columns=["C", "D"])
In [13]: res = df.merge(df2, left_on="B", right_on="C", how="left")
In [14]: res.update(df.merge(df2, left_on="A", right_on="C", how="left"))
In [15]: res
Out[15]:
A B C D
0 1 2 1.0 7.0
1 3 4 4.0 9.0
2 5 6 NaN NaN
As you can see this picks A = 1 -> D = 7 rather than B = 2 -> D = 8.
Note: For more extensibility (matching different columns) it might make sense to pull out a single column, although they're both the same in this example:
In [21]: res = df.merge(df2, left_on="B", right_on="C", how="left")["C"]
In [22]: res.update(df.merge(df2, left_on="A", right_on="C", how="left")["C"])
In [23]: res
Out[23]:
0 1.0
1 4.0
2 NaN
Name: C, dtype: float64
#will this work?
df = pd.concat([df1.merge(df3, left_on='col1', right_on='col1', how='left'), df1.merge(df3, left_on='col1', right_on='col1_alias', how='left')]