I was trying to merge two dataframes using a less-than operator. But I ended up using pandasql.
Is it possible to do the same query below using pandas functions?
(Records may be duplicated, but that is fine as I'm looking for something similar to cumulative total later)
sql = '''select A.Name,A.Code,B.edate from df1 A
inner join df2 B on A.Name = B.Name
and A.Code=B.Code
where A.edate < B.edate '''
df4 = sqldf(sql)
The suggested answer seems similar but couldn't get the result expected. Also the answer below looks very crisp.
Use:
df = df1.merge(df2, on=['Name','Code']).query('edate_x < edate_y')[['Name','Code','edate_y']]
Related
The first df I have is one that has station codes and names, along with lat/long (not as relevant), like so:
code name latitude longitude
I have another df with start/end dates for travel times. This df has only the station code, not the station name, like so:
start_date start_station_code end_date end_station_code duration_sec
I am looking to add columns that have the name of the start/end stations to the second df by matching the first df "code" and second df "start_station_code" / "end_station_code".
I am relatively new to pandas, and was looking for a way to optimize doing this as my current method takes quite a while. I use the following code:
for j in range(0, len(df_stations)):
for i in range(0, len(df)):
if(df_stations['code'][j] == df['start_station_code'][i]):
df['start_station'][i] = df_stations['name'][j]
if(df_stations['code'][j] == df['end_station_code'][i]):
df['end_station'][i] = df_stations['name'][j]
I am looking for a faster method, any help is appreciated. Thank you in advance.
Use merge. If you are familiar with SQL, merge is equivalent to LEFT JOIN:
cols = ["code", "name"]
result = (
second_df
.merge(first_df[cols], left_on="start_station_code", right_on="code")
.merge(first_df[cols], left_on="end_station_code", right_on="code")
.rename(columns={"code_x": "start_station_code", "code_y": "end_station_code"})
)
The answer by #Code-Different is very nearly correct. However the columns to be renamed are the name columns not the code columns. For neatness you will likely want to drop the additional code columns that get created by the merges. Using your names for the dataframes df and df_station the code needed to produce df_required is:
cols = ["code", "name"]
required_df = (
df
.merge(df_stations[cols], left_on="start_station_code", right_on="code")
.merge(df_stations[cols], left_on="end_station_code", right_on="code")
.rename(columns={"name_x": "start_station", "name_y": "end_station"})
.drop(columns = ['code_x', 'code_y'])
)
As you may notice the merge means that the dataframe acquires duplicate 'code' columns which get suffixed automatically, this is a built in default of the merge command. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more detail.
I would like to do the following in pandas which I would do in SQL:
SELECT * FROM table WHERE field = value
I was thinking I could use something similar to an apply or map with a similar interface. Something like:
def filter_func(row):
if row['name'] == 'Bob':
return True
else:
return False
df.filter(filter_func, axis=1)
Similar to how I can do:
df['new_col'] = df.apply(apply_func, axis=1)
Is there a way to do something similar so that it only returns the rows where name='Bob' ?
The strangest thing is the pandas filter function says:
Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.
That seems to me like quite a useless way to make use of a filter ?
Check with
df_filter = df[df['name'] == 'Bob']
For sql in operation we have isin
#SELECT * FROM table WHERE field IN ('A','B')
df_filter = df[df['name'].isin('A','B)]
filter is named badly , which is the filter for columns, or when we do groupby filter
In Scala it's easy to avoid duplicate columns after join operation:
df1.join(df1, Seq("id"), "left").show()
However, is there a similar solution in PySpark? If I do df1.join(df1, df1["id"] == df2["id"], "left").show() in PySpark, I get two columns id...
You have 3 options :
1. Use outer join
aDF.join(bDF, "id", "outer").show()
2. Use Aliasing: You will lose data related to B Specific Id's in this.
aDF.alias("a").join(bDF.alias("b"), aDF.id == bDF.id, "outer").drop(col("b.id")).show()
3. Use drop to drop the columns
columns_to_drop = ['ida', 'idb']
df = df.drop(*columns_to_drop)
Let me know if that helps.
I am wondering if there is a way in Python (within or outside Pandas) to do the equivalent joining as we can do in SQL on two tables based on multiple complex conditions such as value in table 1 is more than 10 less than in table 2, or only on some field in table 1 satisfying some conditions, etc.
This is for combining some fundamental tables to achieve a joint table with more fields and information. I know in Pandas, we can merge two dataframes on some column names, but such a mechanism seems to be too simple to give the desired results.
For example, the equivalent SQL code could be like:
SELECT
a.*,
b.*
FROM Table1 AS a
JOIN Table 2 AS b
ON
a.id = b.id AND
a.sales - b.sales > 10 AND
a.country IN ('US', 'MX', 'GB', 'CA')
I would like an equivalent way to achieve the same joined table in Python on two data frames. Anyone can share insights?
Thanks!
In principle, your query could be rewritten as a join and a filter where clause.
SELECT a.*, b.*
FROM Table1 AS a
JOIN Table2 AS b
ON a.id = b.id
WHERE a.sales - b.sales > 10 AND a.country IN ('US', 'MX', 'GB', 'CA')
Assuming the dataframes are gigantic and you don't want a big intermediate table, we can filter Dataframe A first.
import pandas as pd
df_a, df_b = pd.Dataframe(...), pd.Dataframe(...)
# since A.country has nothing to do with the join, we can filter it first.
df_a = df_a[df_a["country"].isin(['US', 'MX', 'GB', 'CA'])]
# join
merged = pd.merge(df_a, df_b, on='id', how='inner')
# filter
merged = merged[merged["sales_x"] - merged["sales_y"] > 10]
off-topic: depending on the use case, you may want to use abs() the difference.
I'm doing something that I know that I shouldn't be doing. I'm doing a for loop within a for loop (it sounds even more horrible, as I write it down.) Basically, what I want to do, theoretically, using two dataframes is something like this:
for index, row in df_2.iterrows():
for index_1, row_1 in df_1.iterrows():
if row['column_1'] == row_1['column_1'] and row['column_2'] == row_1['column_2'] and row['column_3'] == row_1['column_2']:
row['column_4'] = row_1['column_4']
There has got to be a (better) way to do something like this. Please help!
As pointed out by #Andy Hayden in is it possible to do fuzzy match merge with python pandas?, you can use difflib : get_closest_matches function to create new join columns.
import difflib
df_2['fuzzy_column_1'] = df_2['column_1'].apply(lambda x: difflib.get_close_matches(x, df_1['column_1'])[0])
# Do same for all other columns
Now you can apply inner join using pandas merge function.
result_df = df_1.merge(df_2,left_on=['column_1', 'column_2','column_3'], and right_on=['fuzzy_column_1','fuzzy_column_2','fuzzy_column_3] )
You can use drop function to remove unwanted columns.