I was asked to do a merge between two dataframes, The first contains sales orders (1m rows) and the second contains discounts. applied to that sales orders, established according to a date range.
Both are joined by an ID (A975 ID) but I can't find an efficient way to do that merge without running out of memory.
My idea so far is to do an "outer" merge and once that's done(*), filter the "Sales Order Date" column by the value that is within the date range "DATAB"(start) - "DATBI"(end)
I was able to do it in PowerQuery with "SelectRows" but it takes too long, if I can replicate the same procedure in Python and gain a few minutes of processing it would be more than enough.
I know that an outer join generates tons of duplicated rows with only the date changed, but I don't know what else to do.
IMG Table to Merge
The solution was merging and pd.query
df = pd.merge(all_together, A904, how="outer", left_on='ID A904', right_on='ID A904')\
.query('DATAB <= ERDAT <= DATBI')
Related
My code currently looks like this:
df1 = pd.DataFrame(statsTableList)
df2 = pd.read_csv('StatTracker.csv')
result = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
I get an error and I'm not sure why.
The goal of my program is to pull data from an API, and then write it all to a file for analyzing. df1 is the lets say the first 100 games written to the csv file as the first version. df2 is me reading back those first 100 games the second time around and comparing it to that of df1 (new data, next 100 games) to check for duplicates and delete them.
The part that is not working is the drop duplicates part. It gives me an error of unhashable list, I would assume that's because its two dataframes that are lists of dictionaries. The goal is to pull 100 games of data, and then pull the next 50, but if I pull number 100 again, to drop that one, and just add 101-150 and then add it all to my csv file. Then if I run it again, to pull 150-200, but drop 150 if its a duplicate, etc etc..
Based from your explanation, you can use this one liner to find unique values in df1:
df_diff = df1[~df1.apply(tuple,1)\
.isin(df2.apply(tuple,1))]
This code checks if the rows is exists in another dataframe. To do the comparision it converts each row to tuple (apply tuple conversion along 1 (row) axis).
This solution is indeed slow because its compares each row inside df1 to all rows in df2. So it has time complexity n^2.
If you want more optimised version, try to use pandas built in compare method
df1.compare(df2)
Background info
I'm working on a DataFrame where I have successfully joined two different datasets of football players using fuzzymatcher. These datasets did not have keys for an exact match and instead had to be done by their names. An example match of the name column from two databases to merge as one is the following
long_name name
L. Messi Lionel Andrés Messi Cuccittini
As part of the validation process of a 18,000 row database, I want to check the two date of birth columns in the merged DataFrame - df, ensuring that the columns match like the example below
dob birth_date
1987-06-24 1987-06-24
Both date columns have been converted from strings to dates using pd.to_datetime(), e.g.
df['birth_date'] = pd.to_datetime(df['birth_date'])
My question
My query, I have another column called 'value'. I want to update my pandas DataFrame so that if the two date columns match, the entry is unchanged. However, if the two date columns don't match, I want the data in this value column to be changed to null. This is something I can do quite easily in Excel with a date_diff calculation but I'm unsure in pandas.
My current code is the following:
df.loc[(df['birth_date'] != df['dob']),'value'] = np.nan
Reason for this step (feel free to skip)
The reason for this code is that it will quickly show me fuzzy matches that are inaccurate (approx 10% of total database) and allow me to quickly fix those.
Ideally I need to also work on the matching algorithm to ensure a perfect date match, however, my current algorithm currently works quite well in it's current state and the project is nearly complete. Any advice on this however I'd be happy to hear, if this is something you know about
Many thanks in advance!
IICU:
Please Try np.where.
Works as follows;
np.where(if condition, assign x, else assign y)
if condition=df.loc[(df['birth_date'] != df['dob'],
x=np.nan and
y= prevailing df.value
df['value']= np.where(df.loc[(df['birth_date'] != df['dob']),'value'], np.nan, df['value'])
Let's say we have several dataframes that contain relevant information that need to be compiled into one single dataframe. There are several conditions involved in choosing which pieces of data can be brought over to the results dataframe.
Here are 3 dataframes (columns only) that we need to pull and compile data from:
df1 = ["Date","Order#","Line#","ProductID","Quantity","Sale Amount"]
df2 = ["Date","PurchaseOrderID","ProductID","Quantity","Cost"]
df3 = ["ProductID","Quantity","Location","Cost"]
df3 is the only table in this set that actually contains a unique non-repeating key "productid". The other two dataframes have keys, but they can repeat. the only way to find uniqueness is to refer to date and the other foreign keys.
Now, we'd like the desired result to show which all products grouped by product where df1.date after x date, where df2.quantity<5, where df3.quantity>0. Ideally the results would show the df3.quantity, df.cost (sum both in grouping), most recent purchase date from df2.date, and total number of sale by part from df1.count where all above criteria met.
This is the quickest example I could come up with on this issue. I'm able to accomplish this in VBA with only one problem... it's EXCRUCIATINGLY slow. I understand how list comprehension and perhaps other means of completing this task would be faster than VBA (maybe?), but it would still take a while with all of the logic and decision making that happens behind the scenes.
This example doesn't exactly show the complexities but any advice or direction you have to offer may help me and others understand how to treat these kinds of problems in Python. Any expert opinion, advice, direction is very much appreciated.
If I understand correctly:
You simply need to apply the conditions as filters on each dataframe, then group by ProductID and put it together.
df1 = df1[df1.Date > x].groupby('ProductID').agg({'Quantity':'sum','Sale Amount':'sum'})
df2 = df2.groupby('ProductID').agg({'Date':'max','Quantity':'sum','Cost':'sum'})
df2 = df2[df2.Quantity > 5].copy()
df3 = df3[df3.Quantity > 0].copy()
Once you have all of those, probably something like:
g = [i for i in list(df3.index) if i in list(df2.index) and i in list(df1.index)]
df = df3.loc[g] #use df3 as a frame, with only needed indexes
I am not sure what you want to pull from df1 and df2 - but it will look something like:
df = df.join(df2['col_needed'])
You may need to rename columns to avoid overlap.
This avoids inefficient looping and should be orders of magnitude faster than a loop in VBA.
I have a large dataset ('df'; ~400,000 lines) of rows with a datetime index describing features of cities.
eg.
df = pd.DataFrame([['2016-01-01 00:00:00','Jacksonville'], ['2016-01-01 01:00:00','Jacksonville'],
['2016-01-01 02:00:00','Jacksonville'], ['2016-01-01 03:00:00','Toronto']], columns=['timestamp','City'])
I want to merge this with another smaller dataset I've created ('public_holidays'; ~300 lines) that lists public holidays for those cities.
eg.
public_holidays = pd.DataFrame([['1/01/2016','New Year\'s Day','Jacksonville'], ['1/01/2016','New Year\'s Day','San Francisco'],
['25/12/2018','Christmas Day','Toronto'], ['26/12/2018','Boxing Day','Toronto']], columns=['timestamp','Holiday','City'])
Currently I've done this:
new_df= pd.merge(df, public_holidays, how = 'left', on = ['timestamp','City'])
This works, however as 'df's timestamp contains every hour of each day, the merge only occures at the hour 00:00 (as 'public_holidays' "timestamp" is only by date).
How can I get 'public_holidays' to map to every row that matches its date, regardless of time?
Many thanks for any assistance.
Add to df an auxiliary column with normalized timestamp:
df['dat'] = df.timestamp.dt.normalize()
Then in merge, instead of on=... pass:
left_on=['dat', 'City'],
right_on=['timestamp', 'City'].
Finally (after the new_df is created) you can drop this auxiliary column.
An alternative is to overwrite timestamp column with the normalized timestamp:
df.timestamp = df.timestamp.dt.normalize()
and perform the merge without any change.
Note: As you failed to include sample data, the above advice is only
"theoretical", not supported by any actual test run.
I have a pandas DataFrame with columns patient_id, patient_sex, patient_dob (and other less relevant columns). Rows can have duplicate patient_ids, as each patient may have more than one entry in the data for multiple medical procedures. I discovered, however, that a great many of the patient_ids are overloaded, i.e. more than one patient has been assigned to the same id (evidenced by many instances of a single patient_id being associated with multiple sexes and multiple days of birth).
To refactor the ids so that each patient has a unique one, my plan was to group the data not only by patient_id, but by patient_sex and patient_dob as well. I figure this must be sufficient to separate the data into individual users (and if two patients with the same sex and dob just happened to be assigned the same id, then so be it.
Here is the code I currently use:
# I just use first() here as a way to aggregate the groups into a DataFrame.
# Bonus points if you have a better solution!
indv_patients = patients.groupby(['patient_id', 'patient_sex', 'patient_dob']).first()
# Create unique ids
new_patient_id = 'new_patient_id'
for index, row in indv_patients.iterrows():
# index is a tuple of the three column values, so this should get me a unique
# patient id for each patient
indv_patients.loc[index, new_patient_id] = str(hash(index))
# Merge new ids into original patients frame
patients_with_new_ids = patients.merge(indv_patients, left_on=['patient_id', 'patient_sex', 'patient_dob'], right_index=True)
# Remove byproduct columns, and original id column
drop_columns = [col for col in patients_with_new_ids.columns if col not in patients.columns or col == new_patient_id]
drop_columns.append('patient_id')
patients_with_new_ids = patients_with_new_ids.drop(columns=drop_columns)
patients = patients_with_new_ids.rename(columns={new_patient_id : 'patient_id'})
The problem is that with over 7 million patients, this is way too slow a solution, the biggest bottleneck being the for-loop. So my question is, is there a better way to fix these overloaded ids? (The actual id doesn't matter, so long as its unique for each patient)
I don't know what the values for the columns are but have you tried something like this?
patients['new_patient_id'] = patients.apply(lambda x: x['patient_id'] + x['patient_sex'] + x['patient_dob'],axis=1)
This should create a new column and you can then use groupby with the new_patient_id