Make piece of code efficient for big data - python

I have the following code:
new_df = pd.DataFrame(columns=df.columns)
for i in list:
temp = df[df["customer id"]==i]
new_df = new_df.append(temp)
where list is a list of customer id's for the customers that meet a criteria chosen before. I use the temp dataframe because there are multiple rows for the same customer.
I consider that I know how to code, but I have never learnt how to code for big data efficiency. In this case, the df has around 3 million rows and list contains around 100,000 items. This code ran for more than 24h and it was still not done, so I need to ask, am I doing something terribly wrong? Is there a way to make this code more efficient?

list is a type in Python. You should avoid naming your variables with built-in types or functions. I simulated the problem with 3 million rows and a list of customer id of size 100000. It took only a few seconds using isin.
new_df = df[ df['customer id'].isin(customer_list) ]

You can try this code below, which should make things faster.
new_df = df.loc[df['customer id'].isin(list)]

Related

I need help concatenating 1 csv file and 1 pandas dataframe together without duplicates

My code currently looks like this:
df1 = pd.DataFrame(statsTableList)
df2 = pd.read_csv('StatTracker.csv')
result = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
I get an error and I'm not sure why.
The goal of my program is to pull data from an API, and then write it all to a file for analyzing. df1 is the lets say the first 100 games written to the csv file as the first version. df2 is me reading back those first 100 games the second time around and comparing it to that of df1 (new data, next 100 games) to check for duplicates and delete them.
The part that is not working is the drop duplicates part. It gives me an error of unhashable list, I would assume that's because its two dataframes that are lists of dictionaries. The goal is to pull 100 games of data, and then pull the next 50, but if I pull number 100 again, to drop that one, and just add 101-150 and then add it all to my csv file. Then if I run it again, to pull 150-200, but drop 150 if its a duplicate, etc etc..
Based from your explanation, you can use this one liner to find unique values in df1:
df_diff = df1[~df1.apply(tuple,1)\
.isin(df2.apply(tuple,1))]
This code checks if the rows is exists in another dataframe. To do the comparision it converts each row to tuple (apply tuple conversion along 1 (row) axis).
This solution is indeed slow because its compares each row inside df1 to all rows in df2. So it has time complexity n^2.
If you want more optimised version, try to use pandas built in compare method
df1.compare(df2)

How can I speed up a pandas search with a condition and location?

I have a DataFrame named df_trade_prices_up_to_snap that has about 295k rows and looks something like this:
For each ticker in the DataFrame, I need to get the last trade price and append it into a new DataFrame. The data frame is already ordered properly.
I wrote a little routine that works:
df_trade_prices_at_snap = pd.DataFrame()
ticker_list = list(df_trade_prices_up_to_snap.ticker.unique())
for ticker in ticker_list:
df_trade_prices_at_snap = df_trade_prices_at_snap.append(df_trade_prices_up_to_snap[df_trade_prices_up_to_snap.ticker == ticker].tail(1))
It takes about six seconds to run that loop which is too long for my needs. Can someone suggest a way to get the resulting DataFrame in a much faster way?
If the prices are ordered chronologically, you can use groupby_last:
df_trade_prices_at_snap = df.groupby('ticker')['trade_price'].last()

Python - Ways to create dataframe with multiple sources and conditions

Let's say we have several dataframes that contain relevant information that need to be compiled into one single dataframe. There are several conditions involved in choosing which pieces of data can be brought over to the results dataframe.
Here are 3 dataframes (columns only) that we need to pull and compile data from:
df1 = ["Date","Order#","Line#","ProductID","Quantity","Sale Amount"]
df2 = ["Date","PurchaseOrderID","ProductID","Quantity","Cost"]
df3 = ["ProductID","Quantity","Location","Cost"]
df3 is the only table in this set that actually contains a unique non-repeating key "productid". The other two dataframes have keys, but they can repeat. the only way to find uniqueness is to refer to date and the other foreign keys.
Now, we'd like the desired result to show which all products grouped by product where df1.date after x date, where df2.quantity<5, where df3.quantity>0. Ideally the results would show the df3.quantity, df.cost (sum both in grouping), most recent purchase date from df2.date, and total number of sale by part from df1.count where all above criteria met.
This is the quickest example I could come up with on this issue. I'm able to accomplish this in VBA with only one problem... it's EXCRUCIATINGLY slow. I understand how list comprehension and perhaps other means of completing this task would be faster than VBA (maybe?), but it would still take a while with all of the logic and decision making that happens behind the scenes.
This example doesn't exactly show the complexities but any advice or direction you have to offer may help me and others understand how to treat these kinds of problems in Python. Any expert opinion, advice, direction is very much appreciated.
If I understand correctly:
You simply need to apply the conditions as filters on each dataframe, then group by ProductID and put it together.
df1 = df1[df1.Date > x].groupby('ProductID').agg({'Quantity':'sum','Sale Amount':'sum'})
df2 = df2.groupby('ProductID').agg({'Date':'max','Quantity':'sum','Cost':'sum'})
df2 = df2[df2.Quantity > 5].copy()
df3 = df3[df3.Quantity > 0].copy()
Once you have all of those, probably something like:
g = [i for i in list(df3.index) if i in list(df2.index) and i in list(df1.index)]
df = df3.loc[g] #use df3 as a frame, with only needed indexes
I am not sure what you want to pull from df1 and df2 - but it will look something like:
df = df.join(df2['col_needed'])
You may need to rename columns to avoid overlap.
This avoids inefficient looping and should be orders of magnitude faster than a loop in VBA.

loc in dataframe filtering taking lots of time

I have a dataframe Emp (Details of employess) having 3,500,000 rows and 5 columns. I have to filter Dataframe based on Emp_Name=="John". I am using loc for this purpose. But this step is taking several hours. What is the best and fastest way to filter dataframe with huge dataset?
Emp_subset=Emp.loc[Emp['Emp_Name'] == "John"]
It shouldn't be taking that long. There's no need to use loc here.
Try this and see how much it speeds things up:
emp_subset=Emp[Emp['Emp_Name'] == "John"]
Also try not to use capitals for df object names as it could lead to confusion: https://www.python.org/dev/peps/pep-0008/

Is there a way to increse the speed of the loop or a faster way to do the same thing without using for loop?

I have a huge dataframe (4 million rows and 25 columns). I am trying to investigate 2 categorical columns. One of them has around 5000 levels (app_id) and the other has 50 levels (app_category).
I have seen that for for each level in app_id there is a unique value of app_category. How do I code to prove that?
I have tried something like this:
app_id_unique = list(train['app_id'].unique())
for unique in app_id_unique:
train.loc[train['app_id'] == unique].app_category.nunique()
This code takes forever.
I think you need groupby with nunique:
train.groupby('app_id').app_category.nunique()

Categories

Resources