How to parallelize a python code that has two different pandas dataframes?

How to parallelize a python code that has two different pandas dataframes? - python

I have two dataframes and have a code to extract some data from one of the dataframes and add to the other dataframe:
sales= pd.read_excel("data.xlsx", sheet_name = 'sales', header = 0)
born= pd.read_excel("data.xlsx", sheet_name = 'born', header = 0)
bornuni = born.number.unique()
for babies in bornuni:
datafame = born[born["id"]==number]
for i, r in sales.iterrows():
if r["number"] == babies:
sales.loc[i,'ini_weight'] = datafame["weight"].iloc[0]
sales.loc[i,'ini_date'] = datafame["date of birth"].iloc[0]
else:
pass
this is pretty inefficient with bigger data sets so I want to parallelize this code but I don´t have a clue how to do it. Any help would be great. Here is a link to a mock dataset.

So before worrying about parallelizing, I can't help but notice that you're using lots of for loops to deal with the dataframes. Dataframes are pretty fast when you use their vectorized capabilities.
I see a lot of inefficient use of pandas here, so maybe we first fix that and then worry about throwing more CPU cores at it.
It seems to me you want to accomplish the following:
For each unique baby id number in the born dataframe, you want to update the ini_weight and ini_date fields of the corresponding entry in the sales dataframe.
There's a good chance that you can use some dataframe merging / joining to help you with that, as well as using the pivot table functionality:
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
I strongly suggest you take a look at those, try using the ideas from these articles, and then reframe your question in terms of these operations, because as you correctly notice, looping over all the rows repeatedly to find the row with some matching index is very inefficient.

Related

What is the most pythonic way to relationate 2 pandas dataframe? Based on a key value

So, I work on a place and here I use A LOT of Python (Pandas) and the data keeps getting bigger and bigger, last month I was working with a few hundred thousand rows, weeks after that I was working with a few million rows and now I am working with 42 million rows. Most of my work is just take a dataframe and for each row, I need to consult in another dataframe its "equivalent" and process the data, sometimes just merge but more often i need to do a function with the equivalent data. Back in the days with a few hundred thousand rows, it was ok to just use apply and a simple filter but now it is EXTREMELY SLOW. Recently I've switched to vaex which is way faster than pandas on every aspect but apply, and after some time searching I found that apply is the last resource and should be used only if u haven't another option. So, is there another option? I really don't know
Some code to explain how I was doing this entire time:
def get_secondary(row: pd.DataFrame):
cnae = row["cnae_fiscal"]
cnpj = row["cnpj"]
# cnaes is another dataframe
secondary = cnaes[cnaes.cnpj == cnpj]
return [cnae] + list(secondary["cnae"].values)
empresas["cnae_secundarios"] = empresas.apply(get_secondary, axis=1)
This isn't the only use case, as I said.

Is there a pandas function to merge 2 dfs so that repeating items in the second df are added as columns to the first df?

I have a hard time to formulate this problem in abstract terms, therefore I will mostly try to explain it with examples.
I have 2 pandas dataframes (I get them from a sqlite DB).
First DF:
Second DF:
So the thing is: There are several images per "capture". I would like to add the images to the capture df as columns, so that each capture has 9 image columns, each with a path. There are always 9 images per capture.
I solved it in pandas with what I know in the following way:
cam_idxs = sorted(list(range(9)) * 2)
for cam_idx in cam_idxs:
sub_df = images.loc[(images["camera_id"]==cam_idx)]
captures = captures.merge(sub_df[["image", "capture_id"]], left_on="id",
right_on="capture_id")
I imagine though that there must be a better way. At least I imagine people probably stumble into this problem more often when getting data from a sql database.
Since I am getting the data into pandas from a sql database, I am also open to SQL commands that get me this result. And I'm also grateful for people telling me what this kind of operation is called, I did not find a good way to google for this, therefore I am asking here. Excuse me when this question was asked somewhere, I did not find anything with my searchterms.
So the question at the end is: Is there a better way to do this, especially a more efficient way to do this?

What you are looking for is the pivot table.
You just need to create a column containing the index of the number of image by capture_id that you will use as columns in the pivot table.
For example this could be :
images['column_pivot'] = [x for x in range(1,10)]*int(images.shape[0]/9)
In your case 'column_pivot' would be [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9...7,8,9] (e.g. rolling from 1 to 9)
Then you pivot :
pd.pivot_table(images, columns='column_pivot', index='capture_id', values='image')
This will give the expected result.

Applying corrections to a subsampled copy of a dataframe back to the original dataframe?

I'm a Pandas newbie, so please bear with me.
Overview: I started with a free-form text file created by a data harvesting script that remotely accessed dozens of different kinds of devices, and multiple instances of each. I used OpenRefine (a truly wonderful tool) to munge that into a CSV that was then input to dataframe df using Pandas in a JupyterLab notebook.
My first inspection of the data showed the 'Timestamp' column was not monotonic. I accessed individual data sources as follows, in this case for the 'T-meter' data source. (The technique was taken from a search result - I don't really understand it, but it worked.)
cond = df['Source']=='T-meter'
rows = df.loc[cond, :]
df_tmeter = pd.DataFrame(columns=df.columns)
df_tmeter = df_tmeter.append(rows, ignore_index=True)
then checked each as follows:
df_tmeter['Timestamp'].is_monotonic
Fortunately, the problem was easy to identify and fix: Some sensors were resetting, then sending bad (but still monotonic) timestamps until their clocks were updated. I wrote the function healing() to cleanly patch such errors, and it worked a treat:
df_tmeter['healed'] = df_tmeter['Timestamp'].apply(healing)
Now for my questions:
How do I get the 'healed' values back into the original df['Timestamp'] column for only the 'T-meter' items in df['Source']?
Given the function healing(), is there a clean way to do this directly on df?
Thanks!
Edit: I first thought I should be using 'views' into df, but other operations on the data would either generate errors, or silently turn the views into copies.

I wrote a wrapper function heal_row() for healing():
def heal_row( row ):
if row['Source'] == 'T-meter': # Redundant check, but safe!
row['Timestamp'] = healing(row['Timestamp'])
return row
then did the following:
df = df.apply(lambda row: row if row['Source'] != 'T-meter' else heal_row(row), axis=1)
This ordering is important, since healing() is stateful based on the prior row(s), and thus can't be the default operation.

Pandas Large CSV

A continuation on a previous post. Previously, I had help creating a new column in a dataframe using Pandas, and each value would represent a factorized or unique value based on another column's value. I used this on a test case and it successfully works, but I am having trouble with a much larger log and htm file to do the same process for. I have 12 log files (for each month) and after combining them, I get a 17Gb file to work with. I want to factorize each and every username on it. I have been looking into using Dask, however, I can't replicate the functionality of sort and factorize to do what I want for the Dask dataframe. Would it be better to try to use Dask, continue with Pandas or try with a MySQL database to manipulate a 17GB file?
import pandas as pd
import numpy as np
#import dask.dataframe as pf
df = pd.read_csv('example2.csv', header=0, dtype='unicode')
df_count = df['fruit'].value_counts()
df.sort_values(['fruit'], ascending=True, inplace=True)
sorting the column fruit
df.reset_index(drop=True, inplace=True)
f, u = pd.factorize(df.fruit.values)
n = np.core.defchararray.add('Fruit', f.astype(str))
df = df.assign(NewCol=n)
#print(df)
df.to_csv('output.csv')

Would it be better to try to use Dask, continue with Pandas or try with a MySQL database to manipulate a 17GB file?
The answer to this question depends on a great many things and is probably too general to get a good answer on Stack Overflow.
However, there are a few particular questions you bring up that are easier to answer
How do I factorize a column?
The easy way here is to categorize a column:
df = df.categorize(columns=['fruit'])
How do I sort unique values within a column
You can always set the column as the index, which will cause a sort. However beware that sorting in a distributed setting can be quite expensive.
However if you want to sort a column with a small number of options then you might find the unique values, sort those in-memory, and then join those back onto the dataframe. Something like the following might work:
unique_fruit = df.fruit.drop_duplicates().compute() # this is now a pandas series
unique_fruit = unique_fruit.sort_values()
numbers = pd.Series(unique_fruit.index, index=unique_fruit.values, name='fruit')
df = df.merge(numbers.to_frame(), left_on='fruit', right_index=True)

Python Pandas - Main DataFrame, want to drop all columns in smaller DataFrame

I have a DataFrame ('main') that has about 300 columns. I created a smaller DataFrame ('public') and have been working on this.
I now want to delete the columns contained within 'public' from the larger DataFrame ('main').
I've tried the following instructions:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html
Python Pandas - Deleting multiple series from a data frame in one command
without any success, along with various other statements that have been unsuccessful.
The columns that make up 'public' are not consecutive - i.e. they are taken from various points in the larger DataFrame 'main'. All of the columns have the same Index. [Not sure if this is important, but 'public' was created using the 'join' function].
Yes, I'm being lazy - I don't want to have to type out the names of every column! I'm hoping there's a way to use the DataFrame 'public' in a statement that will allow deletion of these columns en masse. If anyone has any suggestions and/or guidance I'd be most grateful.
(Have Python 2.7 and am using Pandas, numpy, math, pylab etc.)
Thanks in advance.

Ignore my question - Murphy's Law prevails and I've just solved it.
I was using the statement from the stackoverflow question mentioned below:
df.drop(df.columns[1:], axis=1)
and this was not working. I have instead used
df = df.drop(df2, axis=1)
and this worked (df = main, df2 = public). Simple really once you don't overthink it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parallelize a python code that has two different pandas dataframes? - python

Related

What is the most pythonic way to relationate 2 pandas dataframe? Based on a key value

Is there a pandas function to merge 2 dfs so that repeating items in the second df are added as columns to the first df?

Applying corrections to a subsampled copy of a dataframe back to the original dataframe?

Pandas Large CSV

Python Pandas - Main DataFrame, want to drop all columns in smaller DataFrame

Categories

Resources