Running update of Pandas dataframe - python

I'd like to use a DataFrame to manage data from many trials of an experiment I'm controlling with Python code. Ideally I will have one master dataframe with a row for each trial that lives in the main function namespace, and then a separate dict (or dataframe) returned from the function that I call to execute the important bits of code for each trial.
What is the best way to do a running update of the master dataframe with this returned set of data? So far I've come up with:
df = df.append(df_trial, ignore_index=True)
or
df = pd.concat([df, df_trial])
But neither seem ideal (and both take a relatively long time according to %timeit). Is there a more Pandonic way?

You should build a list of the pieces and concatenate them all in one shot at the end.

Related

What is the most pythonic way to relationate 2 pandas dataframe? Based on a key value

So, I work on a place and here I use A LOT of Python (Pandas) and the data keeps getting bigger and bigger, last month I was working with a few hundred thousand rows, weeks after that I was working with a few million rows and now I am working with 42 million rows. Most of my work is just take a dataframe and for each row, I need to consult in another dataframe its "equivalent" and process the data, sometimes just merge but more often i need to do a function with the equivalent data. Back in the days with a few hundred thousand rows, it was ok to just use apply and a simple filter but now it is EXTREMELY SLOW. Recently I've switched to vaex which is way faster than pandas on every aspect but apply, and after some time searching I found that apply is the last resource and should be used only if u haven't another option. So, is there another option? I really don't know
Some code to explain how I was doing this entire time:
def get_secondary(row: pd.DataFrame):
cnae = row["cnae_fiscal"]
cnpj = row["cnpj"]
# cnaes is another dataframe
secondary = cnaes[cnaes.cnpj == cnpj]
return [cnae] + list(secondary["cnae"].values)
empresas["cnae_secundarios"] = empresas.apply(get_secondary, axis=1)
This isn't the only use case, as I said.

How to run the same function multiple times simultaneously?

I have a function that takes dataframe as an input and returns a dataframe. Like:
def process(df):
<all the code for processing>
return df
# input df has 250K rows and 30 columns
# saving it in a variable
result = process(df)
# transform input df into 10,000K rows and over 50 columns
It does a lot of processing and thus takes a long time to return the output. I am using jupyter notebook.
I have come up with a new function that filters the original dataframe into 5 chunks not of equal size but between 30K to 100K, based on some category filter on a column on the origianl df and have it passed separately as process(df1), process(df2)...etc. and save it as result1, result 2, etc and then merge the results together as one single final dataframe.
But I want them to run simultaneously and combine the results automatically. Like a code to run the 5 process functions together and once all are completed then they can join as one to give me the same "result" as earlier but with a lot of run time saved.
Even better if I can split the original dataframe into equal parts and run simultaneously each part using the process(df) function, like it splits randomly those 250 k rows into 5 dataframes of size 50k each and send them as an input to the process(df) five times and runs them parallelly and give me the same final output I would be getting right now without any of this optimization.
I was reading a lot about multi-threading and I found some useful answers on stack overflow but I wasn't able to really get it work. I am very new to this concept of multi-threading.
You can use the multiprocessing library for this, which allows you to run a function on different cores of the CPU.
The following is an example
from multiprocessing import Pool
def f(df):
# Process dataframe
return df
if __name__ == '__main__':
dataframes = [df1, df2, df3]
with Pool(len(dataframes)) as p:
proccessed_dfs = p.map(f, dataframes)
print(processed_dfs)
# You would join them here
You should check dask (https://dask.org/) since it seems like you have mostly operations on dataframes. A big advantage is that you won't have to worry about all the details of manually splitting your dataframe and all of that.

Applying corrections to a subsampled copy of a dataframe back to the original dataframe?

I'm a Pandas newbie, so please bear with me.
Overview: I started with a free-form text file created by a data harvesting script that remotely accessed dozens of different kinds of devices, and multiple instances of each. I used OpenRefine (a truly wonderful tool) to munge that into a CSV that was then input to dataframe df using Pandas in a JupyterLab notebook.
My first inspection of the data showed the 'Timestamp' column was not monotonic. I accessed individual data sources as follows, in this case for the 'T-meter' data source. (The technique was taken from a search result - I don't really understand it, but it worked.)
cond = df['Source']=='T-meter'
rows = df.loc[cond, :]
df_tmeter = pd.DataFrame(columns=df.columns)
df_tmeter = df_tmeter.append(rows, ignore_index=True)
then checked each as follows:
df_tmeter['Timestamp'].is_monotonic
Fortunately, the problem was easy to identify and fix: Some sensors were resetting, then sending bad (but still monotonic) timestamps until their clocks were updated. I wrote the function healing() to cleanly patch such errors, and it worked a treat:
df_tmeter['healed'] = df_tmeter['Timestamp'].apply(healing)
Now for my questions:
How do I get the 'healed' values back into the original df['Timestamp'] column for only the 'T-meter' items in df['Source']?
Given the function healing(), is there a clean way to do this directly on df?
Thanks!
Edit: I first thought I should be using 'views' into df, but other operations on the data would either generate errors, or silently turn the views into copies.
I wrote a wrapper function heal_row() for healing():
def heal_row( row ):
if row['Source'] == 'T-meter': # Redundant check, but safe!
row['Timestamp'] = healing(row['Timestamp'])
return row
then did the following:
df = df.apply(lambda row: row if row['Source'] != 'T-meter' else heal_row(row), axis=1)
This ordering is important, since healing() is stateful based on the prior row(s), and thus can't be the default operation.

How to run an ADFuller test on timeseries data using statsmodels library?

I am completely new to programming languages and I have picked up Python for backtesting a trading strategy (because I heard it is relatively easy). I have made some progress in learning the basics, however I am currently stuck at performing an ADFuller tests on a timeseries dataframe.
This is how my Dataframe looks
Now I need to run ADF test on the columns - "A-Btd", "A- Ctd" and so on (I have 66 columns like these).I would like to get the test statistic/output for each of them.
I tired using lines such as cadfs = [ts.adfuller(df1)]. Since, I lack the expertise I am not able to adjust the code as per my dataframe.
I apologize in advance if I have missed out some important information I have to give out. Please leave a comment and I will provide it asap.
Thanks a lot in advance!
If you have to do it for so many, I would try putting the results in a dict, like this:
import statsmodels.tsa.stattools as tsa
df = ... #load your dataframe
adf_results = {}
for col in df.columns.values: #or edit this for a subset of columns first
adf_results[col] = tsa.adfuller(df[col])
Obviously specifying other settings as desired, e.g. tsa.adfuller(df[col], autolag='BIC'). Or if you don't want all the output and would rather just parse each column to find out if it's stationary or not, the test statistic is the first entry in the tuple returned by adfuller(), so you could just use tsa.adfuller(df[col])[0] and test it against your threshold to get a boolean result, then make that the value in your dict.

Python Pandas - Main DataFrame, want to drop all columns in smaller DataFrame

I have a DataFrame ('main') that has about 300 columns. I created a smaller DataFrame ('public') and have been working on this.
I now want to delete the columns contained within 'public' from the larger DataFrame ('main').
I've tried the following instructions:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html
Python Pandas - Deleting multiple series from a data frame in one command
without any success, along with various other statements that have been unsuccessful.
The columns that make up 'public' are not consecutive - i.e. they are taken from various points in the larger DataFrame 'main'. All of the columns have the same Index. [Not sure if this is important, but 'public' was created using the 'join' function].
Yes, I'm being lazy - I don't want to have to type out the names of every column! I'm hoping there's a way to use the DataFrame 'public' in a statement that will allow deletion of these columns en masse. If anyone has any suggestions and/or guidance I'd be most grateful.
(Have Python 2.7 and am using Pandas, numpy, math, pylab etc.)
Thanks in advance.
Ignore my question - Murphy's Law prevails and I've just solved it.
I was using the statement from the stackoverflow question mentioned below:
df.drop(df.columns[1:], axis=1)
and this was not working. I have instead used
df = df.drop(df2, axis=1)
and this worked (df = main, df2 = public). Simple really once you don't overthink it.

Categories

Resources