How to use pandas apply instead a list loop?

How to use pandas apply instead a list loop? - python

I have a dataframe like this:
df = pd.DataFrame({'market':['ES','UK','DE'],
'provider':['A','B','C'],
'item':['X','Y','Z']})
Then I have a list with the providers and the following loop:
providers_list = ['A','B','C']
for provider in providers_list:
a = df.loc[df['provider']==provider]
That loop creates a dataframe for each provider, which later on I put into an excel. I would like to use the function apply for speed purposes. I have transformed the code like this:
providers_list = pd.DataFrame({'provider':['A','B','C']})
def report(provider):
a = df.loc[df['provider']==provider]
providers_list.apply(report)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py",
line 1190, in wrapper
raise ValueError("Can only compare identically-labeled "
ValueError: ('Can only compare identically-labeled Series objects',
'occurred at index provider')
Thanks

The apply method is generally inefficient. It's nothing more than a glorified loop with some extra functionality. Instead, you can use GroupBy to cycle through each provider:
for prov, df_prov in df.groupby('provider'):
df_prov.to_excel(f'{prov}.xlsx', index=False)
If you only want to include a pre-defined list of providers in your output, you can define a GroupBy object and iterate your list instead:
providers_list = ['A', 'B', 'C']
grouper = df.groupby('provider')
for prov in providers_list:
grouper.get_group(prov).to_excel(f'{prov}.xlsx', index=False)
If you're interested in speed for your process as a whole, I strongly advise you avoid Excel: exporting to csv, csv.gz or pkl will all be much more efficient. For large datasets, it's unlikely filtering your dataframe is your bottleneck when exporting to Excel.

This worked for me with a milllion entries of each provider in under a second:
import pandas as pd
from tqdm import tqdm
tqdm.pandas(desc="Progress:")
df = pd.DataFrame({'market':['ES','UK','DE']*1000000,
'provider':['A','B','C']*1000000,
'item':['X','Y','Z']*1000000})
grouped = df.groupby("provider")
providers_list = ['A','B','C']
for prov in tqdm(providers_list):
frame_name = prov
globals()[frame_name] = pd.DataFrame(grouped.get_group(prov))
print(A)
print(B)
print(C)
100%|██████████| 3/3 [00:00<00:00, 9.59it/s]

Related

efficient way of computing a dataframe using concat and split

I am new to python/pandas/numpy and I need to create the following Dataframe:
DF = pd.concat([pd.Series(x[2]).apply(lambda r: pd.Series(re.split('\#|/',r))).assign(id=x[0]) for x in hDF])
where hDF is a dataframe that has been created by:
hDF=pd.DataFrame(h.DF)
and h.DF is a list whose elements looks like this:
['5203906',
['highway=primary',
'maxspeed=30',
'oneway=yes',
'ref=N 22',
'surface=asphalt'],
['3655224911#1.735928/42.543651',
'3655224917#1.735766/42.543561',
'3655224916#1.735694/42.543523',
'3655224915#1.735597/42.543474',
'4817024439#1.735581/42.543469']]
However, in some cases the list is very long (O(10^7)) and also the list in h.DF[*][2] is very long, so I run out of memory.
I can obtain the same result, avoiding the use of the lambda function, like so:
DF = pd.concat([pd.Series(x[2]).str.split('\#|/', expand=True).assign(id=x[0]) for x in hDF])
But I am still running out of memory in the cases where the lists are very long.
Can you think of a possible solution to obtain the same results without starving resources?

I managed to make it work using the following code:
bl = []
for x in h.DF:
data = np.loadtxt(
np.loadtxt(x[2], dtype=str, delimiter="#")[:, 1], dtype=float, delimiter="/"
).tolist()
[i.append(x[0]) for i in data]
bl.append(data)
bbl = list(itertools.chain.from_iterable(bl))
DF = pd.DataFrame(bbl).rename(columns={0: "lon", 1: "lat", 2: "wayid"})
Now it's super fast :)

Append Pandas Dataframes in a Loop Function _ Investpy

I am using investpy to get historical stock data for 2 stocks ( TRP_pb , TRP_pc )
import investpy
import pandas as pd
import numpy as np
TRP_pb = investpy.get_stock_historical_data(stock='TRP_pb',
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
print(TRP_pb.head())
TRP_pc = investpy.get_stock_historical_data(stock='TRP_pc',
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
print(TRP_pc.head())
I can append the two tables by using the append method
appendedtable = TRP_pb.append(TRP_pc, ignore_index=False)
What I am trying to do is to use a loop function in order to combine these two tables
Here is what I have tried so far
preferredlist = ['TRP_pb','TRP_pc']
for i in preferredlist:
new = investpy.get_stock_historical_data(stock=i,
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
new.append(new, ignore_index=True)
However this doesnt work.
I would appreciate any help

Since get_stock_historical_data returns a DataFrame, you can create an empty dataframe before the for loop and concat in the loop.
preferredlist = ['TRP_pb','TRP_pc']
final_list = pd.DataFrame()
for i in preferredlist:
new = investpy.get_stock_historical_data(stock=i,
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
final_list = pd.concat([final_list, new])

how to parallelize a function which is column bound?

I have a function, which does some operations on each DataFrame column and extracts a shorter series from it (in the original code there is some time consuming calculations going on)
Then it adds it to a dictionary before it goes on with the next columns.
In the end it creates a dataframe from the dictionary and manipulates its index.
How can I parallelize the loop in which each column is manipulated?
This is a less complicated reproducable sample of the code.
import pandas as pd
raw_df = pd.DataFrame({"A":[ 1.1 ]*100000,
"B":[ 2.2 ]*100000,
"C":[ 3.3 ]*100000})
def preprocess_columns(raw_df, ):
df = {}
width = 137
for name in raw_df.columns:
'''
Note: the operations in this loop do not have a deep sense and are just for illustration of the function preprocess_columns. In the original code there are ~ 50 lines of list comprehensions etc.
'''
# 3. do some column operations. (actually theres more than just this operation)
seriesF = raw_df[[name]].dropna()
afterDropping_indices = seriesF.index.copy(deep=True)
list_ = list(raw_df[name])[width:]
df[name]=pd.Series(list_.copy(), index=afterDropping_indices[width:])
# create df from dict and reindex
df=pd.concat(df,axis=1)
df=df.reindex(df.index[::-1])
return df
raw_df = preprocess_columns(raw_df )

Maybe you can use this:
https://github.com/xieqihui/pandas-multiprocess
pip install pandas-multiprocess
from pandas_multiprocess import multi_process
args = {'width': 137}
result = multi_process(func=func, data=df, num_process=8, **args)

Tracking Error on a number of benchmarks

I'm trying to calculate tracking error for a number of different benchmarks versus a fund that I'm looking at (tracking error is defined as the standard deviation of the percent difference between the fund and benchmark). The time series for the fund and all the benchmarks are all in a data frame that I'm reading from an excel on file and what I have so far is this (with the idea that arg1 represents all the benchmarks and is then applied using applymap), but it's returning a KeyError, any suggestions?
import pandas as pd
import numpy as np
data = pd.read_excel('File_Path.xlsx')
def index_analytics(arg1):
tracking_err = np.std((data['Fund'] - data[arg1]) / data[arg1])
return tracking_err
data.applymap(index_analytics)

There are a few things that need fixed. First,applymap passes each individual value for all the columns to your calling function (index_analytics). So arg1 is the individual scalar value for all the values in your dataframe. data[arg1] is always going to return a key error unless all your values are also column names.
You also shouldn't need to use apply to do this. Assuming your benchmarks are in the same dataframe then you should be able to do something like this for each benchmark. Next time include a sample of your dataframe.
df['Benchmark1_result'] = (df['Fund'] - data['Benchmark1']) / data['Benchmark1']
And if you want to calculate all the standard deviations for all the benchmarks you can do this
# assume you have a dataframe with a list of all the benchmark columns
benchmark_columns = [list, of, benchmark, columns]
np.std((df['Fund'].values - df[benchmark_columns].values) / df['Fund'].values, axis=1)

Assuming you're following the definition of Tracking Error below:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
df['Active_Return'] = df['Portfolio_Returns'] - df['Bench_Returns']
print(df.head())
list_ = df['Active_Return']
temp_ = []
for val in list_:
x = val**2
temp_.append(x)
tracking_error = np.sqrt(sum(temp_))
print(f"Tracking Error is: {tracking_error}")
Or if you want it more compact (because apparently the cool kids do it):
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
tracking_error = np.sqrt(sum([val**2 for val in df['Portfolio_Returns'] - df['Bench_Returns']]))
print(f"Tracking Error is: {tracking_error}")

Looking for a python datastructure for cleaning/annotating large datasets

I'm doing a lot of cleaning, annotating and simple transformations on very large twitter datasets (~50M messages). I'm looking for some kind of datastructure that would contain column info the way pandas does, but works with iterators rather than reading the whole dataset into memory at once. I'm considering writing my own, but I wondered if there was something with similar functionality out there. I know I'm not the only one doing things like this!
Desired functionality:
>>> ds = DataStream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds.columns
['id', 'message']
>>> ds.iterator.next()
[2385, "Hi it's me, Sally!"]
>>> ds = datastream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds_tok = get_tokens(ds)
>>> ds_tok.columns
['message_id', 'token', 'n']
>>> ds_tok.iterator.next()
[2385, "Hi", 0]
>>> ds_tok.iterator.next()
[2385, "it's", 1]
>>> ds_tok.iterator.next()
[2385, "me", 2]
>>> ds_tok.to_sql(db_info)
UPDATE: I've settled on a combination of dict iterators and pandas dataframes to satisfy these needs.

As commented there is a chunksize argument for read_sql which means you can work on sql results piecemeal. I would probably use HDF5Store to save the intermediary results... or you could just append it back to another sql table.
dfs = pd.read_sql(..., chunksize=100000)
store = pd.HDF5Store("store.h5")
for df in dfs:
clean_df = ... # whatever munging you have to do
store.append("df", clean_df)
(see hdf5 section of the docs), or
dfs = pd.read_sql(..., chunksize=100000)
for df in dfs:
clean_df = ...
clean_df.to_sql(..., if_exists='append')
see the sql section of the docs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use pandas apply instead a list loop? - python

Related

efficient way of computing a dataframe using concat and split

Append Pandas Dataframes in a Loop Function _ Investpy

how to parallelize a function which is column bound?

Tracking Error on a number of benchmarks

Looking for a python datastructure for cleaning/annotating large datasets

Categories

Resources