Pandas efficient Multiindex getting and setting - python

Below is a snapshot of my data structure in pandas
I build the below structure in a for loop
I am using sortlevel to lexsort the dataframe
df.sortlevel(inplace=True)
1) I need to get an efficient way to get and set specific rows as shown below. This is the formula i am using and it is not efficient.
a) Will i be able to set the values of the rows using assignment
df.loc[idx['AAA', slice(None),'LLL']].iloc[:,0:n]
df.loc[idx['AAA', slice(None),'LLL']].iloc[:,0:n] = another_df
2) How to Efficiently sum the columns for a result below
df.loc[idx['AAA', slice(None),'LLL']].iloc[:,0:n].sum(axis=1)
I am looking for an efficient way to slice the dataframe.
Thanks

Thanks for letting me know the right way to post questions for Pandas. Any case, below are my findings regarding this problem. Multindex is certainly powerful from the standpoint of organizing data and exporting to csv or excel. However, accessing and selecting data has been challenging to accomplish.
Best Practices for initializing multi index
I found it easy to pre allocate the index rather than create them on the fly. Creating index on the fly is not efficient and you will be faced with lexsort warning.
Sort the data frame index once the data frame is initialized.
When accessing do not leave the row or column identifier empty. Use :
for site_name in site_s:
no_of_progs = len(site_s[site_name])
prog_name_in_sites = site_s[site_name].keys()
prog_level_cols = ['A','B', 'C']
prog_level_cols = ['A', 'C']
site_level_cols = ['A PLAN', 'A TOTAL','A UP','A DOWN','A AVAILABLE' ]
if counter == 0:
pd_index_col = pd.MultiIndex.from_product([ [site_name], prog_name_in_sites,prog_level_cols],
names=['SITE', 'PROGRAM','TYPE'])
else:
pd_index_col = pd_index_col.append(pd.MultiIndex.from_product([ [site_name], prog_name_in_sites,prog_level_cols],
names=['SITE', 'PROGRAM','TYPE']))
if no_of_progs >1:
pd_index_col = pd_index_col.append(pd.MultiIndex.from_product([ [site_name], ['LINES'] ,site_level_cols],
names=['SITE', 'PROGRAM','TYPE']))
counter = counter+1
df_A_site_level = pd.DataFrame(0,columns=arr_wk_num_wkly,index= pd_index_col, dtype=np.float64)
df_A_site_level.sort_index(inplace=True)
For setting and getting below are the two methods I recommend
df.iloc - if you know the positional index of rows and/or columns
df.loc - if you want to access data based on labels
Accessing using loc - Use the below to set or get cell/row values
idx = pd.IndexSlice
df_A_site_level[idx[site_name, :,'C'], df_A_site_level[0:no]]
Accessing using iloc - Use the below to set or get cell/row values
df_A_site_level.iloc[no_1:no_2,no3:no_4]

Related

Error when using variables to filter dataframe with multiple conditions (Pandas)

I'm trying to make a temporary dataframe that is created by filtering an existing dataframe (stock_data) based on two criteria;
The stock_data ticker column is matching the tick_id variable
The stock_data date column is within a range from start to end (the variables are created using pd.to_datetime)
I've attempted this using two different solutions
First:
temp = stock_data[(stock_data.ticker == tick_id) & (stock_data["date"].isin(pd.date_range(start, end)))]
Second:
mask = (stock_data.ticker == tick_id) & ((stock_data.date > start) & (stock_data.date <= end))
temp = stock_data.loc[mask]
Both solutions result in the same error:
ValueError: Can only compare identically-labeled Series objects
The error's telling you your tick_id Series has different labels to stock_data['ticker']. I'm guessing one is a ticker name like 'AAPL', the other is a numerical ticker-id? (or even worse, just auto-indexed 0,1,2... from the output of some previous operation)
Solution: make ticker or tick_id the index to both series/dataframes. Do this everywhere, if you can. No numerical indices. Then joins, aggregations etc. are trivially easy.
(PS this is better for mose use-cases, also makes exporting CSV or pickle more intuitive.)
Anyway, this looks like a bad attempt to do a join operation.

Fastest way to generate new rows and append them to DataFrame

I want to change the target value in the data set within a certain interval. When doing it with 500 data, it takes about 1.5 seconds, but I have around 100000 data. Most of the execution time is spent in this process. I want to speed this up.
What is the fastest and most efficient way to append rows to a DataFrame?
I tried the solution in this link, tried to create a dictionary, but I couldn't do it.
Here is the code which takes around 1.5 seconds for 500 data.
def add_new(df,base,interval):
df_appended = pd.DataFrame()
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df
df_new["DeltaG"] = s[i]
df_appended = df_appended.append(df_new)
return df_appended
DataFrames in the pandas are continuous peaces of memory, so appending or concatenating etc. dataframes is very inefficient - this operations create new DataFrames and overwrite all data from old DataFrames.
But basic python structures as list and dicts are not, when append new element to it python just create pointer to new element of structure.
So my advice - make all you data processing on lists or dicts and convert them to DataFrames in the end.
Another advice can be creating preallocated DataFrame of the final size and just change values in it using .iloc. But it works only if you know final size of your resulting DataFrame.
Good examples with code: Add one row to pandas DataFrame
If you need more code examples - let me know.
def add_new(df1,base,interval,has_interval):
dictionary = {}
if has_interval == 0:
for i in range(0,5):
dictionary[i] = (df1.copy())
elif has_interval == 1:
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df1
df_new[4] = s[i]
dictionary[i] = (df_new.copy())
return dictionary
It works. It takes around 10 seconds for whole data. Thanks for your answers.

Pandas - KeyError - Dropping rows by index in a nested loop

I have a Pandas dataframe named pd. I am attempting to use a nested-for-loop to iterate through each tuple of the dataframe and, at each iteration, compare the tuple with all other tuples in the frame. During the comparison step, I am using Python's difflib.SequenceMatcher().ratio() and dropping tuples that have a high similarity (ratio > 0.8).
Problem:
Unfortunately, I am getting a KeyError after the first, outer-loop, iteration.
I suspect that, by dropping the tuples, I am invalidating the outer-loop's indexer. Or, I am invalidating the inner-loop's indexer by attempting to access an element that doesn't exist (dropped).
Here is the code:
import json
import pandas as pd
import pyreadline
import pprint
from difflib import SequenceMatcher
# Note, this file, 'tweetsR.json', was originally csv, but has been translated to json.
with open("twitter data/tweetsR.json", "r") as read_file:
data = json.load(read_file) # Load the source data set, esport tweets.
df = pd.DataFrame(data) # Load data into a pandas(pd) data frame for pandas utilities.
df = df.drop_duplicates(['text'], keep='first') # Drop tweets with identical text content. Note,
these tweets are likely reposts/retweets, etc.
df = df.reset_index(drop=True) # Adjust the index to reflect dropping of duplicates.
def duplicates(df):
for ind in df.index:
a = df['text'][ind]
for indd in df.index:
if indd != 26747: # Trying to prevent an overstep keyError here
b = df['text'][indd+1]
if similar(a,b) >= 0.80:
df.drop((indd+1), inplace=True)
print(str(ind) + "Completed") # Debugging statement, tells us which iterations have completed
duplicates(df)
Error Output:
Can anyone help me understand this and/or fix it?
One solution, which was mentioned by #KazuyaHatta, is the itertools.combination(). Although, the way I've used it (there may be another way), it's O(n^2). So, in this case, with 27,000 tuples, it's nearly 357,714,378 combinations to iterate (too long).
Here is the code:
# Create a set of the dropped tuples and run this code on bizon overnight.
def duplicates(df):
# Find out how to improve the speed of this
excludes = set()
combos = itertools.combinations(df.index, 2)
for combo in combos:
if str(combo) not in excludes:
if similar(df['text'][combo[0]], df['text'][combo[1]]) > 0.8:
excludes.add(f'{combo[0]}, {combo[1]}')
excludes.add(f'{combo[1]}, {combo[0]}')
print("Dropped: " + str(combo))
print(len(excludes))
duplicates(df)
My next step, which #KazuyaHatta described, is to attempt the dropping-by-mask method.
Note: I unfortunately won't be able to post a sample of the dataset.

Correct way of testing Pandas dataframe values and modifying them

I need to modify some values of a Pandas dataframe based on a test, and leave the others values intact. I also need to leave the order of the rows intact.
I have a working code, based on iterating on the dataframe's rows. But it's horrendously slow. Is there a quicker way to get it done?
Here are two examples of this very slow code
for index, row in df.iterrows():
if df.number[index].is_integer():
df.number[index] = int(df.number[index])
for index, row in df.iterrows():
if df.string[index] == "XXX":
df.string[index] = df.other_colum[index].split("\")[0] + df.other_colum[index].split("\")[1]
else:
df.string[index] = df.other_colum[index].split("\")[1] + df.other_colum[index].split("\")[0]
Thanks
Generally you want to avoid iterating through rows in a pandas dataframe as it is slower than other methods pandas has created for accomplishing the same thing. One way of getting around this is using apply. You would redefine the number column:
df["number"] = df["number"].apply(lambda x: int(x) if x.is_integer() else x)
And (re)define the string column:
df["string"] = df["other column"].apply(lambda x: x.split("\\")[0] + x.split("\\")[1] if x == r"XX\X" else x.split("\\")[1] + x.split("\\")[0])
Made some assumptions based off of the data you removed from the problem set up -- .split("\") is incorrect syntax, and "other column" above necessarily has to have a backslash in it in order for your code (and mine) to work, otherwise .split("\\")[1] will return an error.

Pandas formatting column within DataFrame and adding timedelta Index error

I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.
Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.

Categories

Resources