pandas grouping based on different data - python

I want to group data based on different dataframe's cuts.
So for instance I cut from a frame:
my_fcuts = pd.qcut(frame1['prices'],5)
pd.groupby(frame2, my_fcuts)
Since the lengths must be same, the above statement will fail.
I know I can easily write a mapper function, but what if this was the case
my_fcuts = pd.qcut(frame1['prices'],20) or some higher number. Surely there must be some built-in statement in pandas to do this very simple thing. groupby should be able to accept "cuts" from different data and reclassify.
Any ideas?

Thanks I figured out the answer myself
volgroups = np.digitize(btest['vol_proxy'],np.linspace(min(data['vol_proxy']), max(data['vol_proxy']), 10))
trendgroups = np.digitize(btest['trend_proxy'],np.linspace(min(data['trend_proxy']), max(data['trend_proxy']), 10))
#btest.groupby([volgroups,trendgroups]).mean()['pnl'].plot(kind='bar')
#plt.show()
df = btest.groupby([volgroups,trendgroups]).groups

Related

Take value of a cell and store it in a variable for given values of column and row in python

I want to trace back value of X and store it as a variable for given values of columns and rows.
in this case, table is relation between yield strength and slenderness ratio, provided in local building byelaws, I want to use it in a script (Rest API) and call it through flask (local host) in flutter app.
I am storing table as libre-calc/csv file, should I use sql-lite?
I have tried using pandas
example, i want python to take value of sigma(pink) if value of yield strength is 250 and value of slenderness ratio is 60:
table values
Stuck for couple of weeks, help will be so appreciated, thanks!
If you want a pandas solution, DataFrame.at() should work. You'd do df.at[60, 250], but this will only work if those values exist as row/column labels.
Assuming both values you need are on the same row it's easy to operate with them, if you want to use pandas you can simply add a new column with the desire value you need.
Example:
import pandas as pd
data = {'Name':['Jack', 'Maria', 'Alan', 'Casey'],
'Salary':[2400, 3000, 2000, 1700],
'Rent_Cost':[900, 850, 550, 650]
}
df = pd.DataFrame(data)
df['Salary_%_Spent'] = round((df['Rent_Cost']/df['Salary']*100), 2)
This would result in:
Now if you prefer this as a variable...
percentageSpent = round((df['Rent_Cost']/df['Salary']*100), 2)
If you already have the column with the values you need then it becomes way easier.
percentageSpent = df['Salary_%_Spent']
or
caseyPercentage = df['Salary_%_Spent'][df['Name']=='Casey']
I hope this helps you, or at least guide you to solve your issue.

Defining a function with args to be used in df.transform

For a current project, I am planning to winsorize a Pandas DataFrame that consists of two columns/objects df['Policies'] and df['ProCon']. This means that the outliers at the high and the low end of the set shall be cut out.
The winsorising shall be conducted at 0.05 and 0.95 based on the values shown in the df['ProCon'] section, while both columns shall be cut out in case an outlier is identified.
The code below is however not accepting the direct reference to the 'ProCon' column in line def winsorize_series(df['ProCon']):, yielding an error about an invalid syntax.
Is there any smart way to indicate that ProCon shall be the determining value for the winsorizing?
import pandas as pd
from scipy.stats import mstats
# Loading the file
df = pd.read_csv("3d201602.csv")
# Winsorizing
def winsorize_series(df['ProCon']):
return mstats.winsorize(df['ProCon'], limits=[0.05,0.95])
# Defining the winsorized DataFrame
df = df.transform(winsorize_series)
Have you tried separating the column name from the table?
def winsorize_series(df, column):
return mstats.winsorize(df[column], limits=[0.05,0.95])
Can't test it though if there's no sample data.
As per comments, .transform is not the right choice to modify only one or selected columns from df. Whatever the function definition and arguments passed, transform will iterate and pass EVERY column to func and try to broadcast the joined result to the original shape of df.
What you need is much simpler
limits = [0.05,0.95] # keep limits static for any calls you make
colname = 'ProCon' # you could even have a list of columns and loop... for colname in cols
df[colname] = mstats.winsorize(df[colname], limits=limits)
df.transform(func) can be passed *args and **kwargs which will be passed to func, as in
df = df.transform(mstats.winsorize, axis=0, a=df['ProCon'], limits=[0.05,0.95])
So there is no need for
def winsorize_series...

What is a proper idiom in pandas for creating a dataframes from the output of a apply function on a df?

Edit --- I've made some progress, and discovered the drop_duplicates method in pandas, which saves some custom duplicate removal functions I created.
This changes the question in a couple of ways, b/c it changes my initial requirements.
One of the operations I need to conduct is grabbing the latest feed entries --- the feed urls exist in a column in a data frame. Once I've done the apply I get feed objects back:
import pandas as pd
import feedparser
import datetime
df_check_feeds = pd.DataFrame({'account_name':['NYTimes', 'WashPo'],'feed_url':['http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml', 'http://feeds.washingtonpost.com/rss/homepage'], 'last_update':['2015-12-28 23:50:40', '2015-12-28 23:50:40']})
df_check_feeds["feeds_results"] = pd.DataFrame(df_check_feeds.feed_url.apply(lambda feed_url: feedparser.parse(feed_url)))
df_check_feeds["entries"] = df_check_feeds.feeds_results.apply(lambda x: x.entries)
So, now I'm stuck with the feed entries in the "entries" column, I'd like to create a two new data frames in one apply method, and concatenate the two frames immediately.
I've expressed the equivalent in a for loop:
frames_list = []
for index in df_check_feeds.index:
df_temp = pd.DataFrame(df_check_feeds.entries[index])
df_temp['account_name'] = df_check_feeds.ix[index,'account_name']
# some error checking on the info here
frames_list.append(df_temp)
df_total_results = pd.concat(frames_list)
df_total_results
I realize I could do this in a for loop (and indeed have written that), but I feel there is some better, more succinct pandas idiomatic way of writing this statement.
A more compact way could be:
df_total_results = df_check_feeds.groupby('account_name').apply(lambda x: pd.DataFrame(x['entries'].iloc[0]))

My time series plot showing the wrong order

I'm plotting:
df['close'].plot(legend=True,figsize=(10,4))
The original data series comes in an descending order,I then did:
df.sort_values(['quote_date'])
The table now looks good and sorted in the desired manner, but the graph is still the same, showing today first and then going back in time.
Does the .plot() order by index? If so, how can I fix this ?
Alternatively, I'm importing the data with:
df = pd.read_csv(url1)
Can I somehow sort the data there already?
There are two problems with this code:
1) df.sort_values(['quote_date']) does not sort in place. This returns a sorted data frame but df is unchanged =>
df = df.sort_values(['quote_date'])
2) Yes, the plot() method plots by index by default but you can change this behavior with the keyword use_index
df['close'].plot(use_index=False, legend=True,figsize=(10,4))

Pandas formatting column within DataFrame and adding timedelta Index error

I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.
Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.

Categories

Resources