My time series plot showing the wrong order - python

I'm plotting:
df['close'].plot(legend=True,figsize=(10,4))
The original data series comes in an descending order,I then did:
df.sort_values(['quote_date'])
The table now looks good and sorted in the desired manner, but the graph is still the same, showing today first and then going back in time.
Does the .plot() order by index? If so, how can I fix this ?
Alternatively, I'm importing the data with:
df = pd.read_csv(url1)
Can I somehow sort the data there already?

There are two problems with this code:
1) df.sort_values(['quote_date']) does not sort in place. This returns a sorted data frame but df is unchanged =>
df = df.sort_values(['quote_date'])
2) Yes, the plot() method plots by index by default but you can change this behavior with the keyword use_index
df['close'].plot(use_index=False, legend=True,figsize=(10,4))

Related

Displaying consistent scientific notation in pyplot table from Pandas dataframe?

I'm trying to output a Pandas dataframe as a pyplot table to be inserted into a presentation, but I'm running into problems with the formatting of the values in the table itself.
This is one part of a much larger project, so I'll try to give as much code and context as I can:
values = []
for item in scenarios:
value = dataframe.loc[dataframe['Scenario'] == item, 'Value'].sum()
# Formatting for scientific notation with 2 decimal points (so, 3 sig figs)
values.append("{:.2e}".format(value))
# Create a dataframe to hold this new, visual-ready data and add the generated values to it.
visual_data = pd.DataFrame(scens, columns=["Scenario"])
visual_data["Data"] = values
# Formatting the data as a float so Pandas can sort it properly.
visual_data["Data"] = visual_data["Data"].astype(float)
# Sorting. By default, Pandas sorts in descending order.
visual_data.sort_values(by=['Data'], inplace=True)
# Matplotlib code.
fig, ax = plt.subplots()
# fig.patch.set_visible(False)
ax.axis('off')
ax.axis('tight')
table = ax.table(cellText=visual_data.to_numpy(), colLabels=visual_data.columns, loc='center', cellLoc='center')
fig.tight_layout()
plt.show()
The problem I run in to is how everything in the table itself displays. The data I'm working with runs from 1e-8 to 1e-4. When I just have my sorted dataframe everything is formatted in proper scientific notation from lowest to highest values.
As soon as I insert the data into the table (the '.to_numpy' statement when creating the table) the output looks something like the following:
[['DUJ' 3.964e-08]
['DUE' 4.467e-08]
['DUD' 1.172e-07]
['DUC' 2.098e-07]
['DUG' 2.136e-07]
...
['DUN' 7.356e-05]
['MCC' 0.0001046]
['ALU' 0.0001652]]
With the final two entries rendering as standard floats instead of being consistent scientific notation like all the other entries in the table.
I know that Pandas' "set_printoption" has a "suppress=True" variable that should prevent this kind of behavior, but I can't figure out how to enable it (or disable it, as it were).
Any ideas?

dataframe line plot is not plotting a line with column values

I think there is something wrong with the data in my dataframe, but I am having a hard time coming to a conclusion. I think there might be some missing datetime values, which is the index of the dataframe. Given that there are over 1000 rows, it isn't possible for me to check each row manually. Here is a picture of my data and the corresponding line plt. Clearly this isn't a line plot!
Is there any way to supplement the possible missing values in my dataframe somehow?
I also did a line plot in seaborne as well to get another perspective, but I don't think it was immediately helpful.
You have effectively done same as I have simulated. Really you have a multi-index date and age_group. plotting both together means line jumps between the two. Separate them out and plot as separate lines and it is as you expect.
d = pd.date_range("1-jan-2020", "16-mar-2021")
df = pd.concat([pd.DataFrame({"daily_percent":np.sort(np.random.uniform(0.5,1, len(d)))}, index=d).assign(age_group="0-9 Years"),
pd.DataFrame({"daily_percent":np.sort(np.random.uniform(0,0.5, len(d)))}, index=d).assign(age_group="20-29 Years")])
df.plot(kind="line", y="daily_percent", color="red")
df.set_index("age_group", append=True).unstack(1).droplevel(0, axis=1).plot(kind="line", color=["red","blue"])

Sorting based on the alt.Color field in Altair

I am attempting to sort a horizontal barchart based on the group to which it belongs. I have included the dataframe, code that I thought would get me to group-wise sorting, and image. The chart is currently sorted according to the species column in alphabetical order, but I would like it sorted by the group so that all "bads" are together, similarly, all "goods" are together. Ideally, I would like to take it one step further so that the goods and bads are subsequently sorted by value of 'LDA Score', but that was the next step.
Dataframe:
Unnamed: 0,Species,Unknown,group,LDA Score,p value
11,a,3.474929757,bad,3.07502591,5.67e-05
16,b,3.109308852,bad,2.739744898,0.000651725
31,c,3.16979865,bad,2.697247855,0.03310557
38,d,0.06730106400000001,bad,2.347746497,0.013009626000000002
56,e,2.788383183,good,2.223874347,0.0027407140000000004
65,f,2.644346144,bad,2.311106698,0.00541244
67,g,3.626001112,good,2.980960068,0.038597163
74,h,3.132399759,good,2.849798377,0.007021518000000001
117,i,3.192113412,good,2.861299028,8.19e-06
124,j,0.6140430960000001,bad,2.221483531,0.0022149739999999998
147,k,2.873671544,bad,2.390164757,0.002270102
184,l,3.003479213,bad,2.667274876,0.008129727
188,m,2.46344998,good,2.182085465,0.001657861
256,n,0.048663767,bad,2.952260299,0.013009626000000002
285,o,2.783848855,good,2.387345098,0.00092491
286,p,3.636219,good,3.094047,0.001584756
The code:
bars = alt.Chart(df).mark_bar().encode(
alt.X('LDA Score:Q'),
alt.Y("Species:N"),
alt.Color('group:N', sort=alt.EncodingSortField(field="Clinical group", op='distinct', order='ascending'))
)
bars
The resulting figure:
Two things:
If you want to sort the y-axis, you should put the sort expression in the y encoding. Above, you are sorting the color labels in the legend.
Sorting by field in Vega-Lite only works for numeric data (Edit: this is incorrect; see below), so you can use a calculate transform to map the entries to numbers by which to sort.
The result might look something like this:
alt.Chart(df).transform_calculate(
order='datum.group == "bad" ? 0 : 1'
).mark_bar().encode(
alt.X('LDA Score:Q'),
alt.Y("Species:N", sort=alt.SortField('order')),
alt.Color('group:N')
)
Edit: it turns out the reason sorting by group fails is that the default operation for sort fields is sum, which only works well on quantitative data. If you choose a different operation, you can sort on nominal data directly. For example, this shows the correct output:
alt.Chart(df).mark_bar().encode(
alt.X('LDA Score:Q'),
alt.Y("Species:N", sort=alt.EncodingSortField('group', op='min')),
alt.Color('group:N')
)
See vega/vega-lite#6064 for more information.

pandas grouping based on different data

I want to group data based on different dataframe's cuts.
So for instance I cut from a frame:
my_fcuts = pd.qcut(frame1['prices'],5)
pd.groupby(frame2, my_fcuts)
Since the lengths must be same, the above statement will fail.
I know I can easily write a mapper function, but what if this was the case
my_fcuts = pd.qcut(frame1['prices'],20) or some higher number. Surely there must be some built-in statement in pandas to do this very simple thing. groupby should be able to accept "cuts" from different data and reclassify.
Any ideas?
Thanks I figured out the answer myself
volgroups = np.digitize(btest['vol_proxy'],np.linspace(min(data['vol_proxy']), max(data['vol_proxy']), 10))
trendgroups = np.digitize(btest['trend_proxy'],np.linspace(min(data['trend_proxy']), max(data['trend_proxy']), 10))
#btest.groupby([volgroups,trendgroups]).mean()['pnl'].plot(kind='bar')
#plt.show()
df = btest.groupby([volgroups,trendgroups]).groups

Pandas formatting column within DataFrame and adding timedelta Index error

I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.
Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.

Categories

Resources