I'm trying to plot the counts of some pandas dataframe columns, grouped by date:
by_date = data.groupby(data.index.day).count()
The data are correct, but the the data.index.day I specified is no good for plotting purposes:
Is there a way of specifying that I want to group by Python Date objects, or am I doing this completely wrong?
Update: Dan Allan's resample suggestion worked, but now the the xticks are unreadable. Should I be extracting them separately?
I think this task is more easily accomplished using resample, not group. How about
data.resample('D', how='count')
Related
I have a pandas dataframe with an index and jut one column. Index has Dates. Column has Values. I would like to find the NewValue of a NewDate that is not in the index. To do that i suppose i may use interpolation function as: NewValue=InterpolationFunction(NewDate,Index,Column,method ext..). So what is the InterpolationFunction? It seems that most of interpolation functions are used for padding, finding the missing values ext. This is not what i want. I just want the NewValue. Not built a new Dataframe ext..
Thank you very very much for taking the time to read this.
I am afraid that you cannot find the missing values without constructing a base for your data. here is the answer to your question if you make a base dataframe:
You need to construct a panel in order to set up your data for proper interpolation. For instance, in the case of the date, let's say that you have yearly data and you want to add information for a missing year in between or generate data for quarterly intervals.
You need to construct a base for the time series i.e.:
dates = pd.date_range(start="1987-01-01",end="2021-01-01", freq="Q").values
panel = pd.DataFrame({'date_Q' :dates})
Now you can join your data to this structure and it will have dates without information as missing. Now you need to use a proper interpolation algorithm to fill the missing values. Pandas .interpolate() method has some basic interpolation methods such as polynomial and linear which you can find here.
However, much more ways of interpolation are offered by Scipy which you can find in the tutorials here.
I have a fairly large pandas data frame((4000, 103) and for smaller dataframes I love using pairplot to visually see patterns in my data. But for my larger dataset the same command runs for hour+ with no output.
Is there an alternative tool to get the same outcome or a way to speed up the command? I tried to use the sample option on pandas to reduce the dataset but it still takes over a hour with no outcome.
dfSample = myData.sample(100) # make dataset smaller
sns.pairplot(dfSample, diag_kind="hist")
You should sample from colums, so replace your first line by
dfSample=myData.sample(10, axis=1).
And live happy.
I have a dataframe with columns that are an aggregation of corona virus cases over time.
I need the data in the date columns to be the new number of cases for that day instead of the aggregation.
So for example, I am trying to get the first row to be like
Anhui, Mainland China, 1, 8, 6
I think there might be a quick pandas way to do this but can't find it by google searching. A brute force method would be okay too. Thanks!
You can take take the finite difference along constant rows in the dataframe. If df is a copy of the numerical part of the dataframe then the following will do it:
df.diff(axis=1)
Documentation
I'm trying to get a Seaborn barplot containing the top n entries from a dataframe, sorted by one of the columns.
In Pandas, I'd typically do this using something like this:
df = df.sort_values('ColumnFoo', ascending=False)
sns.barplot(data=df[:10], x='ColumnFoo', y='ColumnBar')
Trying out Dask, though, there is (fairly obviously) no option to sort a dataframe, since dataframes are largely deferred objects, and sorting them would eliminate many of the benefits of using Dask in the first place.
Is there a either get ordered entries from a dataframe, or to have Seaborn pick the top n values from a dataframe's column?
If you're moving data to seaborn then it almost certainly fits in memory. I recommend just converting to a Pandas dataframe and then doing the sorting there.
Generally, once you've hit the small-data regime there is no reason to use Dask over Pandas. Pandas is more mature and a smoother experience. Dask Dataframe developers recommend using Pandas when feasible.
is there any possibility to plot an APACHE Dataframe? I figured it out while converting it to a Pandas dataframe which takes a lot of time and is not my goal.
In particular, the goal is to plot a map out of an Apache DataFrame without convertion to a Pandas DataFrame.
With plotting I mean to use a library such as matplotlib or plotly for plotting a graph or something similar.
Any ideas?
Thanks!
Do you mean plot an Spark Dataframe?
In that case, you could do something like this, having yourDF as your Dataframe:
yourDF.show(100, truncate=false)
This will show in your logging your dataframe structure and values (in this case, first 100 rows) the same way you'll find in pandas. With the truncate option you specify you want to show the whole dataframe instead of a reduced version.
EDIT: in order to directly plot from a dataframe, please check the plotly lib, o the
display(dataframe)
function, documented here.