I'm studying phyton and one of my goals is write most os my codes without packages, and I would to like write a structure which looks like with pandas's DataFrame, but without using any other package. Is there any way to do that?
Using pandas, my code looks like this:
From pandas import Dataframe
...
s = DataFrame(s, index = ind)
where ind is the result of a function.
Maybe dictionary could be the answer?
Thanks
No native python data structure has all the features of a pandas dataframe, which was part of why pandas was written in the first place. Leveraging packages others have written brings the time and work of many other people into your code, advancing your own code's capabilities in a similar way that Isaac Newton said his famous discoveries were only possible by standing on the shoulders of giants.
There's no easy summary for your answer except to point out that pandas is open-source, and their implementation of the dataframe can be found at https://github.com/pandas-dev/pandas.
Related
I'm a Python user and I'm quite lost on the task below.
Let df be a time series of 1000 stock returns.
I would like to calculate an iterating mean as for below
df[0:500].mean()
df[0:501].mean()
df[0:502].mean()
...
df[0:999].mean()
df[0:1000].mean()
How can I write a efficient code?
Many thanks
Pandas has common transformations like this built in. See for example:
df.expanding().mean()
I'm using the sample Python Machine Learning "IRIS" dataset (for starting point of a project). These data are POSTed into a Flask web service. Thus, the key difference between what I'm doing and all the examples I can find is that I'm trying to load a Pandas DataFrame from a variable and not from a file or URL which both seem to be easy.
I extract the IRIS data from the Flask's POST request.values. All good. But at that point, I can't figure out how to get the pandas dataframe like the "pd.read_csv(....)" does. So far, it seems the only solution is to parse each row and build up several series I can use with the DataFrame constructor? There must be something I'm missing since reading this data from a URL is a simple one-liner.
I'm assuming reading a variable into a Pandas DataFrame should not be difficult since it seems like an obvious use-case.
I tried wrapping with io.StringIO(csv_data), then following up with read_csv on that variable, but that doesn't work either.
Note: I also tried things like ...
data = pd.DataFrame(csv_data, columns=['....'])
but got nothing but errors (for example, "constructor not called correctly!")
I am hoping for a simple method to call that can infer the columns and names and create the DataFrame for me, from a variable, without me needing to know a lot about Pandas (just to read and load a simple CSV data set, anyway).
Forgive me if my questions is too general, or if its been asked before. I've been tasked to manipulate (e.g. copy and paste several range of entries, perform calculations on them, and then save them all to a new csv file) several large datasets in Python3.
What are the pros/cons of using the aforementioned libraries?
Thanks in advance.
I have not used CSV library, but many people are enjoying the benefits of Pandas. Pandas provides a lot of the tools you'll need, based off Numpy. You can easily then use more advance libraries for all sorts of analysis (sklearn for machine learning, nltk for nlp, etc.).
For your purposes, you'll find it easy to manage different cdv's, merge, concatenate, do whatever you want really.
Heres a link to a quick start guide. Lots of other resources out there as well.
getting started with pandas python
http://pandas.pydata.org/pandas-docs/stable/10min.html
Hope that helps a little bit.
You should always try to use as much as possible the work that other people have already been doing for you (such as programming the pandas library). This saves you a lot of time. Pandas has a lot to offer when you want to process such files so this seems to me to be the the best way to deal with such files. Since the question is very general, I can also only give a general answer... When you use pandas, you will however need to read more in the documentation. But I would not say that this is a downside.
Still new to this, sorry if I ask something really stupid. What are the differences between a Python ordered dictionary and a pandas series?
The only difference I could think of is that an orderedDict can have nested dictionaries within the data. Is that all? Is that even true?
Would there be a performance difference between using one vs the other?
My project is a sales forecast, most of the data will be something like: {Week 1 : 400 units, Week 2 : 550 units}... Perhaps an ordered dictionary would be redundant since input order is irrelevant compared to Week#?
Again I apologize if my question is stupid, I am just trying to be thorough as I learn.
Thank you!
-Stephen
Most importantly, pd.Series is part of the pandas library so it comes with a lot of added functionality - see attributes and methods as you scroll down the pd.Series docs. This compares to OrderDict: docs.
For your use case, using pd.Series or pd.DataFrame (which could be a way of using nested dictionaries as it has an index and multiple columns) seem quite appropriate. If you take a look at the pandas docs, you'll also find quite comprehensive time series functionality that should come in handy for a project around weekly sales forecasts.
Since pandas is built on numpy, the specialized scientific computing package, performance is quite good.
Ordered dict is implemented as part of the python collections lib. These collection are very fast containers for specific use cases. If you would be looking for only dictionary related functionality (like order in this case) i would go for that. While you say you are going to do more deep analysis in an area where pandas is really made for (eg plotting, filling missing values). So i would recommend you going for pandas.Series.
How to handle easily uncertainties on Series or DataFrame in Pandas (Python Data Analysis Library) ? I recently discovered the Python uncertainties package but I am wondering if there is any simpler way to manage uncertainties directly within Pandas. I didn't find anything about this in the documentation.
To be more precise, I don't want to store the uncertainties as a new column in my DataFrame because I think they are part of a data series and shouldn't be logically separated from it. For example, it doesn't make any sense deleting a column in a DataFrame but not its uncertainties, so I have to handle this case by hand.
I was looking for something like data_frame.uncertainties which could work like the data_frame.values attribute. A data_frame.units (for data units) would be great too but I think those things don't exist in Pandas (yet?)...
If you really want it to be a built in function you can just create a class to put your dataframe in. Then you can define whatever values or functions that you want. Below I wrote a quick example but you could easily add a units definition or a more complicated uncertainty formula
import pandas as pd
data={'target_column':[100,105,110]}
class data_analysis():
def __init__(self, data, percentage_uncertainty):
self.df = pd.DataFrame(data)
self.uncertainty = percentage_uncertainty*self.df['target_column'].values
When I run
example=data_analysis(data,.01)
example.uncertainty
I get out
array([1. , 1.05, 1.1 ])
Hope this helps