powerlaw Python package errors - python

I'm currently trying to fit a set of (positive) data with the powerlaw.Fit() function from the powerlaw package. However, every single time I do this I obtain the following message:
<powerlaw.Fit at 0x25eac6d3e80>
which I've been trying to figure out what it means for ages, but obviously without success. Another issue that I've been facing is that whenever I plot my CCDF using
powerlaw.plot_ccdf()
and my PDF using
powerlaw.plot_pdf()
with my data, I only obtain a plot for the CCDF but nothing for the PDF. Why are all of these things happening? My data is within a NumPy array and looks as follows:
array([ 9.90857053e-06, 3.45336391e-05, 4.06757403e-05, ...,
6.91411789e-02, 6.92511375e-02, 7.45046008e-02])
I doubt there is any kind of issue with my data, since, as I said, I get the plot for the CCDF more than fine. Any kind of help would be highly appreciated. Thanks in advance. (Edit: the data is composed of 1908 non-integer values)

It probably helps to read the documentation. http://pythonhosted.org/powerlaw/
powerlaw.Fit is a class, so when you call powerlaw.Fit(...), you will get an object with associated methods. Save the object in a variable, then pull the results you want from it. For example:
results = powerlaw.Fit(data)
print(results.find_xmin())
The 'message' you are getting is just a placeholder for the Fit object that is created.

Related

What does ... mean in Python?

I am an elementary Python programmer and have been using this module called "Pybaseball" to analyze sabermetrics data. When using this module, I came across a problem when trying to retrieve information from the program. The program reads a CSV file from any baseball stats site and outputs it onto a program for ease of use but the problem is that some of the information is not shown and is instead all replaced with a "...". An example of this is shown:
from pybaseball import batting_stats_range
data = batting_stats_range('2017-05-01', '2017-05-08')
print(data.head())
I should be getting:
https://github.com/jldbc/pybaseball#batting-stats-hitting-stats-for-players-within-seasons-or-during-a-specified-time-period
But the information is cutoff from 'TM' all the way to 'CS' and is replaced with a ... on my code. Can someone explain to me why this happens and how I can prevent it?
As the docs states, head() is meant for "quickly testing if your object has the right type of data in it." So, it is expected that some data may not show because it is collapsed.
If you need to analyze the data with more detail you can access specific columns with other methods.
For example, using iloc(). You can read more about it here, but essentially you can "ask" for a slice of those columns and then apply a new slice to get only nrows.
Another example would be loc(), docs here. The main difference being that loc() uses labels (column names) to filter data instead of numerical order of columns. You can filter a subset of specific columns and then get a sample of rows from that.
So, to answer your question "..." is pandas's way of collapsing data in order to get a prettier view of the results.

Assistance with Keras for a noise detection script

I'm currently trying to learn more about Deep learning/CNN's/Keras through what I thought would be a quite simple project of just training a CNN to detect a single specific sound. It's been a lot more of a headache than I expected.
I'm currently reading through this ignoring the second section about gpu usage, the first part definitely seems like exactly what I'm needing. But when I go to run the script, (my script is pretty much totally lifted from the section in the link above that says "Putting the pieces together, you may end up with something like this:"), it gives me this error:
AttributeError: 'DataFrame' object has no attribute 'file_path'
I can't find anything in the pandas documentation about a DataFrame.file_path function. So I'm confused as to what that part of the code is attempting to do.
My CSV file contains two columns, one with the paths and then a second column denoting the file paths as either positive or negative.
Sidenote: I'm also aware that this entire guide just may not be the thing I'm looking for. I'm having a very hard time finding any material that is useful for the specific project I'm trying to do and if anyone has any links that would be better I'd be very appreciative.
The statement df.file_path denotes that you want access the file_path column in your dataframe table. It seams that you dataframe object does not contain this column. With df.head() you can check if you dataframe object contains the needed fields.

groupby for pandas data frame gives wrong results

I am trying to replicate a paper whose code was written in Stata for my course project using Python. I have difficulty replicating the results from a collapse command in their do-file. The corresponding line in the do-file is
collapse lexptot, by(clwpop right)
while I have
df.groupby(['cwpop', 'right'])['lexptot'].agg(['mean'])
The lexptot variable is the logarithm of a variable 'exptot' which I calculated previously using np.log(dfs['exptot]).
Does anyone have an idea what is going wrong here? The means I calculate are typically around 1.5 higher than the means calculated in Stata.
Once you update the question with more relevant details maybe I can answer more. But this is what I think might help you!
df.groupby(['cwpop', 'right']).mean()['lexptot']

Preprocessing data for Time-Series prediction

Okay, so I am doing research on how to do Time-Series Prediction. Like always, it's preprocessing the data that's the difficult part. I get I have to convert the "time-stamp" in a data file into a "datetime" or "timestep" I did that.
df = pd.read_csv("airpassengers.csv")
month = pd.to_datatime(df['Month'])
(I may have parse the datatime incorrectly, I seen people use pd.read_csv() instead to parse the data. If I do, please advise on how to do it properly)
I also understand the part where I scale my data. (Could someone explain to me how the scaling works, I know that it turns all my data within the range I give it, but would the output of my prediction also be scaled or something.)
Lastly, once I have scaled and parsed data and timestamps, how would I actually predict with the trained model. I don't know what to enter into (for example) model.predict()
I did some research it seemed like I have to shift my dataset or something, I don't really understand what the documentation is saying. And the example isn't directly related to time-series prediction.
I know this is a lot, you might now be able to answer all the questions. I am fairly new to this. Just help with whatever you can. Thank you!
So, because you're working with airpassengers.csv and asking about predictive modeling I'm going to assume you're working through this github
There's a couple of things I want to make sure you know before I dive into the answer to your questions.
There are lots of different types of predictive models used in
forecasting. You can find all about them here
You're asking a lot of broad questions but I'll break down the main questions
into two steps and describe what's happening using the example that
I believe you're trying to replicate
Let's break it down
Loading and parsing the data
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
air_passengers = pd.read_csv("./data/AirPassengers.csv", header = 0, parse_dates = [0], names = ['Month', 'Passengers'], index_col = 0)
This section of code loads in the data from a .csv (comma-separated values) file. It's saved into the data frame air_passengers. Inside the function to read in the csv we also state that there's a header in the first row, the first column is full of dates, the name of our columns is assigned, we index our data frame to the first column.
Scaling the data
log_air_passengers = np.log(air_passengers.Passengers)
This is done to make the math make sense. Logs are the inverse of exponents (X^2 is the same as Log2X). Using numpy's log function it gives us the natural log (log e). This is also called the natural log. Your predicted values will actually be so close to a percent change that you can use them as such
Now that the data has been scaled, we can prep it for statistical modeling
log_air_passengers_diff = log_air_passengers - log_air_passengers.shift()
log_air_passengers_diff.dropna(inplace=True)
This changes the data frame to be the difference between the previous and next data points instead of just the log values themselves.
The last part of your question contains too many steps to cover here. It is also not as simple as calling a single function. I encourage you to learn more from here

How to find the minimum value of netCDF array excluding zero

I am using Python 2, and dealing with a netcdf data.
This array is a variable called cloud water mixing ratio, which is an output from WRF climate model that has 4 dimensions:
QC(time (25), vertical level (69), latitude (119), longitude (199))
I'm trying to get the minimum value of the values in this array. From initial analysis using NCVIEW visualisation, I found that the minimum value is approximately 1x10-5 and the maximum is 1x10-3.
I've used
var = fh.variables['QC']
var[:].max()
var[:].min()
The max works fine, but the min gives me 0.0.
Then I tried a solution from here , which is
var[var>0].min()
but I also get zero. Then I realised that the above code works for arrays with negatives, while mine doesn't have negatives.
I've tried looking for solutions here and there but found nothing that works for my situation. Please, if anyone could point me to the right directions, I'd appreciate it a lot.
Thanks.
var[var>0].min is a function, you need to call it using ()
var[var>0].min() should work much better
sorry for not being able to post the data as I don't have the privilege to share the data. I have tried creating a random 4d array that is similar to the data, and used all the solution you all provided, especially by #Joao Abrantes, they all seemed to work fine. So I thought maybe there is some problem with the data.
Fortunately, there is nothing wrong with the data. I have discussed this with my friend and we have finally found the solution.
The solution is
qc[:][qc[:]>0].min()
I have to specify the [:] after the variable instead of just doing
qc[qc>0].min()
There is also another way, which is to specify the array into numpy array because, qc = fh.variables['QC']
returns a netCDF4.Variable. By adding the second line qc2 = qc[:], it has become numpy.ndarray.
qc = fh.variables['QC']
qc2 = qc[:] # create numpy array
qc2[qc2>0].min()
I'm sorry if my question was not clear when I posted it yesterday. As I have only learned about this today.

Categories

Resources