I am a pandas newbie and I want to make a graph from a CSV I have. On this csv, there's some date written to it, and I want to make a graph of how frequent those date appears.
This is how it looks :
2022-01-12
2022-01-12
2022-01-12
2022-01-13
2022-01-13
2022-01-14
Here, we can see that I have three records on the 12th of january, 2 records the 13th and only one records the 14th. So we should see a decrease on the graph.
So, I tried converting my csv like this :
date,records
2022-01-12,3
2022-01-13,2
2022-01-14,1
And then make a graph with the date as the x axis and the records amount as the y axis.
But is there a way panda (or matplotlib I never understand which one to use) can make a graph based on the frequency of appearance, so that I don't have to convert the csv before ?
There is a function of PANDAS which allows you to count the number of values.
First off, you'd need to read your csv file into a dataframe. Do this by using:
import pandas as pd
df = pd.read_csv("~csv file name~")
Using the unique() function in the pandas library, you can display all of the unique values. The syntax should look like:
uniqueVals = df("~column name~").unique()
That should return a list of all the unique values. Then what you'll do is use the function value_counts() with whatever value you are trying to count in square brackets after the normal brackets. The syntax should look something like this:
totalOfVals = []
for date in uniqueVals:
numDate = df[date].valuecounts("~Whatever date you're looking for~")
totalOfVals.append(numDate)
Then you can use the two arrays you have for the unique dates and the amount of dates there are to then use matplotlib to create a graph.
You'll want to use the syntax:
import matplotlib.pyplot as mpl
mpl.plot(uniqueVals, totalOfVals, color = "~whatever colour you want the line to be~", marker = "~whatever you want the marker to look like~")
mpl.xlabel('Date')
mpl.ylabel('Number of occurrences')
mpl.title('Number of occurrences of dates')
mpl.grid(True)
mpl.show()
And that should display a graph with all the dates and number of occurrences with a grid behind it. Of course if you don't want the grid just either set mpl.grid to False or just get rid of it.
I have a really large dataset from beating two laser frequencies and reading out the beat frequency with a freq. counter.
The problem is that I have a lot of outliers in my dataset.
Filtering is not an option since the filtering/removing of outliers kills precious information for my allan deviation I use to analyze my beat frequency.
The problem with removing the outliers is that i want to compare allan deviations of three different beat frequencies. If i now remove some points i will have shorter x-axis than before and my allan deviation x-axis will scale differently. (The adev basically builds up a new x-axis starting with intervals of my sample rate up to my longest measurement time -> which is my highest beat frequency x-axis value.)
Sorry if this is confusing, I wanted to give as many information as possible.
So anyway, what i did until now is i got my whole allan deviation to work and removed outliers successfully, chopping my list into intervals and compare all y-values of each interval to the standard deviation of the interval.
What i want to change now is that instead of removing the outliers i want to replace them with the mean of their previous and next neighbours.
Below you can find my test code for a list with outliers, it seems have a problem using numpy where and i don't really understand why.
The error is given as "'numpy.int32' object has no attribute 'where'". Do I have to convert my dataset to a panda structure?
What the code does is searching for values above/below my threshold, replace them with NaN, and then replace NaN with my mean. I'm not really into using NaN replacement so i would be very grateful for any help.
l = np.array([[0,4],[1,3],[2,25],[3,4],[4,28],[5,4],[6,3],[7,4],[8,4]])
print(*l)
sd = np.std(l[:,1])
print(sd)
for i in l[:,1]:
if l[i,1] > sd:
print(l[i,1])
l[i,1].where(l[i,1].replace(to_replace = l[i,1], value = np.nan),
other = (l[i,1].fillna(method='ffill')+l[i,1].fillna(method='bfill'))/2)
so what i want is to have a list/array with the outliers replaced with the means of previous/following neighbours
error message: 'numpy.int32' object has no attribute 'where'
One option is indeed tranform all the work into pandas just with
import pandas as pd
dataset = pd.DataFrame({'Column1':data[:,0],'Column2':data[:,1]})
that will solve error as pandas dataframe object has where command.
Howewer, that is not obligatory and we can still operate with just numpy
For example, the easiest way to detect outliers is to look if they are not in range mean+-3std.
Code example below, using your setting
import numpy as np
l = np.array([[0,4],[1,3],[2,25],[3,4],[4,28],[5,4],[6,3],[7,4],[8,4]])
std = np.std(l[:,1])
mean=np.mean(l[:,1])
for i in range (len(l[:,1])):
if((l[i,1]<=mean+2*std)&(l[i,1]>=mean-2*std)):
pass
else:
if (i!=len(l[:,1])-1)&(i!=0):
l[i,1]=(l[i-1,1]+l[i+1,1])/2
else:
l[i,1]=mean
What we did here first check is value is outlier at line
if((l[i,1]<=mean+2*std)&(l[i,1]>=mean-2*std)):
pass
Then check if its not first or last element
if (i!=len(l[:,1])-1)&(i!=1):
If it is, just put mean to the field:
else:
l[i,1]=mean
I have a relatively clean data set with two columns and no gaps, a snapshot is shown below:
I run the following line of code:
correlation = pd.rolling_corr(data['A'], data['B'], window=120)
and for some reason, this outputs a dataframe (shown as a plot below) with large gaps in it:
I haven't personally seen this issue before, and am not sure after reviewing the data (more than the code) what the issue could be?
It happens due to the missing dates in the time series, weekends etc. Evidence of this in your example being 7/2/2003 -> 10/2/2003. One solution is to fill in these gaps by re-indexing the time series dataframe.
df.index = pd.DatetimeIndex(df.index) # required
df = df.asfreq('D') # reindex will include missing days
df = df.fillna(method='bfill') # fill / interpolate NaNs
corr = df.A.rolling(30).corr(df.B) # no gaps
You are getting NAN values in your correlation variable where the number of rows is less than the value of the window attribute.
import pandas as pd
import numpy as np
data = pd.DataFrame({'A':np.random.randn(10), 'B':np.random.randn(10)})
correlation = pd.rolling_corr(data['A'], data['B'], window=3)
print correlation
0 NaN
1 NaN
2 0.852602
3 0.020681
4 -0.915110
5 -0.741857
6 0.173987
7 0.874049
8 -0.874258
9 -0.835340
In the docs for this function is warns about this in the min_periods attribute section: "Minimum number of observations in window required to have a value (otherwise result is NA)."
It seems the default None is not working, since you'd think you wouldn't see the NaN unless you set a value for this.
I have a pandas.DataFrame indexed by time, as seen below. The other column contains data recorded from a device measuring current. I want to filter to the second column by a low pass filter with a frequency of 5Hz to eliminate high frequency noise. I want to return a dataframe, but I do not mind if it changes type for the application of the filter (numpy array, etc.).
In [18]: print df.head()
Time
1.48104E+12 1.1185
1.48104E+12 0.8168
1.48104E+12 0.8168
1.48104E+12 0.8168
1.48104E+12 0.8168
I am graphing this data by df.plot(legend=True, use_index=False, color='red') but would like to graph the filtered data instead.
I am using pandas 0.18.1 but I can change.
I have visited https://oceanpython.org/2013/03/11/signal-filtering-butterworth-filter/ and many other sources of similar approaches.
Perhaps I am over-simplifying this but you create a simple condition, create a new dataframe with the filter, and then create your graph from the new dataframe. Basically just reducing the dataframe to only the records that meet the condition. I admit I do not know what the exact number is for high frequency, but let's assume your second column name is "Frequency"
condition = df["Frequency"] < 1.0
low_pass_df = df[condition]
low_pass_df.plot(legend=True, use_index=False, color='red')
I'm trying to determine what kinds of corrections a market makes in response to changes.
A simple version of this using Python, Pandas and matplotlib might look like:
from pandas import *
ts = read_csv("time_series.csv")
chng = ts / ts.shift(1)
chng.name="current"
f = chng.shift(-1)
f.name="future"
frame = concat([chng, f], axis=1)
frame.groupby(frame.current.round(2)).future.mean().plot()
For example if my data set had a strong habit of correcting changes back to the original value (within 1 tick) the output of the code above might show a negative correlation.
The problem with this method is that it can only show what the response is over a fixed time frame (i.e. I can change the amount that the future data set is shifted, but it can only be one value).
What I would like to do is divide the initial values into buckets and show a trendline for how each range of initial values was received over time (1 tick later, 3 ticks later, etc).
How could I go about doing that?