I am graphing data that is stored in a csv. I pull pull 2 columns of data into a dataframe then convert to series and graph with matplotlib.
from pandas import Series
from matplotlib import pyplot
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('Proxy/Proxy_Analytics/API_Statistics.csv')
df
Date Distinct_FLD Not_On_MM API_Call_Count Cost CACHE_Count
0 2018-11-12 35711 18468 18468 8.31060 35711
1 2018-11-13 36118 18741 11004 4.95180 46715
2 2018-11-14 34073 17629 8668 3.90060 55383
3 2018-11-15 34126 17522 7817 3.51765 63200
#Cost
cost_df = df[['Date','Cost']]
cost_series = cost_df.set_index('Date')['Cost']
plt.style.use('dark_background')
plt.title('Domain Rank API Cost Over Time')
plt.ylabel('Cost in Dollars')
cost_series.plot(c = 'red')
plt.show()
And this works totally fine. I would like to do the same and graph multiple rows but when I try to convert the df to series I am getting an error:
#Not Cost
not_cost = df[['Date','Distinct_FLD','Not_On_MM','API_Call_Count','CACHE_Count']]
not_cost_series = not_cost.set_index('Date')['Distinct_FLD','Not_On_MM','API_Call_Count','CACHE_Count']
Error:
KeyError: ('Distinct_FLD', 'Not_On_MM', 'API_Call_Count', 'CACHE_Count')
What can I do to fix this?
It seems that you are trying to convert the columns of a DataFrame into multiple Series, indexed by the 'Date' column of your DataFrame.
Maybe you can try:
not_cost = df[['Date','Distinct_FLD','Not_On_MM','API_Call_Count','CACHE_Count']]
not_cost_series = not_cost.set_index('Date')
Distinct_FLD = not_cost_series['Distinct_FLD']
Not_On_MM = not_cost_series['Not_On_MM']
.
.
.
Related
Pretty straight forward question here.
I'm loading data in from a csv. The csv column for age is then converted into a histogram. Finally I'm showing a graph and the data is populated to it.
For the life of me though, I don't understand how the matplotlib plt is getting the data from the pandas command dftrain.age.hist() without me explicitly passing it in.
Is hist an extension method? That's the only thing that makes sense to me currently.
import pandas as pd
import matplotlib.pyplot as plt
#load csv files
##training data
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
#generate a histogram of ages
dftrain.age.hist()
#show the graph
plt.show()
So according to this article I think internally the pandas hist function is calling the matploitlib hist function
https://www.educba.com/pandas-hist/
Pandas hist() function is utilized to develop Histograms in Python
using the panda’s library. A histogram is a portrayal of the
conveyance of information. This capacity calls
matplotlib.pyplot.hist(), on every arrangement in the DataFrame,
bringing about one histogram for each section or column.
every pandas dataframe in python is itself a class. So you can access its variables as usual with other python classes.
So when you code dftrain.age, you are accessing that column. And by dftrain.age.hist() you can just generate a histogram of the values of that column.
For example:
import pandas as pd
# Creating the DataFrame
df = pd.DataFrame({'Weight':[45, 88, 56, 15, 71],
'Name':['Sam', 'Andrea', 'Alex', 'Robin', 'Kia'],
'Age':[14, 25, 55, 8, 21]})
print("Type of a dataframe: ",type(df))
print("Type of a dataframe column: ",type(df.Age))
print("Printing that column\n",df.Age)
Output will be this:
Type of a dataframe: <class 'pandas.core.frame.DataFrame'>
Type of a dataframe column: <class 'pandas.core.series.Series'>
Printing that column
0 14
1 25
2 55
3 8
4 21
I know this question has been asked many times but it just doesnt work for me and i dont know why.
I need to see a graph without zero values, but it only shows a graph with them no matter what i try
These are the first 5 rows from .csv file. I need this kind of values
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
29944 2022-11-03T09:51:58Z 0.0
29945 2022-11-03T09:52:28Z 0.0
29946 2022-11-03T09:52:58Z 0.0
29947 2022-11-03T09:53:28Z 0.0
29948 2022-11-03T09:53:58Z 0.0
these are the last 5 rows from .cvs file. I need these type of rows deleted
This is my code now:
# libraries
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import seaborn as sns
from datetime import datetime
import plotly.express as px
from sklearn.ensemble import IsolationForest
import plotly.graph_objects as go
headers = ['time', 'value']
# Read data
#dataset = pd.read_csv("raw_input_filtered.csv", parse_dates=[0])
dataset2 = pd.read_csv("coffe-2col.csv", sep=';', skiprows=1, names=headers) #, dtype=dtypes)
# Printing head of the DataFrame
print(dataset2.head())
# remove empty lines
#groups = dataset2.groupby((dataset2.drop('value', axis= 1) == 0).all(axis=1))
#all_zero = groups.get_group(True)
#non_all_zero = groups.get_group(False)
new_dataset2 = dataset2.dropna()
#dataset2 = dataset2.drop(dataset2[dataset2.value == 0].index)
#dataset2.drop(dataset2[dataset2.value == 0].index, inplace=True)
non of these worked
# select rows that has value 0
# then take negation of it and return the DF using LOC
df=df.loc[~df['value'].eq(0)]
This would return all rows that do not have a 0 in any of your columns.
dataset2.loc[~(dataset2==0).all(axis=1)]
Alternatively you could replace 'COLUMN' with your column name to only examine that specific column.
dataset2 = dataset2['VALUE'].loc[~(dataset2['VALUE']==0).all(axis=1)]
My data consist of a particular OHLCV object that is a bit weird in that it can only be accessed by the name, like this:
# rA = [<MtApi.MqlRates object at 0x000000A37A32B308>,...]
type(rA)
# <class 'list'>
ccnt = len(rA) # 100
for i in range(ccnt):
print('{} {} {} {} {} {} {}'.format(i, rA[i].MtTime, rA[i].Open, rA[i].High, rA[i].Low, rA[i].Close, rA[i].TickVolume))
#0 1607507400 0.90654 0.90656 0.90654 0.90656 7
#1 1607507340 0.90654 0.9066 0.90653 0.90653 20
#2 1607507280 0.90665 0.90665 0.90643 0.90653 37
#3 1607507220 0.90679 0.90679 0.90666 0.90666 22
#4 1607507160 0.90699 0.90699 0.90678 0.90678 29
with some additional formatting I have:
Time Open High Low Close Volume
-----------------------------------------------------------------
1607507400 0.90654 0.90656 0.90654 0.90656 7
1607507340 0.90654 0.90660 0.90653 0.90653 20
1607507280 0.90665 0.90665 0.90643 0.90653 37
1607507220 0.90679 0.90679 0.90666 0.90666 22
I have tried things like this:
df = pd.DataFrame(data = rA, index = range(100), columns = ['MtTime', 'Open', 'High','Low', 'Close', 'TickVolume'])
# Resulting in:
# TypeError: iteration over non-sequence
How can I convert this thing into a Panda DataFrame,
so that I can plot this using the original names?
Plotting using matplotlib should then be possible with something like this:
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
...
df = pd.DataFrame(rA) # not working
df['time'] = pd.to_datetime(df['MtTime'], unit='s')
plt.plot(df['MtTime'], df['Open'], 'r-', label='Open')
plt.plot(df['MtTime'], df['Close'], 'b-', label='Close')
plt.legend(loc='upper left')
plt.title('EURAUD candles')
plt.show()
Possibly related questions (but were not helpful to me):
Numpy / Matplotlib - Transform tick data into OHLCV
OHLC aggregator doesn't work with dataframe on pandas?
How to convert a pandas dataframe into a numpy array with the column names
Converting Numpy Structured Array to Pandas Dataframes
Pandas OHLC aggregation on OHLC data
Getting Open, High, Low, Close for 5 min stock data python
Converting OHLC stock data into a different timeframe with python and pandas
One idea is use list comprehension for extract values to list of tuples:
L = [(rA[i].MtTime, rA[i].Open, rA[i].High, rA[i].Low, rA[i].Close, rA[i].TickVolume)
for i in range(len(rA))]
df = pd.DataFrame(L, columns = ['MtTime', 'Open', 'High','Low', 'Close', 'TickVolume']))
Or if possible:
df = pd.DataFrame({'MtTime':list(rA.MtTime), 'Open':list(rA.Open),
'High':list(rA.High),'Low':list(rA.Low),
'Close':list(rA.Close), 'TickVolume':list(rA.TickVolume)})
I have converted a continuous dataset to categorical. I am getting nan values when ever the value of the continuous data is 0.0 after conversion. Below is my code
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins)
category = category.to_frame()
print (category)
How do I convert the values so that I dont get NaN values. I have attached two screenshots for better understanding how the actual data looks and how the convert data looks. This is the main dataset. This is the what it becomes after using bins and pandas.cut(). How can thos "0.00" stays like the other values in the dataset.
When using pd.cut, you can specify the parameter include_lowest = True. This will make the first internal left inclusive (it will include the 0 value as your first interval starts with 0).
So in your case, you can adjust your code to be
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins,include_lowest=True)
category = category.to_frame()
print (category)
Documentation Reference for pd.cut
I have a large pandas dataframe, which is a log of user ids that login in a website:
id datetime
130 2018-05-17 19:46:18
133 2018-05-17 20:59:57
133 2018-05-17 21:54:01
142 2018-05-17 22:49:07
114 2018-05-17 23:02:34
136 2018-05-18 06:06:48
136 2018-05-18 12:21:38
180 2018-05-18 12:49:33
.......
120 2018-05-18 14:03:58
120 2018-05-18 15:28:36
How can I visualize the above pandas dataframe as a time series plot? For example I would like to represent the frequency of logins of each person id as a line of a different color (note that I have about 400 ids). Something like this plot (*):
]
I tried to:
from datetime import date
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
# set your data as df
# strip only YYYY-mm-dd part from original `datetime` column
df3.timestamp = df3.datetime.apply(lambda x: str(x)[:10])
df3.timestamp = df3.datetime.apply(lambda x: date(int(x[:4]), int(x[5:7]), int(x[8:10])))
# plot
plt.figure(figsize=(150,10))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator())
plt.plot(df3.datetime[:800], df3.id[:800], '-')
plt.gcf().autofmt_xdate()
and
import matplotlib.dates as dates
df5 = df3.set_index('datetime')
df5.plot(x_compat=True)
plt.gca().xaxis.set_major_locator(dates.DayLocator())
plt.gca().xaxis.set_major_formatter(dates.DateFormatter('%d\n\n%a'))
plt.gca().invert_xaxis()
plt.gcf().autofmt_xdate(rotation=0, ha="center")
plt.figure(figsize=(150,10))
However, I got something like this:
]
Any idea of how to get a plot similar to (*)?
I've played with your sample data a little so that one user logs in on three days. The problem in your attempt is that you are trying to "just plot" the logins. If you want to see the frequency of logins, you have to calculate that. So I read the data and use a proper DateTime index, then use groupby followed by resample to calculate the frequencies. I think with 400 users this might become a bit messy, but this will do a plot of the daily logins per user.
import pandas
import io
d = """id,datetime
130,2018-05-17T19:46:18
133,2018-05-17T20:59:57
133,2018-05-17T21:54:01
142,2018-05-17T22:49:07
114,2018-05-17T23:02:34
136,2018-05-18T06:06:48
136,2018-05-18T12:21:38
130,2018-05-18T12:49:33
120,2018-05-18T14:03:58
130,2018-05-19T15:28:36"""
# for the data aboce, this is a quick way to parse it
df = pandas.read_csv(io.StringIO(d), parse_dates=['datetime'], index_col='datetime')
# This method is more roundabout but is perhaps useful if you have other data
df = pandas.read_csv(io.StringIO(d))
df.datetime = pandas.to_datetime(df.datetime)
df = df.set_index('datetime')
# Plot daily logins per user id
r = df.groupby('id').resample('D').apply(len).unstack('id').plot()