I'm new to Python development and I have to implement a project on data analysis. I have a data.txt file which has the following values:
ID,name,date,confirmedInfections
DE2,BAYERN,2020-02-24,19
.
.
DE2,BAYERN,2020-02-25,19
DE1,BADEN-WÃœRTTEMBERG,2020-02-24,1
.
.
DE1,BADEN-WÃœRTTEMBERG,2020-02-26,7
.
.(lot of other names and data)
What I'm trying to do?
As you can see in the file above each name represents a city with covid infections. For each city, I need to save a data frame for each city and plot a time series graph which uses the index of date on x-axis and confirmedInfections on y-axis. An example:
Because of the big data file I was given with four columns I think that I'm doing a mistake on parsing that file and selecting the correct values. Here is an example of my code:
# Getting the data fron Bayern city
data = pd.read_csv("data.txt", index_col="name")
first = data.loc["BAYERN"]
print(first)
# Plotting the timeseries
series = read_csv('data.txt' ,header=0, index_col=0, parse_dates=True, squeeze=True)
series.plot()
pyplot.show()
And here is a photo of the result:
As you can see on the x-axis I get all the different IDs that are included on data.txt. From that to exlude the ID and stats of each city.
Thanks for your time.
You need to parse date after reading from CSV
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
# You can limit the columns as below provided
headers = ['ID','name','date','confirmedInfections']
data = pd.read_csv('data.csv',names=headers)
data['Date'] = data['Date'].map(lambda x: datetime.strptime(str(x), '%Y/%m/%d'))
x = data['Date']
y = data['confirmedInfections']
# Plot using pyplotlib
plt.plot(x,y)
# display chart
plt.show()
I haven't tested this particular code.
I hope this will work for you
Related
Beginner question here.
What I'm trying to build:
A program that takes data from a CSV and creates a calendar heat map from it. I am a language learner (language as in spanish, japanese, etc) and the data set I'm using is a CSV that shows how many hours I spent immersing in my target language per day.
I want the individual values in the heat map to be the number of hours. Y axis will be days of the week, and x axis will be months.
What I have tried:
I have tried many methods for the past two days (most of them using seaborn), that have all resulted in error-infested spaghetti code...
The method I'm using today is with calmap. Here is what I have so far:
import seaborn as sns
import matplotlib as plt
import numpy as np
from vega_datasets import data as vds
import calmap
import pandas as pd
import calplot
# importing CSV from google drive
df = pd.read_csv('ImmersionHours.csv', names=['Type', 'Name', 'Date', 'Time', 'Total Time'])
# deleting extraneous row of data
df.drop([0], inplace=True)
# making sure dates are in datetime format
df['Date'] = pd.to_datetime(df['Date'])
# setting the dates as the index
df.set_index('Date', inplace=True)
# the data is now formatted how I want
# creating a series for the heat map values
hm_values = pd.Series(df.Time)
# trying to create the heat map from the series (hm_values)
calmap.yearplot(data=hm_values, year=2021)
and here is a copy of the data set that I imported into Python (for reference) https://docs.google.com/spreadsheets/d/1owZv0NDLz7S4R5Spf-hzRDGMTCS1FVSMvi0WsZJenWE/edit?usp=sharing
Can someone tell me where I'm going wrong and why the heat map won't show?
Thank you in advance for any advice/tips/corrections.
The question is a bit old, but in case anyone is interested, I had the same problem and found that this notebook was very helpful to solve the issue: https://github.com/amandasolis/Fitbit/blob/master/FitbitSummaryPlots.ipynb
import numpy as np
import pandas as pd
import calmap
fulldf = pd.read_csv("./data.csv", index_col=0, header=None,names=['date','duration','frac'], parse_dates=['date'], usecols=['date','frac'], infer_datetime_format=True, dayfirst=True)
fulldf.index=pd.to_datetime(fulldf.index)
events = pd.Series(fulldf['frac'])
calmap.yearplot(events, year=2022) #the notebook linked above has a better but complex viz
first lines of data.csv (I plot frac, the 3rd column, not duration, but it should be similar):
03/11/2022,1,"0.0103"
08/11/2022,1,"0.0103"
15/11/2022,1,"0.0103"
I'm trying to import data from both iex and FRED. Although both time series are over the same time period, when I graph them together the data does not show up correctly on the same x axis. I suspect this is due to differences between how to iex dates are formatted and how the FRED dates are formatted.
Code below:
import matplotlib.pyplot as plt
import pandas as pd
from pandas_datareader.data import DataReader
from datetime import date
start = date(2016,1,1)
end = date(2016,12,31)
ticker = 'AAPL'
data_source = 'iex'
stock_prices = DataReader(ticker, data_source, start, end)
print(stock_prices.head())
stock_prices.info()
stock_prices['close'].plot(title=ticker)
plt.show()
series = 'DCOILWTICO'
start = date(2016,1,1)
end = date(2016,12,31)
oil = DataReader(series,'fred',start,end)
print(oil.head())
oil.info()
data = pd.concat([stock_prices[['close']],oil],axis=1)
print(data.head())
data.columns = ['AAPL','Oil Price']
data.plot()
plt.show()
Using join instead of pd.concat will give you what you want:
data = stock_prices[['close']].join(oil)
Main issue with pd.concat is that the index of your data are not aligned, therefore the weird stiched DataFrame. pd.join will take care of the misalignment
I wrote a python script to read in a distance matrix that was provided via a CSV text file. This distance matrix shows the difference between different animal species, and I'm trying to sort them in different ways(diet, family, genus, etc.) using data from another CSV file that just has one row of ordering information. Code used is here:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as mp
dietCols = pd.read_csv("label_diet.txt", header=None)
df = pd.read_csv("distance_matrix.txt", header=None)
ax = sns.heatmap(df)
fig = ax.get_figure()
fig.savefig("fig1.png")
mp.clf()
dfDiet = pd.read_csv("distance_matrix.txt", header=None, names=dietCols)
ax2 = sns.heatmap(dfDiet, linewidths=0)
fig2 = ax2.get_figure()
fig2.savefig("fig2.png")
mp.clf()
When plotting the distance matrix, the original graph looks like this:
However, when the additional naming information is read from the text file, the graph produced only has one column and looks like this:
You can see the matrix data is being used as row labeling, and I'm not sure why that would be. Some of the rows provided have no values so they're listed as "NaN", so I'm not sure if that would be causing a problem. Is there any easy way to order this distance matrix using an exterior file? Any help would be appreciated!
I haven't had much training with Matplotlib at all, and this really seems like a basic plotting application, but I'm getting nothing but errors.
Using Python 3, I'm simply trying to plot historical stock price data from a CSV file, using the date as the x axis and prices as the y. The data CSV looks like this:
(only just now noticing to big gap in times, but whatever)
import glob
import pandas as pd
import matplotlib.pyplot as plt
def plot_test():
files = glob.glob('./data/test/*.csv')
for file in files:
df = pd.read_csv(file, header=1, delimiter=',', index_col=1)
df['close'].plot()
plt.show()
plot_test()
I'm using glob for now just to identify any CSV file in that folder, but I've also tried just designating one specific CSV filename and get the same error:
KeyError: 'close'
I've also tried just designating a specific column number to only plot one particular column instead, but I don't know what's going on.
Ideally, I would like to plot it just like real stock data, where everything is on the same graph, volume at the bottom on it's own axis, open high low close on the y axis, and date on the x axis for every row in the file. I've tried a few different solutions but can't seem to figure it out. I know this has probably been asked before but I've tried lots of different solutions from SO and others but mine seems to be hanging up on me. Thanks so much for the newbie help!
Here on pandas documentation you can find that the header kwarg should be 0 for your csv, as the first row contains the column names. What is happening is that the DataFrame you are building doesn't have the column close, as it is taking the headers from the "second" row. It will probably work fine if you take the header kwarg or change it to header=0. It is the same with the other kwargs, no need to define them. A simple df = pd.read_csv(file) will do just fine.
You can prettify this according to your needs
import pandas
import matplotlib.pyplot as plt
def plot_test(file):
df = pandas.read_csv(file)
# convert timestamp
df['timestamp'] = pandas.to_datetime(df['timestamp'], format = '%Y-%m-%d %H:%M')
# plot prices
ax1 = plt.subplot(211)
ax1.plot_date(df['timestamp'], df['open'], '-', label = 'open')
ax1.plot_date(df['timestamp'], df['close'], '-', label = 'close')
ax1.plot_date(df['timestamp'], df['high'], '-', label = 'high')
ax1.plot_date(df['timestamp'], df['low'], '-', label = 'low')
ax1.legend()
# plot volume
ax2 = plt.subplot(212)
# issue: https://github.com/matplotlib/matplotlib/issues/9610
df.set_index('timestamp', inplace = True)
df.index.to_pydatetime()
ax2.bar(df.index, df['volume'], width = 1e-3)
ax2.xaxis_date()
plt.show()
I have a csv file with data in the following format -
Issue_Type DateTime
Issue1 03/07/2011 11:20:44
Issue2 01/05/2011 12:30:34
Issue3 01/01/2011 09:44:21
... ...
I'm able to read this csv file, but what I'm unable to achieve is to plot a graph or rather trend based on the data.
For instance - I'm trying to plot a graph with X-axis as Datetime(only Month) and Y-axis as #of Issues. So I would show the trend in line-graphy with 3 lines indicating the pattern of issue under each category for the month.
I really don't have a code for plotting the graph and hence can't share any, but so far I'm only reading the csv file. I'm not sure how to proceed further to plot a graph
PS: I'm not bent on using python - Since I've parsed csv using python earlier I though of using the language, but if there is an easier approach using some other language - I would be open explore that as well.
A way to do this is to use dataframes with pandas.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df = pd.read_csv("D:\Programmes Python\Data\Data_csv.txt",sep=";") #Reads the csv
df.index = pd.to_datetime(df["DateTime"]) #Set the index of the dataframe to the DateTime column
del df["DateTime"] #The DateTime column is now useless
fig, ax = plt.subplots()
ax.plot(df.index,df["Issue_Type"])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m')) #This will only show the month number on the graph
This assumes that Issue1/2/3 are integers, I assumed they were as I didn't really understand what they were supposed to be.
Edit: This should do the trick then, it's not pretty and can probably be optimised, but it works well:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df = pd.read_csv("D:\Programmes Python\Data\Data_csv.txt",sep=";")
df.index = pd.to_datetime(df["DateTime"])
del df["DateTime"]
list=[]
for Issue in df["Issue_Type"]:
list.append(int(Issue[5:]))
df["Issue_number"]=list
fig, ax = plt.subplots()
ax.plot(df.index,df["Issue_number"])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m'))
plt.show()
The first thing you need to do is to parse the datetime fields as dates/times. Try using dateutil.parser for that.
Next, you will need to count the number of issues of each type in each month. The naive way to do that would be to maintain lists of lists for each issue type, and just iterate through each column, see which month and which issue type it is, and then increment the appropriate counter.
When you have such a frequency count of issues, sorted by issue types, you can simply plot them against dates like this:
import matplotlib.pyplot as plt
import datetime as dt
dates = []
for year in range(starting_year, ending_year):
for month in range(1, 12):
dates.append(dt.datetime(year=year, month=month, day=1))
formatted_dates = dates.DateFormatter('%b') # Format dates to only show month names
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(issues[0], dates) # To plot just issues of type 1
ax.plot(issues[1], dates) # To plot just issues of type 2
ax.plot(issues[2], dates) # To plot just issues of type 3
ax.xaxis.set_major_formatter(formatted_dates) # Format X tick labels
plt.show()
plt.close()
honestly, I would just use R. check this link out on downloading / setting up R & RStudio.
data <- read.csv(file="c:/yourdatafile.csv", header=TRUE, sep=",")
attach(data)
data$Month <- format(as.Date(data$DateTime), "%m")
plot(DateTime, Issue_Type)