How to select two columns to plot with dataframe? - python

apple is a dataframe whose data structure is as the below:
apple
Date Open High Low Close Adj Close
0 2017-01-03 115.800003 116.330002 114.760002 116.150002 114.311760
1 2017-01-04 115.849998 116.510002 115.750000 116.019997 114.183815
2 2017-01-05 115.919998 116.860001 115.809998 116.610001 114.764473
3 2017-01-06 116.779999 118.160004 116.470001 117.910004 116.043915
4 2017-01-09 117.949997 119.430000 117.940002 118.989998 117.106812
5 2017-01-10 118.769997 119.379997 118.300003 119.110001 117.224907
6 2017-01-11 118.739998 119.930000 118.599998 119.750000 117.854782
7 2017-01-12 118.900002 119.300003 118.209999 119.250000 117.362694
8 2017-01-13 119.110001 119.620003 118.809998 119.040001 117.156021
9 2017-01-17 118.339996 120.239998 118.220001 120.000000 118.100822
Now i want to select two columns Date and Close ,to set Date as x axis and Close as y axis,how to plot it?
import pandas as pd
import matplotlib.pyplot as plt
x=pd.DataFrame({'key':apple['Date'],'data':apple['Close']})
x.plot()
plt.show()
I got the graph such as below.
The x axis is not Date column !

New DataFrame is not necessary, plot apple and use parameters x and y:
#if not datetime column first convert
#apple['Date'] = pd.to_datetime(apple['Date'])
apple.plot(x='Date', y='Close')

Related

Adding repeating date column to pandas DataFrame

I am new to pandas and I am struggling adding dates to my pandas dataFrame df that comes from .csv file. I have a dataFrame with several unique ids, and each id has 120 months, I need to add a column date. Each id should have exactly the same dates for 120 periods. I am struggling to add them as after first id there is another id and the dates should start over again. my data in csv file looks like this:
month id
1 1593
2 1593
...
120 1593
1 8964
2 8964
...
120 8964
1 58944
...
Here is my code and I am not really sure how should I use groupby method to add dates for my dataframe based on id:
group=df.groupby('id')
group['date']=pd.date_range(start='2020/6/1', periods=120, freq='MS').shift(14,freq='D')
Please help me!!!
If you know how many sets of 120 you have, you can use this. Just change the 2 at the end. This example creates a repeating 120 dates twice. You may have to adapt for your specific use.
new_dates = list(pd.date_range(start='2020/6/1', periods=120, freq='MS').shift(14,freq='D'))*2
df = pd.DataFrame({'date': new_dates})
These are the same except ones using lambda
def repeatingDates(numIds): return [d.strftime(
'%Y/%m/%d') for d in pandas.date_range(start='2020/6/1', periods=120, freq='MS')] * numIds
repeatingDates = lambda numIds: [d.strftime(
'%Y/%m/%d') for d in pandas.date_range(start='2020/6/1', periods=120, freq='MS')] * numIds
You can use Pandas transform. This is how I solved it:
dataf['dates'] = \
(dataf
.groupby("id")
.transform(lambda d: pd.date_range(start='2020/6/1', periods=d.max(), freq='MS').shift(14,freq='D')
)
Results:
month id dates
0 1 1593 2020-06-15
1 2 1593 2020-07-15
2 3 1593 2020-08-15
3 1 8964 2020-06-15
4 2 8964 2020-07-15
5 1 58944 2020-06-15
6 2 58944 2020-07-15
7 3 58944 2020-08-15
8 4 58944 2020-09-15
Test data:
import io
import pandas as pd
dataf = pd.read_csv(io.StringIO("""
month,id
1,1593
2,1593
3,1593
1,8964
2,8964
1,58944
2,58944
3,58944
4,58944""")).astype(int)

Boxplot of Multiindex df

I want to do 2 things:
I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).
Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.
Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):
Date
2016-01-02 NumericIndex 1611664
OriginMovementID 4744
DestinationMovementID 5084
MeanTravelTimeSeconds 1233
RangeLowerBoundTravelTimeSeconds 756
...
2020-03-31 DestinationMovementID 3594
MeanTravelTimeSeconds 1778
RangeLowerBoundTravelTimeSeconds 1601
RangeUpperBoundTravelTimeSeconds 1973
DayOfWeek Tuesday
Length: 11281655, dtype: object
When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.
If I try to do df.stack().unstack() or df.stack().T I get then following error:
Index contains duplicate entries, cannot reshape
How do I plot the boxplot and how do I turn the rows into columns?
You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.
import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
"RangeLowerBoundTravelTimeSeconds"]
df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)],
"Value":random.randint(1000,10000)}
for i in range(random.randint(5,20))
for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])
# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])
# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
if i==0 or p!=v: s=0
else: s+=1
seq[i] = s
p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())
output
Value
Observation DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date SeqNo
2016-01-02 0 NaN NaN 2560.0 5324.0 5085.0
1 NaN NaN 1066.0 7372.0 NaN
2016-01-16 0 NaN 6226.0 NaN 7832.0 NaN
1 NaN 1384.0 NaN 8839.0 NaN
2 NaN 7892.0 NaN NaN NaN

Creating dataframe from dict

first start by creating a list with some values:
list = ['SBSP3.SA', 'CSMG3.SA', 'CGAS5.SA']
I create an empty dictionary because that's the only way I found it to read several .csv files I want as a dataframe. And then I do a for loop to store my .csv files in the empty dictionary:
d = {}
d = {ticker: pd.read_csv('{}.csv'.format(ticker)) for ticker in list}
after that I can only call the dataframe by passing slices with the dictionary keys:
d['SBSP3.SA'].head(5)
Date High Low Open Close Volume Adj Close
0 2017-01-02 14.70 14.60 14.64 14.66 7525700.0 13.880955
1 2017-01-03 15.65 14.95 14.95 15.50 39947800.0 14.676315
2 2017-01-04 15.68 15.31 15.45 15.50 37071700.0 14.676315
3 2017-01-05 15.91 15.62 15.70 15.75 47586300.0 14.913031
4 2017-01-06 15.92 15.50 15.78 15.66 25592000.0 14.827814
I can't for example:
df = pd.DataFrame(d)
My question is:
Can I merge all these dataframes that I threw in dictionary (d) with axis = 1 to view it as one?
Breaking the head a lot here I managed to put all the dataframes together but I lost their key and I could not distinguish who is who, since the name of the columns is the same.
Can I name these keys in columns?
Example:
Date High_SBSP3.SA Low_SBSP3.SA Open_SBSP3.SA Close_SBSP3.SA Volume_SBSP3.SA Adj Close_SBSP3.SA
0 2017-01-02 14.70 14.60 14.64 14.66 7525700.0 13.880955
1 2017-01-03 15.65 14.95 14.95 15.50 39947800.0 14.676315
2 2017-01-04 15.68 15.31 15.45 15.50 37071700.0 14.676315
3 2017-01-05 15.91 15.62 15.70 15.75 47586300.0 14.913031
4 2017-01-06 15.92 15.50 15.78 15.66 25592000.0 14.827814
Don't use list as a variable name, it shadows the actual built-in list.
You don't need a dictionary, a simple list is enough to store all your dataframes.
Call pd.concat on this list - it should properly concatenate the dataframes one below the other, as long as they have the same column names.
ticker_list = ['SBSP3.SA', 'CSMG3.SA', 'CGAS5.SA']
pd_list = [pd.read_csv('{}.csv'.format(ticker)) for ticker in ticker_list]
df = pd.concat(pd_list)
Use df = pd.concat(pd_list, ignore_index=True) if you want to reset the indices when concatenating.
pd.merge will do what you want (including renaming columns) but since it only allows for merging two frames at a time the column names will not be consistent when repeating the merge. Thus you need to rename the columns manually before.
import pandas as pd
from functools import reduce
ticker_list = ['SBSP3.SA', 'CSMG3.SA', 'CGAS5.SA']
pd_list = [pd.read_csv('{}.csv'.format(ticker)) for ticker in ticker_list]
for idx, df in enumerate(pd_list):
old_names = df.columns[1:]
new_names = list(map(lambda x : x + '_' + ticker_list[idx] , old_names))
zipped = dict(zip(old_names, new_names))
df.rename(zipped, axis=1, inplace=True)
def dfmerge(x, y):
return pd.merge(x, y, on="date")
df = reduce(dfmerge, pd_list)
print(df)
Output (with my data):
date High_SBSP3.SA Low_SBSP3.SA Open_SBSP3.SA High_CSMG3.SA Low_CSMG3.SA Open_CSMG3.SA High_CGAS5.SA Low_CGAS5.SA Open_CGAS5.SA
0 2017-01-02 1 2 3 1 2 3 1 2 3
1 2017-01-03 4 5 6 4 5 6 4 5 6
2 2017-01-04 7 8 9 7 8 9 7 8 9
Hint: you may need to edit/delete your comment. Since I preferred to overwrite my previous answer instead of adding a new one.

How can I plot different length pandas series with matplotlib?

I've got two pandas series, one with a 7 day rolling mean for the entire year and another with monthly averages. I'm trying to plot them both on the same matplotlib figure, with the averages as a bar graph and the 7 day rolling mean as a line graph. Ideally, the line would be graph on top of the bar graph.
The issue I'm having is that, with my current code, the bar graph is showing up without the line graph, but when I try plotting the line graph first I get a ValueError: ordinal must be >= 1.
Here's what the series' look like:
These are first 15 values of the 7 day rolling mean series, it has a date and a value for the entire year:
date
2016-01-01 NaN
2016-01-03 NaN
2016-01-04 NaN
2016-01-05 NaN
2016-01-06 NaN
2016-01-07 NaN
2016-01-08 0.088473
2016-01-09 0.099122
2016-01-10 0.086265
2016-01-11 0.084836
2016-01-12 0.076741
2016-01-13 0.070670
2016-01-14 0.079731
2016-01-15 0.079187
2016-01-16 0.076395
This is the entire monthly average series:
dt_month
2016-01-01 0.498323
2016-02-01 0.497795
2016-03-01 0.726562
2016-04-01 1.000000
2016-05-01 0.986411
2016-06-01 0.899849
2016-07-01 0.219171
2016-08-01 0.511247
2016-09-01 0.371673
2016-10-01 0.000000
2016-11-01 0.972478
2016-12-01 0.326921
Here's the code I'm using to try and plot them:
ax = series_one.plot(kind="bar", figsize=(20,2))
series_two.plot(ax=ax)
plt.show()
Here's the graph that generates:
Any help is hugely appreciated! Also, advice on formatting this question and creating code to make two series for a minimum working example would be awesome.
Thanks!!
The problem is that pandas bar plots are categorical (Bars are at subsequent integer positions). Since in your case the two series have a different number of elements, plotting the line graph in categorical coordinates is not really an option. What remains is to plot the bar graph in numerical coordinates as well. This is not possible with pandas, but is the default behaviour with matplotlib.
Below I shift the monthly dates by 15 days to the middle of the month to have nicely centered bars.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
import pandas as pd
t1 = pd.date_range("2018-01-01", "2018-12-31", freq="D")
s1 = pd.Series(np.cumsum(np.random.randn(len(t1)))+14, index=t1)
s1[:6] = np.nan
t2 = pd.date_range("2018-01-01", "2018-12-31", freq="MS")
s2 = pd.Series(np.random.rand(len(t2))*15+5, index=t2)
# shift monthly data to middle of month
s2.index += pd.Timedelta('15 days')
fig, ax = plt.subplots()
ax.bar(s2.index, s2.values, width=14, alpha=0.3)
ax.plot(s1.index, s1.values)
plt.show()
The problem might be the two series' indices are of very different scales. You can use ax.twiny to plot them:
ax = series_one.plot(kind="bar", figsize=(20,2))
ax_tw = ax.twiny()
series_two.plot(ax=ax_tw)
plt.show()
Output:

Data Cleaning in Python/Pandas to iterate through months combinations

I am doing some data cleaning to do some machine learning on a data set.
Basically I would like to predict next 12 months values based on last 12 months.
I have a data set with values per month (example below).
I would like to train my model by iterating into each possible combination of 12 months.
For example I want to train him on 2014-01 to 2014-12 to populate 2015-01 to 2015-12 but also to train him on 2014-02 to 2015-01 to populate 2015-02 to 2016-01 etc.
But I struggle to populate all these possibilities.
I show below where I am currently in my code and an example below of what I would like to have (with just 6 months instead of 12).
import pandas as pd
import numpy as np
data = [[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]]
Months=['201401','201402','201403','201404','201405','201406','201407','201408','201409','201410','201411','201412','201501','201502','201503','201504','201505','201506','201507','201508','201509','201510','201511','201512']
df = pd.DataFrame(data,columns=Months)
The part that I can't get to work.
X = np.array([])
Y = np.array([])
for month in Months:
loc = df.columns.get_loc(month)
print(month,loc)
if loc + 11 <= df.shape[1]:
X = np.append(X,df.iloc[:,loc:loc+5].values,axis=0)
Y = np.append(Y,df.iloc[:,loc+6:loc+1].values,axis=0)
This is what I am expecting (for the first 3 iteratios)
### RESULTS EXPECTED ####
X = [[1,2,3,4,5,6],[2,3,4,5,6,7],[3,4,5,6,7,8]]
Y = [[7,8,9,10,11,12],[8,9,10,11,12,13],[9,10,11,12,13,14]]
To generate date ranges like the ones you describe in your explanation (rather than the ones shown in your sample output), you could use Pandas functionality like so:
import pandas as pd
months = pd.Series([
'201401','201402','201403','201404','201405','201406',
'201407','201408','201409','201410','201411','201412',
'201501','201502','201503','201504','201505','201506',
'201507','201508','201509','201510','201511','201512'
])
# this function converts strings like "201401"
# to datetime objects, and then uses DateOffset
# and date_range to generate a sequence of months
def date_range(month):
date = pd.to_datetime(month, format="%Y%m")
return pd.date_range(date, date + pd.DateOffset(months=11), freq='MS')
# apply function to original Series
# and then apply pd.Series to expand resulting arrays
# into DataFrame columns
month_ranges = months.apply(date_range).apply(pd.Series)
# sample of output:
# 0 1 2 3 4 5 \
# 0 2014-01-01 2014-02-01 2014-03-01 2014-04-01 2014-05-01 2014-06-01
# 1 2014-02-01 2014-03-01 2014-04-01 2014-05-01 2014-06-01 2014-07-01
# 2 2014-03-01 2014-04-01 2014-05-01 2014-06-01 2014-07-01 2014-08-01
# 3 2014-04-01 2014-05-01 2014-06-01 2014-07-01 2014-08-01 2014-09-01

Categories

Resources