I am using the google finance api to pull data into a pandas dataframe. The index is a number and I would like to change it to be a date inclusive of hours and minutes. Any ideas? Thanks!
import pandas as pd
api_call = 'http://finance.google.com/finance/getprices?q=SPY&i=300&p=1d&f=d,o,h,l,c,'
df = pd.read_csv(api_call, skiprows=8, header=None)
df.columns = ['Record', 'Open', 'High', 'Low', 'Close']
df['Record'] = df.index
Record Open High Low Close
0 0 268.19 268.48 268.18 268.46
1 1 268.14 268.23 267.98 268.19
2 2 268.11 268.19 268.06 268.13
3 3 268.05 268.16 267.96 268.11
4 4 267.93 268.1 267.9 268.06
5 5 267.98 268.01 267.89 267.92
6 6 267.95 267.99 267.86 267.97
7 7 267.88 267.95 267.85 267.94
8 8 267.78 267.9 267.78 267.88
9 9 267.94 267.96 267.68 267.78
10 10 267.91 267.95 267.87 267.94
Doesn't look like Pandas supports reading from the Google api. If you look at the raw response from the api it looks like this:
EXCHANGE%3DNYSEARCA
MARKET_OPEN_MINUTE=570
MARKET_CLOSE_MINUTE=960
INTERVAL=300
COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN
DATA=
TIMEZONE_OFFSET=-300
a1514557800,268.51,268.55,268.48,268.55
1,268.19,268.48,268.18,268.46
2,268.14,268.23,267.98,268.19
3,268.11,268.19,268.06,268.13
That first datetime value (with the leading a) is the unix timestamp. Each subsequent "datetime" is really the data for next 300 seconds (INTERVAL value) from the previous row. You need to write something that will parse the header information, and use that to create the timestamps.
Related
I am working on a web app that will allow the user to create a dataframe from within the web app within certain parameters. This is being made with flask, and requires passing a dataframe between different pages, and converting it from a dataframe to a HTML table as necessary, and vice versa.
When I retrieve the values of the table from the client, this is how it appears:
data = request.values.lists()
[('Time', ['00:30', '03:30']), ('Blood Lactate', ['1', '1.1']), ('Velocity (km/h)', ['9', '10'])]
Note: the user cannot add columns, they can only add/remove rows.
When I try to convert this to a dataframe, this is what I get:
df = pd.Dataframe(data)
0 1
0 Time [00:30, 03:30]
1 Blood Lactate [1, 1.1]
2 Velocity (km/h) [9, 10]
I would like if my dataframe was in this format:
Time Blood Lactate Velocity (km/h)
0 00:30 1 9
1 03:30 1.1 10
Is there a way in pandas to convert the data into this format?
You can convert your data to a dictionary first:
pd.DataFrame({e[0]: e[1] for e in request.values.lists()})
Result:
Time Blood Lactate Velocity (km/h)
0 00:30 1 9
1 03:30 1.1 10
Edit:
There is a shorter more readable way:
pd.DataFrame(dict(request.values.lists()))
I need to import a .xlsx sheet into pandas which has a column for the processing time of an associated activity. All entries in this column look somewhat like this:
01:20:34
12:22:30
25:01:02
155:20:56
Which says how much hours, minutes and seconds were needed. When I use pd.read_excel pandas correctly interprets each of the timestamps with less than 24 hours, and reads them as above in the first two cases. The timestamps with more than 24h (last two) on the other hand are converted into a datetime object, which in turn looks like this: 1900-01-02T14:58:03 instead of 62:58:03.
Is there a simple solution?
I think that part of the problem is not in Python/Pandas, but in Excel. Date '1900-01-01' is the base date used by Excel represented by number '1'. You can check that if you write '0' in a cell and then formate that cell to date, you get '1900-01-00' and '1' you get '1900-01-01'.
So, try to export your Excel file to a CSV file before importing to pandas and then import this way:
import pandas as pd
df1 = pd.read_csv('sample_data.csv')
In this case, you can get this DataFrame with the column Duration as a string (I added a column id for reference).
duration id
0 01:20:34 1
1 12:22:30 2
2 25:01:02 3
3 155:20:56 4
Then for your purpose, I suggest you Do not try to convert those values to datetime type, but a timedelta. A strategy will be to split the strings by colons and then build an instance of timedelta using those three fields: hours, minutes, and seconds.
import datetime as dt
def converter1(x):
vals = x.split(':')
vals = [int(val) for val in vals ]
out = dt.timedelta(hours=vals[0], minutes=vals[1], seconds=vals[2])
return out
df1['deltat'] = df1['duration'].apply(converter1)
duration id deltat
0 01:20:34 1 0 days 01:20:34
1 12:22:30 2 0 days 12:22:30
2 25:01:02 3 1 days 01:01:02
3 155:20:56 4 6 days 11:20:56
If you need to convert those values to a number of decimals hours or other new fields use the total_seconds() method from timedelta:
df1['deltat_hr'] = df1['deltat'].apply(lambda x: x.total_seconds()/3600)
duration id deltat deltat_hr
0 01:20:34 1 0 days 01:20:34 1.342778
1 12:22:30 2 0 days 12:22:30 12.375000
2 25:01:02 3 1 days 01:01:02 25.017222
3 155:20:56 4 6 days 11:20:56 155.348889
Search column for each month of the year. Column is organized like this "01-Jan-2018". I want to find how many times "Jan-2018" appears in the column. Basically count it and plot it on a bar graph. I want it to show all the quantities for "Jan-2018" , "Feb-2018", etc. Should be 12 bars on the graph. Maybe using count or sum. I am pulling the data from a CSV using pandas and python.
I have tried to printing it out onto the console with some success. But I am getting confused as correct way to search a portion of the date.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import seaborn as sns
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile1.csv', error_bad_lines=False, encoding="ISO-8859-1", skiprows=6)
cols = data.columns
cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str)) else x)
data.columns = cols
print(data.groupby('Case_Date').mean().plot(kind='bar'))
I am expecting the a bar graph that will show the total quantity for each month. So there should be 12 bar graphs. But I am not sure how to search the column 12 times and each time only looking for the data of each month. While excluding the date, only searching for the month and year.
IIUC, this is what you need.
Let's work with the below dataframe as input dataframe.
date
0 1/31/2018
1 2/28/2018
2 2/28/2018
3 3/31/2018
4 4/30/2018
5 5/31/2018
6 6/30/2018
7 6/30/2018
8 7/31/2018
9 8/31/2018
10 9/30/2018
11 9/30/2018
12 9/30/2018
13 9/30/2018
14 10/31/2018
15 11/30/2018
16 12/31/2018
The below mentioned lines of code will get the number of count for each month as a bar graph. When you have a column as as datetime object, a lot of function are much easy & the contents of the column are much more flexible. With that, you don't need search string of the name of the month.
df['date'] = pd.to_datetime(df['date'])
df['my']=df.date.dt.strftime('%b-%Y')
ax = df.groupby('my', sort=False)['my'].value_counts().plot(kind='bar')
ax.set_xticklabels(df.my, rotation=90);
Output
I am trying to group by hospital staff working hours bi monthly. I have raw data on daily basis which look like below.
date hourse_spent emp_id
9/11/2016 8 1
15/11/2016 8 1
22/11/2016 8 2
23/11/2016 8 1
How I want to group by is.
cycle hourse_spent emp_id
1/11/2016-15/11/2016 16 1
16/11/2016-31/11/2016 8 2
16/11/2016-31/11/2016 8 1
I am trying to do the same with grouper and frequency in pandas something as below.
data.set_index('date',inplace=True)
print data.head()
dt = data.groupby(['emp_id', pd.Grouper(key='date', freq='MS')])['hours_spent'].sum().reset_index().sort_values('date')
#df.resample('10d').mean().interpolate(method='linear',axis=0)
print dt.resample('SMS').sum()
I also tried resampling
df1 = dt.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
data.set_index('date',inplace=True)
df1 = data.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
But this is giving data of 15 days interval not like 1 to 15 and 15 to 31.
Please let me know what I am doing wrong here.
You were almost there. This will do it -
dt = df.groupby(['emp_id', pd.Grouper(key='date', freq='SM')])['hours_spent'].sum().reset_index().sort_values('date')
emp_id date hours_spent
1 2016-10-31 8
1 2016-11-15 16
2 2016-11-15 8
The freq='SM' is the concept of semi-months which will use the 15th and the last day of every month
Put DateTime-Values into Bins
If I got you right, you basically want to put your values in the date column into bins. For this, pandas has the pd.cut() function included, which does exactly what you want.
Here's an approach which might help you:
import pandas as pd
df = pd.DataFrame({
'hours' : 8,
'emp_id' : [1,1,2,1],
'date' : [pd.datetime(2016,11,9),
pd.datetime(2016,11,15),
pd.datetime(2016,11,22),
pd.datetime(2016,11,23)]
})
bins_dt = pd.date_range('2016-10-16', freq='SM', periods=3)
cycle = pd.cut(df.date, bins_dt)
df.groupby([cycle, 'emp_id']).sum()
Which gets you:
cycle emp_id hours
------------------------ ------ ------
(2016-10-31, 2016-11-15] 1 16
2 NaN
(2016-11-15, 2016-11-30] 1 8
2 8
Had a similar question, here was my solution:
df1['BiMonth'] = df1['Date'] + pd.DateOffset(days=-1) + pd.offsets.SemiMonthEnd()
df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')
The construction "df1['Date'] + pd.DateOffset(days=-1)" will take whatever is in the date column and -1 day.
The construction "+ pd.offsets.SemiMonthEnd()" converts it to a bimonthly basket, but its off by a day unless you reduce the reference date by 1.
The construction "df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')" cleans out the time so you just have days.
I am interacting through a number of csv files and want to append the mean temperatures to a blank csv file. How do you create an empty csv file with pandas?
for EachMonth in MonthsInAnalysis:
TheCurrentMonth = pd.read_csv('MonthlyDataSplit/Day/Day%s.csv' % EachMonth)
MeanDailyTemperaturesForCurrentMonth = TheCurrentMonth.groupby('Day')['AirTemperature'].mean().reset_index(name='MeanDailyAirTemperature')
with open('my_csv.csv', 'a') as f:
df.to_csv(f, header=False)
So in the above code how do I create the my_csv.csv prior to the for loop?
Just a note I know you can create a data frame then save the data frame to csv but I am interested in whether you can skip this step.
In terms of context I have the following csv files:
Each of which have the following structure:
The Day column reads up to 30 days for each file.
I would like to output a csv file that looks like this:
But obviously includes all the days for all the months.
My issue is that I don't know which months are included in each analysis hence I wanted to use a for loop that used a list that has that information in it to access the relevant csvs, calculate the mean temperature then save it all into one csv.
Input as text:
Unnamed: 0 AirTemperature AirHumidity SoilTemperature SoilMoisture LightIntensity WindSpeed Year Month Day Hour Minute Second TimeStamp MonthCategorical TimeOfDay
6 6 18 84 17 41 40 4 2016 1 1 6 1 1 10106 January Day
7 7 20 88 22 92 31 0 2016 1 1 7 1 1 10107 January Day
8 8 23 1 22 59 3 0 2016 1 1 8 1 1 10108 January Day
9 9 23 3 22 72 41 4 2016 1 1 9 1 1 10109 January Day
10 10 24 63 23 83 85 0 2016 1 1 10 1 1 10110 January Day
11 11 29 73 27 50 1 4 2016 1 1 11 1 1 10111 January Day
Just open the file in write mode to create it.
with open('my_csv.csv', 'w'):
pass
Anyway I do not think you should be opening and closing the file so many times. You'd better open the file once, write several times.
with open('my_csv.csv', 'w') as f:
for EachMonth in MonthsInAnalysis:
TheCurrentMonth = pd.read_csv('MonthlyDataSplit/Day/Day%s.csv' % EachMonth)
MeanDailyTemperaturesForCurrentMonth = TheCurrentMonth.groupby('Day')['AirTemperature'].mean().reset_index(name='MeanDailyAirTemperature')
df.to_csv(f, header=False)
Creating a blank csv file is as simple as this one
import pandas as pd
pd.DataFrame({}).to_csv("filename.csv")
I would do it this way: first read up all your CSV files (but only the columns that you really need) into one DF, then make groupby(['Year','Month','Day']).mean() and save resulting DF into CSV file:
import glob
import pandas as pd
fmask = 'MonthlyDataSplit/Day/Day*.csv'
df = pd.concat((pd.read_csv(f, sep=',', usecols=['Year','Month','Day','AirTemperature']) for f in glob.glob(fmask)))
df.groupby(['Year','Month','Day']).mean().to_csv('my_csv.csv')
and if want to ignore the year:
import glob
import pandas as pd
fmask = 'MonthlyDataSplit/Day/Day*.csv'
df = pd.concat((pd.read_csv(f, sep=',', usecols=['Month','Day','AirTemperature']) for f in glob.glob(fmask)))
df.groupby(['Month','Day']).mean().to_csv('my_csv.csv')
Some details:
(pd.read_csv(f, sep=',', usecols=['Month','Day','AirTemperature']) for f in glob.glob('*.csv'))
will generate tuple of data frames from all your CSV files
pd.concat(...)
will concatenate them into resulting single DF
df.groupby(['Year','Month','Day']).mean()
will produce wanted report as a data frame, which might be saved into new CSV file:
.to_csv('my_csv.csv')
The problem is a little unclear, but assuming you have to iterate month by month, and apply the groupby as stated just use:
#Before loops
dflist=[]
Then in each loop do something like:
dflist.append(MeanDailyTemperaturesForCurrentMonth)
Then at the end:
final_df = pd.concat([dflist], axis=1)
and this will join everything into one dataframe.
Look at:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html
http://pandas.pydata.org/pandas-docs/stable/merging.html
You could do this to create an empty CSV and add columns without an index column as well.
import pandas as pd
df=pd.DataFrame(columns=["Col1","Col2","Col3"]).to_csv(filename.csv,index=False)