Data Formatting pandas

Data Formatting pandas - python

I am trying to enter a line of code that creates a row for the index 31'st January 1995. I am unable to get the row to look like 31/01/1995 and instead the output is 1995-01-31 00:00:00 .
My original data in a dataframe called MainData
I am trying to add a row at the top for 31st January 1995 in the same format as the data below.
My code is
MainData.loc[pd.to_datetime('31/01/1995',format='%d/%m/%Y'),:] = [100 for number in range(7)]
MainData
Please let me know if there is a way to reformat this to 31/01/1995.
Thanks in advance.

#Making the data look more normal by removing the first column index level
MainData = MainData.rename(columns=MainData.iloc[0])
MainData = MainData.iloc[1:]
#Re-adjusting the Index to a datetime format
MainData['DateAdjusted'] = MainData.index
MainData = MainData.reset_index(drop=True)
MainData['DateAdjusted'] = pd.to_datetime(MainData['DateAdjusted'],dayfirst=True)
#Just renaming the Column and converting the index back to Date
MainData.rename(columns={'DateAdjusted':'Date'},inplace=True)
MainData.index = MainData['Date']
del MainData['Date']
#Defining the date for the row I want to add
InitialDate = "31/01/1995"
format_str = '%d/%m/%Y'
datetime_obj = datetime.datetime.strptime(InitialDate, format_str)
print (datetime_obj.date())
MainData.loc[datetime_obj,:] = [100 for number in range(7)]
MainData = MainData.sort_index(ascending=True)

Related

How do I make this function iterable (getting indexerror)

I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?

To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.

split the date range into multiple ranges

I have data in CSV like this:
1940-10-01,somevalue
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941-05-01,somevalue
1941-06-02,somevalue
1941-07-03,somevalue
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue
I want to separate the dates from 1-oct-year to 31-march-next-year for all data. So for data above output will be:
1940/1941:
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941/1942:
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue
1942-10-01,somevalue
My code trails are:
import csv
from datetime import datetime
with open('data.csv','r') as f:
data = list(csv.reader(f))
quaters = []
year = datetime.strptime(data[0][0], '%Y-%m-%d').year
for each in data:
date = datetime.strptime(each[0], '%Y-%m-%d')
print(each)
if (date>=datetime(year=date.year,month=10,day=1) and date<=datetime(year=date.year+1,month=3,day=31)):
middle_quaters[-1].append(each)
if year != date.year:
quaters.append([])
But I am not getting expected output. I want to store each range of dates in separate list.

I would use pandas dataframe to do this..
it would be easier..
follow this
Pandas: Selecting DataFrame rows between two dates (Datetime Index)
so for your case
data = pd.read_csv("data.csv")
df.loc[startDate : endDate]
# you can walk through a bunch of ranges like so..
listOfDateRanges = [(), (), ()]
for date_range in listOfDateRanges:
df.loc[date_range[0] : date_range[1]]

Without external packages... create a lookup based on the field of choice, and then make an int of it and do a less that vs greater than to establish the range.
import re
data = '''1940-10-01,somevalue
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941-05-01,somevalue
1941-06-02,somevalue
1941-07-03,somevalue
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue'''
lookup={}
lines = data.split('\n')
for line in lines:
d = re.sub(r'-','',line.split(',')[0])
lookup[d]=line
dates=sorted(lookup.keys())
_in=19401201
out=19411004
outfile=[]
for date in dates:
if int(date) > _in and int(date) < out:
outfile.append(lookup[date])
for l in outfile:
print outfile

For this purpose you can use pandas library. Here is the sample code for the same:
import pandas as pd
df = pd.read_csv('so.csv', parse_dates=['timestamp']) #timestamp is your time column
current_year, next_year = 1940, 1941
df = df.query(f'(timestamp >= "{current_year}-10-01") & (timestamp <= "{next_year}-03-31")')
print (df)
This gives following result on your data:
timestamp value
0 1940-10-01 somevalue
1 1940-11-02 somevalue
2 1940-11-03 somevalue
3 1940-11-04 somevalue
4 1940-12-05 somevalue
5 1940-12-06 somevalue
6 1941-01-07 somevalue
7 1941-02-08 somevalue
8 1941-03-09 somevalue
Hope this helps!

How to change value into date format and How to change value into time format

I have two columns one with values that represents time and another with values that represent a date (both values are in floating type), I have the following data in each column:
df['Time']
540.0
630.0
915.0
1730.0
2245.0
df['Date']
14202.0
14202.0
14203.0
14203.0
I need to create new columns with the correct data format for these two columns, to be able to analyze data with date and time in distinct columns.
For ['Time'] I need to convert the format to:
540.0 = 5h40 OR TO 5.40 am
2245.0 = 22h45 OR TO 10.45 pm
For ['Date'], I need to convert the format to:
Each number we can say that represent "days":
where 0 ("days") = 01-01-1980
So if I add 01-01-1980 to 14202.0 = 18-11-1938
and if I add: 01-01-1980 + 14203.0 = 19-11-1938,
this way is possible to do with excel but I need a way to do in Python.
I tried different types of code but nothing works, for example, one of the codes that I tried was the one below:
# creating a variable with the data in column ['Date'] adding the days into the date:
Time1 = pd.to_datetime(df["Date"])
# When I print it is possible to see that 14203 in row n.55384 is added at the end of the date created but including time, and is not what I want:
print(Time1.loc[[55384]])
55384 1970-01-01 00:00:00.000014203
Name: Date, dtype: datetime64[ns]
# printing the same row (55384) to check the value 14203.0, that was added above:
print(df["Date"].loc[[55384]])
55384 14203.0
Name: Date, dtype: float64
For ['Time'] I have the same problem I can't have time without a date, I also tried to insert ':', but is not working even converting the data type to string.
I hope that someone can help me with this matter, and any doubt please let me know, sometimes is not easy to explain.

regarding the time conversion:
# change to integer
tt= [int(i) for i in df['Time']]
# convert to time
time_ = pd.to_datetime(tt,format='%H%M').time
# convert from 24 hour, to 12 hour time format
[t.strftime("%I:%M %p") for t in time_]

Solving problems with Date
from datetime import datetime
from datetime import timedelta
startdate_string = "1980/01/01" #defining start date in string format
startdate_object = datetime.strptime(startdate_string, "%Y/%m/%d").date() # changing string format date, to date object using strptime function
startdate_object # print startdate_object to check date
creating a list to add in the dataframe a new column with date format
import math
datenew = []
dates = df['UTS_Date'] # data from the original column 'UTS_Date'
for values in dates: # using an if statement to accept null values and appending them into the new list
if math.isnan(values):
`datenew.append('NaN')`
`continue `
`currentdate1 = startdate_object + timedelta(days= float(values))` # add the reference data (startdate_object) to a delta (which is the value in each row of the column)
`datenew.append(str(currentdate1)) ` # converte data into string format and add in the end of the list, removing any word from the list (such: datetime.date)
print (len(datenew)) # check the length of the new list datenew, to ensure that all rows on the data are in the new list
df.insert(3, 'Date', datenew) #creating a new column in data frame for date format

solving problems with Time
timenew = [] # creating a new list
times = df['Time'] # variable times is equal to the column df['Time'] of the dataframe
variable to find the location of time that is >= 2400
i = 0
def Normalize_time (val):
`offset = 0`
`if val >= 2400:`
`offset = 1 `
# converting val into integer, to remove decimal places
hours = int(val / 100)
# remove hours and remain just with minutes
minutes = int(val) - hours * 100
# to convert every rows above 24h
hours = (hours%23) - offset
# zfill recognizes that it must have two characters (in this case) for hours and minutes
# and if there aren't enough characters,
# it will add by padding zeros on the left until reaching the number of characters in the argument
return str(hours).zfill(2) + ':' + str(minutes).zfill(2)
creating a for statement to add all the values in the new list, using 'function Normalize_time()'
for values in times:
# using an if statement to accept null values and appending them into the new list
if math.isnan(values):
`timenew.append('NaN') `
` continue `
# using values into the function 'Normalize_time()'
timestr = Normalize_time(values)
# appending each value in the new list
timenew.append(timestr)
print(len(timenew)) # check the length of new list timenew, to ensure that all rows on the data are in the new list
df.insert(4, 'ODTime', timenew) #creating a new column in data frame

Python - Pandas library returns wrong column values after parsing a CSV file

SOLVED Found the solution by myself. Turns out that when you want to retrieve specific columns by their names you should pass the names in the order they appear inside the csv (which is really stupid for a library that is intended to save some parsing time for a developer IMO). Correct me if I am wrong but i dont see a on option to get a specific columns values by its name if the columns are in a different order...
I am trying to read a comma separated value file with python and then
parse it using Pandas library. Since the file has many values (columns) that are not needed I make a list of the column names i do need.
Here's a look at the csv file format.
Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Attendance,Referee,HS,AS,HST,AST,HHW,AHW,HC,AC,HF,AF,HO,AO,HY,AY,HR,AR,HBP,ABP,GBH,GBD,GBA,IWH,IWD,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA
E0,19/08/00,Charlton,Man City,4,0,H,2,0,H,20043,Rob
Harris,17,8,14,4,2,1,6,6,13,12,8,6,1,2,0,0,10,20,2,3,3.2,2.2,2.9,2.7,2.2,3.25,2.75,2.2,3.25,2.88,2.1,3.2,3.1
E0,19/08/00,Chelsea,West Ham,4,2,H,1,0,H,34914,Graham
Barber,17,12,10,5,1,0,7,7,19,14,2,3,1,2,0,0,10,20,1.47,3.4,5.2,1.6,3.2,4.2,1.5,3.4,6,1.5,3.6,6,1.44,3.6,6.5
E0,19/08/00,Coventry,Middlesbrough,1,3,A,1,1,D,20624,Barry
Knight,6,16,3,9,0,1,8,4,15,21,1,3,5,3,1,0,75,30,2.15,3,3,2.2,2.9,2.7,2.25,3.2,2.75,2.3,3.2,2.75,2.3,3.2,2.62
E0,19/08/00,Derby,Southampton,2,2,D,1,2,A,27223,Andy
D'Urso,6,13,4,6,0,0,5,8,11,13,0,2,1,1,0,0,10,10,2,3.1,3.2,1.8,3,3.5,2.2,3.25,2.75,2.05,3.2,3.2,2,3.2,3.2
E0,19/08/00,Leeds,Everton,2,0,H,2,0,H,40010,Dermot
Gallagher,17,12,8,6,0,0,6,4,21,20,6,1,1,3,0,0,10,30,1.65,3.3,4.3,1.55,3.3,4.5,1.55,3.5,5,1.57,3.6,5,1.61,3.5,4.5
E0,19/08/00,Leicester,Aston Villa,0,0,D,0,0,D,21455,Mike
Riley,5,5,4,3,0,0,5,4,12,12,1,4,2,3,0,0,20,30,2.15,3.1,2.9,2.3,2.9,2.5,2.35,3.2,2.6,2.25,3.25,2.75,2.4,3.25,2.5
E0,19/08/00,Liverpool,Bradford,1,0,H,0,0,D,44183,Paul
Durkin,16,3,10,2,0,0,6,1,8,8,5,0,1,1,0,0,10,10,1.25,4.1,7.2,1.25,4.3,8,1.35,4,8,1.36,4,8,1.33,4,8
This list is passed to pandas.read_csv()'s names parameter.
See code.
# Returns an array of the column names needed for our raw data table
def cols_to_extract():
cols_to_use = [None] * RawDataCols.COUNT
cols_to_use[RawDataCols.DATE] = 'Date'
cols_to_use[RawDataCols.HOME_TEAM] = 'HomeTeam'
cols_to_use[RawDataCols.AWAY_TEAM] = 'AwayTeam'
cols_to_use[RawDataCols.FTHG] = 'FTHG'
cols_to_use[RawDataCols.HG] = 'HG'
cols_to_use[RawDataCols.FTAG] = 'FTAG'
cols_to_use[RawDataCols.AG] = 'AG'
cols_to_use[RawDataCols.FTR] = 'FTR'
cols_to_use[RawDataCols.RES] = 'Res'
cols_to_use[RawDataCols.HTHG] = 'HTHG'
cols_to_use[RawDataCols.HTAG] = 'HTAG'
cols_to_use[RawDataCols.HTR] = 'HTR'
cols_to_use[RawDataCols.ATTENDANCE] = 'Attendance'
cols_to_use[RawDataCols.HS] = 'HS'
cols_to_use[RawDataCols.AS] = 'AS'
cols_to_use[RawDataCols.HST] = 'HST'
cols_to_use[RawDataCols.AST] = 'AST'
cols_to_use[RawDataCols.HHW] = 'HHW'
cols_to_use[RawDataCols.AHW] = 'AHW'
cols_to_use[RawDataCols.HC] = 'HC'
cols_to_use[RawDataCols.AC] = 'AC'
cols_to_use[RawDataCols.HF] = 'HF'
cols_to_use[RawDataCols.AF] = 'AF'
cols_to_use[RawDataCols.HFKC] = 'HFKC'
cols_to_use[RawDataCols.AFKC] = 'AFKC'
cols_to_use[RawDataCols.HO] = 'HO'
cols_to_use[RawDataCols.AO] = 'AO'
cols_to_use[RawDataCols.HY] = 'HY'
cols_to_use[RawDataCols.AY] = 'AY'
cols_to_use[RawDataCols.HR] = 'HR'
cols_to_use[RawDataCols.AR] = 'AR'
return cols_to_use
# Extracts raw data from the raw data csv and populates the raw match data table in the database
def extract_raw_data(csv):
# Clear the database table if it has any logs
# if MatchRawData.objects.count != 0:
# MatchRawData.objects.delete()
cols_to_use = cols_to_extract()
# Read and parse the csv file
parsed_csv = pd.read_csv(csv, delimiter=',', names=cols_to_use, header=0)
for col in cols_to_use:
values = parsed_csv[col].values
for val in values:
print(str(col) + ' --------> ' + str(val))
Where RawDataCols is an IntEnum.
class RawDataCols(IntEnum):
DATE = 0
HOME_TEAM = 1
AWAY_TEAM = 2
FTHG = 3
HG = 4
FTAG = 5
AG = 6
FTR = 7
RES = 8
...
The column names are obtained using it. That part of code works ok. The correct column name is obtained but after trying to get its values using
values = parsed_csv[col].values
pandas return the values of a wrong column. The wrong column index is around 13 indexes away from the one i am trying to get. What am i missing?

You can select column by name wise.Just use following line
values = parsed_csv[["Column Name","Column Name2"]]
Or you select Index wise by
cols = [1,2,3,4]
values = parsed_csv[parsed_csv.columns[cols]]

Pandas and stocks: From daily values (in columns) to monthly values (in rows)

I am having trouble reformatting a dataframe.
My input is a day value rows by symbols columns (each symbol has different dates with it's values):
Input
code to generate input
data = [("01-01-2010", 15, 10), ("02-01-2010", 16, 11), ("03-01-2010", 16.5, 10.5)]
labels = ["date", "AAPL", "AMZN"]
df_input = pd.DataFrame.from_records(data, columns=labels)
The needed output is (month row with new row for each month):
Needed output
code to generate output
data = [("01-01-2010","29-01-2010", "AAPL", 15, 20), ("01-01-2010","29-01-2010", "AMZN", 10, 15),("02-02-2010","30-02-2010", "AAPL", 20, 32)]
labels = ['bd start month', 'bd end month','stock', 'start_month_value', "end_month_value"]
df = pd.DataFrame.from_records(data, columns=labels)
Meaning (Pseudo code)
1. for each row take only non nan values to create a new "row" (maybe dictionary with the date as the index and the [stock, value] as the value.
2. take only rows that are business start of month or business end of month.
3. write those rows to a new datatframe.
I have read several posts like this and this and several more.
All treat with dataframe of the same "type" and just resampling while I need to change to structure...
My code so far
# creating the new index with business days
df1 =pd.DataFrame(range(10000), index = pd.date_range(df.iloc[0].name, periods=10000, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df2 = df1.resample(bmth_us).mean()
# creating the new index interseting my old one (daily) with the monthly index
new_index = df.index.intersection(df2.index)
# selecting only the rows I want
df = df.loc[new_index]
# creating a dict that will be my new dataset
new_dict = collections.OrderedDict()
# iterating over the rows and adding to dictionary
for index, row in df.iterrows():
# print index
date = df.loc[index].name
# values are the not none values
values = df.loc[index][~df.loc[index].isnull().values]
new_dict[date]=values
# from dict to list
data=[]
for key, values in new_dict.iteritems():
for i in range(0, len(values)):
date = key
stock_name = str(values.index[i])
stock_value = values.iloc[i]
row = (key, stock_name, stock_value)
data.append(row)
# from the list to df
labels = ['date','stock', 'value']
df = pd.DataFrame.from_records(data, columns=labels)
df.to_excel("migdal_format.xls")
Current output I get
One big problem:
I only get value of the stock on the start of month day.. I need start and end so I can calculate the stock gain on this month..
One smaller problem:
I am sure this is not the cleanest and fastest code :)
Thanks a lot!

So I have found a way.
looping through each column
groupby month
taking the first and last value I have in that month
calculate return
df_migdal = pd.DataFrame()
for col in df_input.columns[0:]:
stock_position = df_input.loc[:,col]
name = stock_position.name
name = re.sub('[^a-zA-Z]+', '', name)
name = name[0:-4]
stock_position=stock_position.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])
stock_position["name"] = name
stock_position["return"] = ((stock_position["last"] / stock_position["first"]) - 1) * 100
stock_position.dropna(inplace=True)
df_migdal=df_migdal.append(stock_position)
df_migdal=df_migdal.round(decimals=2)
I tried I way cooler way, but did not know how to handle the ,multi index I got... I needed that for each column, to take the two sub columns and create a third one from some lambda function.
df_input.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data Formatting pandas - python

Related

How do I make this function iterable (getting indexerror)

split the date range into multiple ranges

How to change value into date format and How to change value into time format

Python - Pandas library returns wrong column values after parsing a CSV file

Pandas and stocks: From daily values (in columns) to monthly values (in rows)

Categories

Resources