I am trying to split a data set for training and testing using Pandas.
data = pd.read_csv("housingdata.csv", header=None)
train = testing.sample(frac=0.6)
train.reindex()
test = testing.loc[~testing.index.isin(train.index)]
print train
print test
when I print the data, I get
0 1 2 3 4
9 0.17004 12.5 7.87 0 0.524
1 0.02731 0.0 7.07 0 0.469
5 0.02985 0.0 2.18 0 0.458
3 0.03237 0.0 2.18 0 0.458
7 0.14455 12.5 7.87 0 0.524
6 0.08829 12.5 7.87 0 0.524
0 1 2 3 4
0 0.00632 18.0 2.31 0 0.538
2 0.02729 0.0 7.07 0 0.469
4 0.06905 0.0 2.18 0 0.458
8 0.21124 12.5 7.87 0 0.524
As noticed, the row indices are re-shuffled. How to re-index the rows in both the data sets?
This however does not change global settings. Eg.,
train.iloc[0,4]
gives 0.524
As #EdChum's comments point out, it's not exactly clear what behavior you're looking for. But if all you want to do is to give both new dataframes indices going from 0, 1, 2 ... n then you can use reset_index():
train.reset_index(inplace=True, drop=True)
test.reset_index(inplace=True, drop=True)
Related
First question here and a long one - there are a couple of things I am struggling with regarding merging and formatting my dataframes. I have some half working solutions ones but I am unsure if they are the best possible based on what I want.
Here are the standard formats of the dataframes I am merging with pandas.
df1 =
RT %Area RRT
0 4.83 5.257 0.509
1 6.76 0.424 0.712
2 7.27 0.495 0.766
3 7.70 0.257 0.811
4 7.79 0.122 0.821
5 9.49 92.763 1.000
6 11.40 0.681 1.201
df2=
RT %Area RRT
0 4.83 0.731 0.508
1 6.74 1.243 0.709
2 7.28 0.109 0.766
3 7.71 0.287 0.812
4 7.79 0.177 0.820
5 9.50 95.824 1.000
6 11.31 0.348 1.191
7 11.40 1.166 1.200
8 12.09 0.113 1.273
df3 = ...
Currently I am using a reduce operation on pd.merge_ordered() like below to merge my dataframes (3+). This kind of yields what I want and was from a previous question (pandas three-way joining multiple dataframes on columns). I am merging on RRT, and want the indexes with the same RRT values to be placed on the same row - and if the RRT values are unique for that dataset I want a NaN for missing data from other datasets.
#The for loop I use to generate the list of formatted dataframes prior to merging
dfs = []
for entry in os.scandir(directory):
if (entry.path.endswith(".csv")) and entry.is_file():
entry = pd.read_csv(entry.path, header=None)
#Block of formatting code removed
dfs.append(entry.round(2))
dfs = [df1ar,df2ar,df3ar]
df_final = reduce(lambda left,right: pd.merge_ordered(left,right,on='RRT'), dfs)
cols = ['RRT', 'RT_x', '%Area_x', 'RT_y', '%Area_y', 'RT', '%Area']
df_final = df_final[cols]
print(df_final)
RRT RT_x %Area_x RT_y %Area_y RT %Area
0 0.508 NaN NaN 4.83 0.731 NaN NaN
1 0.509 4.83 5.257 NaN NaN 4.83 5.257
2 0.709 NaN NaN 6.74 1.243 NaN NaN
3 0.712 6.76 0.424 NaN NaN 6.76 0.424
4 0.766 7.27 0.495 7.28 0.109 7.27 0.495
5 0.811 7.70 0.257 NaN NaN 7.70 0.257
6 0.812 NaN NaN 7.71 0.287 NaN NaN
7 0.820 NaN NaN 7.79 0.177 NaN NaN
8 0.821 7.79 0.122 NaN NaN 7.79 0.122
9 1.000 9.49 92.763 9.50 95.824 9.49 92.763
10 1.191 NaN NaN 11.31 0.348 NaN NaN
11 1.200 NaN NaN 11.40 1.166 NaN NaN
12 1.201 11.40 0.681 NaN NaN 11.40 0.681
13 1.273 NaN NaN 12.09 0.113 NaN NaN
This works, but:
Can I can insert a multiindex based on the filename of the dataframe that the data came from from and place it above the corresponding columns? Like the suffix option but related back to filename and for more than two sets of data. Is this better done prior to merging? and if so how do I do it? (I've included the for loop I use for to create a list of tables prior to merging.
Is this reduced merge_ordered the simplest way of doing this?
Can I do a similar merge with pd.merge_asof() and use the tolerance value to fine tune the merging based on the similarities between the RRT values? That is, can it be done without cutting off data from the longer dataframes?
I've tried the above and searched for answers, but I'm struggling to find the most efficient way to do everything I want.
concat = pd.concat(dfs, axis=1, keys=['A','B','C'])
concat_final = concat.round(3)
print(concat_final)
A B C
RT %Area RRT RT %Area RRT RT %Area RRT
0 4.83 5.257 0.509 4.83 0.731 0.508 4.83 5.257 0.509
1 6.76 0.424 0.712 6.74 1.243 0.709 6.76 0.424 0.712
2 7.27 0.495 0.766 7.28 0.109 0.766 7.27 0.495 0.766
3 7.70 0.257 0.811 7.71 0.287 0.812 7.70 0.257 0.811
4 7.79 0.122 0.821 7.79 0.177 0.820 7.79 0.122 0.821
5 9.49 92.763 1.000 9.50 95.824 1.000 9.49 92.763 1.000
6 11.40 0.681 1.201 11.31 0.348 1.191 11.40 0.681 1.201
7 NaN NaN NaN 11.40 1.166 1.200 NaN NaN NaN
8 NaN NaN NaN 12.09 0.113 1.273 NaN NaN NaN
I have also tried this - and I get the multiindex to denote which file (A,B,C, just as placeholders) it came from. However, it has obviously not merged based on the RRT value like I want.
Can I apply an operation to change this into a similar format to the pd.merge_ordered() format above? Would groupby() work?
Thanks!
I have the following NFL tracking data:
Event PlayId FrameId x-coord y-coord
0 Start 1 1 20.2 20.0
1 NaN 1 2 21.0 19.1
2 NaN 1 3 21.3 18.3
3 NaN 1 4 22.0 17.5
4 End 1 5 22.5 17.2
4 NaN 1 6 22.5 17.2
4 NaN 1 7 22.5 17.2
4 NaN 1 8 22.5 17.2
4 NaN 1 9 22.5 17.2
4 NaN 1 10 22.5 17.2
5 NaN 2 1 23.0 16.9
6 Start 2 2 23.6 16.7
7 End 2 3 25.1 34.1
8 NaN 2 4 25.9 34.2
10 NaN 3 1 22.7 34.2
11 Nan 3 2 21.5 34.5
12 NaN 3 3 21.1 37.3
13 Start 3 4 21.2 44.3
14 NaN 3 5 20.4 44.6
15 End 3 6 21.9 42.7
How can I filter this list to only get the rows in between the "Start" and "End" values for the Event column? To clarify, this is the data I want to filter for:
Event PlayId FrameId x-coord y-coord
0 Start 1 1 20.2 20.0
1 NaN 1 2 21.0 19.1
2 NaN 1 3 21.3 18.3
3 NaN 1 4 22.0 17.5
4 End 1 5 22.5 17.2
6 Start 2 2 23.6 16.7
7 End 2 3 25.1 34.1
13 Start 3 4 21.2 44.3
14 NaN 3 5 20.4 44.6
15 End 3 6 21.9 42.7
An explicit solution will not work because the actual dataset is very large and there is no way to predict where the Start and End values fall.
Doing with slice and ffill then concat back , Also you have Nan in your df , should it be NaN ?
df1=df.copy()
newdf=pd.concat([df1[df.Event.ffill()=='Start'],df1[df.Event=='End']]).sort_index()
newdf
Event PlayId FrameId x-coord y-coord
0 Start 1 1 20.2 20.0
1 NaN 1 2 21.0 19.1
2 NaN 1 3 21.3 18.3
3 NaN 1 4 22.0 17.5
4 End 1 5 22.5 17.2
6 Start 2 2 23.6 16.7
7 End 2 3 25.1 34.1
13 Start 3 4 21.2 44.3
14 NaN 3 5 20.4 44.6
15 End 3 6 21.9 42.7
Or
newdf=df[~((df.Event.ffill()=='End')&(df.Event.isna()))]
newdf
Event PlayId FrameId x-coord y-coord
0 Start 1 1 20.2 20.0
1 NaN 1 2 21.0 19.1
2 NaN 1 3 21.3 18.3
3 NaN 1 4 22.0 17.5
4 End 1 5 22.5 17.2
6 Start 2 2 23.6 16.7
7 End 2 3 25.1 34.1
13 Start 3 4 21.2 44.3
14 NaN 3 5 20.4 44.6
15 End 3 6 21.9 42.7
Any help please for reading this file from url website.
eurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
data = pandas.read_csv(url, sep=',', header = None)
I tried sep=',', sep=';' and sep='\t' but the data read like this
but with
data = pandas.read_csv(url, sep=' ', header = None)
I received an error,
pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)()
pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)()
pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8970)()
pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)()
pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:22649)()
CParserError: Error tokenizing data. C error: Expected 30 fields in line 2, saw 31
Maybe same question asked here enter link description here but the accepted answer does not help me.
any help please to read this file from the url provide it.
BTW, I know there is Boston = load_boston() to read this data but when I read it from this function, the attribute 'MEDV' in the dataset does not download with the dataset.
There are multiple spaces used as a delimiter, that's why it's not working when you use a single space as a delimiter (sep=' ')
you can do it using sep='\s+':
In [171]: data = pd.read_csv(url, sep='\s+', header = None)
In [172]: data.shape
Out[172]: (506, 14)
In [173]: data.head()
Out[173]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
or using delim_whitespace=True:
In [174]: data = pd.read_csv(url, delim_whitespace=True, header = None)
In [175]: data.shape
Out[175]: (506, 14)
In [176]: data.head()
Out[176]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
I have this two DataFrames:
Seasonal_Component:
# DataFrame that has the seasonal component of a time series
Date
2014-12 -1.08
2015-01 -0.28
2015-02 0.15
2015-03 0.46
2015-04 0.48
2015-05 0.37
2015-06 0.20
2015-07 0.15
2015-08 0.12
2015-09 -0.02
2015-10 -0.17
2015-11 -0.39
Prediction_df:
# DataFrame with the prediction of the trend of that same time series
Prediction MAPE Score
2015-11-01 7.93 1.83 1
2015-12-01 7.93 1.67 1
2016-01-01 7.92 1.71 1
2016-02-01 7.95 1.84 1
2016-03-01 7.94 1.53 1
2016-04-01 7.87 1.45 1
2016-05-01 7.91 1.53 1
2016-06-01 7.87 1.40 1
2016-07-01 7.84 1.40 1
2016-08-01 7.89 1.77 1
2016-09-01 7.87 1.99 1
What I need to do:
Check which Prediction_df index have the same months as the Seasonal_Component index and sum the correspondent seasonal component with the prediction, so the Prediction_df looks like this:
Prediction MAPE Score
2015-11-01 7,54 1.83 1
2015-12-01 6.85 1.67 1
2016-01-01 7.64 1.71 1
2016-02-01 8.10 1.84 1
2016-03-01 8.40 1.53 1
2016-04-01 8.35 1.45 1
2016-05-01 8.28 1.53 1
2016-06-01 8.07 1.40 1
2016-07-01 7.99 1.40 1
2016-08-01 8.01 1.77 1
2016-09-01 7.85 1.99 1
Anyone available to enlight my journey?
I'm already on the "almost mad" stage trying to solve this.
EDIT
Important note to make it clearer: I need to disconsider the year and consider only the month to make the sum. Something like "everytime that an April appears (doesn't matter if it is 2006 or 2025) I need to sum with the April value of the Seasonal_Component frame.
Consider a data frame merge on the date fields (month values), then a simple addition of the two fields. The date fields may require conversion from string values:
import datetime as dt
...
# IF DATES ARE REGULAR COLUMNS
seasonal_component['Date'] = pd.to_datetime(seasonal_component['Date'])
seasonal_component['Month'] = seasonal_component['Date'].dt.month
predict_df['Date'] = pd.to_datetime(predict_df['Date'])
predict_df['Month'] = predict_df['Date'].dt.month
# IF DATES ARE INDICES
seasonal_component.index = pd.to_datetime(seasonal_component.index)
seasonal_component['Month'] = seasonal_component.index.month
predict_df.index = pd.to_datetime(predict_df.index)
predict_df['Month'] = predict_df.index.month
However, think about how you need to join the two data sets (akin to SQL's join clauses):
inner (default) - keeps only records matching both
left - keeps records of predict_df and only those matching seasonal_component where predict_df is first argument
right - keeps records of seasonal_component and only those matching predict_df where predict_df is first argument
outer - keeps all records, those that match and those that don't match
Below assumes an outer join where data on both sides remain with NaNs to fill for missing values.
# MERGING DATA FRAMES
merge_df = pd.merge(predict_df, seasonal_component[['Month', 'SeasonalComponent']],
on=['Month'], how='outer')
# ADDING COLUMNS
merge_df['Prediction'] = merge_df['Prediction'] + merge_df['SeasonalComponent']
Outcome (using posted data)
Date Prediction MAPE Score Month SeasonalComponent
0 2015-11-01 7.54 1.83 1 11 -0.39
1 2015-12-01 6.85 1.67 1 12 -1.08
2 2016-01-01 7.64 1.71 1 1 -0.28
3 2016-02-01 8.10 1.84 1 2 0.15
4 2016-03-01 8.40 1.53 1 3 0.46
5 2016-04-01 8.35 1.45 1 4 0.48
6 2016-05-01 8.28 1.53 1 5 0.37
7 2016-06-01 8.07 1.40 1 6 0.20
8 2016-07-01 7.99 1.40 1 7 0.15
9 2016-08-01 8.01 1.77 1 8 0.12
10 2016-09-01 7.85 1.99 1 9 -0.02
11 NaT NaN NaN NaN 10 -0.17
Firstly separate the month from both dataframes and then merge on basis of month. Further add the required columns and create new column with desired output. Here is the code below:
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
from numpy.random import randn
Seasonal_Component = DataFrame({
'Date': ['2014-12','2015-01','2015-02','2015-03','2015-04','2015-05','2015-06','2015-07','2015-08','2015-09','2015-10','2015-11'],
'Value': [-1.08,-0.28,0.15,0.46,0.48,0.37,0.20,0.15,0.12,-0.02,-0.17,-0.39]
})
Prediction_df = DataFrame({
'Date': ['2015-11-01','2015-12-01','2016-01-01','2016-02-01','2016-03-01','2016-04-01','2016-05-01','2016-06-01','2016-07-01','2016-08-01','2016-09-01'],
'Prediction': [7.93,7.93,7.92,7.95,7.94,7.87,7.91,7.87,7.84,7.89,7.87],
'MAPE':[1.83,1.67,1.71,1.84,1.53,1.45,1.53,1.40,1.40,1.77,1.99],
'Score':[1,1,1,1,1,1,1,1,1,1,1]
})
def mon_extract(date):
return date.split('-')[1]
Seasonal_Component['Month']=Seasonal_Component['Date'].apply(mon_extract)
def mon_extract(date):
return date.split('-')[1].split('-')[0]
Prediction_df['Month']=Prediction_df['Date'].apply(mon_extract)
FinalDF=pd.merge(Seasonal_Component,Prediction_df,on='Month',how='right')
FinalDF
FinalDF['PredictionF']=FinalDF['Value']+FinalDF['Prediction']
FinalDF.loc[:,['Date_y','PredictionF','MAPE','Score']]
I am trying to reshape the dataframe below.
Tenor 2013M06D12 2013M06D13 2013M06D14 \
1 1 1.24 1.26 1.23
4 2 2.01 0.43 0.45
5 3 1.21 2.24 1.03
8 4 0.39 2.32 1.23
So, that it looks as follows. I was looking at using pivot_table, but this is sort of the opposite of what that would do as I need to convert Column Headers to rows and not the other way around. Hence, I am not sure how to proceed in order to obtain this dataframe.
Date Tenor Rate
1 2013-06-12 1 1.24
2 2013-06-13 1 1.26
4 2013-06-14 1 1.23
The code just involves reading from a CSV:
result = pd.DataFrame.read_csv("BankofEngland.csv")
I think you can do with with a melt, a sort, a date parse, and some column shuffling:
dfm = pd.melt(df, id_vars="Tenor", var_name="Date", value_name="Rate")
dfm = dfm.sort("Tenor").reset_index(drop=True)
dfm["Date"] = pd.to_datetime(dfm["Date"], format="%YM%mD%d")
dfm = dfm[["Date", "Tenor", "Rate"]]
produces
In [104]: dfm
Out[104]:
Date Tenor Rate
0 2013-06-12 1 1.24
1 2013-06-13 1 1.26
2 2013-06-14 1 1.23
3 2013-06-12 2 2.01
4 2013-06-13 2 0.43
5 2013-06-14 2 0.45
6 2013-06-12 3 1.21
7 2013-06-13 3 2.24
8 2013-06-14 3 1.03
9 2013-06-12 4 0.39
10 2013-06-13 4 2.32
11 2013-06-14 4 1.23
import pandas as pd
import numpy as np
# try to read your sample data, replace with your read_csv func
df = pd.read_clipboard()
Out[139]:
Tenor 2013M06D12 2013M06D13 2013M06D14
1 1 1.24 1.26 1.23
4 2 2.01 0.43 0.45
5 3 1.21 2.24 1.03
8 4 0.39 2.32 1.23
# reshaping
df.set_index('Tenor', inplace=True)
df = df.stack().reset_index()
df.columns=['Tenor', 'Date', 'Rate']
# suggested by DSM, use the date parser
df.Date = pd.to_datetime(df.Date, format='%YM%mD%d')
Out[147]:
Tenor Date Rate
0 1 2013-06-12 1.24
1 1 2013-06-13 1.26
2 1 2013-06-14 1.23
3 2 2013-06-12 2.01
4 2 2013-06-13 0.43
.. ... ... ...
7 3 2013-06-13 2.24
8 3 2013-06-14 1.03
9 4 2013-06-12 0.39
10 4 2013-06-13 2.32
11 4 2013-06-14 1.23
[12 rows x 3 columns]