parsing data in excel file to create data frame

parsing data in excel file to create data frame - python

I am analyzing data from excel file.
I want to create data frame by parsing data from excel using python.
Data in my excel file looks like as follow:
The first row highlighted in yellow contains match, which will be one of the columns in data frame that I wanted to create.
In fact, second row and 4th row are the name of the columns that I wanted to created in a new data frame.
3rd row and fifth row are the value of each column.
The sample here is only for one match.
I have multiple matches in the excel file.
I want to create a data frame that contain the column Match and all name in blue colors in the file.
I have attached the sample file that contains multiple matches.
Download the file here.
My expected data frame is
Match 1-0 2-0 2-1 3-0 3-1 3-2 4-0 4-1 4-2 4-3.......
MOL Vivi -vs- Chelsea 14 42 20 170 85 85 225 225 225 .....
Can anyone advise me how to parse the excel data and convert to data frame?
Thanks,
Zep

Use:
import pandas as pd
from datetime import datetime
df = pd.read_excel('test_match.xlsx')
#mask for check a-z in column HOME -vs- AWAY
m1 = df['HOME -vs- AWAY'].str.contains('[a-z]', na=False)
#create index by matches
df.index = df['HOME -vs- AWAY'].where(m1).ffill()
df.index.name = 'Match'
#remove same index and HOME -vs- AWAY column rows
df = df[df.index != df['HOME -vs- AWAY']].copy()
#test if datetime or string
m2 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, datetime))
m3 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, str))
#seelct next rows and set new columns names
df1 = df[m2.shift().fillna(False)]
df1.columns = df[m2].iloc[0]
#also remove only NaNs columns
df2 = df[m3.shift().fillna(False)].dropna(axis=1, how='all')
df2.columns = df[m3].iloc[0].dropna()
#join together
df = pd.concat([df1, df2], axis=1).astype(float).reset_index().rename_axis(None, axis=1)
print (df.head())
Match 2000-01-01 00:00:00 2000-02-01 00:00:00 \
0 MOL Vidi -vs- Chelsea 14.00 42.00
1 Lazio -vs- Eintracht Frankfurt 8.57 11.55
2 Sevilla -vs- FC Krasnodar 7.87 6.63
3 Villarreal -vs- Spartak Moscow 7.43 7.03
4 Rennes -vs- FC Astana 4.95 6.38
2018-02-01 00:00:00 2000-03-01 00:00:00 2018-03-01 00:00:00 \
0 20.00 170.00 85.00
1 7.87 23.80 15.55
2 7.87 8.72 8.65
3 7.07 10.00 9.43
4 7.33 12.00 13.20
2018-03-02 00:00:00 2000-04-01 00:00:00 2018-04-01 00:00:00 \
0 85.0 225.00 225.00
1 21.3 64.30 42.00
2 25.9 14.80 14.65
3 23.9 19.35 17.65
4 38.1 31.50 34.10
2018-04-02 00:00:00 ... 0-1 0-2 2018-01-02 00:00:00 \
0 225.0 ... 5.6 6.80 7.00
1 55.7 ... 11.0 19.05 10.45
2 38.1 ... 28.0 79.60 29.20
3 38.4 ... 20.9 58.50 22.70
4 81.4 ... 12.9 42.80 22.70
0-3 2018-01-03 00:00:00 2018-02-03 00:00:00 0-4 \
0 12.5 12.0 32.0 30.0
1 48.4 27.4 29.8 167.3
2 223.0 110.0 85.4 227.5
3 203.5 87.6 73.4 225.5
4 201.7 97.6 103.6 225.5
2018-01-04 00:00:00 2018-02-04 00:00:00 2018-03-04 00:00:00
0 29.0 60.0 220.0
1 91.8 102.5 168.3
2 227.5 227.5 227.5
3 225.5 225.5 225.5
4 225.5 225.5 225.5
[5 rows x 27 columns]

Related

ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got nearest

I have this df:
Week U.S. 30 yr FRM U.S. 15 yr FRM
0 2014-12-31 3.87 3.15
1 2015-01-01 NaN NaN
2 2015-01-02 NaN NaN
3 2015-01-03 NaN NaN
4 2015-01-04 NaN NaN
... ... ... ...
2769 2022-07-31 NaN NaN
2770 2022-08-01 NaN NaN
2771 2022-08-02 NaN NaN
2772 2022-08-03 NaN NaN
2773 2022-08-04 4.99 4.26
And when I try to run this interpolation:
pmms_df.interpolate(method = 'nearest', inplace = True)
I get ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got nearest
I read in this post that pandas interpolate doesn't do well with the time columns, so I tried this:
pmms_df[['U.S. 30 yr FRM', 'U.S. 15 yr FRM']].interpolate(method = 'nearest', inplace = True)
but the output is exactly the same as before the interpolation.

It may not work great with date columns, but it works well with a datetime index, which is probably what you should be using here:
df = df.set_index('Week')
df = df.interpolate(method='nearest')
print(df)
# Output:
U.S. 30 yr FRM U.S. 15 yr FRM
Week
2014-12-31 3.87 3.15
2015-01-01 3.87 3.15
2015-01-02 3.87 3.15
2015-01-03 3.87 3.15
2015-01-04 3.87 3.15
2022-07-31 4.99 4.26
2022-08-01 4.99 4.26
2022-08-02 4.99 4.26
2022-08-03 4.99 4.26
2022-08-04 4.99 4.26

Finding the date range of a date in different dataframes

I have three dataframes:
df_1 =
Name Description Date Quant Value
0 B100 text123 2021-01-02 3 89.1
1 B101 text567 2021-01-03 2 90.1
2 A200 text820 2021-03-02 1 90.2
3 B101 text567 2021-03-02 6 90.2
4 A500 text758 2021-03-06 1 94.0
5 A500 text758 2021-03-06 2 94.0
6 A500 text758 2021-03-07 2 94.0
7 A200 text820 2021-04-02 1 90.2
8 A999 text583 2021-05-05 2 90.6
9 A998 text834 2021-05-09 1 99.9
df_2 = # the index is funny because I did some manipulations and dropped some NaNs before
Code Name Person
0 900 B100 600
1 901 B100 610
2 959 B101 670
3 979 A999 670
6 944 A200 388
7 921 A500 663
8 988 B300 794
df_3 =
Code StartDate EndDate RealValue
0 900 2000-01-01 2007-12-31 80.9
1 901 2008-01-01 2099-12-31 98.8
2 902 2000-01-01 2020-02-02 98.3
3 903 2000-01-01 2007-01-10 90.6
4 903 2007-01-11 2099-12-31 90.7
5 959 2020-04-09 2099-12-31 98.9
6 979 2000-01-01 2009-02-12 87.6
7 979 2009-02-13 2021-06-13 78.0
8 979 2021-06-15 2099-12-31 89.5
9 944 2020-04-09 2099-12-31 98.9
10 921 2020-04-09 2099-12-31 98.9
I want to do the following:
Start from df_1, find the corresponding Code(s) in df_2 for each Name in df_1. Now I look into df_1 to see what the Value and Quant were at every Date, and I compare the Value with every RealValue from the date range where my Date is. The difficult part is selecting the right Code and then the right data range. So:
Name Date Code Value RealValue Quant
B100 2021-01-02 901 89.1 98.8 3
B101 2021-01-03 959 90.1 98.9 2
A200 2021-03-02 944 90.2 98.9 1
B101 2021-03-02 959 90.1 98.9 6
A500 2021-03-06 921 94.0 98.9 1
A500 2021-03-06 921 94.0 98.9 2
A500 2021-03-07 921 94.0 98.9 2
A200 2021-05-05 944 90.2 98.9 2
A999 2021-05-05 979 90.6 78.0 2
What I did was merging everything in one table, but since my real dataset is huge and there are many records that do not appear everywhere, I might have lost some data or ended up with NaNs. So I would leave the dataframes as they are here and navigate through them for every record in df_1. Is that possible?

First, map Name column from df2 to df3 then merge df1 and df3 on Name column. Finally, filter out rows where Date is between StartDate and EndDate:
COLS = df1.columns.tolist() + ['RealValue']
df3['Name'] = df3['Code'].map(df2.set_index('Code')['Name'])
out = df1.merge(df3, on='Name', how='left') \
.query('Date.between(StartDate, EndDate)')[COLS]
Output:
>>> out
Name Description Date Quant Value RealValue
1 B100 text123 2021-01-02 3 89.1 98.8
2 B101 text567 2021-01-03 2 90.1 98.9
3 A200 text820 2021-03-02 1 90.2 98.9
4 B101 text567 2021-03-02 6 90.2 98.9
5 A500 text758 2021-03-06 1 94.0 98.9
6 A500 text758 2021-03-06 2 94.0 98.9
7 A500 text758 2021-03-07 2 94.0 98.9
8 A200 text820 2021-04-02 1 90.2 98.9
10 A999 text583 2021-05-05 2 90.6 78.0

Example:
start_date = "2019-1-1"
end_date="2019-1-31"
after_start_dates=df['date']>= start_date
before_star_dates=df['date']<= end_date
between_date=after_start_dates & before_end_dates
filter_dates=df.loc[between_dates]
print(filtered_dates)

try this:
try1 = pd.merge(df_1, df_2, on = 'Name', how = 'outer')
try2 = pd.merge(try1, df_3, on = 'Code', how = 'outer')
try2
and then, you try to navigate in try2.
try2[['Name','Date','Code','Value','RealValue','Quant']]

Getting a valueerror on merging two dataframes in Pandas

I tried to merge two dataframes using panda but this is the error code that I get:
ValueError: You are trying to merge on datetime64[ns] and datetime64[ns, UTC] columns. If you wish to proceed you should use pd.concat
I have tried different solutions found online but nothing works!! The code has been provided to me and it seems to work on other PCs but not on my computer.
This is my code:
import sys
import os
from datetime import datetime
import numpy as np
import pandas as pd
# --------------------------------------------------------------------
# -- price, consumption and production --
# --------------------------------------------------------------------
fn = '../data/np_data.csv'
if os.path.isfile(fn):
df_data = pd.read_csv(fn,header=[0],parse_dates=[0])
else:
sys.exit('Could not open data file {}̈́'.format(fn))
# --------------------------------------------------------------------
# -- temp. data --
# --------------------------------------------------------------------
fn = '../data/temp.csv'
if os.path.isfile(fn):
dtemp = pd.read_csv(fn,header=[0],parse_dates=[0])
else:
sys.exit('Could not open data file {}̈́'.format(fn))
# --------------------------------------------------------------------
# -- price data --
# -- first date: 2014-01-13 --
# -- last date: 2020-02-01 --
# --------------------------------------------------------------------
fn = '../data/eprice.csv'
if os.path.isfile(fn):
eprice = pd.read_csv(fn,header=[0])
else:
sys.exit('Could not open data file {}̈́'.format(fn))
# --------------------------------------------------------------------
# -- combine dataframes (and save as CSV file) --
# --------------------------------------------------------------------
#
df= df_data.merge(dtemp, on='time',how='left') ## This is where I get the error.
print(df.info())
print(eprice.info())
#
# add eprice
df = df.merge(eprice, on='date', how='left')
#
# eprice available only available on trading days
# fills in missing values, last observation is used
df = df.fillna(method='ffill')
#
# keep only the relevant time period
df = df[df.date > '2014-01-23']
df = df[df.date < '2020-02-01']
df.to_csv('../data/my_data.csv',index=False)
The datasets that have been imported look normal with expected number of columns and observations. The version that I have in Panda is 1.0.3
Edit:
this is the output (df) when I first merge df_data and dtemp.
time price_sys price_no1 ... temp_no3 temp_no4 temp_no5
0 2014-01-23 00:00:00+00:00 32.08 32.08 ... NaN NaN NaN
1 2014-01-24 00:00:00+00:00 31.56 31.60 ... -2.5 -8.7 2.5
2 2014-01-24 00:00:00+00:00 30.96 31.02 ... -2.5 -8.7 2.5
3 2014-01-24 00:00:00+00:00 30.84 30.79 ... -2.5 -8.7 2.5
4 2014-01-24 00:00:00+00:00 31.58 31.10 ... -2.5 -8.7 2.5
[5 rows x 25 columns]
This is the output for eprice before I merge:
<bound method NDFrame.head of date gas price oil price coal price carbon price
0 2014-01-24 00:00:00 66.00 107.88 79.42 6.89
1 2014-01-27 00:00:00 64.20 106.69 79.43 7.04
2 2014-01-28 00:00:00 63.75 107.41 79.29 7.20
3 2014-01-29 00:00:00 63.20 107.85 78.52 7.21
4 2014-01-30 00:00:00 62.60 107.95 78.18 7.46
... ... ... ... ...
1608 2020-03-25 00:00:00 22.30 27.39 67.81 17.51
1609 2020-03-26 00:00:00 21.55 26.34 70.35 17.35
1610 2020-03-27 00:00:00 18.90 24.93 72.46 16.39
1611 2020-03-30 00:00:00 19.20 22.76 71.63 17.06
1612 2020-03-31 00:00:00 18.00 22.74 71.13 17.68
[1613 rows x 5 columns]>
This is what happends when I merge df and eprice:
<bound method NDFrame.head of date gas price oil price coal price carbon price
0 2014-01-24 00:00:00 66.00 107.88 79.42 6.89
1 2014-01-27 00:00:00 64.20 106.69 79.43 7.04
2 2014-01-28 00:00:00 63.75 107.41 79.29 7.20
3 2014-01-29 00:00:00 63.20 107.85 78.52 7.21
4 2014-01-30 00:00:00 62.60 107.95 78.18 7.46
... ... ... ... ...
1608 2020-03-25 00:00:00 22.30 27.39 67.81 17.51
1609 2020-03-26 00:00:00 21.55 26.34 70.35 17.35
1610 2020-03-27 00:00:00 18.90 24.93 72.46 16.39
1611 2020-03-30 00:00:00 19.20 22.76 71.63 17.06
1612 2020-03-31 00:00:00 18.00 22.74 71.13 17.68
[1613 rows x 5 columns]>
time price_sys ... coal price carbon price
0 2014-01-23 00:00:00+00:00 32.08 ... NaN NaN
1 2014-01-24 00:00:00+00:00 31.56 ... NaN NaN
2 2014-01-24 00:00:00+00:00 30.96 ... NaN NaN
3 2014-01-24 00:00:00+00:00 30.84 ... NaN NaN
4 2014-01-24 00:00:00+00:00 31.58 ... NaN NaN
[5 rows x 29 columns]

Try doing df['Time'] = pd.to_datetime(df['Time'], utc = True) on both the time columns before joining (or rather the one without UTC needs to go through this!)

How to combine 3 separate columns of year(2 digit) ,month and day into single date column

I am combining 3 seoerate columns of year,month and day into a single column of my dataframe. But the year is in 2 digit which is giving error.
I have tried to_datetime() to do the same in jupyter notebook
Dataframe is in this form:
Yr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL
61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50
61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54
61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75
data.rename(columns={'Yr':'Year','Mo':'Month','Dy':'Day'},inplace=True)
data['Date']=pd.to_datetime(data[['Year','Month','Day']],format='%y%m%d')
The error i am getting is:
cannot assemble the datetimes: time data 610101 does not match format '%Y%m%d' (match)

There is problem to_datetime with specify columns ['Year','Month','Day'] need YYYY format, so need alternative solution, because year is YY only:
s = data[['Yr','Mo','Dy']].astype(str).apply('-'.join, 1)
data['Date'] = pd.to_datetime(s, format='%y-%m-%d')
print (data)
Yr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL \
0 61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83
1 61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79
2 61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50
CLO BEL Date
0 12.58 18.50 2061-01-01
1 9.67 17.54 2061-01-02
2 7.67 12.75 2061-01-03

Pandas converting timestamp and monthly summary

I have several .csv files which I am importing via Pandas and then work out a summary of the data (min, max, mean), ideally weekly and monthly reports. I have the following code, but just do not seem to get the month summary to work, I am sure the problem is with the timestamp conversion.
What am I doing wrong?
import pandas as pd
import numpy as np
#Format of the data that is been imported
#2017-05-11 18:29:14+00:00,264.0,987.99,26.5,23.70,512.0,11.763,52.31
df = pd.read_csv('data.csv')
df['timestamp'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
print 'month info'
print [g for n, g in df.groupby(pd.Grouper(key='timestamp',freq='M'))]
print(data.groupby('timestamp')['light'].mean())

IIUC, you almost have it, and your datetime conversion is fine. Here is an example:
Starting from a dataframe like this (which is your example row, duplicated with slight modifications):
>>> df
time x y z a b c d
0 2017-05-11 18:29:14+00:00 264.0 947.99 24.5 53.7 511.0 11.463 12.31
1 2017-05-15 18:29:14+00:00 265.0 957.99 25.5 43.7 512.0 11.563 22.31
2 2017-05-21 18:29:14+00:00 266.0 967.99 26.5 33.7 513.0 11.663 32.31
3 2017-06-11 18:29:14+00:00 267.0 977.99 26.5 23.7 514.0 11.763 42.31
4 2017-06-22 18:29:14+00:00 268.0 997.99 27.5 13.7 515.0 11.800 52.31
You can do what you did before with your datetime:
df['timestamp'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
And then get your summaries either separately:
monthly_mean = df.groupby(pd.Grouper(key='timestamp',freq='M')).mean()
monthly_max = df.groupby(pd.Grouper(key='timestamp',freq='M')).max()
monthly_min = df.groupby(pd.Grouper(key='timestamp',freq='M')).min()
weekly_mean = df.groupby(pd.Grouper(key='timestamp',freq='W')).mean()
weekly_min = df.groupby(pd.Grouper(key='timestamp',freq='W')).min()
weekly_max = df.groupby(pd.Grouper(key='timestamp',freq='W')).max()
# Examples:
>>> monthly_mean
x y z a b c d
timestamp
2017-05-31 265.0 957.99 25.5 43.7 512.0 11.5630 22.31
2017-06-30 267.5 987.99 27.0 18.7 514.5 11.7815 47.31
>>> weekly_mean
x y z a b c d
timestamp
2017-05-14 264.0 947.99 24.5 53.7 511.0 11.463 12.31
2017-05-21 265.5 962.99 26.0 38.7 512.5 11.613 27.31
2017-05-28 NaN NaN NaN NaN NaN NaN NaN
2017-06-04 NaN NaN NaN NaN NaN NaN NaN
2017-06-11 267.0 977.99 26.5 23.7 514.0 11.763 42.31
2017-06-18 NaN NaN NaN NaN NaN NaN NaN
2017-06-25 268.0 997.99 27.5 13.7 515.0 11.800 52.31
Or aggregate them all together to get a multi-indexed dataframe with your summaries:
monthly_summary = df.groupby(pd.Grouper(key='timestamp',freq='M')).agg(['mean', 'min', 'max'])
weekly_summary = df.groupby(pd.Grouper(key='timestamp',freq='W')).agg(['mean', 'min', 'max'])
# Example of summary of row 'x':
>>> monthly_summary['x']
mean min max
timestamp
2017-05-31 265.0 264.0 266.0
2017-06-30 267.5 267.0 268.0
>>> weekly_summary['x']
mean min max
timestamp
2017-05-14 264.0 264.0 264.0
2017-05-21 265.5 265.0 266.0
2017-05-28 NaN NaN NaN
2017-06-04 NaN NaN NaN
2017-06-11 267.0 267.0 267.0
2017-06-18 NaN NaN NaN
2017-06-25 268.0 268.0 268.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing data in excel file to create data frame - python

Related

ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got nearest

Finding the date range of a date in different dataframes

Getting a valueerror on merging two dataframes in Pandas

How to combine 3 separate columns of year(2 digit) ,month and day into single date column

Pandas converting timestamp and monthly summary

Categories

Resources