In a notebook inside Azure Databricks, the following code loads data from a csv file into pandas dataframe. The OrderDate column values look like the ones shown in image below. print(data_df['OrderDate']) prints the values all the way to the almost second last row. And in the next row I get the error shown below:
Question: What could be the cause of the error and how can we fix it?
Sample of csv file OrderDate column values:
ParserError: hour must be in 0..23: 48:03.3
Output of print(data_df['OrderDate']) line [Above error occurs at row 145207]:
0 48:03.3
1 25:25.8
2 05:19.4
3 35:16.9
4 56:40.6
...
145204 40:22.4
145205 25:17.8
145206 25:19.7
Name: OrderDate, Length: 145207, dtype: object
Error occurs at last line of the following code:
import sqlalchemy as sq
import pandas as pd
data_df = pd.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')
print(data_df['OrderDate'])
data_df['OrderDate'] = pd.to_datetime(data_df['OrderDate'])
Your input seems to be a duration, not a time, as in date & time. Parse to timedelta, then add that to a reference date to get datetime dtype.
Ex:
import pandas as pd
df = pd.DataFrame({'duration': ["40:22.4", "25:17.8", "25:19.7"]})
df['datetime'] = pd.Timestamp("1900-01-01") + pd.to_timedelta('00:' + df['duration'])
df['datetime']
0 1900-01-01 00:40:22.400
1 1900-01-01 00:25:17.800
2 1900-01-01 00:25:19.700
Name: datetime, dtype: datetime64[ns]
Probably a simple answer but I am new to coding and this is my first project.
I have managed to sum together the necessary information from individual spreadsheets and would now like to write an 'End of Month' spreadsheet to sum all individual data.
heres what i have so far..
import pandas as pd
from pathlib import Path
path = Path("Spreadsheets")
for file in path.glob("*.xlsx"):
df = pd.read_excel(f"{file}")
client_total = df.groupby(["Nominal"]).sum()["Amount"]
print(client_total)
This returns
Nominal
1118 379
1135 2367
1158 811
Name: Amount, dtype: int64
Nominal
1118 1147.85
1135 422.66
1158 990.68
Name: Amount, dtype: float64
Nominal
1118 736.38
1135 477.40
1158 470.16
Name: Amount, dtype: float64
Please let me know how I can merge these three separate results into one easy to read month total.
Many thanks.
Create list of Series called out and then use concat with sum by index by sum(level=0):
out = []
from pathlib import Path
path = Path("Spreadsheets")
for file in path.glob("*.xlsx"):
df = pd.read_excel(f"{file}")
client_total = df.groupby(["Nominal"])["Amount"].sum()
out.append(client_total)
df = pd.concat(out).sum(level=0)
print(df)
Assuming you have three dataframes (df1, df2,df3), you can simply use the add function along columns:
df_sum=df1.add(df2)
df_sum=df_sum.add(df3)
print(df_sum)
Nominal
1118 2263.23
1135 3267.06
1158 2271.84
Hopefully, this can help you:
import pandas as pd
from pathlib import Path
path = Path("Spreadsheets")
df_sum=pd.DataFrame(columns=['Nominal'],index=[1118,1135,1158],data=[0,0,0])
for file in path.glob("*.xlsx"):
df = pd.read_excel(f"{file}")
client_total = df.groupby(["Nominal"]).sum()["Amount"]
print(client_total)
df_sum=df_sum.add(client_total)
print(df_sum)
Would this work?
import pandas as pd
from pathlib import Path
path = Path("Spreadsheets")
dfs = []
for file in path.glob("*.xlsx"):
df = pd.read_excel(f"{file}")
client_total = df.groupby(["Nominal"]).sum()["Amount"]
dfs.append(client_total)
df = dfs[0]
for df in dfs[1:]:
df = df.add(df)
print(df)
I am new to Python. I read data from an excel file. How may I turn the column into a list? The column is part of a pandas dataframe, read from an xlsx file by xlrd package. Any better way to solve the problem is also appreciated.
import pandas as pd
import xlrd
workbook = xlrd.open_workbook("MyData_XYZ.xlsx")
sheet1 = workbook.sheet_by_index(0)
def get_cell_range2(sheet, start_col, start_row, end_col, end_row):
return [sheet.row_slice(row, start_colx=start_col-1, end_colx=end_col) for row in range(start_row-1, end_row)]
er_aaa = get_cell_range2(sheet1, 1, 2, 2, 67)
er_aaa_df = pd.DataFrame(er_aaa, columns= ['date', 'aaa'])
raw_seq = list(er_aaa_df['aaa'])
I got this in Spyder
raw_seq
Out[61]:
0 number:25.405
1 number:25.427
2 number:25.411
3 number:25.423
4 number:25.45
61 number:26.054
62 number:26.09
63 number:26.103
64 number:26.1
65 number:26.03
Name: aaa, Length: 66, dtype: object
How can I turn the result to a simple list, namely,
[25.405, 25.427, 25.411, ...... 26.03]
Thank you!!
If I understand it correctly, you want to read data from xlsx and get one of its columns. You can get it such as below.
df = pd.read_excel("MyData_XYZ.xlsx")
list = df['aaa'].tolist()
Why don't you just use: data=pd.read_excel("MyData_XYZ.xlsx", header=1), see this link
Then you can just select your values with data['number']. But carefull this is still a dataframe object, if you want a pure list, just do list( data['number'])
Edit: I believe this was all user error. I have been typing df.T by default, and it just occurred to me that this is very likely the TRANSPOSE output. By typing df, the data frame is output normally (headers as columns). Thank you for those who stepped up to try and help. In the end, it was just my misunderstanding of pandas language..
Original Post
I'm not sure if I am making a simple mistake but the columns in a .csv file are being imported as rows using pd.read_csv. The dataframe turns out to be 5 rows by 2000 columns. I am importing only 5 columns out of 14 so I set up a list to hold the names of the columns I want. They match exactly those in the .csv file. What am I doing wrong here?
import os
import numpy as np
import pandas as pd
fp = 'C:/Users/my/file/path'
os.chdir(fp)
cols_to_use = ['VCOMPNO_CURRENT', 'MEASUREMENT_DATETIME',
'EQUIPMENT_NUMBER', 'AXLE', 'POSITION']
df = pd.read_csv('measurement_file.csv',
usecols=cols_to_use,
dtype={'EQUIPMENT_NUMBER': np.int,
'AXLE': np.int},
parse_dates=[2],
infer_datetime_format=True)
Output:
0 ... 2603
VCOMPNO_CURRENT T92656 ... T5M247
MEASUREMENT_DATETIME 7/26/2018 13:04 ... 9/21/2019 3:21
EQUIPMENT_NUMBER 208 ... 537
AXLE 1 ... 6
POSITION L ... R
[5 rows x 2000 columns]
Thank you.
Edit: To note, if I import the entire .csv with the standard pd.read_csv('measurement_file.csv'), the columns are imported properly.
Edit 2: Sample csv:
VCOMPNO_CURRENT,MEASUREMENT_DATETIME,REPAIR_ORDER_NUMBER,EQUIPMENT_NUMBER,AXLE,POSITION,FLANGE_THICKNESS,FLANGE_HEIGHT,FLANGE_SLOPE,DIAMETER,RO_NUMBER_SRC,CL,VCOMPNO_AT_MEAS,VCOMPNO_SRC
T92656,10/19/2018 7:11,5653054,208,1,L,26.59,27.34,6.52,691.3,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,1,R,26.78,27.25,6.64,691.5,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,2,L,26.6,27.13,6.49,691.5,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,2,R,26.61,27.45,6.75,691.6,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T7L672,10/19/2018 7:11,5653054,208,3,L,26.58,27.14,6.58,644.4,OPTIMESS_DATA,2CTC ,T7L672 ,BOTH
T7L672,10/19/2018 7:11,5653054,208,3,R,26.21,27.44,6.17,644.5,OPTIMESS_DATA,2CTC ,T7L672 ,BOTH
A simple workaround here is to just to take the transpose of the dataframe.
Link to Pandas Documentation
df = pd.DataFrame.transpose(df)
Can you try like this?
import pandas as pd
dataset = pd.read_csv('yourfile.csv')
#filterhere
dataset = dataset[cols_to_use]
I have a csv file imported into a dataframe and have trouble sorting the data.
df looks like this:
Data
0 <WindSpeed>0.69</WindSpeed>
1 <PowerOutput>0</PowerOutput>
2 <ThrustCoEfficient>0</ThrustCoEffici...
3 <RotorSpeed>8.17</RotorSpeed>
4 <ReactivePower>0</ReactivePower>
5 </DataPoint>
6 <DataPoint>
7 <WindSpeed>0.87</WindSpeed>
8 <PowerOutput>0</PowerOutput
I want it to look like this:
0 Windspeed Poweroutput
1 0.69 0.0
HereĀ“s the code that I wrote so far:
import pandas as pd
from pandas.compat import StringIO
import re
import numpy as np
df= pd.read_csv('powercurve.csv', encoding='utf-8',skiprows=42)
df.columns=['Data']
no_of_rows=df.Data.str.count("WindSpeed").sum()/2
rows=no_of_rows.astype(np.uint32)
TRBX=pd.DataFrame(index=range(0,abs(rows)),columns=['WSpd[m/s]','Power[kW]'],dtype='float')
i=0
for i in range(len(df)):
if 'WindSpeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
elif 'Rotorspeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
Is this a suitable approach? If yes, so far there are no values written into the TRBX dataframe. Where is my mistake?
The code below should help you if your df is indeed in the same format as you:
import re
split_func = lambda x: re.split('<|>', str(x))
split_series = df.Data.apply(split_func)
data = a.apply(lambda x: x[2]).rename('data')
features = a.apply(lambda x: x[1]).rename('features')
df = pd.DataFrame(data).set_index(features).T
You may want to drop some columns that have no data or input some N/A values afterwards. You also may want to rename the variables and series to different names that make more sense to you.