"Unknown string format:" pandas / pyplot parser error - python

I'm trying to plot a section of a dataframe. The first column is formatted using the to_datetime method:
all_data['Date_Time_(GMT)'] = pd.to_datetime(all_data['Date_Time_(GMT)'])
...
all_data['Date_Time_(GMT)'].dtype
[out] dtype('<M8[ns]')
The second column is a bunch of intergers:
all_data[new_column].dtype
[out] dtype('int64')
When I try to plot the two columns I get a parser error. Here is the code for the plot:
my_column = 'My Column'
start_date = '2020-08-11 09:28:37'
end_date = '2020-08-11 09:29:28'
new_plot = pd.DataFrame()
new_plot['Date_Time_(GMT)'] = all_data['Date_Time_(GMT)']
new_plot[my_column] = all_data[my_column]
mask = (new_plot['Date_Time_(GMT)'] > start_date) & (new_plot['Date_Time_(GMT)'] <= end_date)
new_plot = new_plot.loc[mask]
df = pd.DataFrame(new_plot, columns=[new_plot['Date_Time_(GMT)'], new_plot[my_column]])
df.plot(x='Date_Time_(GMT)', y=my_column, kind='line' )
plt.show()
Here is the error output:
ParserError Traceback (most recent call last)
pandas\_libs\tslibs\conversion.pyx in pandas._libs.tslibs.conversion._convert_str_to_tsobject()
pandas\_libs\tslibs\parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()
c:\users\user name\appdata\local\programs\python\python38\lib\site-
packages\dateutil\parser\_parser.py in parse(timestr, parserinfo, **kwargs)
1373 else:
-> 1374 return DEFAULTPARSER.parse(timestr, **kwargs)
1375
c:\users\user name\appdata\local\programs\python\python38\lib\site-
packages\dateutil\parser\_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
648 if res is None:
--> 649 raise ParserError("Unknown string format: %s", timestr)
650
ParserError: Unknown string format: Date_Time_(GMT)
Any ideas what completely obvious thing i've done wrong?

Setting the index should do the trick
df = all_data[
(all_data['Date_Time_(GMT)'] > start_date) &
(all_data['Date_Time_(GMT)'] <= end_date)
].copy().set_index('Date_Time_(GMT)')
df.plot(y=my_column, kind='line')

Related

ValueError: "Columns must be same length as key" > Python

I'm using Colab to run the code but I get this error and I can't fix it. Could you help me out?
I don't know what to do in order to fix it because I have tried to change upper case or lower case.
#Inativos: ajustar nomes das colunas
dfInativos = dfInativos.rename(columns={'userId': 'id'})
dfInativos = dfInativos.rename(columns={'classId': 'ClasseId'})
dfInativos[['id','ClasseId','lastActivityDate','inactivityDaysCount','sevenDayInactiveStatus']] = dfInativos
#dfInativos['id'] = dfInativos['id'].astype(int, errors = 'ignore')
#dfInativos['ClasseId'] = dfInativos['ClasseId'].astype(int, errors = 'ignore')
dfInativos['id'] = pd.to_numeric(dfInativos['id'],errors = 'coerce')
dfInativos['ClasseId'] = pd.to_numeric(dfInativos['ClasseId'],errors = 'coerce')
#dfInativos.dropna(subset = ['lastActivityDate'], inplace=True)
dfInativos.drop_duplicates(subset = ['id','ClasseId'], inplace=True)
dfInativos['seven DayInactiveStatus'] = dfInativos['sevenDayInactiveStatus'].replace(0,'')
#Add Inactive data to main data frame
df = df.merge(dfInativos, on=['id','ClasseId'], how='left')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-79-10fe94c48d1f> in <module>()
2 dfInativos = dfInativos.rename(columns={'userId': 'id'})
3 dfInativos = dfInativos.rename(columns={'classId': 'ClasseId'})
----> 4 dfInativos[['id','ClasseId','lastActivityDate','inactivityDaysCount','sevenDayInactiveStatus']] = dfInativos
5
6
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexers.py in check_key_length(columns, key, value)
426 if columns.is_unique:
427 if len(value.columns) != len(key):
--> 428 raise ValueError("Columns must be same length as key")
429 else:
430 # Missing keys in columns are represented as -1
ValueError: Columns must be same length as key

TypeError for >= between float and string values

I have been using the below script to calculate the storage capacity on one of our server environments. It reads the values from a report I get every two weeks and then creates a file I can import into PowerBI to create graphs. It ran without an issue 2 weeks ago but today when I tried to run it I get a TypeError. I assume it is "if float(df['Capacity(TB)']) >= 0.01: " causing the issue as per the error message.
The data I am importing is a xls sheet with a header name and values underneath it. I had a look to see if there are any blank fields but could not find any. Any help/suggestions would be greatly appreciated.
import pandas as pd
import numpy as np
from datetime import datetime
import os
from os import listdir
from os.path import isfile, join
#SCCM resource import as 'df'
pathres = r'C:\Capacity Reports\SCOM Reports'
onlyfiles = [f for f in listdir(pathres) if isfile(join(pathres, f))]
df = pd.DataFrame()
for i in onlyfiles:
print(i)
dfresimp = pd.read_excel(pathres+'\\'+i)
df = pd.concat([df, dfresimp])
#CMDB import as 'df2'
df2 = pd.read_excel('C:\\Capacity Reports\\CMDB_Export.xlsx')
#Windows Lifecycle import as 'df3'
df3 = pd.read_excel('C:\\Capacity Reports\\Windows Server Lifecycle.xlsx')
#SCVMM clusters import as 'df4'
df4 = pd.read_excel('C:\\Capacity Reports\\HyperV Overview.xlsx')
#SCVMM Storage reports import as 'df5'
pathstor = r'C:\Capacity Reports\Hyper-V Storage'
Storfiles = [f for f in listdir(pathstor) if isfile(join(pathstor, f))]
df5 = pd.DataFrame()
for i in Storfiles:
print(i)
dfstorimp = pd.read_excel(pathstor+'\\'+i)
df5 = pd.concat([df5, dfstorimp])
#CREATE MAIN TABLE
df['NAME'] = df['Computer Name'].str.upper()
df11 = pd.DataFrame()
df11['NAME'] = df2['NAME'].str.upper()
df11['Application Owner'] = df2['Application Owner'].str.title()
df11['HW EOSL'] = df2['HW EOSL'].str.title()
#print(df11['HW EOSL'])
Main_Table = df.merge(df11, on='NAME', how='left')
Main_Table = Main_Table.merge(df3, on='Operating System Edition', how='left')
df13 = pd.DataFrame()
df13['Hyper V Cluster name'] = df4['Hyper V Cluster name']
df13['Computer Name'] = df4['Server Name'].str.upper()
Main_Table = Main_Table.merge(df13, on='Computer Name', how='left')
Main_Table['OS_Support'] = pd.to_datetime(Main_Table['Extended_Support_End_Date'], format='"%Y-%m-%d %H:%S:%f')
Main_Table['OS_Support'] = Main_Table['OS_Support'].dt.strftime("%Y-%m-%d")
#print(Main_Table['OS_Support'])
def f(df):
if df['Host/GuestVM'] == 'GuestVM':
result = (df['Total Physical Memory GB']-(df['Total Physical Memory GB']*(df['Memory % Used Max Value']/100)))/2
return result
else:
np.nan
Main_Table['Reclaimable Memory Calculated'] = Main_Table.apply(f, axis=1)
def f(df):
if df['Host/GuestVM'] == 'GuestVM':
result = (df['Total Logical Processors']-(df['Total Logical Processors']*(df['CPU % Used Max Value']/100)))/2
return result
else:
np.nan
Main_Table['Reclaimable CPU Calculated'] = Main_Table.apply(f, axis=1)
Main_Table['Reclaimable Memory Calculated'] = round(Main_Table['Reclaimable Memory Calculated'])
Main_Table['Reclaimable CPU Calculated'] = round(Main_Table['Reclaimable CPU Calculated'])
Main_Table['Report Timestamp'] = Main_Table['Report Timestamp'].dt.strftime("%Y%m%d")
Main_Table = Main_Table.drop_duplicates()
Main_Table['Report Timestamp Number'] = Main_Table['Report Timestamp']
column = Main_Table["Report Timestamp Number"]
max_value = column.max()
Total_Memory_Latest = 0
def f(df):
global Total_Memory_Latest
if df['Report Timestamp Number'] == max_value and df['Host/GuestVM'] == 'Host':
Total_Memory_Latest += df['Total Physical Memory GB']
return 0
else:
np.nan
Main_Table['DummyField'] = Main_Table.apply(f, axis=1)
Main_Table.to_excel(r'C:\Users\storm_he\OneDrive - MTN Group\Documents\Testing\Main_Table.xlsx')
#CREATE STORAGE TABLE AND EXPORT
def f(df):
#if df['Host/GuestVM'] == 'Host':
#try:
if float(df['Capacity(TB)']) >= 0.01:
result = (df['Available(TB)']/df['Capacity(TB)'])*100
return round(result)
else:
return ''
#except:
#return np.nan
df5['% Storage free'] = df5.apply(f, axis=1)
pattern = '|'.join(['.mtn.co.za', '.mtn.com'])
df5['VMHost'] = df5['VMHost'].str.replace(pattern,'')
df5['VMHost'] = df5['VMHost'].str.upper()
df5['Report Timestamp'] = df5['Report Timestamp'].dt.strftime("%Y%m%d")
#print(df5['Report Timestamp'])
df5.to_excel(r'C:\Users\storm_he\OneDrive - MTN Group\Documents\Testing\Main_Storage_table.xlsx')
print('Run Finished')
StackTrace
TypeError Traceback (most recent call last)
<ipython-input-1-3c53bb32e311> in <module>
108 column = Main_Table["Report Timestamp Number"]
109
--> 110 max_value = column.max()
111 Total_Memory_Latest = 0
112
~\Anaconda3\lib\site-packages\pandas\core\generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
11212 if level is not None:
11213 return self._agg_by_level(name, axis=axis, level=level, skipna=skipna)
> 11214 return self._reduce(
11215 f, name, axis=axis, skipna=skipna, numeric_only=numeric_only
11216 )
~\Anaconda3\lib\site-packages\pandas\core\series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
3889 )
3890 with np.errstate(all="ignore"):
-> 3891 return op(delegate, skipna=skipna, **kwds)
3892
3893 # TODO(EA) dispatch to Index
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in f(values, axis, skipna, **kwds)
123 result = alt(values, axis=axis, skipna=skipna, **kwds)
124 else:
--> 125 result = alt(values, axis=axis, skipna=skipna, **kwds)
126
127 return result
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in reduction(values, axis, skipna, mask)
835 result = np.nan
836 else:
--> 837 result = getattr(values, meth)(axis)
838
839 result = _wrap_results(result, dtype, fill_value)
~\Anaconda3\lib\site-packages\numpy\core\_methods.py in _amax(a, axis, out, keepdims, initial, where)
28 def _amax(a, axis=None, out=None, keepdims=False,
29 initial=_NoValue, where=True):
---> 30 return umr_maximum(a, axis, None, out, keepdims, initial, where)
31
32 def _amin(a, axis=None, out=None, keepdims=False,
TypeError: '>=' not supported between instances of 'float' and 'str'

More efficient way to deal with KeyError from dictionary

I want to find the country names for a data frame columns with top level domains such as 'de', 'it', 'us'.. by using the iso3166 package.
There are domains in the dataset that does not exist in iso3166, therefore, Value Error got raised.
I tried to solve the value error by letting the code return Boolean values but it runs for a really long time. Will be great to know how to speed it up.
Sample data: df['country']
0 an
1 de
2 it
My code (Note the code does not raise KeyError error. My question is how to make it faster)
df['country_name'] = df['country'].apply(lambda x: countries.get(x)[0] if \
df['country'].str.find(x).any() == countries.get(x)[1].lower() else 'unknown')
df['country] is the data frame column. countries.get() is for getting country names from iso3166
df['country'].str.find(x).any() finds top level domains and countries.get(x)[1].lower()returns top level domains. If they are the same then I use countries.get(x)[0] to return the country name
Expected output
country country_name
an unknown
de Germany
it Italy
Error if I run df['country_name'] = df['country'].apply(lambda x: countries.get(x)[0]) (I renamed the dataframe so it's different from the error message)
KeyError Traceback (most recent call last)
<ipython-input-110-d51176ce2978> in <module>
----> 1 bf['country_name'] = bf['country'].apply(lambda x: countries.get(x)[0])
/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
3846 else:
3847 values = self.astype(object).values
-> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype)
3849
3850 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-110-d51176ce2978> in <lambda>(x)
----> 1 bf['country_name'] = bf['country'].apply(lambda x: countries.get(x)[0])
/opt/anaconda3/lib/python3.8/site-packages/iso3166/__init__.py in get(self, key, default)
358
359 if r == NOT_FOUND:
--> 360 raise KeyError(key)
361
362 return r
KeyError: 'an'```
A little error handling and defining your logic outside of the apply() method should get you where you want to go. Something like:
def get_country_name(x):
try:
return countries.get(x)[0]
except:
return 'unknown'
df['country_name'] = df['country'].apply(lambda x: get_country_name(x))
This James Tollefson's answer, narrowed down to its core, didn't want to change his answer too much, here's the implementation:
def get_country_name(x: str) -> str:
return countries[x][0] if countries.get(x) else 'unknown'
df['country_name'] = df['country'].apply(get_country_name)

Why I have this problem with index range?why does it not work?

I have got this error when try split my one column to few columns. But it split on just on one or two columns.If you wanna split on 3,4,5 columns it writes:
ValueError Traceback (most recent call last)
/usr/local/Cellar/jupyterlab/2.1.5/libexec/lib/python3.8/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
349 try:
--> 350 return self._range.index(new_key)
351 except ValueError:
ValueError: 2 is not in range
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-19-d4e6a4d03e69> in <module>
22 data_old[Col_1_Label] = newz[0]
23 data_old[Col_2_Label] = newz[1]
---> 24 data_old[Col_3_Label] = newz[2]
25 #data_old[Col_4_Label] = newz[3]
26 #data_old[Col_5_Label] = newz[4]
/usr/local/Cellar/jupyterlab/2.1.5/libexec/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
2798 if self.columns.nlevels > 1:
2799 return self._getitem_multilevel(key)
-> 2800 indexer = self.columns.get_loc(key)
2801 if is_integer(indexer):
2802 indexer = [indexer]
/usr/local/Cellar/jupyterlab/2.1.5/libexec/lib/python3.8/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
350 return self._range.index(new_key)
351 except ValueError:
--> 352 raise KeyError(key)
353 return super().get_loc(key, method=method, tolerance=tolerance)
354
KeyError: 2
There is my code.I have csv file.And when pandas read it - create one column with value 'Контракт'.Then. I split it on another columns. But it split till two columns.I wanna 7 columns!Help please to understand this logic!
import pandas as pd
from pandas import Series, DataFrame
import re
dframe1 = pd.read_csv('po.csv')
columns = ['Контракт']
data_old = pd.read_csv('po.csv', header=None, names=columns)
data_old
# The thing you want to split the column on
SplitOn = ':'
# Name of Column you want to split
Split_Col = 'Контракт'
newz = data_old[Split_Col].str.split(pat=SplitOn, n=-1, expand=True)
# Column Labels (you can add more if you will have more)
Col_1_Label = 'Номер телефону'
Col_2_Label = 'Тарифний пакет'
Col_3_Label = 'Вихідні дзвінки з України за кордон'
Col_4_Label = 'ВАРТІСТЬ ПАКЕТА/ЩОМІСЯЧНА ПЛАТА'
Col_5_Label = 'ЗАМОВЛЕНІ ДОДАТКОВІ ПОСЛУГИ ЗА МЕЖАМИ ПАКЕТА'
Col_6_Label = 'Вартість послуги "Корпоративна мережа'
Col_7_Label = 'ЗАГАЛОМ ЗА КОНТРАКТОМ (БЕЗ ПДВ ТА ПФ)'
data_old[Col_1_Label] = newz[0]
data_old[Col_2_Label] = newz[1]
data_old[Col_3_Label] = newz[2]
#data_old[Col_4_Label] = newz[3]
#data_old[Col_5_Label] = newz[4]
#data_old[Col_6_Label] = newz[5]
#data_old[Col_7_Label] = newz[6]
data_old
Pandas does not support "unstructured text", you should convert it to a standard format or python objects and then create a dataframe from it
Imagine that you have a file with this text named data.txt:
Contract № 12345679 Number of phone: +7984563774
Total price for month : 00.00000
Total price: 10.0000
You can load an process it with Python like this:
with open('data.txt') as f:
content = list(data.readlines())
# First line contains the contract number and phone information
contract, phone = content[0].split(':')
# find contract number using regex
contract = re.findall('\d+', contract)[0]
# The phone is strightforward
phone = phone.strip()
# Second line and third line for prices
total_price = float(content[1].split(':')[1].strip())
total_month_price = float(content[2].split(':')[1].strip())
Then with these variables you can create a dataframe
df = pd.DataFrame([dict(N_of_contract=contract, total_price=total_price, total_month_price =total_month_price )])
Repeat the same for all files.

Getting Error When Trying to Import Data From Website Using Loop

I'm trying to import data from multiple web pages into a data table using Python.
Basically, I'm trying to download attendance data for certain teams since 2000.
Here is what I have so far:
import requests
import pandas as pd
import numpy as np
#What is the effect of a rival team's performance on a team's attendance
Teams = ['LAA', 'LAD', 'NYY', 'NYM', 'CHC', 'CHW', 'OAK', 'SFG']
Years = []
for year in range(2000,2020):
Years.append(str(year))
bbattend = pd.DataFrame(columns=['GM_Num','Date','Team','Home','Opp','W/L','R','RA','Inn','W-L','Rank','GB','Time','D/N','Attendance','Streak','Game_Win','Wins','Losses','Net_Wins'])
for team in Teams:
for year in Years:
url = 'https://www.baseball-reference.com/teams/' + team + '/' + year +'-schedule-scores.shtml'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
#Formatting data table
df.rename(columns={"Gm#": "GM_Num", "Unnamed: 4": "Home", "Tm": "Team", "D/N": "Night"}, inplace = True)
df['Home'] = df['Home'].apply(lambda x: 0 if x == '#' else 1)
df['Game_Win'] = df['W/L'].astype(str).str[0]
df['Game_Win'] = df['Game_Win'].apply(lambda x: 0 if x == 'L' else 1)
df['Night'] = df['Night'].apply(lambda x: 1 if x == 'N' else 0)
df['Streak'] = df['Streak'].apply(lambda x: -1*len(x) if '-' in x else len(x))
df.drop('Unnamed: 2', axis=1, inplace = True)
df.drop('Orig. Scheduled', axis=1, inplace = True)
df.drop('Win', axis=1, inplace = True)
df.drop('Loss', axis=1, inplace = True)
df.drop('Save', axis=1, inplace = True)
#Drop rows that do not have data
df = df[df['GM_Num'].str.isdigit()]
WL = df["W-L"].str.split("-", n = 1, expand = True)
df["Wins"] = WL[0].astype(dtype=np.int64)
df["Losses"] = WL[1].astype(dtype=np.int64)
df['Net_Wins'] = df['Wins'] - df['Losses']
bbattend.append(df)
bbattend
When I do the thing in the loop separately by using a specific link instead of trying to use concatenation to make the url, it seems to work.
However, using this code, I am getting the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-77-997e6aeea77e> in <module>
16 url = 'https://www.baseball-reference.com/teams/' + team + '/' + year +'-schedule-scores.shtml'
17 html = requests.get(url).content
---> 18 df_list = pd.read_html(html)
19 df = df_list[-1]
20 #Formatting data table
~/anaconda3/lib/python3.7/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
1092 decimal=decimal, converters=converters, na_values=na_values,
1093 keep_default_na=keep_default_na,
-> 1094 displayed_only=displayed_only)
~/anaconda3/lib/python3.7/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
914 break
915 else:
--> 916 raise_with_traceback(retained)
917
918 ret = []
~/anaconda3/lib/python3.7/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
418 if traceback == Ellipsis:
419 _, _, traceback = sys.exc_info()
--> 420 raise exc.with_traceback(traceback)
421 else:
422 # this version of raise is a syntax error in Python 3
ValueError: No tables found
I don't really understand what the error message is saying.
I'd appreciate any help!
Because do not have any table in some page, e.g., this page and this page
So, df_list = pd.read_html(html) will raise ValueError: No tables found.
You should need use try-except in here.

Categories

Resources