I am curious as to why when I create a data frame in the manner below, using lists to create the values in the rows does not graph and gives me the error "ValueError: x must be a label or position"
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
values = [9.83, 19.72, 7.19, 3.04]
values
[9.83, 19.72, 7.19, 3.04]
cols = ['Condition', 'No-Show']
conditions = ['Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism']
df = pd.DataFrame(columns = [cols])
df['Condition'] = conditions
df['No-Show'] = values
df
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diabetes 7.19
3 Alcoholism 3.04
df.plot(kind='bar', x='Condition', y='No-Show');
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 df.plot(kind='bar', x='Condition', y='No-Show')
File ~\anaconda3\lib\site-packages\pandas\plotting\_core.py:938, in
PlotAccessor.__call__(self, *args, **kwargs)
936 x = data_cols[x]
937 elif not isinstance(data[x], ABCSeries):
--> 938 raise ValueError("x must be a label or position")
939 data = data.set_index(x)
940 if y is not None:
941 # check if we have y as int or list of ints
ValueError: x must be a label or position
Yet if I create the same DataFrame a different way, it graphs just fine....
df2 = pd.DataFrame({'Condition': ['Scholarship', 'Hipertension', 'Diatebes', 'Alcoholism'],
'No-Show': [9.83, 19.72, 7.19, 3.04]})
df2
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diatebes 7.19
3 Alcoholism 3.04
df2.plot(kind='bar', x='Condition', y='No-Show')
plt.ylim(0, 50)
#graph appears here just fine
Can someone enlighten me why it works the second way and not the first? I am a new student and am confused. I appreciate any insight.
Let's look at pd.DataFrame.info for both dataframes.
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (Condition,) 4 non-null object
1 (No-Show,) 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note, your column headers are tuples with a empty second element.
Now, look at info for df2.
df2.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Condition 4 non-null object
1 No-Show 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note your column headers here are strings.
As, #BigBen states in his comment you don't need the extra brackets in your dataframe constructor for df.
FYI... to fix your statement with the incorrect dataframe constructor for df.
df.plot(kind='bar', x=('Condition',), y=('No-Show',))
I am using df= df.replace('No data', np.nan) on a csv file containing ‘No data’ instead of blank/null entries where there is no numeric data. Using the head() method I see that the replace method does replace the ‘No data’ entries with NaN. When I use df.info() it says that the datatypes for each of the series is an object.
When I open the csv file in Excel and manually edit the data using find and replace to change ‘No data’ to blank/null entries, although the dataframes look exactly the same when I use df.head(), when I use df.info() it says that the datatypes are floats.
I was wondering why this was and how can I make it so that the datatypes for my series are floats, without having to manually edit the csv files.
If the rest of the data in your columns is numeric then you should use pd.to_numeric(df, errors='coerce')
import pandas as pd
import numpy as np
# Create data for columns with strings in it
column_data = [1,2,3] + ['no data']
# Costruct data frame with two columns
df = pd.DataFrame({'col1':column_data, 'col2':column_data[::-1]})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
col1 4 non-null object
col2 4 non-null object
dtypes: object(2)
memory usage: 144.0+ bytes
# Replace 'no data' with Nan
df_nan = df.replace('no data', np.nan)
# Set type of all columns to float
df_result = df_nan.as_type({c:float for c in df_nan.columns})
df_result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
col1 3 non-null float64
col2 3 non-null float64
dtypes: float64(2)
memory usage: 144.0 bytes
Consider using the converters arg in pandas.read_csv() where you pass a dictionary of column numbers referencing a conversion function. Below checks the No data string and conditionally replaces value with np.nan otherwise leave as is:
import numpy as np
import pandas as pd
c_fct = lambda x : float(x if 'No data' not in x else np.nan)
convertdict = {1:c_fct, 2:c_fct, 3:c_fct, 4:c_fct, 5:c_fct}
df = pd.read_csv('Input.csv', converters=convertdict)
Input CSV
ID Col1 Col2 Col3 Col4 Col5
TGG 0.634516647 0.900464347 0.998505978 0.170422713 0.893340128
GRI No data 0.045915333 0.718398939 0.924813864 No data
NLB 0.921127268 0.614460813 0.677857676 0.343612947 0.559437744
SEI 0.081852313 No data 0.890816385 0.943313021 0.874857844
LOY 0.632556715 0.362855866 0.038702448 0.253762859 No data
OPL 0.375088582 0.268283238 0.761552111 0.589547625 0.192223208
CTK 0.349464541 0.844718987 No data 0.841439909 0.898093646
EUE 0.629784261 0.982589843 0.315670377 0.832419474 0.950044814
JLP 0.543942659 0.988380305 0.417191823 0.823857176 0.542514099
RHK 0.728053447 0.521816539 0.402523435 No data 0.558226706
AEM 0.005495116 0.715363776 0.075508356 0.959119268 0.844730368
VLQ 0.21146319 0.558208766 0.501769554 0.226539046 0.795861461
MDB 0.230514689 0.223163664 No data 0.324636384 0.700716246
LPH 0.853433224 0.582678173 0.633109347 0.432191426 No data
PEP 0.41096305 No data .627776178 0.482359278 0.179863537
UQK 0.252598809 0.497517585 0.276060768 No data 0.087985623
KGJ 0.033985585 0.033702088 anNo data 0.286682709 0.543349787
JUQ 0.25971543 0.142067155 0.597985191 0.219841249 0.699822866
NYW No data 0.17187907 0.157413049 0.209011772 0.592824483
Output
print(df)
# ID Col1 Col2 Col3 Col4 Col5
# 0 TGG 0.634517 0.900464 0.998506 0.170423 0.893340
# 1 GRI NaN 0.045915 0.718399 0.924814 NaN
# 2 NLB 0.921127 0.614461 0.677858 0.343613 0.559438
# 3 SEI 0.081852 NaN 0.890816 0.943313 0.874858
# 4 LOY 0.632557 0.362856 0.038702 0.253763 NaN
# 5 OPL 0.375089 0.268283 0.761552 0.589548 0.192223
# 6 CTK 0.349465 0.844719 NaN 0.841440 0.898094
# 7 EUE 0.629784 0.982590 0.315670 0.832419 0.950045
# 8 JLP 0.543943 0.988380 0.417192 0.823857 0.542514
# 9 RHK 0.728053 0.521817 0.402523 NaN 0.558227
# 10 AEM 0.005495 0.715364 0.075508 0.959119 0.844730
# 11 VLQ 0.211463 0.558209 0.501770 0.226539 0.795861
# 12 MDB 0.230515 0.223164 NaN 0.324636 0.700716
# 13 LPH 0.853433 0.582678 0.633109 0.432191 NaN
# 14 PEP 0.410963 NaN 0.627776 0.482359 0.179864
# 15 UQK 0.252599 0.497518 0.276061 NaN 0.087986
# 16 KGJ 0.033986 0.033702 NaN 0.286683 0.543350
# 17 JUQ 0.259715 0.142067 0.597985 0.219841 0.699823
# 18 NYW NaN 0.171879 0.157413 0.209012 0.592824
print(df.types)
# ID object
# Col1 float64
# Col2 float64
# Col3 float64
# Col4 float64
# Col5 float64
# dtype: object
I have a csv-file called 'filename' and want to read in these data as 64float, except the column 'hour'. I managed it with the pd.read_csv - function and an converter.
df = pd.read_csv("../data/filename.csv",
delimiter = ';',
date_parser = ['hour'],
skiprows = 1,
converters={'column1': lambda x: float(x.replace ('.','').replace(',','.'))})
Now, I have two points:
FIRST:
The delimiter works with ; ,but if I take a look in Notepad to my data, there are ',', not ';'. But if I take ',' I get: 'pandas.parser.CParserError: Error tokenizing data. C error: Expected 7 fields in line 13, saw 9'
SECOND:
If I want to use the converter for all columns, how can I get this?! What`s the right term?
I try to use dtype = float in the readin-function, but I get 'AttributeError: 'NoneType' object has no attribute 'dtype'' Whats happend? Thats the reasion why I want to managed it with the converter.
Data:
,hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind
offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"
This should work:
In [40]:
# text data
temp=''',hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"'''
# so read the csv, pass params quotechar and the thousands character
df = pd.read_csv(io.StringIO(temp), quotechar='"', thousands=',')
df
Out[40]:
Unnamed: 0 hour PV Wind onshore Wind offshore PV.1 Wind onshore.1 \
0 0 1 0 12985.0 9614.0 0 32825.5
1 1 2 0 12908.9 9290.8 0 36052.3
2 2 3 0 12740.9 8886.9 0 38540.9
3 3 4 0 12485.3 8644.5 0 40734.0
4 4 5 0 11188.5 8079.0 0 42688.0
5 5 6 0 11219.0 7594.2 0 43333.5
Wind offshore.1 PV.2 Wind onshore.2 Wind offshore.2
0 9495.7 0 13110.3 10855.5
1 9589.1 0 13670.2 10828.6
2 10087.3 0 14610.8 10828.6
3 10087.3 0 15638.3 10343.7
4 10087.3 0 16809.4 10343.7
5 10025.0 0 18266.9 10343.7
In [41]:
# check the dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 11 columns):
Unnamed: 0 6 non-null int64
hour 6 non-null int64
PV 6 non-null float64
Wind onshore 6 non-null float64
Wind offshore 6 non-null float64
PV.1 6 non-null float64
Wind onshore.1 6 non-null float64
Wind offshore.1 6 non-null float64
PV.2 6 non-null float64
Wind onshore.2 6 non-null float64
Wind offshore.2 6 non-null float64
dtypes: float64(9), int64(2)
memory usage: 576.0 bytes
So basically you need to pass the quotechar='"' and thousands=',' params to read_csv to achieve what you want, see the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv
EDIT
If you want to convert after importing (which is a waste when you can do it upfront) then you can do this for each column of interest:
In [43]:
# replace the comma separator
df['Wind onshore'] = df['Wind onshore'].str.replace(',','')
# convert the type
df['Wind onshore'] = df['Wind onshore'].astype(np.float64)
df['Wind onshore'].dtype
Out[43]:
dtype('float64')
It would be faster to replace the comma separator on all the columns of interest first and just call convert_objects like so: df.convert_objects(convert_numeric=True)
given dataframe d such as this:
index col1
1 a
2 a
3 b
4 b
Create a prefiltered group object with new values:
g = d[prefilter].groupby(['some cols']).apply( somefunc )
index col1
2 c
4 d
Now I want to update df to this:
index col1
1 a
2 c
3 b
4 d
Ive been hacking away with update, ix, filtering, where, etc... I am guessing there is an obvious solution I am not seeing here.
stuff like this is not working:
d[d.index == db.index]['alert_v'] = db['alert_v']
q90 = g.transform( somefunc )
d.ix[ d['alert_v'] >=q90, 'alert_v'] = 1
d.ix[ d['alert_v'] < q90, 'alert_v'] = 0
d['alert_v'] = np.where( d.index==db.index, db['alert_v'], d['alert_v'] )
any help is appreciated
thankyou
--edit--
the two dataframes are in the same form:
one is simply a filtered version of the other, with different values, that I want to update to the original.
ValueError: cannot reindex from a duplicate axis
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 1984-12-12 13:33:00 to 1939-03-19 22:54:00
Data columns (total 9 columns):
source 2186 non-null object
subject_id 2186 non-null float64
alert_t 2186 non-null object
variable 2186 non-null object
timeindex 2186 non-null datetime64[ns]
alert_v 2105 non-null float64
value 2186 non-null float64
tavg 54 non-null timedelta64[ns]
iqt 61 non-null object
dtypes: datetime64[ns](1), float64(3), object(4), timedelta64[ns](1)None<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1982 entries, 1984-12-12 13:33:00 to 1939-03-19 22:54:00
Data columns (total 9 columns):
source 1982 non-null object
subject_id 1982 non-null float64
alert_t 1982 non-null object
variable 1982 non-null object
timeindex 1982 non-null datetime64[ns]
alert_v 1982 non-null int64
value 1982 non-null float64
tavg 0 non-null timedelta64[ns]
iqt 0 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4), timedelta64[ns](1)None
you want the df.update() function.
Try something like this:
import pandas as pd
df1 = pd.DataFrame({'Index':[1,2,3,4],'Col1':['A', 'B', 'C', 'D']}).set_index('Index')
df2 = pd.DataFrame({'Index':[2,4],'Col1':['E', 'F']}).set_index('Index')
print df1
Col1
Index
1 A
2 B
3 C
4 D
df1.update(df2)
print df1
Col1
Index
1 A
2 E
3 C
4 F