I am curious as to why when I create a data frame in the manner below, using lists to create the values in the rows does not graph and gives me the error "ValueError: x must be a label or position"
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
values = [9.83, 19.72, 7.19, 3.04]
values
[9.83, 19.72, 7.19, 3.04]
cols = ['Condition', 'No-Show']
conditions = ['Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism']
df = pd.DataFrame(columns = [cols])
df['Condition'] = conditions
df['No-Show'] = values
df
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diabetes 7.19
3 Alcoholism 3.04
df.plot(kind='bar', x='Condition', y='No-Show');
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 df.plot(kind='bar', x='Condition', y='No-Show')
File ~\anaconda3\lib\site-packages\pandas\plotting\_core.py:938, in
PlotAccessor.__call__(self, *args, **kwargs)
936 x = data_cols[x]
937 elif not isinstance(data[x], ABCSeries):
--> 938 raise ValueError("x must be a label or position")
939 data = data.set_index(x)
940 if y is not None:
941 # check if we have y as int or list of ints
ValueError: x must be a label or position
Yet if I create the same DataFrame a different way, it graphs just fine....
df2 = pd.DataFrame({'Condition': ['Scholarship', 'Hipertension', 'Diatebes', 'Alcoholism'],
'No-Show': [9.83, 19.72, 7.19, 3.04]})
df2
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diatebes 7.19
3 Alcoholism 3.04
df2.plot(kind='bar', x='Condition', y='No-Show')
plt.ylim(0, 50)
#graph appears here just fine
Can someone enlighten me why it works the second way and not the first? I am a new student and am confused. I appreciate any insight.
Let's look at pd.DataFrame.info for both dataframes.
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (Condition,) 4 non-null object
1 (No-Show,) 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note, your column headers are tuples with a empty second element.
Now, look at info for df2.
df2.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Condition 4 non-null object
1 No-Show 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note your column headers here are strings.
As, #BigBen states in his comment you don't need the extra brackets in your dataframe constructor for df.
FYI... to fix your statement with the incorrect dataframe constructor for df.
df.plot(kind='bar', x=('Condition',), y=('No-Show',))
Related
I'm trying to merge two dataframes: 'new_df' and 'df3'.
new_df contains years and months, and df3 contains years, months and other columns.
I've cast most of the columns as object, and tried to merge them both.
The merge 'works' as doesn't return an error, but my final datafram is all empty, only the year and month columns are correct.
new_df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_test 119 non-null datetime64[ns]
1 year 119 non-null object
2 month 119 non-null object
dtypes: datetime64[ns](1), object(2)
df3
<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 53 to 1297
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 case_number 191 non-null object
1 date 191 non-null object
2 year 191 non-null object
3 country 191 non-null object
4 area 191 non-null object
5 location 191 non-null object
6 activity 191 non-null object
7 fatal_y_n 182 non-null object
8 time 172 non-null object
9 species 103 non-null object
10 month 190 non-null object
dtypes: object(11)
I've tried this line of code:
df_joined = pd.merge(left=new_df, right=df3, how='left', on=['year','month'])
I was expecting a table with only filled fields in all columns, instead i got the table:
Your issue is with the data types for month and year in both columns - they're of type object which gets a bit weird during the join.
Here's a great answer that goes into depth about converting types to numbers, but here's what the code might look like before joining:
# convert column "year" and "month" of new_df
new_df["year"] = pd.to_numeric(new_df["year"])
new_df["month"] = pd.to_numeric(new_df["month"])
And make sure you do the same with df3 as well.
You may also have a data integrity problem as well - not sure what you're doing before you get those data frames, but if it's casting as an 'Object', you may have had a mix of ints/strings or other data types that got merged together. Here's a good article that goes over Panda Data Types. Specifically, and Object data type can be a mix of strings or other data, so the join might get weird.
Hope that helps!
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 12 months ago.
Improve this question
I would like to select rows using condition on columns like "sex" = "male".
I normally used loc function on DataFrame.
import pandas as pd
dane = pd.read_csv('insurance.csv')
dane.info()
the result is:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
a = dane.loc(dane["sex"] == "male")
And after this calling this cells i have this error
TypeError Traceback (most recent call last)
<ipython-input-9-18dd4823c7e4> in <module>()
----> 1 a = dane.loc(dane["sex"] == "male")
1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
544 def _get_axis_number(cls, axis: Axis) -> int:
545 try:
--> 546 return cls._AXIS_TO_AXIS_NUMBER[axis]
547 except KeyError:
548 raise ValueError(f"No axis named {axis} for object type {cls.__name__}")
TypeError: unhashable type: 'Series'
If i did example from the Internet everything is good:
boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle'],
'Price': [10,15,5,5,10,15,15,5]
}
df = pd.DataFrame(boxes, columns= ['Color','Shape','Price'])
df.info()
select_color = df.loc[df['Color'] == 'Green']
print (select_color)
The result is:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Color 8 non-null object
1 Shape 8 non-null object
2 Price 8 non-null int64
dtypes: int64(1), object(2)
memory usage: 320.0+ bytes
Color Shape Price
0 Green Rectangle 10
1 Green Rectangle 15
2 Green Square 5
What is reason of problem with my situation. This is normall csv file, the same format of data etc.
you are doing a function call on the method loc: dane.loc(dane["sex"] == "male")
where you should do indexing: dane.loc[dane["sex"] == "male"]
A tutorial had this dataframe sequels as follows:
title sequel
id
19995 Avatar nan
862 Toy Story 863
863 Toy Story 2 10193
597 Titanic nan
24428 The Avengers nan
<class 'pandas.core.frame.DataFrame'>
Index: 4803 entries, 19995 to 185567
Data columns (total 2 columns):
title 4803 non-null object
sequel 4803 non-null object
dtypes: object(2)
memory usage: 272.6+ KB
The tutorial provided a file sequels.p. However, when I read the file in, my dataframe was different to that of the tutorial
my_sequels = pd.read_pickle('data/pandas/sequels.p')
my_sequels.set_index('id', inplace=True)
my_sequels.head()
title sequel
id
19995 Avatar <NA>
862 Toy Story 863
863 Toy Story 2 10193
597 Titanic <NA>
24428 The Avengers <NA>
sequels.info()
<class 'pandas.core.frame.DataFrame'>
Index: 4803 entries, 19995 to 185567
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 4803 non-null object
1 sequel 90 non-null Int64
dtypes: Int64(1), object(1)
memory usage: 117.3+ KB
My question is: is there a way to manipulate my_sequels to be similar to sequels, that is, to have my_sequels['sequel'] as an object with 4803 non-null where <NA> becomes nan?
EDIT: the reason I wanted to have my_sequels to be the same as sequels was to avoid the errors from the subsequent steps:
sequels_fin = my_sequels.merge(financials, on='id', how='left')
orig_seq = sequels_fin.merge(sequels_fin, how='inner', left_on='sequel',
right_on='id', right_index=True,
suffixes=('_org','_seq'))
ValueError Traceback (most recent call last)
<ipython-input-5-7215de303684> in <module>
3 orig_seq = sequels_fin.merge(sequels_fin, how='inner', left_on='sequel',
4 right_on='id', right_index=True,
----> 5 suffixes=('_org','_seq'))
ValueError: cannot convert to 'int64'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.
I don't think you would want to. The reason you are seeing this is the tutorial is based on an older version of Pandas than what you are using.
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
You can detect and operate on missing values as you would probably expect.
arr = pd.array([1, 2, None], dtype=pd.Int64Dtype())
arr.isna()
array([False, False, True])
arr.fillna(0)
<IntegerArray>
[1, 2, 0]
Length: 3, dtype: Int64
First index 'id':
sequels_fin = sequels_fin.set_index('id')
After:
orig_seq = sequels_fin.merge(sequels_fin, how='inner', left_on='sequel',
right_on='id', right_index=True,
suffixes=('_org','_seq'))
I've organized my data using pandas. and I fill my procedure out like below
import pandas as pd
import numpy as np
df1 = pd.read_table(r'E:\빅데이터 캠퍼스\골목상권 프로파일링 - 서울 열린데이터 광장 3.초기-16년5월분1\17.상권-추정매출\201301-201605\tbsm_trdar_selng.txt\tbsm_trdar_selng_utf8.txt' , sep='|' ,header=None
,dtype = { '0' : pd.np.int})
df1 = df1.replace('201301', int(201301))
df2 = df1[[0 ,1, 2, 3 ,4, 11,12 ,82 ]]
df2_rename = df2.columns = ['STDR_YM_CD', 'TRDAR_CD', 'TRDAR_CD_NM', 'SVC_INDUTY_CD', 'SVC_INDUTY_CD_NM', 'THSMON_SELNG_AMT', 'THSMON_SELNG_CO', 'STOR_CO' ]
print(df2.head(40))
df3_groupby = df2.groupby(['STDR_YM_CD', 'TRDAR_CD' ])
df4_agg = df3_groupby.agg(np.sum)
print(df4_agg.head(30))
When I print df2 I can see the 11947 and 11948 values in my TRDAR_CD column. like below picture
after that, I used groupby function and I lose my 11948 values in my TRDAR_CD column. You can see this situation in below picture
probably, this problem from the warning message?? warning message is 'sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.'
help me plz
print(df2.info()) is
RangeIndex: 1089023 entries, 0 to 1089022
Data columns (total 8 columns):
STDR_YM_CD 1089023 non-null object
TRDAR_CD 1089023 non-null int64
TRDAR_CD_NM 1085428 non-null object
SVC_INDUTY_CD 1089023 non-null object
SVC_INDUTY_CD_NM 1089023 non-null object
THSMON_SELNG_AMT 1089023 non-null int64
THSMON_SELNG_CO 1089023 non-null int64
STOR_CO 1089023 non-null int64
dtypes: int64(4), object(4)
memory usage: 66.5+ MB
None
MultiIndex is called first and second columns and if first level has duplicates by default it 'sparsified' the higher levels of the indexes to make the console output a bit easier on the eyes.
You can show data in first level of MultiIndex by setting display.multi_sparse to False.
Sample:
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6],
'C':[7,8,9]})
df.set_index(['A','B'], inplace=True)
print (df)
C
A B
1 4 7
5 8
3 6 9
#temporary set multi_sparse to False
#http://pandas.pydata.org/pandas-docs/stable/options.html#getting-and-setting-options
with pd.option_context('display.multi_sparse', False):
print (df)
C
A B
1 4 7
1 5 8
3 6 9
EDIT by edit of question:
I think problem is type of value 11948 is string, so it is omited.
EDIT1 by file:
You can simplify your solution by add parameter usecols in read_csv and then aggregating by GroupBy.sum:
import pandas as pd
import numpy as np
df2 = pd.read_table(r'tbsm_trdar_selng_utf8.txt' ,
sep='|' ,
header=None ,
usecols=[0 ,1, 2, 3 ,4, 11,12 ,82],
names=['STDR_YM_CD', 'TRDAR_CD', 'TRDAR_CD_NM', 'SVC_INDUTY_CD', 'SVC_INDUTY_CD_NM', 'THSMON_SELNG_AMT', 'THSMON_SELNG_CO', 'STOR_CO'],
dtype = { '0' : int})
df4_agg = df2.groupby(['STDR_YM_CD', 'TRDAR_CD' ]).sum()
print(df4_agg.head(10))
THSMON_SELNG_AMT THSMON_SELNG_CO STOR_CO
STDR_YM_CD TRDAR_CD
201301 11947 1966588856 74798 73
11948 3404215104 89064 116
11949 1078973946 42005 45
11950 1759827974 93245 71
11953 779024380 21042 84
11954 2367130386 94033 128
11956 511840921 23340 33
11957 329738651 15531 50
11958 1255880439 42774 118
11962 1837895919 66692 68
I have the following dataframe:
Join_Count 1
LSOA11CD
E01006512 15
E01006513 35
E01006514 11
E01006515 11
E01006518 11
...
But when I try to sort it:
BusStopList.sort("LSOA11CD",ascending=1)
I get the following:
Key Error: 'LSOA11CD'
How do I go about sorting this by either the LSOA column or the column full of numbers which doesn't have a heading?
The following is the information produced by Python about this dataframe:
<class 'pandas.core.frame.DataFrame'>
Index: 286 entries, E01006512 to E01033768
Data columns (total 1 columns):
1 286 non-null int64
dtypes: int64(1)
memory usage: 4.5+ KB
'LSOA11CD' is the name of the index, 1 is the name of the column. So you must use sort index (rather than sort_values):
BusStopList.sort_index(level="LSOA11CD", ascending=True)