Problem with selecting individual rows from pandas data frame [closed] - python

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 12 months ago.
Improve this question
I would like to select rows using condition on columns like "sex" = "male".
I normally used loc function on DataFrame.
import pandas as pd
dane = pd.read_csv('insurance.csv')
dane.info()
the result is:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
a = dane.loc(dane["sex"] == "male")
And after this calling this cells i have this error
TypeError Traceback (most recent call last)
<ipython-input-9-18dd4823c7e4> in <module>()
----> 1 a = dane.loc(dane["sex"] == "male")
1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
544 def _get_axis_number(cls, axis: Axis) -> int:
545 try:
--> 546 return cls._AXIS_TO_AXIS_NUMBER[axis]
547 except KeyError:
548 raise ValueError(f"No axis named {axis} for object type {cls.__name__}")
TypeError: unhashable type: 'Series'
If i did example from the Internet everything is good:
boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle'],
'Price': [10,15,5,5,10,15,15,5]
}
df = pd.DataFrame(boxes, columns= ['Color','Shape','Price'])
df.info()
select_color = df.loc[df['Color'] == 'Green']
print (select_color)
The result is:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Color 8 non-null object
1 Shape 8 non-null object
2 Price 8 non-null int64
dtypes: int64(1), object(2)
memory usage: 320.0+ bytes
Color Shape Price
0 Green Rectangle 10
1 Green Rectangle 15
2 Green Square 5
What is reason of problem with my situation. This is normall csv file, the same format of data etc.

you are doing a function call on the method loc: dane.loc(dane["sex"] == "male")
where you should do indexing: dane.loc[dane["sex"] == "male"]

Related

A merge in pandas is returning only NaN values

I'm trying to merge two dataframes: 'new_df' and 'df3'.
new_df contains years and months, and df3 contains years, months and other columns.
I've cast most of the columns as object, and tried to merge them both.
The merge 'works' as doesn't return an error, but my final datafram is all empty, only the year and month columns are correct.
new_df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_test 119 non-null datetime64[ns]
1 year 119 non-null object
2 month 119 non-null object
dtypes: datetime64[ns](1), object(2)
df3
<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 53 to 1297
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 case_number 191 non-null object
1 date 191 non-null object
2 year 191 non-null object
3 country 191 non-null object
4 area 191 non-null object
5 location 191 non-null object
6 activity 191 non-null object
7 fatal_y_n 182 non-null object
8 time 172 non-null object
9 species 103 non-null object
10 month 190 non-null object
dtypes: object(11)
I've tried this line of code:
df_joined = pd.merge(left=new_df, right=df3, how='left', on=['year','month'])
I was expecting a table with only filled fields in all columns, instead i got the table:
Your issue is with the data types for month and year in both columns - they're of type object which gets a bit weird during the join.
Here's a great answer that goes into depth about converting types to numbers, but here's what the code might look like before joining:
# convert column "year" and "month" of new_df
new_df["year"] = pd.to_numeric(new_df["year"])
new_df["month"] = pd.to_numeric(new_df["month"])
And make sure you do the same with df3 as well.
You may also have a data integrity problem as well - not sure what you're doing before you get those data frames, but if it's casting as an 'Object', you may have had a mix of ints/strings or other data types that got merged together. Here's a good article that goes over Panda Data Types. Specifically, and Object data type can be a mix of strings or other data, so the join might get weird.
Hope that helps!

Pandas Data Frame Graphing Issue

I am curious as to why when I create a data frame in the manner below, using lists to create the values in the rows does not graph and gives me the error "ValueError: x must be a label or position"
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
values = [9.83, 19.72, 7.19, 3.04]
values
[9.83, 19.72, 7.19, 3.04]
cols = ['Condition', 'No-Show']
conditions = ['Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism']
df = pd.DataFrame(columns = [cols])
df['Condition'] = conditions
df['No-Show'] = values
df
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diabetes 7.19
3 Alcoholism 3.04
df.plot(kind='bar', x='Condition', y='No-Show');
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 df.plot(kind='bar', x='Condition', y='No-Show')
File ~\anaconda3\lib\site-packages\pandas\plotting\_core.py:938, in
PlotAccessor.__call__(self, *args, **kwargs)
936 x = data_cols[x]
937 elif not isinstance(data[x], ABCSeries):
--> 938 raise ValueError("x must be a label or position")
939 data = data.set_index(x)
940 if y is not None:
941 # check if we have y as int or list of ints
ValueError: x must be a label or position
Yet if I create the same DataFrame a different way, it graphs just fine....
df2 = pd.DataFrame({'Condition': ['Scholarship', 'Hipertension', 'Diatebes', 'Alcoholism'],
'No-Show': [9.83, 19.72, 7.19, 3.04]})
df2
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diatebes 7.19
3 Alcoholism 3.04
df2.plot(kind='bar', x='Condition', y='No-Show')
plt.ylim(0, 50)
#graph appears here just fine
Can someone enlighten me why it works the second way and not the first? I am a new student and am confused. I appreciate any insight.
Let's look at pd.DataFrame.info for both dataframes.
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (Condition,) 4 non-null object
1 (No-Show,) 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note, your column headers are tuples with a empty second element.
Now, look at info for df2.
df2.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Condition 4 non-null object
1 No-Show 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note your column headers here are strings.
As, #BigBen states in his comment you don't need the extra brackets in your dataframe constructor for df.
FYI... to fix your statement with the incorrect dataframe constructor for df.
df.plot(kind='bar', x=('Condition',), y=('No-Show',))

Changing string to integer in Pandas

The data set had "deaths" as object and I need to convert it to the INTEGER. I try to use the formula from another thread and it doesn't seem to work.
******Input:******
data.info()
*****Output:*****
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1270 entries, 0 to 1271
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 1270 non-null object
1 leading_cause 1270 non-null object
2 sex 1270 non-null object
3 race_ethnicity 1270 non-null object
4 deaths 1270 non-null object
dtypes: object(5)
memory usage: 59.5+ KB
****Input:****
df = pd.DataFrame({'deaths':['50','30','28']})
print (df)
df = pd.DataFrame({'deaths':['50','30','28']})
print (df)
****Output:****
deaths
0 50
1 30
2 28
****Input:****
print (pd.to_numeric(df.deaths, errors='coerce'))
****Output:****
0 50
1 30
2 28
Name: deaths, dtype: int64
****Input:****
df.deaths = pd.to_numeric(df.deaths, errors='coerce').astype('Int64')
print (df)
****Output:****
deaths
0 50
1 30
2 28
****Input:****
data.info()
****Output:****
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1270 entries, 0 to 1271
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 1270 non-null object
1 leading_cause 1270 non-null object
2 sex 1270 non-null object
3 race_ethnicity 1270 non-null object
4 deaths 1270 non-null object
dtypes: object(5)
memory usage: 59.5+ KB
If you have nulls (np.NaN) in the column it will not convert to int type.
You need to deal with nulls first.
1 Either replace them with an int value:
df.deaths = df.deaths.fillna(0)
df.deaths = df.deaths.astype(int)
2 Or drop null values:
df = df[df.deaths.notna()]
df.deaths = df.deaths.astype(int)
3 Or (preferred) learn to live with them:
# make your other function accept null values

AttributeError: Can only use .str accessor with string values! python pandas

I am trying to convert all the cells value (except date) to float point number, I can successfully convert first 3 column but getting an error on the last one:
Here is my code:
df['Market Cap_'+str(coin)] = df['Market Cap_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
df['Volume_'+str(coin)] = df['Volume_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
df['Open_'+str(coin)] = df['Open_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
df['Close_'+str(coin)] = df['Close_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
Here is df.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 1 to 30
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date_ETHEREUM 30 non-null datetime64[ns]
1 Market Cap_ETHEREUM 30 non-null float64
2 Volume_ETHEREUM 30 non-null float64
3 Open_ETHEREUM 30 non-null float64
4 Close_ETHEREUM 30 non-null object
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 1.4+ KB
And here is the Error:
AttributeError: Can only use .str accessor with string values!
As you can see the column type is an object, (same as what others were before conversion, but I'm getting an error on this one)

Parse correct datetime using Python and pandas

So I have two spreadsheets in csv format that I've been provided with for my masters uni course.
Part of the processing of the data involved the merging of the files, followed by running some reports off the merged content using dates. this I've completed successfully, however....
The current date format I'm led to believe is epoch so for example the first date on the spreadsheet is 43471
So, firstly I ran this code first to check what format it was looking at
pd.read_csv('bookloans_merged.csv')
df.info()
This returned the result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1958 entries, 0 to 1957
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Number 1958 non-null int64
1 Title 1958 non-null object
2 Author 1854 non-null object
3 Genre 1958 non-null object
4 SubGenre 1958 non-null object
5 Publisher 1845 non-null object
6 member_number 1958 non-null int64
7 date_of_loan 1958 non-null int64
8 date_of_return 1958 non-null int64
dtypes: int64(4), object(5)
memory usage: 137.8+ KB
I then ran the following code:
# parsing date values
df = pd.read_csv('bookloans_merged.csv')
df[['date_of_loan','date_of_return']] = df[['date_of_loan','date_of_return']].apply(pd.to_datetime, format='%Y-%m-%d %H:%M:%S.%f')
df.to_csv('bookloans_merged_dates.csv', index=False)
Running this again:
pd.read_csv('bookloans_merged_dates.csv')
df.info()
I get:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1958 entries, 0 to 1957
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Number 1958 non-null int64
1 Title 1958 non-null object
2 Author 1854 non-null object
3 Genre 1958 non-null object
4 SubGenre 1958 non-null object
5 Publisher 1845 non-null object
6 member_number 1958 non-null int64
7 date_of_loan 1958 non-null datetime64[ns]
8 date_of_return 1958 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(2), object(5)
memory usage: 137.8+ KB
So I can see the date_of_loan and date_of_return is now datetime64
trouble is, all the dates are now showing as 1970-01-01 00:00:00.000043471
How do I get to 01/03/2019 format please?
Thanks
David.
So I managed to get this figured out, with a little help. Here is the answer
from datetime import datetime
df1 = pd.DataFrame(data_frame, columns=['Title','Author','date_of_loan'])
df1['date_of_loan'] = pd.to_datetime(df1['date_of_loan'], unit='d', origin=pd.Timestamp('1900-01-01'))
df1.sort_values('date_of_loan', ascending=True)
from datetime import datetime
excel_date = 43139
d_time = datetime.fromordinal(datetime(1900, 1, 1).toordinal() + excel_date - 2)
t_time = d_time.timetuple()
print(d_time)
print(t_time)
So how I was able to use that premise in my program was like this
from datetime import datetime
df1 = pd.DataFrame(data_frame, columns=['Title','Author','date_of_loan'])
df1['date_of_loan'] = pd.to_datetime(df1['date_of_loan'], unit='d', origin=pd.Timestamp('1900-01-01'))
df1.sort_values('date_of_loan', ascending=True)

Categories

Resources