Sorting a Pandas dataframe

Sorting a Pandas dataframe - python

I have the following dataframe:
Join_Count 1
LSOA11CD
E01006512 15
E01006513 35
E01006514 11
E01006515 11
E01006518 11
...
But when I try to sort it:
BusStopList.sort("LSOA11CD",ascending=1)
I get the following:
Key Error: 'LSOA11CD'
How do I go about sorting this by either the LSOA column or the column full of numbers which doesn't have a heading?
The following is the information produced by Python about this dataframe:
<class 'pandas.core.frame.DataFrame'>
Index: 286 entries, E01006512 to E01033768
Data columns (total 1 columns):
1 286 non-null int64
dtypes: int64(1)
memory usage: 4.5+ KB

'LSOA11CD' is the name of the index, 1 is the name of the column. So you must use sort index (rather than sort_values):
BusStopList.sort_index(level="LSOA11CD", ascending=True)

Related

A merge in pandas is returning only NaN values

I'm trying to merge two dataframes: 'new_df' and 'df3'.
new_df contains years and months, and df3 contains years, months and other columns.
I've cast most of the columns as object, and tried to merge them both.
The merge 'works' as doesn't return an error, but my final datafram is all empty, only the year and month columns are correct.
new_df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_test 119 non-null datetime64[ns]
1 year 119 non-null object
2 month 119 non-null object
dtypes: datetime64[ns](1), object(2)
df3
<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 53 to 1297
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 case_number 191 non-null object
1 date 191 non-null object
2 year 191 non-null object
3 country 191 non-null object
4 area 191 non-null object
5 location 191 non-null object
6 activity 191 non-null object
7 fatal_y_n 182 non-null object
8 time 172 non-null object
9 species 103 non-null object
10 month 190 non-null object
dtypes: object(11)
I've tried this line of code:
df_joined = pd.merge(left=new_df, right=df3, how='left', on=['year','month'])
I was expecting a table with only filled fields in all columns, instead i got the table:

Your issue is with the data types for month and year in both columns - they're of type object which gets a bit weird during the join.
Here's a great answer that goes into depth about converting types to numbers, but here's what the code might look like before joining:
# convert column "year" and "month" of new_df
new_df["year"] = pd.to_numeric(new_df["year"])
new_df["month"] = pd.to_numeric(new_df["month"])
And make sure you do the same with df3 as well.
You may also have a data integrity problem as well - not sure what you're doing before you get those data frames, but if it's casting as an 'Object', you may have had a mix of ints/strings or other data types that got merged together. Here's a good article that goes over Panda Data Types. Specifically, and Object data type can be a mix of strings or other data, so the join might get weird.
Hope that helps!

Pandas Data Frame Graphing Issue

I am curious as to why when I create a data frame in the manner below, using lists to create the values in the rows does not graph and gives me the error "ValueError: x must be a label or position"
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
values = [9.83, 19.72, 7.19, 3.04]
values
[9.83, 19.72, 7.19, 3.04]
cols = ['Condition', 'No-Show']
conditions = ['Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism']
df = pd.DataFrame(columns = [cols])
df['Condition'] = conditions
df['No-Show'] = values
df
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diabetes 7.19
3 Alcoholism 3.04
df.plot(kind='bar', x='Condition', y='No-Show');
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 df.plot(kind='bar', x='Condition', y='No-Show')
File ~\anaconda3\lib\site-packages\pandas\plotting\_core.py:938, in
PlotAccessor.__call__(self, *args, **kwargs)
936 x = data_cols[x]
937 elif not isinstance(data[x], ABCSeries):
--> 938 raise ValueError("x must be a label or position")
939 data = data.set_index(x)
940 if y is not None:
941 # check if we have y as int or list of ints
ValueError: x must be a label or position
Yet if I create the same DataFrame a different way, it graphs just fine....
df2 = pd.DataFrame({'Condition': ['Scholarship', 'Hipertension', 'Diatebes', 'Alcoholism'],
'No-Show': [9.83, 19.72, 7.19, 3.04]})
df2
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diatebes 7.19
3 Alcoholism 3.04
df2.plot(kind='bar', x='Condition', y='No-Show')
plt.ylim(0, 50)
#graph appears here just fine
Can someone enlighten me why it works the second way and not the first? I am a new student and am confused. I appreciate any insight.

Let's look at pd.DataFrame.info for both dataframes.
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (Condition,) 4 non-null object
1 (No-Show,) 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note, your column headers are tuples with a empty second element.
Now, look at info for df2.
df2.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Condition 4 non-null object
1 No-Show 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note your column headers here are strings.
As, #BigBen states in his comment you don't need the extra brackets in your dataframe constructor for df.
FYI... to fix your statement with the incorrect dataframe constructor for df.
df.plot(kind='bar', x=('Condition',), y=('No-Show',))

grouping DateTime by week of the day

I suppose, it should be easy question for experienced guys. I want to group records by week' day and to have number of records at particular week-day.
Here is my DataFrame rent_week.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1689 entries, 3 to 1832
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1689 non-null int64
1 createdAt 1689 non-null datetime64[ns]
2 updatedAt 1689 non-null datetime64[ns]
3 endAt 1689 non-null datetime64[ns]
4 timeoutAt 1689 non-null datetime64[ns]
5 powerBankId 1689 non-null int64
6 station_id 1689 non-null int64
7 endPlaceId 1686 non-null float64
8 endStatus 1689 non-null object
9 userId 1689 non-null int64
10 station_name 1689 non-null object
dtypes: datetime64[ns](4), float64(1), int64(4), object(2)
memory usage: 158.3+ KB
Data in 'createdAt' columns looks like "2020-07-19T18:00:27.190010000"
I am trying to add new column:
rent_week['a_day'] = rent_week['createdAt'].strftime('%A')
and receive error back: AttributeError: 'Series' object has no attribute 'strftime'.
Meanwhile, if I write:
a_day = datetime.today()
print(a_day.strftime('%A'))
it shows expected result. In my understanding, a_day and rent_week['a_day'] have similar type datetime.
Even request through:
rent_week['a_day'] = pd.to_datetime(rent_week['createdAt']).strftime('%A')
shows me the same error: no strftime attribute.
I even didn't start grouping my data. What I am expecting in result is a DataFrame with information like:
a_day number_of_records
Monday 101
Tuesday 55
...

Try a_day.dt.strftime('%A') - note the additional .dt on your DataFrame column/Series object.
Background: the "similar" type assumption you make is almost correct. However, as a column could be of many types (numeric, string, datetime, geographic, ...), the methods of the underlying values are typically stored in a namespace to not clutter the already broad API (method count) of the Series type itself. That's why string functions are available only through .str, and datetime functions only available through .dt.

You can make a lambda function for conversion and apply that function to the column of "createdAt" Columns. After this step you can groupby based on your requirement. You can take help from this code:
rent_week['a_day'] = rent_week['createdAt'].apply(lambda x: x.strftime('%A'))

Thank you Quamar and Ojdo for your contribution. I found the problem: it is in index
<ipython-input-41-a42a82727cdd>:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
rent_week['a_day'] = rent_week['createdAt'].dt.strftime('%A')
as soon as I reset index
rent_week.reset_index()
both variants are working as expected!

Group by of dataframe with average of a column

I am really new to python..just a week ago started learning it. I have a query and hope you guys can help me to solve it. Thanks in advance..!!
I have data in below format.
Date Product Price Discount
1/1/2020 A 17,490 30
1/1/2020 B 34,990 21
1/1/2020 C 20,734 11
1/2/2020 A 16,884 26
1/2/2020 B 26,990 40
1/2/2020 C 17,936 10
1/3/2020 A 16,670 36
1/3/2020 B 12,990 13
1/3/2020 C 30,990 43
I want to take the average of discount column for each date and just have 2 columns.. It aint working out.. :(
Date AVG_Discount
1/1/2020 x %
1/2/2020 y %
1/3/2020 z %
What I have tried doing is below.. As I said, I am novice in Python so approach might be incorrect.. Need guidance guys.. TIA
mean_col=df.groupby(df['time'])['discount'].mean()
df=df.set_index(['time'])
df['mean_col']=mean_col
df=df.reset_index()

df.groupby(df['time'])['discount'].mean() Is already returning series with time as index.
All you need to do is just use reset_index function on this.
grouped_df = df.groupby(df['time'])['discount'].mean().reset_index()
As Quang Hoang Suggested in comments. You can also pass as_index=False to groupby.

Apparently, you have read your DataFrame from a text file,
e.g. CSV, but with separator other than a comma.
Run df.info() and I assume that you got result something like below:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
Date 9 non-null object
Product 9 non-null object
Price 9 non-null object
Discount 9 non-null int64
dtypes: int64(1), object(3)
Note that Date, Product and Price columns are of object type
(actually, a string). This remark is especially importoant in case of
Price column, because to compte mean you should have source column
as a number (not a string).
So first you should convert Date and Price columns to proper types
(datetime and float). To do it run:
df.Date = pd.to_datetime(df.Date)
df.Price = df.Price.str.replace(',', '.').astype(float)
Run df.info() again and now the result should be:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
Date 9 non-null datetime64[ns]
Product 9 non-null object
Price 9 non-null float64
Discount 9 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
And now you can compute the mean discount, running:
df.groupby('Date').Discount.mean()
For your data I got:
Date
2020-01-01 20.666667
2020-01-02 25.333333
2020-01-03 30.666667
Name: Discount, dtype: float64
Note that your code sample contains the following errors:
Argument of groupby is the column name (or a list of column names), so:
df between parentheses is not needed,
instead of time you should write Date (you have no time column).
Your Discount column is written starting with capital D.

pandas dataframe plot columns

After reading a series of files I create a dataframe with 7 columns:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 756 entries, 0 to 755
Data columns:
Fr(Hz) 756 non-null values
res_ohm*m 756 non-null values
phase_mrad 756 non-null values
ImC_S/m 756 non-null values
Rm_S/m 756 non-null values
C_el 756 non-null values
date 756 non-null values
dtypes: float64(6), object(1)
then I want to group the date by column 6 (C_el) which has 12 variables:
Pairs = = data_set.groupby('C_el')
each group now contains data that are multiple of 21 (that means each 21 lines I have a new unique dataset) - 21 refers to the column 1 (Fr(Hz) where I am using 21 frequencies for each dataset
what I want to do is to create an x, y scattered plot - on X axis is column 1 (Fr(Hz), and on Y axis is column 3 (phase_mrad) - each dataset will have the 21 unique poits of frequency, and then I want to add all available datasets on the same plot, using different color
the final step, is to repeat this for the 11 remaining groups (as defined in an aearlier step)
sample datasets are here (A12)
currently I do this very ugly in numpy multiple_datasets

I don't know if this will really satisfy your requirement, but I think groupby could do you a lot of favour. For instance, instead of the code example that you provided, you could instead do this:
for key, group in data_set.groupby('C_el'):
# -- define the filename, path, etc..
# e.g. filename = key
group.to_csv(filename, sep=' ')
See also the documentation here. Sorry I can't help you out with more details, but I hope it helps to proceed somewhat.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting a Pandas dataframe - python

'LSOA11CD' is the name of the index, 1 is the name of the column. So you must use sort index (rather than sort_values): BusStopList.sort_index(level="LSOA11CD", ascending=True)

Related

A merge in pandas is returning only NaN values

Pandas Data Frame Graphing Issue

grouping DateTime by week of the day

Group by of dataframe with average of a column

pandas dataframe plot columns

Categories

Resources