grouping DateTime by week of the day - python

I suppose, it should be easy question for experienced guys. I want to group records by week' day and to have number of records at particular week-day.
Here is my DataFrame rent_week.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1689 entries, 3 to 1832
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1689 non-null int64
1 createdAt 1689 non-null datetime64[ns]
2 updatedAt 1689 non-null datetime64[ns]
3 endAt 1689 non-null datetime64[ns]
4 timeoutAt 1689 non-null datetime64[ns]
5 powerBankId 1689 non-null int64
6 station_id 1689 non-null int64
7 endPlaceId 1686 non-null float64
8 endStatus 1689 non-null object
9 userId 1689 non-null int64
10 station_name 1689 non-null object
dtypes: datetime64[ns](4), float64(1), int64(4), object(2)
memory usage: 158.3+ KB
Data in 'createdAt' columns looks like "2020-07-19T18:00:27.190010000"
I am trying to add new column:
rent_week['a_day'] = rent_week['createdAt'].strftime('%A')
and receive error back: AttributeError: 'Series' object has no attribute 'strftime'.
Meanwhile, if I write:
a_day = datetime.today()
print(a_day.strftime('%A'))
it shows expected result. In my understanding, a_day and rent_week['a_day'] have similar type datetime.
Even request through:
rent_week['a_day'] = pd.to_datetime(rent_week['createdAt']).strftime('%A')
shows me the same error: no strftime attribute.
I even didn't start grouping my data. What I am expecting in result is a DataFrame with information like:
a_day number_of_records
Monday 101
Tuesday 55
...

Try a_day.dt.strftime('%A') - note the additional .dt on your DataFrame column/Series object.
Background: the "similar" type assumption you make is almost correct. However, as a column could be of many types (numeric, string, datetime, geographic, ...), the methods of the underlying values are typically stored in a namespace to not clutter the already broad API (method count) of the Series type itself. That's why string functions are available only through .str, and datetime functions only available through .dt.

You can make a lambda function for conversion and apply that function to the column of "createdAt" Columns. After this step you can groupby based on your requirement. You can take help from this code:
rent_week['a_day'] = rent_week['createdAt'].apply(lambda x: x.strftime('%A'))

Thank you Quamar and Ojdo for your contribution. I found the problem: it is in index
<ipython-input-41-a42a82727cdd>:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
rent_week['a_day'] = rent_week['createdAt'].dt.strftime('%A')
as soon as I reset index
rent_week.reset_index()
both variants are working as expected!

Related

A merge in pandas is returning only NaN values

I'm trying to merge two dataframes: 'new_df' and 'df3'.
new_df contains years and months, and df3 contains years, months and other columns.
I've cast most of the columns as object, and tried to merge them both.
The merge 'works' as doesn't return an error, but my final datafram is all empty, only the year and month columns are correct.
new_df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_test 119 non-null datetime64[ns]
1 year 119 non-null object
2 month 119 non-null object
dtypes: datetime64[ns](1), object(2)
df3
<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 53 to 1297
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 case_number 191 non-null object
1 date 191 non-null object
2 year 191 non-null object
3 country 191 non-null object
4 area 191 non-null object
5 location 191 non-null object
6 activity 191 non-null object
7 fatal_y_n 182 non-null object
8 time 172 non-null object
9 species 103 non-null object
10 month 190 non-null object
dtypes: object(11)
I've tried this line of code:
df_joined = pd.merge(left=new_df, right=df3, how='left', on=['year','month'])
I was expecting a table with only filled fields in all columns, instead i got the table:
Your issue is with the data types for month and year in both columns - they're of type object which gets a bit weird during the join.
Here's a great answer that goes into depth about converting types to numbers, but here's what the code might look like before joining:
# convert column "year" and "month" of new_df
new_df["year"] = pd.to_numeric(new_df["year"])
new_df["month"] = pd.to_numeric(new_df["month"])
And make sure you do the same with df3 as well.
You may also have a data integrity problem as well - not sure what you're doing before you get those data frames, but if it's casting as an 'Object', you may have had a mix of ints/strings or other data types that got merged together. Here's a good article that goes over Panda Data Types. Specifically, and Object data type can be a mix of strings or other data, so the join might get weird.
Hope that helps!

In pandas how to create a new column from part of another, obeying a condition?

In python 3 and pandas I have the dataframe:
lista_projetos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 14 columns):
n_projeto 59 non-null object
autor 59 non-null object
ementa 59 non-null object
resumo 59 non-null object
votacao_nominal 59 non-null object
votacao_nominal_alternativa_emenda 59 non-null object
link_votacao 0 non-null float64
observacao 0 non-null float64
link_emenda 0 non-null float64
indicado_por 59 non-null object
entidade_que_avalia 59 non-null object
favoravel_desfavoravel_indiferente 59 non-null object
explicacao 59 non-null object
link_projeto 59 non-null object
dtypes: float64(3), object(11)
memory usage: 6.5+ KB
The column "link_projeto" has urls, always in this format:
"http://www.camara.gov.br/proposicoesWeb/fichadetramitacao?idProposicao=2171854"
"http://www.camara.gov.br/proposicoesWeb/fichadetramitacao?idProposicao=2147513"
"http://www.camara.gov.br/proposicoesWeb/fichadetramitacao?idProposicao=2168253"
I want to create a new column from the "link_projeto" column. So: always pick up the final number after the "=" sign
Like this:
new_column
2171854
2147513
2168253
Please, is there a way to create a new column from part of another?
First, how would you do this on a single value?
>>> link = "http://www.camara.gov.br/proposicoesWeb/fichadetramitacao?idProposicao=2171854"
>>> link.split("=", 1)[1]
'2171854'
But split is a method on str objects; how do you apply it to a column full of strings? Simple: columns (Series and Index) have a str attribute for exactly this purpose:
df.link_projecto.str.split("=", 1)
But split doesn't just return a string, it returns a list of strings. How do we get the last one?
As explained in Splitting and Replacing Strings, you just access str again and index it:
df.link_projecto.str.split("=", 1).str[1]
So:
df["new_column"] = df.link_projecto.str.split("=", 1).str[1]

Converting object to Int pandas

Hello I have an issue to convert column of object to integer for complete column.
I have a data frame and I tried to convert some columns that are detected as Object into Integer (or Float) but all the answers I already found are working for me
First status
Then I tried to apply the to_numeric method but doesn't work.
To numeric method
Then a custom method that you can find here: Pandas: convert dtype 'object' to int
but doesn't work either: data3['Title'].astype(str).astype(int)
( I cannot pass the image anymore - You have to trust me that it doesn't work)
I tried to use the inplace statement but doesn't seem to be integrated in those methods:
I am pretty sure that the answer is dumb but cannot find it
You need assign output back:
#maybe also works omit astype(str)
data3['Title'] = data3['Title'].astype(str).astype(int)
Or:
data3['Title'] = pd.to_numeric(data3['Title'])
Sample:
data3 = pd.DataFrame({'Title':['15','12','10']})
print (data3)
Title
0 15
1 12
2 10
print (data3.dtypes)
Title object
dtype: object
data3['Title'] = pd.to_numeric(data3['Title'])
print (data3.dtypes)
Title int64
dtype: object
data3['Title'] = data3['Title'].astype(int)
print (data3.dtypes)
Title int32
dtype: object
As python_enthusiast said ,
This command works for me too
data3.Title = data3.Title.str.replace(',', '').astype(float).astype(int)
but also works fine with
data3.Title = data3.Title.str.replace(',', '').astype(int)
you have to use str before replace in order to get rid of commas only then change it to int/float other wise you will get error .
2 years and 11 months later, but here I go.
It's important to check if your data has any spaces, special characters (like commas, dots, or whatever else) first. If yes, then you need to basically remove those and then convert your string data into float and then into an integer (this is what worked for me for the case where my data was numerical values but with commas, like 4,118,662).
data3.Title = data3.Title.str.replace(',', '').astype(flaoat).astype(int)
also you can try this code, work fine with me
data3.Title= pd.factorize(data3.Title)[0]
Version that works with Nulls
With older version of Pandas there was no NaN for int but newer versions of pandas offer Int64 which has pd.NA.
So to go from object to int with missing data you can do this.
df['col'] = df['col'].astype(float)
df['col'] = df['col'].astype('Int64')
By switching to float first you avoid object cannot be converted to an IntegerDtype error.
Note it is capital 'I' in the Int64.
More info here https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
Working with pd.NA
In Pandas 1.0 the new pd.NA datatype has been introduced; the goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).
With this in mind they have created the dataframe.convert_dtypes() and Series.convert_dtypes() functions which converts to datatypes that support pd.NA. This is currently considered experimental but might well be a bright future.
I had a dataset like this
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79902 entries, 0 to 79901
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Query 79902 non-null object
1 Video Title 79902 non-null object
2 Video ID 79902 non-null object
3 Video Views 79902 non-null object
4 Comment ID 79902 non-null object
5 cleaned_comments 79902 non-null object
dtypes: object(6)
memory usage: 5.5+ MB
Removed the None, NaN entries using
dataset = dataset.replace(to_replace='None', value=np.nan).dropna()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79868 entries, 0 to 79901
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Query 79868 non-null object
1 Video Title 79868 non-null object
2 Video ID 79868 non-null object
3 Video Views 79868 non-null object
4 Comment ID 79868 non-null object
5 cleaned_comments 79868 non-null object
dtypes: object(6)
memory usage: 6.1+ MB
Notice the reduced entries
But the Video Views were floats, as shown in dataset.head()
Then I used
dataset['Video Views'] = pd.to_numeric(dataset['Video Views'])
dataset['Video Views'] = dataset['Video Views'].astype(int)
Now,
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79868 entries, 0 to 79901
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Query 79868 non-null object
1 Video Title 79868 non-null object
2 Video ID 79868 non-null object
3 Video Views 79868 non-null int64
4 Comment ID 79868 non-null object
5 cleaned_comments 79868 non-null object
dtypes: int64(1), object(5)
memory usage: 6.1+ MB

Investigating different datatypes in Pandas DataFrame

I have 4 files which I want to read with Python / Pandas, the files are: https://github.com/kelsey9649/CS8370Group/tree/master/TaFengDataSet
I stripped away the first row (column titles in chinese) in all 4 files.
But other from that, the 4 files are supposed to have the same format.
Now I want to read them and merge into one big DataFrame. I tried it by using
pars = {'sep': ';',
'header': None,
'names': ['date','customer_id','age','area','prod_class','prod_id','amount','asset','price'],
'parse_dates': [0]}
df = pd.DataFrame()
for i in ('01', '02', '12', '11'):
df = df.append(pd.read_csv(cfg.abspath+'D'+i,**pars))
BUT: The file D11 gives me a different format of the single columns and thus cannot be merged properly. The file contains like over 200k lines and thus I cannot easily look for the problem in that file but as mentioned above, I was assuming it has the same format, but obviously there's some small difference in the format.
What's the easiest way of now investigating into the problem? Obviously, I cannot check every single line in that file...
When I read the 3 working files and merge them; and read D11 independetly, the line
A = pd.read_csv(cfg.abspath+'D11',**pars)
still gives me the following warning:
C:\Python27\lib\site-packages\pandas\io\parsers.py:1130: DtypeWarning: Columns (
1,4,5,6,7,8) have mixed types. Specify dtype option on import or set low_memory=
False.
data = self._reader.read(nrows)
Using the method .info() in pandas (for A and df) yields:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 594119 entries, 0 to 178215
Data columns (total 9 columns):
date 594119 non-null datetime64[ns]
customer_id 594119 non-null int64
age 594119 non-null object
area 594119 non-null object
prod_class 594119 non-null int64
prod_id 594119 non-null int64
amount 594119 non-null int64
asset 594119 non-null int64
price 594119 non-null int64
dtypes: datetime64[ns](1), int64(6), object(2)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 223623 entries, 0 to 223622
Data columns (total 9 columns):
date 223623 non-null object
customer_id 223623 non-null object
age 223623 non-null object
area 223623 non-null object
prod_class 223623 non-null object
prod_id 223623 non-null object
amount 223623 non-null object
asset 223623 non-null object
price 223623 non-null object
Even if I would use the dtype-option on import, I would somehow still be scared of wrong/bad results as there might happen some wrong casting of datatypes while importing!?
How to overcome and solve the issue?
Thanks a lot
Whenever you have a problem that is too boring to be done by hand, the solution is to write a program:
for col in ('age', 'area'):
for i, val in enumerate(A[col]):
try:
int(val)
except:
print('Line {}: {} = {}'.format(i, col, val))
This will show you all the lines in the file with non-integer values in the age and area columns. This is the first step in debugging the problem. Once you know what the problematic values are, you can better decide how to deal with them -- maybe by pre-processing (cleaning) the data file, or by using some pandas code to select and fix the problematic values.

Plotting with GroupBy in Pandas/Python

Although it is straight-forward and easy to plot groupby objects in pandas, I am wondering what the most pythonic (pandastic?) way to grab the unique groups from a groupby object is. For example:
I am working with atmospheric data and trying to plot diurnal trends over a period of several days or more. The following is the DataFrame containing many days worth of data where the timestamp is the index:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10909 entries, 2013-08-04 12:01:00 to 2013-08-13 17:43:00
Data columns (total 17 columns):
Date 10909 non-null values
Flags 10909 non-null values
Time 10909 non-null values
convt 10909 non-null values
hino 10909 non-null values
hinox 10909 non-null values
intt 10909 non-null values
no 10909 non-null values
nox 10909 non-null values
ozonf 10909 non-null values
pmtt 10909 non-null values
pmtv 10909 non-null values
pres 10909 non-null values
rctt 10909 non-null values
smplf 10909 non-null values
stamp 10909 non-null values
no2 10909 non-null values
dtypes: datetime64[ns](1), float64(11), int64(2), object(3)
To be able to average (and take other statistics) the data at every minute for several days, I group the dataframe:
data = no.groupby('Time')
I can then easily plot the mean NO concentration as well as quartiles:
ax = figure(figsize=(12,8)).add_subplot(111)
title('Diurnal Profile for NO, NO2, and NOx: East St. Louis Air Quality Study')
ylabel('Concentration [ppb]')
data.no.mean().plot(ax=ax, style='b', label='Mean')
data.no.apply(lambda x: percentile(x, 25)).plot(ax=ax, style='r', label='25%')
data.no.apply(lambda x: percentile(x, 75)).plot(ax=ax, style='r', label='75%')
The issue that fuels my question, is that in order to plot more interesting looking things like plots using like fill_between() it is necessary to know the x-axis information per the documentation
fill_between(x, y1, y2=0, where=None, interpolate=False, hold=None, **kwargs)
For the life of me, I cannot figure out the best way to accomplish this. I have tried:
Iterating over the groupby object and creating an array of the groups
Grabbing all of the unique Time entries from the original DataFrame
I can make these work, but I know there is a better way. Python is far too beautiful. Any ideas/hints?
UPDATES:
The statistics can be dumped into a new dataframe using unstack() such as
no_new = no.groupby('Time')['no'].describe().unstack()
no_new.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1440 entries, 00:00 to 23:59
Data columns (total 8 columns):
count 1440 non-null values
mean 1440 non-null values
std 1440 non-null values
min 1440 non-null values
25% 1440 non-null values
50% 1440 non-null values
75% 1440 non-null values
max 1440 non-null values
dtypes: float64(8)
Although I should be able to plot with fill_between() using no_new.index, I receive a TypeError.
Current Plot code and TypeError:
ax = figure(figzise=(12,8)).add_subplot(111)
ax.plot(no_new['mean'])
ax.fill_between(no_new.index, no_new['mean'], no_new['75%'], alpha=.5, facecolor='green')
TypeError:
TypeError Traceback (most recent call last)
<ipython-input-6-47493de920f1> in <module>()
2 ax = figure(figsize=(12,8)).add_subplot(111)
3 ax.plot(no_new['mean'])
----> 4 ax.fill_between(no_new.index, no_new['mean'], no_new['75%'], alpha=.5, facecolor='green')
5 #title('Diurnal Profile for NO, NO2, and NOx: East St. Louis Air Quality Study')
6 #ylabel('Concentration [ppb]')
C:\Users\David\AppData\Local\Enthought\Canopy\User\lib\site-packages\matplotlib\axes.pyc in fill_between(self, x, y1, y2, where, interpolate, **kwargs)
6986
6987 # Convert the arrays so we can work with them
-> 6988 x = ma.masked_invalid(self.convert_xunits(x))
6989 y1 = ma.masked_invalid(self.convert_yunits(y1))
6990 y2 = ma.masked_invalid(self.convert_yunits(y2))
C:\Users\David\AppData\Local\Enthought\Canopy\User\lib\site-packages\numpy\ma\core.pyc in masked_invalid(a, copy)
2237 cls = type(a)
2238 else:
-> 2239 condition = ~(np.isfinite(a))
2240 cls = MaskedArray
2241 result = a.view(cls)
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The plot as of now looks like this:
Storing the groupby stats (mean/25/75) as columns in a new dataframe and then passing the new dataframe's index as the x parameter of plt.fill_between() works for me (tested with matplotlib 1.3.1). e.g.,
gdf = df.groupby('Time')[col].describe().unstack()
plt.fill_between(gdf.index, gdf['25%'], gdf['75%'], alpha=.5)
gdf.info() should look like this:
<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, 00:00:00 to 22:00:00
Data columns (total 8 columns):
count 12 non-null float64
mean 12 non-null float64
std 12 non-null float64
min 12 non-null float64
25% 12 non-null float64
50% 12 non-null float64
75% 12 non-null float64
max 12 non-null float64
dtypes: float64(8)
Update: to address the TypeError: ufunc 'isfinite' not supported exception, it is necessary to first convert the Time column from a series of string objects in "HH:MM" format to a series of datetime.time objects, which can be done as follows:
df['Time'] = df.Time.map(lambda x: pd.datetools.parse(x).time())

Categories

Resources