I have a dataframe. I am trying to find percentiles of datetimes. I am using the function:
Dataframe:
student, attempts, time
student 1,14, 9/3/2019 12:32:32 AM
student 2,2, 9/3/2019 9:37:14 PM
student 3, 5
student 4, 16, 9/5/2019 8:58:14 PM
studentInfo2 = [14, 4, Timestamp('2019-09-04 00:26:36')]
data['time'] = pd.to_datetime(data['time_0001'], errors='coerce')
perc1_first = stats.percentileofscore(data['time'].notnull(), student2Info[2], 'rank')
where student2Info[2] holds the datetime for a particular student. When I try and do this I get the error:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Any ideas on how I can get the percentile to calculate correctly even when there are missing times in the columns?
You need to transform the Timestamps into units that percentileofscore can understand. Also, pd.DataFrame.notnull() returns a boolean list that you may use to filter your DataFrame, it does not return the filtered list, so I've updated that for you. Here is a working example:
import pandas as pd
import scipy.stats as stats
data = pd.DataFrame.from_dict({
"student": [1, 2, 3, 4],
"attempts": [14, 2, 5, 16],
"time_0001": [
"9/3/2019 12:32:32 AM",
"9/3/2019 9:37:14 PM",
"",
"9/5/2019 8:58:14 PM"
]
})
student2Info = [14, 4, pd.Timestamp('2019-09-04 00:26:36')]
data['time'] = pd.to_datetime(data['time_0001'], errors='coerce')
perc1_first = stats.percentileofscore(data[data['time'].notnull()].time.transform(pd.Timestamp.toordinal), student2Info[2].toordinal(), 'rank')
print(perc1_first) #-> 66.66666666666667
Related
I can validate a DataFrame index using the DataFrameSchema like this:
import pandera as pa
from pandera import Column, DataFrameSchema, Check, Index
schema = DataFrameSchema(
columns={
"column1": pa.Column(int),
},
index=pa.Index(int, name="index_name"),
)
# raises the error as expected
schema.validate(
pd.DataFrame({"column1": [1, 2, 3]}, index=pd.Index([1, 2, 3], name="index_incorrect_name"))
)
Is there a way to do the same using a SchemaModel?
You can do as follows -
import pandera as pa
from pandera.typing import Index, Series
class Schema(pa.SchemaModel):
idx: Index[int] = pa.Field(ge=0, check_name=True)
column1: Series[int]
df = pd.DataFrame({"column1": [1, 2, 3]}, index=pd.Index([1, 2, 3], name="index_incorrect_name"))
Schema.validate(df)
Found an answer in GitHub
You can use pa.typing.Index to type-annotate an index.
class Schema(pa.SchemaModel):
column1: pa.typing.Series[int]
index_name: pa.typing.Index[int] = pa.Field(check_name=True)
See how you can validate a MultiIndex index: https://pandera.readthedocs.io/en/stable/schema_models.html#multiindex
I have a data frame that is obtained after grouping an initial data frame by the 'hour' and 'site' column. So the current data frame has details of 'value' grouped per 'hour' and 'site'. What I want is to fill the hour which has no 'value' with zero. 'Hour' range is from 0-23. how can I do this?
Left is input, right is expected output
You can try this:
import numpy as np
import pandas as pd
raw_df = pd.DataFrame(
{
"Hour": [1, 2, 4, 12, 0, 2, 7, 13],
"Site": ["x", "x", "x", "x", "y", "y", "y", "y"],
"Value": [1, 1, 1, 1, 1, 1, 1, 1],
}
)
full_hour = pd.DataFrame(
{
"Hour": np.concatenate(
[range(24) for site_name in raw_df["Site"].unique()]
),
"Site": np.concatenate(
[[site_name] * 24 for site_name in raw_df["Site"].unique()]
),
}
)
result = full_hour.merge(raw_df, on=["Hour", "Site"], how="left").fillna(0)
Then you can get what you want. But I suggest you copy your test data in your question instead an image. You know, we have no responsibility to create your data. You should think more about how can make others answer your question comfortably.
So if you want to change the value in hours column to zero, where the value is not in range of 0-23, here is what to do.I actually didn't get your question clearly so i assume this must be what you want.I have taken a dummy example as you have not provided you own data.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011',
'13/2/2011','14/2/2011'],
'Product':['Umbrella', 'Matress', 'Badminton', 'Shuttle','ewf'],
'Last_Price':[1200, 1500, 1600, 352,'ee'],
'Updated_Price':[12, 24, 0, 1,np.nan],
'Discount':[10, 10, 10, 10, 11]})
df['Updated_Price'] = df['Updated_Price'].fillna(0)
df.loc[df['Updated_Price']>23,'Updated_Price']=0
This replaces all nan values with 0 and and for values greater than 23, also replaces with 0
For the following array;
[[[11, 22, 33]]],[[[32, 12, 3]]], I wanted to extract the 1st row and it should output 11,22,33. However, using the following code, I got [[11, 22, 33]]. How can I remove the double bracket?
df = pd.DataFrame([
[[[11, 22, 33]]],
[[[32, 12, 3]]]
], index=[1, 2], columns=['ColA'])
df[df.index == 1].ColA.item()
Expected output should be in the form of 11,22,33; without the bracket
Use .astype(str) and str.replace with the regex or operator (|). Then we use iat to get the first value:
df['ColA'].astype(str).str.replace('\[|\]', '').iat[0]
Output
'11, 22, 33'
Notice: that the type of your value changed from list to string
Or using native python functions str and replace:
str(df['ColA'].iat[0]).replace('[', '').replace(']', '')
I have a dataframe (df) whose column names are ["Home", "Season", "Date", "Consumption", "Temp"]. Now what I'm trying to do is perform calculations on these dataframe by "Home", "Season", "Temp" and "Consumption".
In[56]: df['Home'].unique().tolist()
Out[56]: [1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
In[57]: df['Season'].unique().tolist()
Out[57]: ['Spring', 'Summer', 'Autumn', 'Winter']
Here is what is done so far:
series = {}
for i in df['Home'].unique().tolist():
for j in df["Season"].unique().tolist():
series[i, j] = df[(df["Home"] == i) & (df["Consumption"] >= 0) & (df["Season"] == j)]
for key, value in series.items():
value["Corr"] = value["Temp"].corr(value["Consumption"])
Here is the dictionary of dataframes named "Series" as an output of loop.
What I expected from last loop is to give me a dictionary of dataframes with a new column i.e. "Corr" added that would have correlated values for "Temp" and "Consumption", but instead it gives a single dataframe for last home in the iteration i.e. 23.
To simply add sixth column named "Corr" in all dataframes in a dictionary that would be a correlation between "Temp" and "Consumption". Can you help me with the above? I'm somehow missing the use of keys in the last loop. Thanks in advance!
All of those loops are entirely unnecessary! Simply call:
df.groupby(['Home', 'Season'])['Consumption', 'Temp'].corr()
(thanks #jezrael for the correction)
One of the answer on How to find the correlation between a group of values in a pandas dataframe column
helped. Avoiding all unnecessary loops. Thanks #jezrael and #JoshFriedlander for suggesting groupby method. Upvote (y).
Posting solution here:
df = df[df["Consumption"] >= 0]
corrs = (df[["Home", "Season", "Temp"]]).groupby(
["Home", "Season"]).corrwith(
df["Consumption"]).rename(
columns = {"Temp" : "Corr"}).reset_index()
df = pd.merge(df, corrs, how = "left", on = ["Home", "Season"])
Please could I solicit some general advice regarding Python lists. I know I shouldn't answer 'open' questions on here but I am worried about setting off on completely the wrong path.
My problem is that I have .csv files that are approximately 600,000 lines long each. Each row of the .csv has 6 fields, of which the first field is a date-time stamp in the format DD/MM/YYYY HH:MM:SS. The next two fields are blank and the last three fields contain float and integer values, so for example:
23/05/2017 16:42:17, , , 1.25545, 1.74733, 12
23/05/2017 16:42:20, , , 1.93741, 1.52387, 14
23/05/2017 16:42:23, , , 1.54875, 1.46258, 11
etc
No two values in column 1 (date-time stamp) will ever be the same.
I need to write a program that will do a few basic operations with the data, such as:
read all of the data into a dictionary, list, set (?) etc as appropriate.
search through the date time stamp column for a particular value.
read through the list and do basic calculations on the floats in columns 4 and 5.
write a new list based on the searches/calculations.
My question is - how should I 'handle' the data and am I likely to run into problems due to the length of the dataset?
For example, should I import all of the data into a list, and each element of the list is a sublist of each rows data? E.g:
[[23/05/2017 16:42:17,'','', 1.25545, 1.74733, 12],[23/05/2017 16:42:20,'','', 1.93741, 1.52387, 14], ...]
Or would it be better to make each date-time stamp the 'key' in a dictionary and make the dictionary 'value' a list with all the other values, e.g:
{'23/05/2017 16:42:17': [ , , 1.25545, 1.74733, 12], ...}
etc
If I use the list approach, is there a way to get Python to 'search' in only the first column for a particular time stamp rather than making it search through 600,000 rows times 6 columns when we know that only the first column contains timestamps?
I apologize if my query is a little vague, but would appreciate any guidance that anyone can offer.
600000 lines aren't that many, your script should run fine with either a list or a dict.
As a test, let's use:
data = [["2017-05-02 17:28:24", 0.85260, 1.16218, 7],
["2017-05-04 05:40:07", 0.72118, 0.47710, 15],
["2017-05-07 19:27:53", 1.79476, 0.47496, 14],
["2017-05-09 01:57:10", 0.44123, 0.13711, 16],
["2017-05-11 07:22:57", 0.17481, 0.69468, 0],
["2017-05-12 10:11:01", 0.27553, 0.47834, 4],
["2017-05-15 05:20:36", 0.01719, 0.51249, 7],
["2017-05-17 14:01:13", 0.35977, 0.50052, 7],
["2017-05-17 22:05:33", 1.68628, 1.90881, 13],
["2017-05-18 14:44:14", 0.32217, 0.96715, 14],
["2017-05-18 20:24:23", 0.90819, 0.36773, 5],
["2017-05-21 12:15:20", 0.49456, 1.12508, 5],
["2017-05-22 07:46:18", 0.59015, 1.04352, 6],
["2017-05-26 01:49:38", 0.44455, 0.26669, 13],
["2017-05-26 18:55:24", 1.33678, 1.24181, 7]]
dict
If you're looking for exact timestamps, a lookup will be much faster with a dict than with a list. You have to know exactly what you're looking for though: "23/05/2017 16:42:17" has a completely different hash than "23/05/2017 16:42:18".
data_as_dict = {l[0]: l[1:] for l in data}
print(data_as_dict)
# {'2017-05-21 12:15:20': [0.49456, 1.12508, 5], '2017-05-18 14:44:14': [0.32217, 0.96715, 14], '2017-05-04 05:40:07': [0.72118, 0.4771, 15], '2017-05-26 01:49:38': [0.44455, 0.26669, 13], '2017-05-17 14:01:13': [0.35977, 0.50052, 7], '2017-05-15 05:20:36': [0.01719, 0.51249, 7], '2017-05-26 18:55:24': [1.33678, 1.24181, 7], '2017-05-07 19:27:53': [1.79476, 0.47496, 14], '2017-05-17 22:05:33': [1.68628, 1.90881, 13], '2017-05-02 17:28:24': [0.8526, 1.16218, 7], '2017-05-22 07:46:18': [0.59015, 1.04352, 6], '2017-05-11 07:22:57': [0.17481, 0.69468, 0], '2017-05-18 20:24:23': [0.90819, 0.36773, 5], '2017-05-12 10:11:01': [0.27553, 0.47834, 4], '2017-05-09 01:57:10': [0.44123, 0.13711, 16]}
print(data_as_dict.get('2017-05-17 14:01:13'))
# [0.35977, 0.50052, 7]
print(data_as_dict.get('2017-05-17 14:01:10'))
# None
Note that your DD/MM/YYYY HH:MM:SS format isn't very convenient : sorting the cells lexicographically won't sort them by datetime. You'd need to use datetime.strptime() first:
from datetime import datetime
data_as_dict = {datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S'): l[1:] for l in data}
print(data_as_dict.get(datetime(2017,5,17,14,1,13)))
# [0.35977, 0.50052, 7]
print(data_as_dict.get(datetime(2017,5,17,14,1,10)))
# None
list with binary search
If you're looking for timestamps ranges, a dict won't help you much. A binary search (e.g. with bisect) on a list of timestamps should be very fast.
import bisect
timestamps = [datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S') for l in data]
i = bisect.bisect(timestamps, datetime(2017,5,17,14,1,10))
print(data[i-1])
# ['2017-05-15 05:20:36', 0.01719, 0.51249, 7]
print(data[i])
# ['2017-05-17 14:01:13', 0.35977, 0.50052, 7]
Database
Before reinventing the wheel, you might want to dump all your CSVs into a small database (sqlite, Postgresql, ...) and use the corresponding queries.
Pandas
If you don't want the added complexity of a database but are ready to invest some time learning a new syntax, you should use pandas.DataFrame. It does exactly what you want, and then some.