I have a dataframe that I would like to split into multiple dataframes using the value in my Date column. Ideally, I would like to split my dataframe by decades. Do I need to use np.array_split method or is there a method that does not require numpy?
My Dataframe looks like a larger version of this:
Date Name
0 1746-06-02 Borcke (#p1)
1 1746-09-02 Jordan (#p31)
2 1747-06-02 Sa Majesté (#p32)
3 1752-01-26 Maupertuis (#p4)
4 1755-06-02 Jordan (#p31)
And so I would ideally want in this scenario two data frames like these:
Date Name
0 1746-06-02 Borcke (#p1)
1 1746-09-02 Jordan (#p31)
2 1747-06-02 Sa Majesté (#p32)
Date Name
0 1752-01-26 Maupertuis (#p4)
1 1755-06-02 Jordan (#p31)
Building up on mozways answer for getting the decades.
d = {
"Date": [
"1746-06-02",
"1746-09-02",
"1747-06-02",
"1752-01-26",
"1755-06-02",
],
"Name": [
"Borcke (#p1)",
"Jordan (#p31)",
"Sa Majesté (#p32)",
"Maupertuis (#p4)",
"Jord (#p31)",
],
}
import pandas as pd
import math
df = pd.DataFrame(d)
df["years"] = df['Date'].str.extract(r'(^\d{4})', expand=False).astype(int)
df["decades"] = (df["years"] / 10).apply(math.floor) *10
dfs = [g for _,g in df.groupby(df['decades'])]
Use groupby, you can generate a list of DataFrames:
dfs = [g for _,g in df.groupby(df['Date'].str.extract(r'(^\d{3})', expand=False)]
Or, validating the dates:
dfs = [g for _,g in df.groupby(pd.to_datetime(df['Date']).dt.year//10)]
If you prefer a dictionary for indexing by decade:
dfs = dict(list(df.groupby(pd.to_datetime(df['Date']).dt.year//10*10)))
NB. I initially missed that you wanted decades, not years. I updated the answer. The logic remains unchanged.
Related
I’ve got a dataframe like this one:
df = pd.DataFrame({"ID": [123214, 123214, 321455, 321455, 234325, 234325, 234325, 234325, 132134, 132134, 132134],
"DATETIME": ["2020-05-28", "2020-06-12", "2020-01-06", "2020-01-10", "2020-01-11", "2020-02-06", "2020-07-24", "2020-10-14", "2020-03-04", "2020-09-11", "2020-10-17"],
"CATEGORY": ["computer technology", "early childhood", "early childhood", "shoes and bags", "early childhood", "garden and gardening", "musical instruments", "handmade products", "musical instruments", "early childhood", "beauty"]})
I’d like to:
Group by ID
Where CATEGORY == “early childhood” (input), select the next item bought (next row)
The result should be:
321455 "2020-01-10" "shoes and bags"
234325 "2020-02-06" "garden and gardening"
132134 "2020-10-17" "beauty"
The shift function for Pandas is what I need but I can’t make it work while grouping.
Thanks!
You can create mask with test CATEGORY by Series.eq with DataFrameGroupBy.shift, replace first missing values to False and pass to boolean indexing:
#if necessary convert to datetimes and sorting
#df['DATETIME'] = pd.to_datetime(df['DATETIME'])
#df = df.sort_values(['ID','DATETIME'])
mask = df['CATEGORY'].eq('early childhood').groupby(df['ID']).shift(fill_value=False)
df = df[mask]
print (df)
ID DATETIME CATEGORY
3 321455 2020-01-10 shoes and bags
5 234325 2020-02-06 garden and gardening
10 132134 2020-10-17 beauty
I have a dataframe with 4 columns each containing actor names.
The actors are present in several columns and I want to find the actor or actress most present in all the dataframe.
I used mode and but it doesn't work, it gives me the most present actor in each column
I would strongly advise you to use the Counter class in python. Thereby, you can simply add whole rows and columns into the object. The code would look like this:
import pandas as pd
from collections import Counter
# Artifically creating DataFrame
actors = [
["Will Smith","Johnny Depp","Johnny Depp","Johnny Depp"],
["Will Smith","Morgan Freeman","Morgan Freeman","Morgan Freeman"],
["Will Smith","Mila Kunis","Mila Kunis","Mila Kunis"],
["Will Smith","Charlie Sheen","Charlie Sheen","Charlie Sheen"],
]
df = pd.DataFrame(actors)
# Creating counter
counter = Counter()
# inserting the whole row into the counter
for _, row in df.iterrows():
counter.update(row)
print("counter object:")
print(counter)
# We show the two most common actors
for actor, occurences in counter.most_common(2):
print("Actor {} occured {} times".format(actor, occurences))
The output would look like this:
counter object:
Counter({'Will Smith': 4, 'Morgan Freeman': 3, 'Johnny Depp': 3, 'Mila Kunis': 3, 'Charlie Sheen': 3})
Actor Will Smith occured 4 times
Actor Morgan Freeman occured 3 times
The counter object solves your problem quite fast but be aware that the counter.update-function expects lists. You should not update with pure strings. If you do it like this, your counter counts the single chars.
Use stack and value_counts to get the entire list of actors/actresses:
df.stack().value_counts()
Using #Ofi91 setup:
# Artifically creating DataFrame
actors = [
["Will Smith","Johnny Depp","Johnny Depp","Johnny Depp"],
["Will Smith","Morgan Freeman","Morgan Freeman","Morgan Freeman"],
["Will Smith","Mila Kunis","Mila Kunis","Mila Kunis"],
["Will Smith","Charlie Sheen","Charlie Sheen","Charlie Sheen"],
]
df = pd.DataFrame(actors)
df.stack().value_counts()
Output:
Will Smith 4
Morgan Freeman 3
Johnny Depp 3
Charlie Sheen 3
Mila Kunis 3
dtype: int64
To find most number of appearances:
df.stack().value_counts().idxmax()
Output:
'Will Smith'
Let's consider your data frame to be like this
First we stack all columns to 1 column.
Use the below code to achieve that
df1 = pd.DataFrame(df.stack().reset_index(drop=True))
Now, take the value_counts of the actors column using the code
df2 = df1['actors'].value_counts().sort_values(ascending = False)
Here you go, the resulting data frame has the actor name and the number of occurrences in the data frame.
Happy Analysis!!!
I have been attempting to solve a problem for hours and stuck on it. Here is the problem outline:
import numpy as np
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
df
orderid customerid orderdate
0 10315 ISLAT 1996-09-26
1 10318 ISLAT 1996-10-01
2 10321 ISLAT 1996-10-03
3 10473 ISLAT 1997-03-13
4 10621 ISLAT 1997-08-05
5 10253 HANAR 1996-07-10
6 10541 HANAR 1997-05-19
7 10645 HANAR 1997-08-26
I would like to select all the customers who has ordered items more than once WITHIN 5 DAYS.
For example, here only the customer ordered within 5 days of period and he has done it twice.
I would like to get the output in the following format:
Required Output
customerid initial_order_id initial_order_date nextorderid nextorderdate daysbetween
ISLAT 10315 1996-09-26 10318 1996-10-01 5
ISLAT 10318 1996-10-01 10321 1996-10-03 2
First, to be able to count the difference in days, convert orderdate
column to datetime:
df.orderdate = pd.to_datetime(df.orderdate)
Then define the following function:
def fn(grp):
return grp[(grp.orderdate.shift(-1) - grp.orderdate) / np.timedelta64(1, 'D') <= 5]
And finally apply it:
df.sort_values(['customerid', 'orderdate']).groupby('customerid').apply(fn)
It is a bit tricky because there can be any number of purchase pairs within 5 day windows. It is a good use case for leveraging merge_asof, which allows to do approximate-but-not-exact matching of a dataframe with itself.
Input data
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
Define a function that computes the pairs of purchases, given data for a customer.
def compute_purchase_pairs(df):
# Approximate self join on the date, but not exact.
df_combined = pd.merge_asof(df,df, left_index=True, right_index=True,
suffixes=('_first', '_second') , allow_exact_matches=False)
# Compute difference
df_combined['timedelta'] = df_combined['orderdate_first'] - df_combined['orderdate_second']
return df_combined
Do the preprocessing and compute the pairs
# Convert to datetime
df['orderdate'] = pd.to_datetime(df['orderdate'])
# Sort dataframe from last buy to newest (groupby will not change this order)
df2 = df.sort_values(by='orderdate', ascending=False)
# Create an index for joining
df2 = df.set_index('orderdate', drop=False)
# Compute puchases pairs for each customer
df_differences = df2.groupby('customerid').apply(compute_purchase_pairs)
# Show only the ones we care about
result = df_differences[df_differences['timedelta'].dt.days<=5]
result.reset_index(drop=True)
Result
orderid_first customerid_first orderdate_first orderid_second \
0 10318 ISLAT 1996-10-01 10315.0
1 10321 ISLAT 1996-10-03 10318.0
customerid_second orderdate_second timedelta
0 ISLAT 1996-09-26 5 days
1 ISLAT 1996-10-01 2 days
you can create the column 'daysbetween' with sort_values and diff. After to get the following order, you can join df with df once groupby per customerid and shift all the data. Finally, query where the number of days in 'daysbetween_next ' is met:
df['daysbetween'] = df.sort_values(['customerid', 'orderdate'])['orderdate'].diff().dt.days
df_final = df.join(df.groupby('customerid').shift(-1),
lsuffix='_initial', rsuffix='_next')\
.drop('daysbetween_initial', axis=1)\
.query('daysbetween_next <= 5 and daysbetween_next >=0')
It's quite simple. Let's write down the requirements one at the time and try to build upon.
First, I guess that the customer has a unique id since it's not specified. We'll use that id for identifying customers.
Second, I assume it does not matter if the customer bought 5 days before or after.
My solution, is to use a simple filter. Note that this solution can also be implemented in a SQL database.
As a condition, we require the user to be the same. We can achieve this as follows:
new_df = df[df["ID"] == df["ID"].shift(1)]
We create a new DataFrame, namely new_df, with all rows such that the xth row has the same user id as the xth - 1 row (i.e. the previous row).
Now, let's search for purchases within the 5 days, by adding the condition to the previous piece of code
new_df = df[df["ID"] == df["ID"].shift(1) & (df["Date"] - df["Date"].shift(1)) <= 5]
This should do the work. I cannot test it write now, so some fixes may be needed. I'll try to test it as soon as I can
I am trying to figure how to visualize some sensor data. I have data collected every 5 minutes for multiple devices, stored in a JSON structure that looks something like this (note that I don't have control over the data structure):
[
{
"group": { "id": "01234" },
"measures": {
"measures": {
"...device 1 uuid...": {
"metric.name.here": {
"mean": [
["2019-04-17T14:30:00+00:00", 300, 1],
["2019-04-17T14:35:00+00:00", 300, 2],
...
]
}
},
"...device 2 uuid...": {
"metric.name.here": {
"mean": [
["2019-04-17T14:30:00+00:00", 300, 0],
["2019-04-17T14:35:00+00:00", 300, 1],
...
]
}
}
}
}
}
]
Each tuple of the form ["2019-04-17T14:30:00+00:00", 300, 0] is [timestamp, granularity, value]. Devices are grouped by project id. Within any given group, I want to take the data for multiple devices and sum them together. E.g., for the above sample data, I want the final series to look like:
["2019-04-17T14:30:00+00:00", 300, 1],
["2019-04-17T14:35:00+00:00", 300, 3],
The series are not necessarily the same length.
Lastly, I want to aggregate these measurements into hourly samples.
I can get the individual series like this:
with open('data.json') as fd:
data = pd.read_json(fd)
for i, group in enumerate(data.group):
project = group['project_id']
instances = data.measures[i]['measures']
series_for_group = []
for instance in instances.keys():
measures = instances[instance][metric][aggregate]
# build an index from the timestamps
index = pd.DatetimeIndex(measure[0] for measure in measures)
# extract values from the data and link it to the index
series = pd.Series((measure[2] for measure in measures),
index=index)
series_for_group.append(series)
At the bottom of the outer for loop, I have an array of pandas.core.series.Series objects representing the different sets of measurements associated with the current group. I was hoping I could simply add them together as in total = sum(series_for_group) but that produces invalid data.
Am I even reading in this data correctly? This is the first time I've worked with Pandas; I'm not sure if (a) create an index followed by (b) populate the data is the correct procedure here.
How would I successfully sum these series together?
How would I resample this data into 1-hour intervals? Looking at this question it looks as if the .groupby and .agg methods are of interest, but it's not clear from that example how to specify the interval size.
Update 1
Maybe I can use concat and groupby? E.g.:
final = pd.concat(all_series).groupby(level=0).sum()
What I suggested in the comment is to do something like this:
result = pd.DataFrame({}, columns=['timestamp', 'granularity', 'value',
'project', 'uuid', 'metric', 'agg'])
for i, group in enumerate(data.group):
project = group['id']
instances = data.measures[i]['measures']
series_for_group = []
for device, measures in instances.items():
for metric, aggs in measures.items():
for agg, lst in aggs.items():
sub_df = pd.DataFrame(lst, columns = ['timestamp', 'granularity', 'value'])
sub_df['project'] = project
sub_df['uuid'] = device
sub_df['metric'] = metric
sub_df['agg'] = agg
result = pd.concat((result,sub_df), sort=True)
# parse date:
result['timestamp'] = pd.to_datetime(result['timestamp'])
Which results in a data that looks like this
agg granularity metric project timestamp uuid value
0 mean 300 metric.name.here 01234 2019-04-17 14:30:00 ...device 1 uuid... 1
1 mean 300 metric.name.here 01234 2019-04-17 14:35:00 ...device 1 uuid... 2
0 mean 300 metric.name.here 01234 2019-04-17 14:30:00 ...device 2 uuid... 0
1 mean 300 metric.name.here 01234 2019-04-17 14:35:00 ...device 2 uuid... 1
then you can do overall aggregation
result.resample('H', on='timestamp').sum()
which gives:
timestamp
2019-04-17 14:00:00 4
Freq: H, Name: value, dtype: int64
or groupby aggregation:
result.groupby('uuid').resample('H', on='timestamp').value.sum()
which gives:
uuid timestamp
...device 1 uuid... 2019-04-17 14:00:00 3
...device 2 uuid... 2019-04-17 14:00:00 1
Name: value, dtype: int64
To construct a dataframe (df) from series with different lengths(for example s1, s2, s3) you can try:
df=pd.concat([s1,s2,s3], ignore_index=True, axis=1).fillna('')
Once you have your dataframe constructed:
Ensure all dates are stored as timestamp objects:
df['Date']=pd.to_datetime(df['Date'])
Then, Add another column to extract the hours from the date column:
df['Hour']=df['Date'].dt.hour
And then group by hours and sum up the values:
df.groupby('Hour').sum()
I ended up with what seems like a working solution based on the code in my question. On my system, this takes about 6 seconds to process about 85MB of input data. In comparison, I cancelled Quang's code after 5 minutes.
I don't know if this is the Right Way to process this data, but it produces apparently correct results. I notice that building a list of series, as in this solution, and then making a single pd.concat call is more performant than putting pd.concat inside the loop.
#!/usr/bin/python3
import click
import matplotlib.pyplot as plt
import pandas as pd
#click.command()
#click.option('-a', '--aggregate', default='mean')
#click.option('-p', '--projects')
#click.option('-r', '--resample')
#click.option('-o', '--output')
#click.argument('metric')
#click.argument('datafile', type=click.File(mode='rb'))
def plot_metric(aggregate, projects, output, resample, metric, datafile):
# Read in a list of project id -> project name mappings, then
# convert it to a dictionary.
if projects:
_projects = pd.read_json(projects)
projects = {_projects.ID[n]: _projects.Name[n].lstrip('_')
for n in range(len(_projects))}
else:
projects = {}
data = pd.read_json(datafile)
df = pd.DataFrame()
for i, group in enumerate(data.group):
project = group['project_id']
project = projects.get(project, project)
devices = data.measures[i]['measures']
all_series = []
for device, measures in devices.items():
samples = measures[metric][aggregate]
index = pd.DatetimeIndex(sample[0] for sample in samples)
series = pd.Series((sample[2] for sample in samples),
index=index)
all_series.append(series)
# concatenate all the measurements for this project, then
# group them using the timestamp and sum the values.
final = pd.concat(all_series).groupby(level=0).sum()
# resample the data if requested
if resample:
final = final.resample(resample).sum()
# add series to dataframe
df[project] = final
fig, ax = plt.subplots()
df.plot(ax=ax, figsize=(11, 8.5))
ax.legend(frameon=False, loc='upper right', ncol=3)
if output:
plt.savefig(output)
plt.close()
else:
plt.show()
if __name__ == '__main__':
plot_metric()
I have the following data frame:
import pandas as pd
data = pd.DataFrame()
data['Home'] = ['A','B','C','D','E','F']
data['HomePoint'] = [3,0,1,1,3,3]
data['Away'] = ['B','C','A','E','D','D']
data['AwayPoint'] = [0,3,1,1,0,0]
i want to groupby the columns ['Home', 'Away'] and change the name as Team. Then i like to sum homepoint and awaypoint as name as Points.
Team Points
A 4
B 0
C 4
D 1
E 4
F 3
How can I do it?
I was trying different approach using the following post:
Link
But I was not able to get the format that I wanted.
Greatly appreciate your advice.
Thanks
Zep.
A simple way is to create two new Series indexed by the teams:
home = pd.Series(data.HomePoint.values, data.Home)
away = pd.Series(data.AwayPoint.values, data.Away)
Then, the result you want is:
home.add(away, fill_value=0).astype(int)
Note that home + away does not work, because team F never played away, so would result in NaN for them. So we use Series.add() with fill_value=0.
A complicated way is to use DataFrame.melt():
goo = data.melt(['HomePoint', 'AwayPoint'], var_name='At', value_name='Team')
goo.HomePoint.where(goo.At == 'Home', goo.AwayPoint).groupby(goo.Team).sum()
Or from the other perspective:
ooze = data.melt(['Home', 'Away'])
ooze.value.groupby(ooze.Home.where(ooze.variable == 'HomePoint', ooze.Away)).sum()
You can concatenate, pairwise, columns of your input dataframe. Then use groupby.sum.
# calculate number of pairs
n = int(len(df.columns)/2)+1)
# create list of pairwise dataframes
df_lst = [data.iloc[:, 2*i:2*(i+1)].set_axis(['Team', 'Points'], axis=1, inplace=False) \
for i in range(n)]
# concatenate list of dataframes
df = pd.concat(df_lst, axis=0)
# perform groupby
res = df.groupby('Team', as_index=False)['Points'].sum()
print(res)
Team Points
0 A 4
1 B 0
2 C 4
3 D 1
4 E 4
5 F 3