My dataframe contains houses (id's) located in zipcodes, that purchased a product on a certain date. I would like to add a column to my dataframe that, for every ID, adds up the number of purchases in the zipcode up untill that point, minus 3 months. So for a row that contains a purchase on December 30th, I would add up all the purchases in that zip code up until September 30th.
I already converted the purchase date column to a datetime format.
My dataset looks like this below. As you can see, between row 2 and 3, there is a period of almost 2 years where nothing happened in that area.
Id: Zipcode: Purchase_Date:
1 9999 2017-August-24
2 9999 2017-December-30
3 9999 2019-July-14
4 2000 2017-March-11
5 2000 2018-May-14
etc.
Ideally, the end result would look like this:
Id: Zipcode: Purchase_Date: Cumulative_purchases:
1 9999 2017-August-24 0
2 9999 2017-December-30 1
3 9999 2019-July-14 1
4 2000 2017-March-11 0
5 2000 2017-May-14 0
etc.
Related
I have the following multi-index data frame, with ID and Year being part of the index. The solvency column is based on wether or not there are NaNs in both Profit/Loss and Total Sales for that year.
ID Year Profit/Loss Total Sales Solvency
0 2008 300. 2000. 1
0 2009 NaN NaN 0
0 2010 500. 2000. 1
1 2008 300. 2000. 1
1 2009 NaN NaN 0
1 2010 NaN NaN 0
However, it is the case that sometimes a company has NaNs in one year, but not in the one after, so it is in fact not insolvent and did not disappear from the data set. For my analysis I need to know how many companies drop out over the time period. I am guessing that I need a function with groupby that checks if a 0 appears in the Solvency column and then checks if there ever is a 1 again in the next years for that specific company. The final output should tell how many companies dropped out in every year.
Year Count Dropouts
2008 0
2009 1
2010 1
Every month I collect data that contains details of employees to be stored in our database.
I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.
I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.
There are also some important details to mention:
Each column can also contain blank values in any of the dataframes;
The dataframes have the same column names but not necessarily the same data type;
The dataframes do not have the same number of rows necessarily;
If a row do not find its Index match, do not return to the new dataframe;
The rows of the dataframes can be matched by a column named "Index"
So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):
df1:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com
3 MKT $7600 Maria d 30-06-2021
4 I'T 8000 Peter az#i.com 14-07-2021
df2:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
5 IT 9000 John NOT PROVIDED
6 IT 9900 John NOT PROVIDED
df3:
Index Department Salary Manager Email Start_Date
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
**The differences in this example are:
Start date added in row of Index 2
Salary format corrected and email corrected for row Index 3
Department format corrected for row Index 4
What would be the best way to to this comparison?
I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.
Thank you for the support!
I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
But first you would need to align the rows between old and new dataframe via the index:
new_df_to_compare=new_df.loc[old_df.index]
When datatypes don't match. You would also need to align them:
new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())
Then compare should work just like this:
difference_df = old_df.compare(new_df_to_compare)
May I know how to select a row of values which has max count number after grouping by a column
Examples:
STATE COUNTY POPULATION
1 5571 1000
2 3421 2000
3 6781 3000
2 1234 4000
2 3344 6600
1 5566 9900
I want to find the STATE with max number of count of county, select STATE and COUNTY to show only, without POPULATION.
Answer should be, but i dont know how to code it in python. Thanks for help
STATE COUNTY
2 3
Try:
u = df.groupby('STATE')['COUNTRY'].size()
v = u[u.index==u.idxmax()].reset_index()
v:
STATE COUNTRY 0 2 3
Approach:
Group by STATE and then use nunique if you want to count distinct values or size on COUNTRY Column.
get the index of the row where count is the max.
I have a dataframe shown in the image 1. It is a sample of pubs in London,UK (3337 pubs/rows). And the geometry is at an LSOA level. In some LSOAs, there is more than 1 pub. I want my dataframe to summarise the number of pubs in every LSOA. I already have the information by using
psdf['lsoa11nm'].value_counts()
prints out:
City of London 001F 103
City of London 001G 40
Westminster 013B 36
Westminster 018A 36
Westminster 013E 30
...
Lambeth 005A 1
Croydon 043C 1
Hackney 002E 1
Merton 022D 1
Bexley 008B 1
Name: lsoa11nm, Length: 1630, dtype: int64
I cant use this as a new dataframe because it is a key and one column as opposed two columns where one would be lsoa11nm and the other pub count.
Does anyone know how to groupby the dataframe so that there will be only one row for every lsoa, that says how many pubs are in it?
I have a two tables of of customer information and transaction info.
Customer information includes each person's quality of health (from 0 to 100)
e.g. if I extract just the Name and HealthQuality columns:
John: 70
Mary: 20
Paul: 40
etc etc.
After applying featuretools I noticed a new DIFF(HealthQuality) variable.
According to the docs, this is what DIFF does:
"Compute the difference between the value in a list and the previous value in that list."
Is featuretools calculating the difference between Mary and John's health quality in this instance?
I don't think this kind of feature synthesis really works for customer records e.g. CUM_SUM(emails_sent) for John. John's record is one row, and he has one value for the amount of emails we sent him.
For now I am using the ignore_variables=[all_customer_info] option to remove all of the customer data except for transactions table of course.
This also leads me into another question.
Using data from the transactions table, John now has a DIFF(MEAN(transactions.amount)). What is the DIFF measured in this instance?
id MEAN(transactions.amount) DIFF(MEAN(transactions.amount))
0 1 21.950000 NaN
1 2 20.000000 -1.950000
2 3 35.604581 15.604581
3 4 NaN NaN
4 5 22.782682 NaN
5 6 35.616306 12.833624
6 7 24.560536 -11.055771
7 8 331.316552 306.756016
8 9 60.565852 -270.750700