I have a two tables of of customer information and transaction info.
Customer information includes each person's quality of health (from 0 to 100)
e.g. if I extract just the Name and HealthQuality columns:
John: 70
Mary: 20
Paul: 40
etc etc.
After applying featuretools I noticed a new DIFF(HealthQuality) variable.
According to the docs, this is what DIFF does:
"Compute the difference between the value in a list and the previous value in that list."
Is featuretools calculating the difference between Mary and John's health quality in this instance?
I don't think this kind of feature synthesis really works for customer records e.g. CUM_SUM(emails_sent) for John. John's record is one row, and he has one value for the amount of emails we sent him.
For now I am using the ignore_variables=[all_customer_info] option to remove all of the customer data except for transactions table of course.
This also leads me into another question.
Using data from the transactions table, John now has a DIFF(MEAN(transactions.amount)). What is the DIFF measured in this instance?
id MEAN(transactions.amount) DIFF(MEAN(transactions.amount))
0 1 21.950000 NaN
1 2 20.000000 -1.950000
2 3 35.604581 15.604581
3 4 NaN NaN
4 5 22.782682 NaN
5 6 35.616306 12.833624
6 7 24.560536 -11.055771
7 8 331.316552 306.756016
8 9 60.565852 -270.750700
Related
For a school project I am attempting to determine the number of mentions specific words have in Reddit titles and comments. More specifically, stock ticker mentions. Currently the dataframe looks like this (where type could be a string of either title or comment):
body score id created subreddit type mentions
3860 There are much better industrials stocks than ... 1 NaN 2021-03-13 20:32:08+00:00 stocks comment {GE}
3776 I guy I work with told me about PENN about 9 m... 1 NaN 2021-03-13 20:29:30+00:00 investing comment {PENN}
4122 [mp4 link](https://preview.redd.it/ieae3z7suum... 2 NaN 2021-03-13 20:28:43+00:00 StockMarket comment {KB}
2219 If you cant decide, then just buy $GME options 1 NaN 2021-03-13 20:28:12+00:00 wallstreetbets comment {GME}
2229 This sub the most wholesome fucking thing in t... 2 NaN 2021-03-13 20:27:57+00:00 wallstreetbets comment {GME}
Where the mentions column contains a set of tickers mentioned in the body (could be multiple). What I wish to do is to count the number of unique mentions on a per-subreddit per-type (either comment or title) basis. The result I am looking for would be similar to this:
ticker subreddit type count
GME wallstreetbets comment 5
GME wallstreetbets title 4
GME investing comment 3
GME investing title 2
Repeated for all unique tickers mentioned.
I had used counters to figure this out utilizing dataframes specific to each instance (ie one dataframe for wallstreetbets comments, one dataframe for wallstreetbets titles) but I could not figure out how to make it work in this fashion when confined to a singular dataframe.
Sound like a simple groupby should do it:
df.groupby(['mentions','subreddit','type']).count()
produces
body score id created
mentions subreddit type
{GE} stocks comment 1 1 0 1
{GME} wallstreetbets comment 2 2 0 2
{KB} StockMarket comment 1 1 0 1
{PENN} investing comment 1 1 0 1
I have a dataframe with 4 columns: County, Salesperson, Part Number, and $Total. There can be any number of rows, from tens of thousands to 2 to none.
I'd like to be able to give python a number (let's assume I'm setting with an inputbox value) for the total number of rows in my new randomized report. However, I would like Python to "randomly" pull those records BUT with giving 3 different fields a tiered priority for maximizing the UNIQUE values. I want to make sure that, up to the number of records I've specified (let's use 5 as an example), Python will:
Get 5 records where all 5 values in County are unique, all 5 values in Salesperson are unique, all 5 values in Partnumber are unique. If that's not possible, then:
5 records where all 5 County values are unique, all 5 values in Salesperson are unique, and as many as possible values are unique in Partnumber. If THAT's not possible, then:
5 records where all 5 County values are unique, as many as possible are unique in Salesperson, and given those two columns, as many values as possible are unique in Partnumber. If THAT's not possible, then:
5 records where as many as possible are unique in County, then as many as possible unique in Salesperson, then as many as possible unique in Partnumber
Note that just trying to use unique key across all 3 fields could very well give me lots of repeated values as long as one of the columns has unique values. For instance, with this dataframe, if I wanted 5 records pulled that have unique values using all 3 columns as combined key:
County SalesPerson PartNum $Total
Roberts Marcus 2A 300
Lewis James 100A 400
Roberts Ruby 5Z 100
Midland Marcus 10B 50
Lewis Marcus 5E 400
Middlesex Marcus 1W 25
Fannin Marcus 10E 45
Python might give me this:
County SalesPerson PartNum $Total
Roberts Ruby 5Z 100
Midland Marcus 10B 50
Lewis Marcus 5E 400
Middlesex Marcus 1W 25
Fannin Marcus 10E 45
That has 5 unique records given all 3 columns as a whole since all the part numbers are different, but I have lots of duplicate salesperson values, and didn't get a record with James. In the original dataframe, County has 5 unique values, so my resulting records should have all 5 unique County names. I should have all 3 Salesperson values also, but not the Roberts county record for Marcus (since Ruby is in only one record, the algorithm should see that it can still get 5 unique values for County and also get all 3 salespeople as long as it gets the Ruby record and a NON-Roberts county record for Marcus). So my output SHOULD be:
County SalesPerson PartNum $Total
Lewis James 100A 400
Roberts Ruby 5Z 100
Midland Marcus 10B 50
Middlesex Marcus 1W 25
Fannin Marcus 10E 45
Now there's this dataframe, and let's say I give "5" again:
County SalesPerson PartNum $Total
Roberts Marcus 2A 300
Lewis James 100A 400
Middlesex Ruby 5Z 100
Middlesex Marcus 1W 50
Middlesex James 100A 400
Middlesex Marcus 2X 25
Fannin Marcus 10E 45
There are only 4 unique county values, so one of them will be represented twice in my report. So, given that Python has already made sure to max out possible unique values in County, it must now max out unique in Salesperson. So of my 5 needed records, it will get 2 Middlesex County records, and focus on making sure we get as many unique Salesperson values as possible. So Ruby will definitely be one of them, as she has only one record and we've already fulfilled max number of unique County. For the OTHER Middlesex record, python would NOT pick the James option, because that would give me a duplicate Partnumber when it's possible to get a different Partnumber and having already maxed out unique County AND unique Salesperson; so I would get either one of the Middlesex Marcus records (one with Partnumber 1W OR 2X):
County SalesPerson PartNum $Total
Roberts Marcus 2A 300
Lewis James 100A 400
Middlesex Ruby 5Z 100
Middlesex Marcus 1W 50
Fannin Marcus 10E 45
One thing that occurred to me was making 3 lists of unique values from each of the three columns, then interating through each list in order of priority (get first value of County list, make a new dataset of all records in original dataframe where County matches that value, then get first value of my Salesperson unique list, find all rows of my NEW data set where Salesperson is that value, turn that into a smaller subset, then search THAT for where Partnumber is the first Partnumber value from my list, then randomly get a record where all that was successful, then starting over with all my searched values removed the respective lists, and doing all of THAT until i've iterated through my specified number (5 in these cases). But I think that breaks whenever I have fewer unique values than my specified number in any of my lists, also doesn't take into account prioritizing one column over the other when a higher priority one doesn't have as many unique values as my specified number, and also doesn't seem very pythonic.
Any help on this would be very much appreciated, I'm having a hard time wrapping my head around it and am not super familiar with python. I hope my descriptions are clear, please let me know if I need to clarify further. And again sorry about the length of this, it was difficult for me to clearly communicate this briefly.
I have two dataframes: one with an account number, a purchase ID, a total cost, and a date
and another with account number, money paid, and date:
To make it clear there are two accounts, 11111 and 33333, but there are some typos in the dataframes.
AccountNumber Purchase ID Total Cost Date
11111 100 10 1/1/2020
333333 100 10 1/1/2020
33333 200 20 2/2/2020
11111 300 30 4/2/2020
AccountNumber Date Money Paid:
11111 1/1/2020 5
111111 1/2/2020 2
33333 1/2/2020 1
33333 2/3/2020 15
1111 4/2/2020 30
Each Purchase ID is an identifier for a single purchase, however multiple accounts may be involved within the purchase, such as account 11111 and 33333. Moreover, an account may be used for two different purchases such as account 11111 with Purchase ID 100 and 300. In the second dataframe, payments can be made in increments, so I need to use the date to make sure that the payment is associated with the correct Purchase ID. Moreover, there may be some slight errors in the account numbers so I need to use a fuzzy match. In the end, I want to get a dataframe that is grouped by Purchase ID and compares how much the accounts paid vs. the cost of the item:
Purchase ID Date Total Cost Amount Paid $Owed
100 1/1/2020 10 8 2
200 2/2/2020 20 15 5
300 4/2/2020 30 30 0
As you can see, this is a fairly complicated question. I first tried just joining the two dataframes based on AccountNumber but I ran into issues due to the slight differences as well as the problem of matching the Accountnumber transaction to the correct Purchase ID with the date, because one error with merging is that you might accidentally sum up money paid for the wrong Purchase since accounts can be involved with different purchases.
I'm thinking about iterating through the rows and using if statements/regex but I feel like that would take too long.
What's the simplest and efficient solution to this problem? I'm a beginner at pandas and python.
The library pandas-dedupe can help you to link the two dataframe by using a combination of active learning and clustering. have a look at the repo.
Here is the sample code (and step by step explanation):
import pandas as pd
import pandas_dedupe
#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')
#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])
# At this point pandas_dedupe will ask you to label a sample of records according
# to whether they are distinct or the same observation.
# After that, pandas-dedupe uses its knowledge to cluster together similar records.
#send output to csv
df_final.to_csv('linkage_output.csv')
I am working on a data set of employee database which contains employee records for each quarter and therefore I ended up having each employee name multiple times. I need to pre-process the data in the form of Table 2 to train my model.
Table 1:
Time Stamp Emp_ID Emp_Name Role Job-Level Joining-Date Exp_in_Current Role(Months)
3/31/2014 987943 John Analyst JL3 1/1/2014 3
3/31/2015 987943 John Lead JL4 1/1/2014 2
3/31/2014 987926 Alex Lead JL4 1/2/2013 13
3/31/2015 987926 Alex Manager JL5 1/2/2013 2
I need to process the above table in the following format:
Employee_ID Employee_Name Role_1 Duration in Role_1 Role 2 Duration in Role 2(As on today)
987943 John Analyst 12 Lead 31
987926 Alex Lead 24 Manager 31
Could you please help me in resolving the above mentioned problem?
I am trying to get a daily status count from the following DataFrame (it's a subset, the real data set is ~14k jobs with overlapping dates, only one status at any given time within a job):
Job Status User
Date / Time
1/24/2011 10:58:04 1 A Ted
1/24/2011 10:59:20 1 C Bill
2/11/2011 6:53:14 1 A Ted
2/11/2011 6:53:23 1 B Max
2/15/2011 9:43:13 1 C Bill
2/21/2011 15:24:42 1 F Jim
3/2/2011 15:55:22 1 G Phil Jr.
3/4/2011 14:57:45 1 H Ted
3/7/2011 14:11:02 1 I Jim
3/9/2011 9:57:34 1 J Tim
8/18/2014 11:59:35 2 A Ted
8/18/2014 13:56:21 2 F Bill
5/21/2015 9:30:30 2 G Jim
6/5/2015 13:17:54 2 H Jim
6/5/2015 14:40:38 2 I Ted
6/9/2015 10:39:15 2 J Tom
1/16/2015 7:45:58 3 A Phil Jr.
1/16/2015 7:48:23 3 C Jim
3/6/2015 14:09:42 3 A Bill
3/11/2015 11:16:04 3 K Jim
My initial thought (from the following link) was to groupby the job column, fill in the missing dates for each group and then ffill the statuses down.
Pandas reindex dates in Groupby
I was able to make this work...kinda...if two statuses occurred on the same date, one would not be included in output and consequently some statuses were missing.
I then found the following, it supposedly handles the duplicate issue, but I am unable to get it to work with my data.
Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe
Am I on the right path thinking that filling in the missing dates and then ffill down the statuses is the correct way to ultimately capture daily counts of individual statuses? Is there another method that might better use pandas features that I'm missing?