Conditional copy of values from one column to another columns - python

I have a pandas dataframe that looks something like this:
name job jobchange_rank date
Thisguy Developer 1 2012
Thisguy Analyst 2 2014
Thisguy Data Scientist 3 2015
Anotherguy Developer 1 2018
The jobchange_rank represents the each individual's (based on name) ranked change in position, where rank nr 1 represent his/her first position nr 2 his/her second position, etc.
Now for the fun part. I want to create a new column where I can see a person's previous job, something like this:
name job jobchange_rank date previous_job
Thisguy Developer 1 2012 None
Thisguy Analyst 2 2014 Developer
Thisguy Data Scientist 3 2015 Analyst
Anotherguy Developer 1 2018 None
I've created the following code to get the "None" values where there was no job change:
df.loc[df['jobchange_rank'].sub(df['jobchange_rank'].min()) == 0, 'previous_job'] = 'None'
Sadly, I can't seem to figure out how to get the values from the other column where the needed condition applies.
Any help is more then welcome!
Thanks in advance.

This answer assumes that your DataFrame is sorted by name and jobchange_rank, if that is not the case, sort first.
# df = df.sort_values(['name', 'jobchange_rank'])
m = df['name'].eq(df['name'].shift())
df['job'].shift().where(m)
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Or using a groupby + shift (assuming at least sorted by jobchange_rank)
df.groupby('name')['job'].shift()
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Although the groupby + shift is more concise, on larger inputs, if your data is already sorted like your example, it may be faster to avoid the groupby and use the first solution.

Related

Compare different df's row by row and return changes

Every month I collect data that contains details of employees to be stored in our database.
I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.
I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.
There are also some important details to mention:
Each column can also contain blank values in any of the dataframes;
The dataframes have the same column names but not necessarily the same data type;
The dataframes do not have the same number of rows necessarily;
If a row do not find its Index match, do not return to the new dataframe;
The rows of the dataframes can be matched by a column named "Index"
So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):
df1:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com
3 MKT $7600 Maria d 30-06-2021
4 I'T 8000 Peter az#i.com 14-07-2021
df2:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
5 IT 9000 John NOT PROVIDED
6 IT 9900 John NOT PROVIDED
df3:
Index Department Salary Manager Email Start_Date
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
**The differences in this example are:
Start date added in row of Index 2
Salary format corrected and email corrected for row Index 3
Department format corrected for row Index 4
What would be the best way to to this comparison?
I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.
Thank you for the support!
I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
But first you would need to align the rows between old and new dataframe via the index:
new_df_to_compare=new_df.loc[old_df.index]
When datatypes don't match. You would also need to align them:
new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())
Then compare should work just like this:
difference_df = old_df.compare(new_df_to_compare)

DataFrame from variable and filtering data

I have a DataFrame and want to extract 3 columns from it, but one of them is an input from the user. I made a list, but need it to be iterable so I can run a For iteration.
So far I made it through by making a dictionary with 2 of the columns making a list of each and zipping them... but I really need the 3 columns...
My code:
Data=pd.read_csv(----------)
selec=input("What month would you want to show?")
NewData=[(Data['Country']),(Data['City']),(Data[selec].astype('int64')]
#here I try to iterate:
iteration=[i for i in NewData if NewData[i]<=25]
print (iteration)
*TypeError:list indices must be int ot slices, not Series*
My CSV is the following:
I want to be able to choose the month with the variable "selec" and filter the results of the month I've chosen... so the output for selec="Feb" would be:
I tried as well with loc/iloc, but not lucky at all (unhashable type:'list').
See the below example for how you can:
select specific columns from a DataFrame by providing a list of columns between the selection brackets (link to tutorial)
select specific rows from a DataFrame by providing a condition between the selection brackets (link to tutorial)
iterate rows of a DataFrame, although I don't suppose you need it - if you'd like to keep working with the DataFrame after filtering it, it's better to use the method mentioned above (you won't have to put the rows back together, and it will likely be more performant because pandas is optimized for bulk operations)
import pandas as pd
# this is just for testing, instead of pd.read_csv(...)
df = pd.DataFrame([
dict(Country="Spain", City="Madrid", Jan="15", Feb="16", Mar="17", Apr="18", May=""),
dict(Country="Spain", City="Galicia", Jan="1", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="France", City="Paris", Jan="0", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="Algeria", City="Argel", Jan="20", Feb="28", Mar="29", Apr="30", May=""),
])
print("---- Original df:")
print(df)
selec = "Feb" # let's pretend this comes from input()
print("\n---- Just the 3 columns:")
df = df[["Country", "City", selec]] # narrow down the df to just the 3 columns
df[selec] = df[selec].astype("int64") # convert the selec column to proper type
print(df)
print("\n---- Filtered dataframe:")
df1 = df[df[selec] <= 25]
print(df1)
print("\n---- Iterated & filtered rows:")
for row in df.itertuples():
# we could also use row[3] instead of getattr(...)
if getattr(row, selec) <= 25:
print(row)
Output:
---- Original df:
Country City Jan Feb Mar Apr May
0 Spain Madrid 15 16 17 18
1 Spain Galicia 1 2 3 4
2 France Paris 0 2 3 4
3 Algeria Argel 20 28 29 30
---- Just the 3 columns:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
3 Algeria Argel 28
---- Filtered dataframe:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
---- Iterated & filtered dataframe:
Pandas(Index=0, Country='Spain', City='Madrid', Feb=16)
Pandas(Index=1, Country='Spain', City='Galicia', Feb=2)
Pandas(Index=2, Country='France', City='Paris', Feb=2)

How is DIFF calculated on customer demographics in featuretools?

I have a two tables of of customer information and transaction info.
Customer information includes each person's quality of health (from 0 to 100)
e.g. if I extract just the Name and HealthQuality columns:
John: 70
Mary: 20
Paul: 40
etc etc.
After applying featuretools I noticed a new DIFF(HealthQuality) variable.
According to the docs, this is what DIFF does:
"Compute the difference between the value in a list and the previous value in that list."
Is featuretools calculating the difference between Mary and John's health quality in this instance?
I don't think this kind of feature synthesis really works for customer records e.g. CUM_SUM(emails_sent) for John. John's record is one row, and he has one value for the amount of emails we sent him.
For now I am using the ignore_variables=[all_customer_info] option to remove all of the customer data except for transactions table of course.
This also leads me into another question.
Using data from the transactions table, John now has a DIFF(MEAN(transactions.amount)). What is the DIFF measured in this instance?
id MEAN(transactions.amount) DIFF(MEAN(transactions.amount))
0 1 21.950000 NaN
1 2 20.000000 -1.950000
2 3 35.604581 15.604581
3 4 NaN NaN
4 5 22.782682 NaN
5 6 35.616306 12.833624
6 7 24.560536 -11.055771
7 8 331.316552 306.756016
8 9 60.565852 -270.750700

How to stop a Dataframe subset from "remembering" old values

Sorry for the weird phrasing, but I didn't know how to better describe it. I will be translating my problem to US terms to ease understanding. My problem is, I have a national database with States and Districts and I need to work only with Districts from Florida, so I do this:
df_fl=df.loc[df.state=='florida'].copy()
After some transformations I want to take mean values of every district from Florida, so I do this:
df_final=df_fl.groupby(['district']).mean()
But this brings a dataframe with every district in the database. All rows from districts that are not in Florida are filled with nans. I suppose there's an easy solution to this, but I haven't been able to find it. It's kinda counter intuitive that it works like that, too.
So, can you help me fix this?
Thanks in advance,
Edit:
my data looked like this:
District state Salary
1 Florida 1000
1 Florida 2000
2 Florida 2000
2 Florida 3000
3 California 3000
df_fl, then, looks like this:
District state Salary
1 Florida 1000
1 Florida 2000
2 Florida 2000
2 Florida 3000
And after applying
df_final=df_fl.groupby(['district']).mean()
I expected to get this:
District Salary
1 1500
2 2500
But I'm getting this:
District Salary
1 1500
2 2500
3 nan
Obviously a very simplified version, but the core remains.
It is because your 'District' column is a categorical type.
MCVE
df = pd.DataFrame(dict(
State=list('CCCCFFFF'),
District=list('WXWXYYZZ'),
Value=range(1, 9)
))
Without categorical
df.query('State == "F"').groupby('District').Value.mean()
District
Y 5.5
Z 7.5
Name: Value, dtype: float64
With categorical
df.assign(
District=pd.Categorical(df.District)
).query('State == "F"').groupby('District').Value.mean()
District
W NaN
X NaN
Y 5.5
Z 7.5
Name: Value, dtype: float64
Solution
Many ways to do this. One way that preserves the categorical typing is to use the method, remove_unused_categories
df = df.assign(District=df.District.cat.remove_unused_categories())
As piRSquared already explained this only happens with categorical data. Starting from 0.23.0 groupby has a new "observed" argument which toggles this behavior. MCVE taken from piRSquared:
>>> df = pd.DataFrame(dict(
State=list('CCCCFFFF'),
District=list('WXWXYYZZ'),
Value=range(1, 9)
))
>>> df.assign(
District=pd.Categorical(df.District)
).query('State == "F"').groupby('District').Value.mean()
District
W NaN
X NaN
Y 5.5
Z 7.5
Name: Value, dtype: float64
>>> df.assign(
District=pd.Categorical(df.District)
).query('State == "F"').groupby('District', observed=True).Value.mean()
District
Y 5.5
Z 7.5
Name: Value, dtype: float64

Fill Missing Dates in DataFrame with Duplicate Dates in Groupby

I am trying to get a daily status count from the following DataFrame (it's a subset, the real data set is ~14k jobs with overlapping dates, only one status at any given time within a job):
Job Status User
Date / Time
1/24/2011 10:58:04 1 A Ted
1/24/2011 10:59:20 1 C Bill
2/11/2011 6:53:14 1 A Ted
2/11/2011 6:53:23 1 B Max
2/15/2011 9:43:13 1 C Bill
2/21/2011 15:24:42 1 F Jim
3/2/2011 15:55:22 1 G Phil Jr.
3/4/2011 14:57:45 1 H Ted
3/7/2011 14:11:02 1 I Jim
3/9/2011 9:57:34 1 J Tim
8/18/2014 11:59:35 2 A Ted
8/18/2014 13:56:21 2 F Bill
5/21/2015 9:30:30 2 G Jim
6/5/2015 13:17:54 2 H Jim
6/5/2015 14:40:38 2 I Ted
6/9/2015 10:39:15 2 J Tom
1/16/2015 7:45:58 3 A Phil Jr.
1/16/2015 7:48:23 3 C Jim
3/6/2015 14:09:42 3 A Bill
3/11/2015 11:16:04 3 K Jim
My initial thought (from the following link) was to groupby the job column, fill in the missing dates for each group and then ffill the statuses down.
Pandas reindex dates in Groupby
I was able to make this work...kinda...if two statuses occurred on the same date, one would not be included in output and consequently some statuses were missing.
I then found the following, it supposedly handles the duplicate issue, but I am unable to get it to work with my data.
Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe
Am I on the right path thinking that filling in the missing dates and then ffill down the statuses is the correct way to ultimately capture daily counts of individual statuses? Is there another method that might better use pandas features that I'm missing?

Categories

Resources