Merging DataFrames with "uneven" data - python

Excuse the title, I'm not even sure how to label what I'm trying to do. I have data in a DataFrame that looks like this:
Name Month Status
---- ----- ------
Bob Jan Good
Bob Feb Good
Bob Mar Bad
Martha Feb Bad
John Jan Good
John Mar Bad
Not every name 'Name' has every 'Month' and 'Status'. What I want to get is:
Name Month Status
---- ----- ------
Bob Jan Good
Bob Feb Good
Bob Mar Bad
Martha Jan N/A
Martha Feb Bad
Martha Mar N/A
John Jan Good
John Feb N/A
John Mar Bad
Where the missing months are filled in with a value in the 'Status' column.
What I've tried to do so far is export all of the unique 'Month" values to a list, convert to a DataFrame, then join/merge the two DataFrames. But I can't get anything to work.
What is the best way to do this?

You have to take advantage of Pandas' indexing to reshape the data :
Step1 : create a new index from the unique values of Name and Month columns :
new_index = pd.MultiIndex.from_product(
(df.Name.unique(), df.Month.unique()), names=["Name", "Month"]
)
Step2 : set Name and Month as the new index, reindex with new_index and reset_index to get your final output :
df.set_index(["Name", "Month"]).reindex(new_index).reset_index()
UPDATE 2021/01/08:
You can use the complete function from pyjanitor; at the moment you have to install the latest development version from github:
# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import pyjanitor
df.complete("Name", "Month")

You can treat the month as a categorical column, then allow GroupBy to do the heavy lifting:
df['Month'] = pd.Categorical(df['Month'])
df.groupby(['Name', 'Month'], as_index=False).first()
Name Month Status
0 Bob Feb Good
1 Bob Jan Good
2 Bob Mar Bad
3 John Feb NaN
4 John Jan Good
5 John Mar Bad
6 Martha Feb Bad
7 Martha Jan NaN
8 Martha Mar NaN
The secret sauce here is that pandas treats missing "categories" by inserting a NaN there.
Caveat: This always sorts your data.

Do pivot
df=df.pivot(*df).stack(dropna=False).to_frame('Status').reset_index()
Name Month Status
0 Bob Feb Good
1 Bob Jan Good
2 Bob Mar Bad
3 John Feb NaN
4 John Jan Good
5 John Mar Bad
6 Martha Feb Bad
7 Martha Jan NaN
8 Martha Mar NaN

Related

How to fill missing values in a dataframe based on group value counts?

I have a pandas DataFrame with 2 columns: Year(int) and Condition(string). In column Condition I have a nan value and I want to replace it based on information from groupby operation.
import pandas as pd
import numpy as np
year = [2015, 2016, 2017, 2016, 2016, 2017, 2015, 2016, 2015, 2015]
cond = ["good", "good", "excellent", "good", 'excellent','excellent', np.nan, 'good','excellent', 'good']
X = pd.DataFrame({'year': year, 'condition': cond})
stat = X.groupby('year')['condition'].value_counts()
It gives:
print(X)
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 NaN
7 2016 good
8 2015 excellent
9 2015 good
print(stat)
year condition
2015 good 2
excellent 1
2016 good 3
excellent 1
2017 excellent 2
As nan value in 6th row gets year = 2015 and from stat I get that from 2015 the most frequent is 'good' so I want to replace this nan value with 'good' value.
I have tried with fillna and .transform method but it does not work :(
I would be grateful for any help.
I did a little extra transformation to get stat as a dictionary mapping the year to its highest frequency name (credit to this answer):
In[0]:
fill_dict = stat.unstack().idxmax(axis=1).to_dict()
fill_dict
Out[0]:
{2015: 'good', 2016: 'good', 2017: 'excellent'}
Then use fillna with map based on this dictionary (credit to this answer):
In[0]:
X['condition'] = X['condition'].fillna(X['year'].map(fill_dict))
X
Out[0]:
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 good
7 2016 good
8 2015 excellent
9 2015 good

How to filter certain values in consecutive months?

I have a dataframe structured as follows:
Name Month Grade
Sue Jan D
Sue Feb D
Jason Mar B
Sue Mar D
Jason Jan B
Sue Apr A
Jason Feb C
I want to get the list of students who got D 3 consecutive months in the past 6 months. In the example above, Sue will be on the list since she got D in Jan, Feb ad March. How can I do that using Python or Pandas or Numpy?
I tried to solve your problem. I do have a solution for you but it may not be the fastest in terms of efficiency / code execution. Please see below:
newdf = df.pivot(index='Name', columns='Month', values='Grade')
newdf = newdf[['Jan', 'Feb', 'Mar', 'Apr']].fillna(-1)
newdf['concatenated'] = newdf['Jan'].astype('str') + newdf['Feb'].astype('str') + newdf['Mar'].astype('str') + newdf['Apr'].astype('str')
newdf[newdf['concatenated'].str.contains('DDD', regex=False, na=False)]
Output will be like:
Month Jan Feb Mar Apr concatenated
Name
Sue D D D A DDDA
If you just want the names, then the following command instead.
newdf[newdf['concatenated'].str.contains('DDD', regex=False, na=False)].index.to_list()
I came up with this.
df['Month_Nr'] = pd.to_datetime(df.Month, format='%b').dt.month
names = df.Name.unique()
students = np.array([])
for name in names:
filter = df[(df.Name==name) & (df.Grade=='D')].sort_values('Month_Nr')
if filter['Month_Nr'].diff().cumsum().max() >= 2:
students = np.append(students, name)
print(students)
Output:
['Sue']
you have a few ways to deal with this, first use my previous solution but this will require mapping academic numbers to months (i.e September = 1, August = 12) that way you can apply math to work out consecutive values.
the following is to covert the Month into a DateTime and work out the difference in months, we can then apply a cumulative sum and filter any values greater than 3.
d = StringIO("""Name Month Grade
Sue Jan D
Sue Feb D
Jason Mar B
Sue Dec D
Jason Jan B
Sue Apr A
Jason Feb C""")
df = pd.read_csv(d,sep='\s+')
df['date'] = pd.to_datetime(df['Month'],format='%b').dt.normalize()
# set any values greater than June to the previous year.
df['date'] = np.where(df['date'].dt.month > 6,
(df['date'] - pd.DateOffset(years=1)),df['date'])
df.sort_values(['Name','date'],inplace=True)
def month_diff(date):
cumlative_months = (
np.round(((date.sub(date.shift(1)) / np.timedelta64(1, "M")))).eq(1).cumsum()
) + 1
return cumlative_months
df['count'] = df.groupby(["Name", "Grade"])["date"].apply(month_diff)
print(df.drop('date',axis=1))
Name Month Grade count
4 Jason Jan B 1
6 Jason Feb C 1
2 Jason Mar B 1
3 Sue Dec D 1
0 Sue Jan D 2
1 Sue Feb D 3
5 Sue Apr A 1
print(df.loc[df['Name'] == 'Sue'])
Name Month Grade date count
3 Sue Dec D 1899-12-01 1
0 Sue Jan D 1900-01-01 2
1 Sue Feb D 1900-02-01 3
5 Sue Apr A 1900-04-01 1

How do I groupby two columns and create a loop to subplots?

I have a large dataframe (df) in this strutcture:
year person purchase
2016 Peter 0
2016 Peter 223820
2016 Peter 0
2017 Peter 261740
2017 Peter 339987
2018 Peter 200000
2016 Carol 256400
2017 Carol 33083820
2017 Carol 154711
2018 Carol 3401000
2016 Frank 824043
2017 Frank 300000
2018 Frank 214416259
2018 Frank 4268825
2018 Frank 463080
2016 Rita 0
To see how much each person spent per year I do groupby year and person, which gives me what I want.
code:
df1 = df.groupby(['person','year']).sum().reset_index()
How do I create a loop to create subplots for each person containing what he/she spent on purchase each year?
So a subplot for each person where x = year and y = purchase.
I've tried a lot of different things explained here but none seems to work.
Thanks!
You can either do pivot_table or groupby().sum().unstack('person') and then plot:
(df.pivot_table(index='year',
columns='person',
values='purchase',
aggfunc='sum')
.plot(subplots=True)
);
Or
(df.groupby(['person','year'])['purchase']
.sum()
.unstack('person')
.plot(subplots=True)
);
Output:

Python - Extract multiple values from string in pandas df

I've searched for an answer for the following question but haven't found the answer yet. I have a large dataset like this small example:
df =
A B
1 I bought 3 apples in 2013
3 I went to the store in 2020 and got milk
1 In 2015 and 2019 I went on holiday to Spain
2 When I was 17, in 2014 I got a new car
3 I got my present in 2018 and it broke down in 2019
What I would like is to extract all the values of > 1950 and have this as an end result:
A B C
1 I bought 3 apples in 2013 2013
3 I went to the store in 2020 and got milk 2020
1 In 2015 and 2019 I went on holiday to Spain 2015_2019
2 When I was 17, in 2014 I got a new car 2014
3 I got my present in 2018 and it broke down in 2019 2018_2019
I tried to extract values first, but didn't get further than:
df["C"] = df["B"].str.extract('(\d+)').astype(int)
df["C"] = df["B"].apply(lambda x: re.search(r'\d+', x).group())
But all I get are error messages (I've only started python and working with texts a few weeks ago..). Could someone help me?
With single regex pattern (considering your comment "need the year it took place"):
In [268]: pat = re.compile(r'\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})')
In [269]: df['C'] = df['B'].apply(lambda x: '_'.join(pat.findall(x)))
In [270]: df
Out[270]:
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019
Here's one way using str.findall and joining those items from the resulting lists that are greater than 1950::
s = df["B"].str.findall('\d+')
df['C'] = s.apply(lambda x: '_'.join(i for i in x if int(i)> 1950))
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019

How to save split data in panda in reverse order?

You can use this to create the dataframe:
xyz = pd.DataFrame({'release' : ['7 June 2013', '2012', '31 January 2013',
'February 2008', '17 June 2014', '2013']})
I am trying to split the data and save, them into 3 columns named "day, month and year", using this command:
dataframe[['day','month','year']] = dataframe['release'].str.rsplit(expand=True)
The resulting dataframe is :
dataframe
As you can see, that it works perfectly when it gets 3 strings, but whenever it is getting less then 3 strings, it saves the data at the wrong place.
I have tried split and rsplit, both are giving the same result.
Any solution to get the data at the right place?
The last one is year and it is present in every condition , it should be the first one to be saved and then month if it is present otherwise nothing and same way the day should be stored.
You could
In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply(
lambda x: pd.Series(x.split()[::-1]))
In [18]: dataframe
Out[18]:
release year month day
0 7 June 2013 2013 June 7
1 2012 2012 NaN NaN
2 31 January 2013 2013 January 31
3 February 2008 2008 February NaN
4 17 June 2014 2014 June 17
5 2013 2013 NaN NaN
Try reversing the result.
dataframe[['year','month','day']] = dataframe['release'].str.rsplit(expand=True).reverse()

Categories

Resources