Reshaping Pandas DataFrame with Repeated Column Index - python

Suppose I have the following DataFrame:
>>> cols = ['model', 'parameter', 'condition', 'value']
>>> df = pd.DataFrame([['BMW', '0-60', 'rain', '7'], ['BMW', '0-60', 'sun', '7'],
['BMW','mpg', 'rain','25'],
['BMW', 'stars', 'rain','5'],
['Toyota', '0-60', 'rain','9'],
['Toyota','mpg', 'rain','40'],
['Toyota', 'stars', 'rain','4']], columns=cols)
>>> df
model parameter condition value
0 BMW 0-60 rain 7
1 BMW 0-60 sun 7
2 BMW mpg rain 25
3 BMW stars rain 5
4 Toyota 0-60 rain 9
5 Toyota mpg rain 40
6 Toyota stars rain 4
This is a list of performance metrics for various cars at different conditions. This is a made up data set, of course, but its representative of my problem.
What I ultimately want is to have observation for a given condition on its own row, and each metric on its own column. This would look something like this:
parameter condition 0-60 mpg stars
model
0 BMW rain 7 25 5
1 BMW sun 7 NaN NaN
2 Toyota rain 9 40 4
Note that I just made up the format above. I don't know if Pandas would generate something exactly like that, but that's the general idea. I would also of course transform the "condition" into a Boolean array and fill in the NaNs.
My problem is that when I try to use the pivot method I get an error. I think this is because my "column" key is repeated (because I have BMW 0-60 stats for the rain and for the sun conditions).
df.pivot(index='model',columns='parameter')
ValueError: Index contains duplicate entries, cannot reshape
Does anyone know of a slick way to do this? I'm finding a lot of these Pandas reshaping methods to be quite obtuse.

You can just change the index and unstack it...
df.set_index(['model', 'condition', 'parameter']).unstack()
returns
value
parameter 0-60 mpg stars
model condition
BMW rain 7 25 5
sun 7 NaN NaN
Toyota rain 9 40 4

You can get the result you want using pivot_table and passing the following parameters:
>>> df.pivot_table(index=['model', 'condition'], values='value', columns='parameter')
parameter 0-60 mpg stars
model condition
BMW rain 7 25 5
sun 7 NaN NaN
Toyota rain 9 40 4
(You may need to ensure the "value" column has numeric types first or else you can pass aggfunc=lambda x: x in the pivot_table function to get around this requirement.)

Related

Splitting series of Strings into Dataframe

I have a big series of strings that I want to split into a dataframe.
The series looks like this:
s = pd.Series({"1":"Name=Marc-Age=48-Car=Tesla",
"2":"Name=Ben-Job=Pilot-Car=Porsche",
"3":"Name=Tom-Age=24-Car=Ford"})
I want to split this into a dataframe looking like this:
Name Age Job Car
1 Marc 48 Nan Tesla
2 Ben Nan Pilot Porsche
3 Tom 24 Nan Ford
I tried to split the strings first by "-" and then by "=" but don't understand how to continue after
df=s.str.split("-", expand=True)
for col in df.columns:
df[col]=df[col].str.split("=")
I get this:
` 0 1 2`
`1 ['Name', 'Marc'] ['Age', '48'] ['Car', 'Tesla']`
`2 ['Name', 'Ben'] ['Job', 'Pilot'] ['Car', 'Porsche']`
`3 ['Name', 'Tom'] ['Age', '24'] ['Car', 'Ford']`
I don't know how to continue from here. I can't loop through the rows because my dataset is really big.
Can anyone help on how to go on from here?
If you split then explode and split again you can then use a pivot.
import pandas as pd
s = pd.Series({"1":"Name=Marc-Age=48-Car=Tesla",
"2":"Name=Ben-Job=Pilot-Car=Porsche",
"3":"Name=Tom-Age=24-Car=Ford"})
s = s.str.split('-').explode().str.split('=', expand=True).reset_index()
s = s.pivot(index='index', columns=0, values=1).reset_index(drop=True)
Output
Age Car Job Name
0 48 Tesla NaN Marc
1 NaN Porsche Pilot Ben
2 24 Ford NaN Tom

Use Groupby to Calculate Average if Date < X

I am trying to use a data frame that includes historical game statistics like the below df1, and build a second data frame that shows what the various column averages were going into each game (as I show in df2). How can I use grouby or something else to find the various averages for each team but only for games that have a date prior to the date in that specific row. Example of historical games column:
Df1 = Date Team Opponent Points Points Against 1st Downs Win?
4/16/20 Eagles Ravens 10 20 10 0
2/10/20 Eagles Falcons 30 40 8 0
12/15/19 Eagles Cardinals 40 10 7 1
11/15/19 Eagles Giants 20 15 5 1
10/12/19 Jets Giants 10 18 2 1
Below is the dataframe that i'm trying to create. As you can see, it is showing the averages for each column but only for the games that happened prior to each game. Note: this is a simplified example of a much larger data set that i'm working with. In case the context helps, I'm trying to create this dataframe so I can analyze the correlation between the averages and whether the team won.
Df2 = Date Team Opponent Avg Pts Avg Pts Against Avg 1st Downs Win %
4/16/20 Eagles Ravens 25.0 21.3 7.5 75%
2/10/20 Eagles Falcons 30.0 12.0 6.0 100%
12/15/19 Eagles Cardinals 20.0 15.0 5.0 100%
11/15/19 Eagles Giants NaN NaN NaN NaN
10/12/19 Jets Giants NaN NaN NaN NaN
Let me know if anything above isn't clear, appreciate the help.
The easiest way is to turn your dataframe into a Time Series.
Run this for a file:
data=pd.read_csv(r'C:\Users\...csv',index_col='Date',parse_dates=True)
This is an example with a CSV file.
You can run this after:
data[:'#The Date you want to have all the dates before it']
If you want build a Series that has time indexed:
index=pd.DatetimeIndex(['2014-07-04',...,'2015-08-04'])
data=pd.Series([0, 1, 2, 3], index=index)
Define your own function
def aggs_under_date(df, date):
first_team = df.Team.iloc[0]
first_opponent= df.Opponent.iloc[0]
if df.date.iloc[0] <= date:
avg_points = df.Points.mean()
avg_againts = df['Points Against'].mean()
avg_downs = df['1st Downs'].mean()
win_perc = f'{win_perc.sum()/win_perc.count()*100} %'
return [first_team, first_opponent, avg_points, avg_againts, avg_downs, win_perc]
else:
return [first_team, first_opponent, np.nan, np.nan, np.nan, np.nan]
And do the groupby applying the function you just defined
date_max = pd.to_datetime('11/15/19')
Df1.groupby(['Date']).agg(aggs_under_date, date_max)

Pandas group and join

I am new to pandas. I want to analysis the following case. Let say, A fruit market is giving the prices of the fruits daily the time from 18:00 to 22:00. For every half an hour they are updating the price of the fruits between the time lab. Consider the market giving the prices of the fruits at 18:00 as follows,
Fruit Price
Apple 10
Banana 20
After half an hour at 18:30, the list has been updated as follows,
Fruit Price
Apple 10
Banana 21
Orange 30
Grapes 25
Pineapple 65
I want to check has the prices of the fruits been changed of recent one[18:30] with the earlier one[18:00].
Here I want to get the result as,
Fruit 18:00 18:30
Banana 20 21
To solve this I am thinking to do the following,
1) Add time column in the two data frames.
2) Merge the tables into one.
3) Make a Pivot table with Index Fruit name and Column as ['Time','Price'].
I don't know how to get intersect the two data frames of grouped by Time. How to get the common rows of the two Data Frames.
You dont need to pivot in this case, we can simply use merge and use suffixes argument to get the desired results:
df_update = pd.merge(df, df2, on='Fruit', how='outer', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10.0 10.0
1 Banana 20.0 21.0
2 Orange NaN 30.0
3 Grapes NaN 25.0
4 Pineapple NaN 65.0
Edit
Why are we using the outer argument? We want to keep all the new data that is updated in df2. If we use inner for example, we will not get the updated fruits, like below. Unless this is the desired output by OP, which is not clear in this case.
df_update = pd.merge(df, df2, on='Fruit', how='inner', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10 10.0
1 Banana 20 21.0
If Fruit is the index of your data frame the following code should work. The Idea is to return rows with inequality:
df['1800'] = df1['Price']
df['1830'] = df2['Price']
print(df.loc[df['1800'] != df1['1830']])
You can also use datetime in your column heading.

For any index return the minimum of the previous two index

population = pd.DataFrame({'village': pd.Series([15,4,1,2], index=['boys','girls','men','women']),
'town': pd.Series([20,36,26,28], index=['boys','girls', 'men', 'women'])})
Output:
---- town village
boys 20 15
girls 36 4
men 26 1
women 28 2
For any index in the dataframe above, I want that particular index value to be the minimum value between the previous two index values.
For example I expect the the value for men in town to be 20 since it is the smaller value between (36,20)
I tried implementing it using df.shift(2).cummin(axis=0) but that didn't work.
Expected_output:
---- town village
boys NaN NaN
girls NaN NaN
men 20 4
women 26 1
As was said by #Zero, so you can mark this as answered, you can use:
population.shift(1).rolling(2).min()

Chained conditional count in Pandas

I have a dataframe that looks at how a form has been filled out. Here's an example:
ID Name Postcode Street Employer Salary
1 John NaN Craven Road NaN NaN
2 Sue TD2 NAN NaN 15000
3 Jimmy MW6 Blake Street Bank 40000
4 Laura QE2 Mill Lane NaN 20000
5 Sam NW2 Duke Avenue Farms 35000
6 Jordan SE6 NaN NaN NaN
7 NaN CB2 NaN Startup NaN `
I want to return a count of successively filled out columns on the condition that all previous columns have been filled. The final output should look something like:
Name Postcode Street Employer salary
6 5 3 2 2
Is there a good Pandas way of doing this? I suppose there could be a way of applying a mask so that if any previous boolean is given as zero the current column is also zero and then counting that but I'm not sure if that is the best way.
Thanks!
I think you can use notnull and cummin:
In [99]: df.notnull().cummin(axis=1).sum(axis=0)
Out[99]:
Name 6
Postcode 5
Street 3
Employer 2
Salary 2
dtype: int64
Although note that I had to replace your NAN (Sue's street) with a float NaN before I did that, and I assumed that ID was your index.
The cumulative minimum is one way to implement "applying a mask so that if any previous boolean is given as zero the current column is also zero", as you predicted would work.
Maybe cumprod BTW you have 'NAN' in your df, I try then as notnull here
df.notnull().cumprod(1).sum()
Out[59]:
ID 7
Name 6
Postcode 5
Street 4
Employer 2
Salary 2
dtype: int64

Categories

Resources