One index colums to multi-index - python

I have a data frame which headers look like that:
Time Peter_Price, Peter_variable 1, Peter_variable 2, Maria_Price, Maria_variable 1, Maria_variable 3,John_price,...
2017 12 985685466 Street 1 12 4984984984 Street 2
2018 10 985785466 Street 3 78 4984974184 Street 8
2019 12 985685466 Street 1 12 4984984984 Street 2
2020 12 985685466 Street 1 12 4984984984 Street 2
2021 12 985685466 Street 1 12 4984984984 Street 2
What would be the best multi-index to compare variables by group later such as what person has the highest variable 3 or the trend of all variable 3 by people
I think that what I need is something like this but I accept other suggestions (this is my first approach with multi-index).
Peter Maria John
Price, variable 1, variable 2, Price, variable 1, variable 3, Price,...
Time

You can try this:
Create data
import pandas as pd
import numpy as np
import itertools
people = ["Peter", "Maria"]
vars = ["Price", "variable 1", "variable 2"]
columns = ["_".join(x) for x in itertools.product(people, vars)]
df = (pd.DataFrame(np.random.rand(10, 6), columns=columns)
.assign(time=np.arange(2012, 2022))
print(df.head())
Peter_Price Peter_variable 1 Peter_variable 2 Maria_Price Maria_variable 1 Maria_variable 2 time
0 0.542336 0.201243 0.616050 0.313119 0.652847 0.928497 2012
1 0.587392 0.143169 0.594997 0.553803 0.249188 0.076633 2013
2 0.447318 0.410310 0.443391 0.947064 0.476262 0.230092 2014
3 0.285560 0.018005 0.869387 0.165836 0.399670 0.307120 2015
4 0.422084 0.414453 0.626180 0.658528 0.286265 0.404369 2016
Snippet to try
new_df = df.set_index("time")
new_df.columns = new_df.columns.str.split("_", expand=True)
print(new_df.head())
Peter Maria
Price variable 1 variable 2 Price variable 1 variable 2
time
2012 0.542336 0.201243 0.616050 0.313119 0.652847 0.928497
2013 0.587392 0.143169 0.594997 0.553803 0.249188 0.076633
2014 0.447318 0.410310 0.443391 0.947064 0.476262 0.230092
2015 0.285560 0.018005 0.869387 0.165836 0.399670 0.307120
2016 0.422084 0.414453 0.626180 0.658528 0.286265 0.404369
Then you can use the xs method to subselect specific variables for an individual level analysis. Subsetting to only "variable 2"
>>> new_df.xs("variable 2", level=1, axis=1)
Peter Maria
time
2012 0.616050 0.928497
2013 0.594997 0.076633
2014 0.443391 0.230092
2015 0.869387 0.307120
2016 0.626180 0.404369
2017 0.443827 0.544415
2018 0.425426 0.176707
2019 0.454269 0.414625
2020 0.863477 0.322609
2021 0.902759 0.821789
Example analysis: For each year, who has the higher "Price"
>>> new_df.xs("Price", level=1, axis=1).idxmax(axis=1)
time
2012 Peter
2013 Peter
2014 Maria
2015 Peter
2016 Maria
2017 Peter
2018 Maria
2019 Peter
2020 Maria
2021 Peter
dtype: object

Try:
df=df.set_index('Time')
df.columns = pd.MultiIndex.from_tuples([x.split('_') for x in df.columns])
Output:
Peter Maria
Price variable1 variable2 Price variable1 variable3
Time
2017 12 985685466 Street 1 12 4984984984 Street 2
2018 10 985785466 Street 3 78 4984974184 Street 8
2019 12 985685466 Street 1 12 4984984984 Street 2
2020 12 985685466 Street 1 12 4984984984 Street 2
2021 12 985685466 Street 1 12 4984984984 Street 2

Related

Pandas where function

I'm using Pandas where function trying to find the percentage in each state
filter1 = df['state']=='California'
filter2 = df['state']=='Texas'
filter3 = df['state']=='Florida'
df['percentage']= df['total'].where(filter1)/df['total'].where(filter1).sum()
The output is
Year state total percentage
2014 California 914198.0 0.134925
2014 Florida 766441.0 NaN
2014 Texas 1045274.0 NaN
2015 California 874642.0 0.129087
2015 Florida 878760.0 NaN
how do I apply the rest of 2 filters into there too?
Don't use where but groupby.transform:
df['percentage'] = df['total'].div(df.groupby('state')['total'].transform('sum'))
Output:
Year state total percentage
0 2014 California 914198.0 0.511056
1 2014 Florida 766441.0 0.465865
2 2014 Texas 1045274.0 1.000000
3 2015 California 874642.0 0.488944
4 2015 Florida 878760.0 0.534135
You can try out df.loc[(filter1) & (filter2) & (filter3)] in pandas to apply multiple filter together !

I want to filter rows from data frame where the year is 2020 and 2021 using re.search and re.match functions

Data Frame:
Unnamed: 0 date target insult tweet year
0 1 2014-10-09 thomas-frieden fool Can you believe this fool, Dr. Thomas Frieden ... 2014
1 2 2014-10-09 thomas-frieden DOPE Can you believe this fool, Dr. Thomas Frieden ... 2014
2 3 2015-06-16 politicians all talk and no action Big time in U.S. today - MAKE AMERICA GREAT AG... 2015
3 4 2015-06-24 ben-cardin It's politicians like Cardin that have destroy... Politician #SenatorCardin didn't like that I s... 2015
4 5 2015-06-24 neil-young total hypocrite For the nonbeliever, here is a photo of #Neily... 2015
I want the data frame which consists for only year with 2020 and 2021 using search and match methods.
df_filtered = df.loc[df.year.str.contains('2014|2015', regex=True) == True]

How to fill dataframe's empty/nan cell with conditional column mean

I am trying to fill the (pandas) dataframe's null/empty value using the mean of that specific column.
The data looks like this:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019
.
.
I am trying to fill that empty cell with the mean of Revenue column where Industry is == 'Construction'.
To get our numerical mean value I did:
df.groupby(['Industry'], as_index = False).mean()
I am trying to do something like this to fill up that empty cell in-place:
(df[df['Industry'] == "Construction"]['Revenue']).fillna("$21212121.01", inplace = True)
..but it is not working. Can anyone tell me how to achieve it! Thanks a lot.
Expected Output:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013 $21212121.01
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019 $21212121.01
.
.
Although the numbers used as averages are different, we have presented two types of averages: the normal average and the average calculated on the number of cases that include NaN.
df['Revenue'] = df['Revenue'].replace({'\$':'', ',':''}, regex=True)
df['Revenue'] = df['Revenue'].astype(float)
df_mean = df.groupby(['Industry'], as_index = False)['Revenue'].mean()
df_mean
Industry Revenue
0 Construction 4.358071e+06
1 Financial Services 8.858420e+06
2 IT Services 1.175702e+07
df_mean_nan = df.groupby(['Industry'], as_index = False)['Revenue'].agg({'Sum':np.sum, 'Size':np.size})
df_mean_nan['Mean_nan'] = df_mean_nan['Sum'] / df_mean_nan['Size']
df_mean_nan
Industry Sum Size Mean_nan
0 Construction 13074212.0 5.0 2614842.4
1 Financial Services 17716840.0 2.0 8858420.0
2 IT Services 11757018.0 1.0 11757018.0
Average taking into account the number of NaNs
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean_nan.loc[df_mean_nan['Industry'] == 'Construction',['Mean_nan']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5387469.0
1 2 Rednimdox Construction 2013 2614842.4
2 3 Lamtone IT Services 2009 11757018.0
3 4 Stripfind Financial Services 2010 12329371.0
4 5 Openjocon Construction 2013 4273207.0
5 6 Villadox Construction 2012 1097353.0
6 7 Sumzoomit Construction 2010 7703652.0
7 8 Abcddd Construction 2019 2614842.4
Normal average: (NaN is excluded)
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean.loc[df_mean['Industry'] == 'Construction',['Revenue']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5.387469e+06
1 2 Rednimdox Construction 2013 4.358071e+06
2 3 Lamtone IT Services 2009 1.175702e+07
3 4 Stripfind Financial Services 2010 1.232937e+07
4 5 Openjocon Construction 2013 4.273207e+06
5 6 Villadox Construction 2012 1.097353e+06
6 7 Sumzoomit Construction 2010 7.703652e+06
7 8 Abcddd Construction 2019 4.358071e+06

Looking back at previous row in data-frame and selecting specific records

I have a dataframe df which looks like:
name year dept metric
0 Steve Jones 2018 A 0.703300236
1 Steve Jones 2019 A 0.255587222
2 Jane Smith 2018 A 0.502505934
3 Jane Smith 2019 B 0.698808749
4 Barry Evans 2019 B 0.941325241
5 Tony Edwards 2017 B 0.880940126
6 Tony Edwards 2018 B 0.649086123
7 Tony Edwards 2019 A 0.881365905
I would like to create 2 new data-frame which contains the records where someone has moved from dept A to B and and another where someone has moved from dept B to A. Therefore my desired output is:
name year dept metric
0 Jane Smith 2018 A 0.502505934
1 Tony Edwards 2019 B 0.649086123
name year dept metric
0 Jane Smith 2019 B 0.698808749
1 Tony Edwards 2018 B 0.881365905
Where records for the year the last year that someone is in their old dept are captured in one data-frame and the first year in the new dept are captured in another only. The records are sorted by name and year so will be in the correct order.
I've tried :
for row in agg_data.rows:
df['match'] = np.where(df.dept == 'A' and df.dept.shift() =='B','1')
df['match'] = np.where(df.dept == 'B' and df.dept.shift() =='A','2')
and then select out the records into a data-frame but I get it to work.
I believe you need:
df = df[df.groupby('name')['dept'].transform('nunique') > 1]
df = df.drop_duplicates(['name','dept'], keep='last')
df1 = df.drop_duplicates('name')
print (df1)
name year dept metric
2 Jane Smith 2018 A 0.502506
6 Tony Edwards 2018 B 0.649086
df2 = df.drop_duplicates('name', keep='last')
print (df2)
name year dept metric
3 Jane Smith 2019 B 0.698809
7 Tony Edwards 2019 A 0.881366
You could join the initial dataframe with a shift of itself to have convecutive rows on same line. Then you ask the departments you want requiring the names to be the same and you get the indices of one of the expected rows, the other row just has an adjacent index. It gives:
df = agg_data.join(agg_data.shift(), rsuffix='_old')
df1 = df[(df.name_old==df.name)&(df.dept_old=='A')&(df.dept=='B')]
print(pd.concat([agg_data.loc[df1.index], agg_data.loc[df1.index-1]]
).sort_index())
df2 = df[(df.name_old==df.name)&(df.dept_old=='B')&(df.dept=='A')]
print(pd.concat([agg_data.loc[df2.index], agg_data.loc[df2.index-1]]
).sort_index())
with following output:
name year dept metric
2 Jane Smith 2018 A 0.502506
3 Jane Smith 2019 B 0.698809
name year dept metric
6 Tony Edwards 2018 B 0.649086
7 Tony Edwards 2019 A 0.881366
I come up with a solution using drop_duplicates, groupby and rank. Creating df2 on rank=2 and creating df1 on rank==1 and name exists in df2
df['rk'] = df.sort_values(['name', 'dept', 'year']).drop_duplicates(['name', 'dept'], keep='last').groupby('name').year.rank()
df2 = df[df.rk.eq(2)].drop('rk', 1)
df1 = df[df.rk.eq(1) & df.name.isin(df2.name)].drop('rk', 1)
df1:
name year dept metric
2 Jane Smith 2018 A 0.502506
6 Tony Edwards 2018 B 0.649086
df2:
name year dept metric
3 Jane Smith 2019 B 0.698809
7 Tony Edwards 2019 A 0.881366

In a pandas dataframe, count the number of times a condition occurs in one column?

Background
I have five years of NO2 measurement data, in csv files-one file for every location and year. I have loaded all the files into pandas dataframes in the same format:
Date Hour Location NO2_Level
0 01/01/2016 00 Street 18
1 01/01/2016 01 Street 39
2 01/01/2016 02 Street 129
3 01/01/2016 03 Street 76
4 01/01/2016 04 Street 40
Goal
For each dataframe count the number of times NO2_Level is greater than 150 and output this.
So I wrote a loop that's creates all the dataframes from the right directories and cleans them appropriately .
Problem
Whatever I've tried produces results I know on inspection are incorrect, e.g :
-the count value for every location on a given year is the same (possible but unlikely)
-for a year when I know there should be any positive number for the count, every location returns 0
What I've tried
I have tried a lot of approaches to getting this value for each dataframe, such as making the column a series:
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()'''
Using pd.count():
count = df[df['NO2_Level'] >= 150].count()
These two approaches have gotten closest to what I want to output
Example to test on
data = {'Date': ['01/01/2016','01/02/2016',' 01/03/2016', '01/04/2016', '01/05/2016'], 'Hour': ['00', '01', '02', '03', '04'], 'Location': ['Street','Street','Street','Street','Street',], 'NO2_Level': [18, 39, 129, 76, 40]}
df = pd.DataFrame(data=d)
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()
count
Expected Outputs
So from this I'm trying to get it to output a single line for each dataframe that was made in the format Location, year, count (of condition):
Kirkstall Road,2013,47
Haslewood Close,2013,97
...
Jack Lane Hunslet,2015,158
So the above example would produce
Street, 2016, 1
Actual
Every year produces the same result for each location, for some years (2014) the count doesn't seem to work at all when on inspection there should be:
Kirkstall Road,2013,47
Haslewood Close,2013,47
Tilbury Terrace,2013,47
Corn Exchange,2013,47
Temple Newsam,2014,0
Queen Street Morley,2014,0
Corn Exchange,2014,0
Tilbury Terrace,2014,0
Haslewood Close,2015,43
Tilbury Terrace,2015,43
Corn Exchange,2015,43
Jack Lane Hunslet,2015,43
Norman Rows,2015,43
Hopefully this helps.
import pandas as pd
ddict = {
'Date':['2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-02',],
'Hour':['00','01','02','03','04','02'],
'Location':['Street','Street','Street','Street','Street','Street',],
'N02_Level':[19,39,129,76,40, 151],
}
df = pd.DataFrame(ddict)
# Convert dates to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Make a Year column
df['Year'] = df['Date'].apply(lambda x: x.strftime('%Y'))
# Group by lcoation and year, count by M02_Level > 150
df1 = df[df['N02_Level'] > 150].groupby(['Location','Year']).size().reset_index(name='Count')
# Interate the results
for i in range(len(df1)):
loc = df1['Location'][i]
yr = df1['Year'][i]
cnt = df1['Count'][i]
print(f'{loc},{yr},{cnt}')
### To not use f-strings
for i in range(len(df1)):
print('{loc},{yr},{cnt}'.format(loc=df1['Location'][i], yr=df1['Year'][i], cnt=df1['Count'][i]))
Sample data:
Date Hour Location N02_Level
0 2016-01-01 00 Street 19
1 2016-01-01 01 Street 39
2 2016-01-01 02 Street 129
3 2016-01-01 03 Street 76
4 2016-01-01 04 Street 40
5 2016-01-02 02 Street 151
Output:
Street,2016,1
here is a solution with a sample generated (randomly):
def random_dates(start, end, n):
start_u = start.value // 10 ** 9
end_u = end.value // 10 ** 9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
location = ['street', 'avenue', 'road', 'town', 'campaign']
df = pd.DataFrame({'Date' : random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-12-31'), 20),
'Location' : np.random.choice(location, 20),
'NOE_level' : np.random.randint(low=130, high= 200, size=20)})
#Keep only year for Date
df['Date'] = df['Date'].dt.strftime("%Y")
print(df)
df = df.groupby(['Location', 'Date'])['NOE_level'].apply(lambda x: (x>150).sum()).reset_index(name='count')
print(df)
Example df generated:
Date Location NOE_level
0 2018 town 191
1 2017 campaign 187
2 2017 town 137
3 2016 avenue 148
4 2017 campaign 195
5 2018 town 181
6 2018 road 187
7 2018 town 184
8 2016 town 155
9 2016 street 183
10 2018 road 136
11 2017 road 171
12 2018 street 165
13 2015 avenue 193
14 2016 campaign 170
15 2016 street 132
16 2016 campaign 165
17 2015 road 161
18 2018 road 161
19 2015 road 140
output:
Location Date count
0 avenue 2015 1
1 avenue 2016 0
2 campaign 2016 2
3 campaign 2017 2
4 road 2015 1
5 road 2017 1
6 road 2018 2
7 street 2016 1
8 street 2018 1
9 town 2016 1
10 town 2017 0
11 town 2018 3

Categories

Resources