I am trying to make a dataframe so that I can send it to a CSV easily, otherwise I have to do this process manually..
I'd like this to be my final output. Each person has a month and year combo that starts at 1/1/2014 and goes to 12/1/2016:
Name date
0 ben 1/1/2014
1 ben 2/1/2014
2 ben 3/1/2014
3 ben 4/1/2014
....
12 dan 1/1/2014
13 dan 2/1/2014
14 dan 3/1/2014
code so far:
import pandas as pd
days = [1]
months = list(range(1, 13))
years = ['2014', '2015', '2016']
listof_people = ['ben','dan','nathan', 'gary', 'Mark', 'Sean', 'Tim', 'Chris']
df = pd.DataFrame({"Name": listof_people})
for month in months:
df.append({'date': month}, ignore_index=True)
print(df)
When I try looping to create the dataframe it either does not work, I get index errors (because of the non-matching lists) and I'm at a loss.
I've done a good bit of searching and have found some following links that are similar, but I can't reverse engineer the work to fit my case.
Filling empty python dataframe using loops
How to build and fill pandas dataframe from for loop?
I don't want anyone to feel like they are "doing my homework", so if i'm derping on something simple please let me know.
I think you can use product for all combination with to_datetime for column date:
from itertools import product
days = [1]
months = list(range(1, 13))
years = ['2014', '2015', '2016']
listof_people = ['ben','dan','nathan', 'gary', 'Mark', 'Sean', 'Tim', 'Chris']
df1 = pd.DataFrame(list(product(listof_people, months, days, years)))
df1.columns = ['Name', 'month','day','year']
print (df1)
Name month day year
0 ben 1 1 2014
1 ben 1 1 2015
2 ben 1 1 2016
3 ben 2 1 2014
4 ben 2 1 2015
5 ben 2 1 2016
6 ben 3 1 2014
7 ben 3 1 2015
8 ben 3 1 2016
9 ben 4 1 2014
10 ben 4 1 2015
...
...
df1['date'] = pd.to_datetime(df1[['month','day','year']])
df1 = df1[['Name','date']]
print (df1)
Name date
0 ben 2014-01-01
1 ben 2015-01-01
2 ben 2016-01-01
3 ben 2014-02-01
4 ben 2015-02-01
5 ben 2016-02-01
6 ben 2014-03-01
7 ben 2015-03-01
...
...
mux = pd.MultiIndex.from_product(
[listof_people, years, months],
names=['Name', 'Year', 'Month'])
pd.Series(
1, mux, name='Day'
).reset_index().assign(
date=pd.to_datetime(df[['Year', 'Month', 'Day']])
)[['Name', 'date']]
Related
I have a data frame which headers look like that:
Time Peter_Price, Peter_variable 1, Peter_variable 2, Maria_Price, Maria_variable 1, Maria_variable 3,John_price,...
2017 12 985685466 Street 1 12 4984984984 Street 2
2018 10 985785466 Street 3 78 4984974184 Street 8
2019 12 985685466 Street 1 12 4984984984 Street 2
2020 12 985685466 Street 1 12 4984984984 Street 2
2021 12 985685466 Street 1 12 4984984984 Street 2
What would be the best multi-index to compare variables by group later such as what person has the highest variable 3 or the trend of all variable 3 by people
I think that what I need is something like this but I accept other suggestions (this is my first approach with multi-index).
Peter Maria John
Price, variable 1, variable 2, Price, variable 1, variable 3, Price,...
Time
You can try this:
Create data
import pandas as pd
import numpy as np
import itertools
people = ["Peter", "Maria"]
vars = ["Price", "variable 1", "variable 2"]
columns = ["_".join(x) for x in itertools.product(people, vars)]
df = (pd.DataFrame(np.random.rand(10, 6), columns=columns)
.assign(time=np.arange(2012, 2022))
print(df.head())
Peter_Price Peter_variable 1 Peter_variable 2 Maria_Price Maria_variable 1 Maria_variable 2 time
0 0.542336 0.201243 0.616050 0.313119 0.652847 0.928497 2012
1 0.587392 0.143169 0.594997 0.553803 0.249188 0.076633 2013
2 0.447318 0.410310 0.443391 0.947064 0.476262 0.230092 2014
3 0.285560 0.018005 0.869387 0.165836 0.399670 0.307120 2015
4 0.422084 0.414453 0.626180 0.658528 0.286265 0.404369 2016
Snippet to try
new_df = df.set_index("time")
new_df.columns = new_df.columns.str.split("_", expand=True)
print(new_df.head())
Peter Maria
Price variable 1 variable 2 Price variable 1 variable 2
time
2012 0.542336 0.201243 0.616050 0.313119 0.652847 0.928497
2013 0.587392 0.143169 0.594997 0.553803 0.249188 0.076633
2014 0.447318 0.410310 0.443391 0.947064 0.476262 0.230092
2015 0.285560 0.018005 0.869387 0.165836 0.399670 0.307120
2016 0.422084 0.414453 0.626180 0.658528 0.286265 0.404369
Then you can use the xs method to subselect specific variables for an individual level analysis. Subsetting to only "variable 2"
>>> new_df.xs("variable 2", level=1, axis=1)
Peter Maria
time
2012 0.616050 0.928497
2013 0.594997 0.076633
2014 0.443391 0.230092
2015 0.869387 0.307120
2016 0.626180 0.404369
2017 0.443827 0.544415
2018 0.425426 0.176707
2019 0.454269 0.414625
2020 0.863477 0.322609
2021 0.902759 0.821789
Example analysis: For each year, who has the higher "Price"
>>> new_df.xs("Price", level=1, axis=1).idxmax(axis=1)
time
2012 Peter
2013 Peter
2014 Maria
2015 Peter
2016 Maria
2017 Peter
2018 Maria
2019 Peter
2020 Maria
2021 Peter
dtype: object
Try:
df=df.set_index('Time')
df.columns = pd.MultiIndex.from_tuples([x.split('_') for x in df.columns])
Output:
Peter Maria
Price variable1 variable2 Price variable1 variable3
Time
2017 12 985685466 Street 1 12 4984984984 Street 2
2018 10 985785466 Street 3 78 4984974184 Street 8
2019 12 985685466 Street 1 12 4984984984 Street 2
2020 12 985685466 Street 1 12 4984984984 Street 2
2021 12 985685466 Street 1 12 4984984984 Street 2
I have a sample dataframe/table as below and I would like to do a simple pivot table in Python to calculate the % difference from the previous year.
DataFrame
Year Month Count Amount Retailer
2019 5 10 100 ABC
2019 3 8 80 XYZ
2020 3 8 80 ABC
2020 5 7 70 XYZ
...
Expected Output
MONTH %Diff
ABC 7 -0.2
XYG 8 -0.125
Thanks,
EDIT: I would like to reiterate that I would like to create the following table below. Not to do a join with the two tables
It looks like you need a groupby not pivot
gdf = df.groupby(['Retailer']).agg({'Amount': 'pct_change'})
Then rename and merge with original df
df = gdf.rename(columns={'Amount': '%Diff'}).dropna().merge(df, how='left', left_index=True, right_index=True)
%Diff Year Month Count Amount Retailer
2 -0.200 2020 3 7 80 ABC
3 -0.125 2020 5 8 70 XYZ
I have a dataframe df which looks like:
name year dept metric
0 Steve Jones 2018 A 0.703300236
1 Steve Jones 2019 A 0.255587222
2 Jane Smith 2018 A 0.502505934
3 Jane Smith 2019 B 0.698808749
4 Barry Evans 2019 B 0.941325241
5 Tony Edwards 2017 B 0.880940126
6 Tony Edwards 2018 B 0.649086123
7 Tony Edwards 2019 A 0.881365905
I would like to create 2 new data-frame which contains the records where someone has moved from dept A to B and and another where someone has moved from dept B to A. Therefore my desired output is:
name year dept metric
0 Jane Smith 2018 A 0.502505934
1 Tony Edwards 2019 B 0.649086123
name year dept metric
0 Jane Smith 2019 B 0.698808749
1 Tony Edwards 2018 B 0.881365905
Where records for the year the last year that someone is in their old dept are captured in one data-frame and the first year in the new dept are captured in another only. The records are sorted by name and year so will be in the correct order.
I've tried :
for row in agg_data.rows:
df['match'] = np.where(df.dept == 'A' and df.dept.shift() =='B','1')
df['match'] = np.where(df.dept == 'B' and df.dept.shift() =='A','2')
and then select out the records into a data-frame but I get it to work.
I believe you need:
df = df[df.groupby('name')['dept'].transform('nunique') > 1]
df = df.drop_duplicates(['name','dept'], keep='last')
df1 = df.drop_duplicates('name')
print (df1)
name year dept metric
2 Jane Smith 2018 A 0.502506
6 Tony Edwards 2018 B 0.649086
df2 = df.drop_duplicates('name', keep='last')
print (df2)
name year dept metric
3 Jane Smith 2019 B 0.698809
7 Tony Edwards 2019 A 0.881366
You could join the initial dataframe with a shift of itself to have convecutive rows on same line. Then you ask the departments you want requiring the names to be the same and you get the indices of one of the expected rows, the other row just has an adjacent index. It gives:
df = agg_data.join(agg_data.shift(), rsuffix='_old')
df1 = df[(df.name_old==df.name)&(df.dept_old=='A')&(df.dept=='B')]
print(pd.concat([agg_data.loc[df1.index], agg_data.loc[df1.index-1]]
).sort_index())
df2 = df[(df.name_old==df.name)&(df.dept_old=='B')&(df.dept=='A')]
print(pd.concat([agg_data.loc[df2.index], agg_data.loc[df2.index-1]]
).sort_index())
with following output:
name year dept metric
2 Jane Smith 2018 A 0.502506
3 Jane Smith 2019 B 0.698809
name year dept metric
6 Tony Edwards 2018 B 0.649086
7 Tony Edwards 2019 A 0.881366
I come up with a solution using drop_duplicates, groupby and rank. Creating df2 on rank=2 and creating df1 on rank==1 and name exists in df2
df['rk'] = df.sort_values(['name', 'dept', 'year']).drop_duplicates(['name', 'dept'], keep='last').groupby('name').year.rank()
df2 = df[df.rk.eq(2)].drop('rk', 1)
df1 = df[df.rk.eq(1) & df.name.isin(df2.name)].drop('rk', 1)
df1:
name year dept metric
2 Jane Smith 2018 A 0.502506
6 Tony Edwards 2018 B 0.649086
df2:
name year dept metric
3 Jane Smith 2019 B 0.698809
7 Tony Edwards 2019 A 0.881366
I have the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'clif_cod' : [1,2,3,3,4,4,4],
'peds_val_fat' : [10.2, 15.2, 30.9, 14.8, 10.99, 39.9, 54.9],
'mes' : [1,2,4,5,5,6,12],
'ano' : [2016, 2016, 2016, 2016, 2016, 2016, 2016]})
vetor_valores = df.groupby(['mes','clif_cod']).sum()
which yields me this output:
ano peds_val_fat
mes clif_cod
1 1 2016 10.20
2 2 2016 15.20
4 3 2016 30.90
5 3 2016 14.80
4 2016 10.99
6 4 2016 39.90
12 4 2016 54.90
How do I select rows based on mes and clif_cod?
When I do list(df) I only get ano and peds_val_fat.
IIUC, you can just pass the argument as_index=False to your groupby. You can then access it as you would any other dataframe
vetor_valores = df.groupby(['mes','clif_cod'], as_index=False).sum()
>>> vetor_valores
mes clif_cod ano peds_val_fat
0 1 1 2016 10.20
1 2 2 2016 15.20
2 4 3 2016 30.90
3 5 3 2016 14.80
4 5 4 2016 10.99
5 6 4 2016 39.90
6 12 4 2016 54.90
To access values, you can now use iloc or loc as you would any dataframe:
# Select first row:
vetor_valores.iloc[0]
...
Alternatively, if you've already created your groupby and don't want to go back and re-make it, you can reset the index, the result is identical.
vetor_valores.reset_index()
By using pd.IndexSlice
vetor_valores.loc[[pd.IndexSlice[1,1]],:]
Out[272]:
ano peds_val_fat
mes clif_cod
1 1 2016 10.2
You've got a dataframe with a two-level MultiIndex. Use both values to access rows, e.g., vetor_valores.loc[(4,3)].
Use axis parameter in .loc:
vetor_valores.loc(axis=0)[1,:]
Output:
ano peds_val_fat
mes clif_cod
1 1 2016 10.2
By year and name, I am hoping to count the occurrence of words in a dataframe from imported from Excel which results will also be exported to Excel.
This is the sample code:
source = pd.DataFrame({'Name' : ['John', 'Mike', 'John','John'],
'Year' : ['1999', '2000', '2000','2000'],
'Message' : [
'I Love You','Will Remember You','Love','I Love You]})
Excepted results are the following in a dataframe. Any ideas?
Year Name Message Count
1999 John I 1
1999 John love 1
1999 John you 1
2000 Mike Will 1
2000 Mike Remember 1
2000 Mike You 1
2000 John Love 2
2000 John I 1
2000 John You 1
I think you can first split column Message, create Serie and add it to original source. Last groupby with size:
#split column Message to new df, create Serie by stack
s = (source.Message.str.split(expand=True).stack())
#remove multiindex
s.index = s.index.droplevel(-1)
s.name= 'Message'
print(s)
0 I
0 Love
0 You
1 Will
1 Remember
1 You
2 Love
3 I
3 Love
3 You
Name: Message, dtype: object
#remove old column Message
source = source.drop(['Message'], axis=1)
#join Serie s to df source
df = (source.join(s))
#aggregate size
print (df.groupby(['Year', 'Name', 'Message']).size().reset_index(name='count'))
Year Name Message count
0 1999 John I 1
1 1999 John Love 1
2 1999 John You 1
3 2000 John I 1
4 2000 John Love 2
5 2000 John You 1
6 2000 Mike Remember 1
7 2000 Mike Will 1
8 2000 Mike You 1