I've used group by and pivot table from pandas package in order to create the following table:
Input:
q4 = q1[['category','Month']].groupby(['category','Month']).Month.agg({'Count':'count'}).reset_index()
q4 = pd.DataFrame(q4.pivot(index='category',columns='Month').reset_index())
then the output :
category Count
Month 6 7 8
0 adult-classes 29.0 109.0 162.0
1 air-pollution 27.0 43.0 13.0
2 babies-and-toddlers 4.0 51.0 2.0
3 bicycle 210.0 96.0 23.0
4 building NaN 17.0 NaN
5 buildings-maintenance 23.0 12.0 NaN
6 catering 1351.0 4881.0 1040.0
7 childcare 9.0 NaN NaN
8 city-planning 105.0 81.0 23.0
9 city-services 2461.0 2130.0 1204.0
10 city-taxes 1.0 4.0 42.0
I'm trying to add a condition to the months,
the problem I'm having is that after pivoting I can't access the columns
how can I show only the rows where 6<7<8?
To flatten your multi-index, you can use renaming of your columns (check out this answer).
q4.columns = [''.join([str(c) for c in col]).strip() for col in q4.columns.values]
To remove NaNs:
q4.fillna(0, inplace=True)
To select according to your constraint:
result = q4[(q4['Count6'] < q['Count7']) & (q4['Count7'] < q4['Count8'])]
Related
I have multiple pandas data frames with some common columns and some overlapping rows. I would like to combine them in such a way that I have one final data frame with all of the columns and all of the unique rows (overlapping/duplicate rows dropped). The remaining gaps should be nans.
I have come up with the function below. In essence it goes through all columns one by one, appending all of the values from each data frame, dropping the duplicates (overlap), and building a new output data frame column by column.
def combine_dfs(dataframes:list):
## Identifying all unique columns in all data frames
columns = []
for df in dataframes:
columns.extend(df.columns)
columns = np.unique(columns)
## Appending values from each data frame per column
output_df = pd.DataFrame()
for col in columns:
column = pd.Series(dtype="object", name=col)
for df in dataframes:
if col in df.columns:
column = column.append(df[col])
## Removing overlapping data (assuming consistent values)
column = column[~column.index.duplicated()]
## Adding column to output data frame
column = pd.DataFrame(column)
output_df = pd.concat([output_df,column], axis=1)
output_df.sort_index(inplace=True)
return output_df
df_1 = pd.DataFrame([[10,20,30],[11,21,31],[12,22,32],[13,23,33]], columns=["A","B","C"])
df_2 = pd.DataFrame([[33,43,54],[34,44,54],[35,45,55],[36,46,56]], columns=["C","D","E"], index=[3,4,5,6])
df_3 = pd.DataFrame([[50,60],[51,61],[52,62],[53,63],[54,64]], columns=["E","F"])
print(combine_dfs([df_1,df_2,df_3]))
The output, as intended in the visualization, looks like this:
A B C D E F
0 10.0 20.0 30 NaN 50 60.0
1 11.0 21.0 31 NaN 51 61.0
2 12.0 22.0 32 NaN 52 62.0
3 13.0 23.0 33 43.0 54 63.0
4 NaN NaN 34 44.0 54 64.0
5 NaN NaN 35 45.0 55 NaN
6 NaN NaN 36 46.0 56 NaN
This method works well on small data sets. Is there a way to optimize this?
IIUC you can chain combine_first:
print (df_1.combine_first(df_2).combine_first(df_3))
A B C D E F
0 10.0 20.0 30 NaN 50.0 60.0
1 11.0 21.0 31 NaN 51.0 61.0
2 12.0 22.0 32 NaN 52.0 62.0
3 13.0 23.0 33 43.0 54.0 63.0
4 NaN NaN 34 44.0 54.0 64.0
5 NaN NaN 35 45.0 55.0 NaN
6 NaN NaN 36 46.0 56.0 NaN
Now, I have two dataframe. I have use groupby. and count() function to export this dataframe(df1). When I used groupby. to count the total number of each category. It filtered out the category which the count is 0. How can I use Python to get the outcome?
However,I will like to have a dataframe which also required categories.
Original dataframe:
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 Beam Sensor 27.0 12.0 13.0 14.0
3 CLPS 1.0 NaN NaN 1.0
However,I will like to have a dataframe which also required categories.
(required categories: ATIDS, BasicCrane, LLP, Beam Sensor, CLPS, SPR)
Expected dataframe (The count number of 'LLP' and 'SPR' is 0)
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN
>>> categories
['ATIDS', 'BasicCrane', 'LLP', 'Beam Sensor', 'CLPS', 'SPR']
>>> pd.merge(pd.DataFrame({'Cat': categories}), df, how='outer')
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN
One way you could easily do is to fill NaN value with 0 'before' doing a groupby function. All zero data (previously NaN value) will be totally be counted as zero.
df.fillna(0)
I have a dataframe with about 40 columns and about 100000 rows:
ID MONTH_NUM_
FROM_EVENT F1 F2 F3 F4 etc…
2 1 4.0 133.0 28.0 NaN
2 2 NaN 132.0 29.0 24.0
2 3 NaN 131.0 NaN 29.0
2 4 4.0 130.0 31.0 7.0
2 5 8.0 129.0 26.0 2.0
2 6 8.0 128.0 25.0 3.0
4 1 5.0 139.0 29.0 7.0
4 2 5.0 138.0 NaN 22.0
4 3 5.0 137.0 30.0 28.0
4 4 5.0 136.0 29.0 25.0
4 5 5.0 135.0 NaN 27.0
4 6 5.0 134.0 27.0 29.0
etc…
each columns F is a 6m time series data with NaN for each rows ID client
I want to output new dataframe without monthes like so:
ID F1 F2 F3 F4 etc…
2
4
etc…
where each cell of new data frame is a slope calculation of 6m time series for each F colums with following code example:
x = [6, 5, 4, 3, 2, 1] #its constanta for each calcul, monthes with reverse orders because 1 is last month before event prediction
y = df.F1[df['ID']==2]
xm = np.ma.masked_array(x,mask=np.isnan(y)).compressed() #ignore Nans
ym = np.ma.masked_array(y,mask=np.isnan(y)).compressed() #ignore Nans
from scipy.stats import linregress
linregress(xm, ym).slope
What is the efficient way to looping this calculation and create new df?
Thanx in advance...
I have data something like this
import random
import pandas as pd
jobs = ['Agriculture', 'Crafts', 'Labor', 'Professional']
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(300)]
})
And I want a simple table showing the count of jobs in each region.
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len))
Output is
MaritalStatus
Region 1 2 3 4 5 All
JobCategory
Agriculture 13.0 23.0 17.0 18.0 8.0 79.0
Crafts 16.0 13.0 18.0 19.0 14.0 80.0
Labor 15.0 11.0 19.0 11.0 14.0 70.0
Professional 22.0 17.0 16.0 7.0 9.0 71.0
All 66.0 64.0 70.0 55.0 45.0 300.0
I assume "MaritalStatus" is showing up in the output because that is the column that the count is being calculated on. How do I get Pandas to calculate based on the Region-JobCategory count and ignore extraneous columns in the dataframe?
Added in edit ---
I am looking for a table with margin values to be output. The values in the table I show are what I want but I don't want MaritalStatus to be what is counted. If there is a Nan in that column, e.g. change the column definition to
'MaritalStatus':[random.choice(['Not Married', 'Married'])
for i in range(299)].append(np.NaN)
This is the output (both with and without values = 'MaritalStatus',)
MaritalStatus
Region 1 2 3 4 5 All
JobCategory
Agriculture 16.0 14.0 16.0 14.0 16.0 NaN
Crafts 25.0 17.0 15.0 14.0 16.0 NaN
Labor 14.0 16.0 8.0 17.0 15.0 NaN
Professional 13.0 14.0 14.0 13.0 13.0 NaN
All NaN NaN NaN NaN NaN 0.0
You can fill the nan values with 0 and then find the len i.e
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(299)].append(np.NaN)})
df = df.fillna(0)
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
values='MaritalStatus',
aggfunc=len))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 19.0 17.0 13.0 20.0 9.0 78.0
Crafts 17.0 14.0 9.0 11.0 16.0 67.0
Labor 10.0 17.0 15.0 19.0 11.0 72.0
Professional 11.0 14.0 19.0 19.0 20.0 83.0
All 57.0 62.0 56.0 69.0 56.0 300.0
If you cut the dataframe down to just the columns that are to be part of the final index counting rows works without having to refer to another column.
pd.pivot_table(testdata[['JobCategory', 'Region']],
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len)
Output is the same as in the question except the line with "MaritialStatus" is not present.
The len aggregation function counts the number of times a value of MaritalStatus appears along a particular combination of JobCategory - Region. Thus you're counting the number of JobCategory - Region instances, which is what you're expecting I guess.
EDIT
We can assign key value to each records and count or size that value.
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(299)].append(np.NaN)})
print(pd.pivot_table(df.assign(key=1),
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len,
values='key'))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 16.0 14.0 13.0 16.0 16.0 75.0
Crafts 14.0 9.0 17.0 22.0 13.0 75.0
Labor 11.0 18.0 20.0 10.0 16.0 75.0
Professional 16.0 14.0 15.0 14.0 16.0 75.0
All 57.0 55.0 65.0 62.0 61.0 300.0
You could add MaritalStatus as the values parameter, and this would eliminate that extra level in the column index. With aggfunc=len, it really doesn't matter what you select as the values parameter it is going to return a count of 1 for every row in that aggregation.
So, try:
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len,
values='MaritalStatus'))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 10.0 18.0 10.0 15.0 19.0 72.0
Crafts 11.0 13.0 17.0 11.0 22.0 74.0
Labor 12.0 10.0 18.0 16.0 12.0 68.0
Professional 21.0 16.0 20.0 13.0 16.0 86.0
All 54.0 57.0 65.0 55.0 69.0 300.0
Option 2
Use groupby and size:
df.groupby(['JobCategory','Region']).size()
Output:
JobCategory Region
Agriculture 1 10
2 18
3 10
4 15
5 19
Crafts 1 11
2 13
3 17
4 11
5 22
Labor 1 12
2 10
3 18
4 16
5 12
Professional 1 21
2 16
3 20
4 13
5 16
dtype: int64
here is my DataFrame:
0 1 2
0 0 0.0 20.0 NaN
1 1.0 21.0 NaN
2 2.0 22.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2011.0
1 0 3.0 23.0 NaN
1 4.0 24.0 NaN
2 5.0 25.0 NaN
3 6.0 26.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2012.0
i want to convert the 'ID' and 'Year' rows to dataframe Index with 'ID' being level=0 and 'Year' being level=1. I tried using stack() but still cannot figure it .
Edited: my desired output should look like below:
0 1
11111 2011 0 0.0 20.0
1 1.0 21.0
2 2.0 22.0
2012 0 3.0 23.0
1 4.0 24.0
2 5.0 25.0
3 6.0 26.0
This should work:
df1 = df.loc[pd.IndexSlice[:, ['ID', 'Year']], '2']
dfs = df1.unstack()
dfi = df1.index
dfn = df.drop(dfi).drop('2', axis=1).unstack()
dfn.set_index([dfs.ID, dfs.Year]).stack()