TL;DR:
Have 2 dataframes with different sizes, but one 'id' column(in both df) that supposed to act as index. Need to merge them, group by 'sector' and 'gender' and count/sum entrys in each group.
Long version:
I have a dataframe with 'id', 'sector', among other columns, from company personnel. Another dataframe with 'id' and 'gender'. Examples bellow:
df1:
row* id sector other columns
1 0 Operational ...
2 0 Administrative ...
3 1 Sales ...
4 2 IT ...
5 3 Operational ...
6 3 IT ...
7 4 Sales ...
[...]
150 100 Operational ...
151 100 Sales ...
152 101 IT ...
*I don't really have a 'row' column, it's there just to make it easier to understand my problem.
df2:
row* id gender
1 0 Male
2 1 Female
3 2 Female
4 3 Male
5 4 Male
[...]
101 100 Male
102 101 Female
As you can see, one person can be in more then one sector (which seems to make my problem more complicated.)
I need to merge them together and then make a sum from how many male and female in each sector.
FIRST PROBLEM
Decided to make a new df to get only the columns 'id' and 'sector'.
df3 = df1[['id','sector']]
df3 = df3.merge(df2)
I get:
No common columns to perform merge on. Merge options: left_on=None,
right_on=None, left_index=False, right_index=False
Tried using .join() instead of .merge() and I get:
['id'] not in index"
Tried now with reset_index() - Found in some of the answers around here, but didn't really solved my issue.
df1 = df1.reset_index()
df3 = df1[['id','sector']]
df3 = df3.join(df2)
What I got was this:
row* id sector gender
1 0 Operational Male
2 0 Administrative Female
3 1 Sales Female
4 2 IT Male
5 3 Operational Male
6 3 IT ...
7 4 Sales ...
[...]
150 100 Operational NaN
151 100 Sales NaN
152 101 IT NaN
It didn't respected the 'id' and just concatenated the column to the side. Since df2 only had 102 rows, I got NaN in the other rows(103 to 152), aside from the fact that the 'gender' was no longer accurate.
SECOND PROBLEM
Decided to power through that in order to get the rest of the work done. I tried this:
df3 = df3.groupby('sector','gender').size()
It raises:
No axis named gender for object type < class 'pandas.core.frame.DataFrame'>
What doesn't really make sense to me, because I can call df3.gender and I get the (entire) expected series. If I remove 'gender' from the line above, it actually group but just that doesn't work for me. Also tried passing the columns names befor groupby, to no avail.
Expected result should be something like this:
sector gender sum
operational male 20
operational female 5
administrative male 10
administrative female 17
sales male 12
sales female 13
IT male 1
IT female 11
Not sure if I can answer to my own question but I think I should since the issue is resolved.
The solutions were very simple, even though I don't understand some of the issues I got.
First problem added on='id' in the merge
df3 = df1[['id','sector']].merge(df2, on='id')
Second problem just missing a list, as pointed by #DYZ
df3.groupby(['sector','gender']).size()
Feeling quite stupid right now... Must be tired. Thanks DYZ and sorry for the trouble.
Related
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have 2 csv file. They have one common column which is ID. What I want to do is I want to extract the common rows and built another dataframe. Firstly, I want to select job, and after that, as I said they have one common column, I want to find the rows whose IDs are the same. Visually, the dataframe should be seen like this:
Let first DataFrame is:
#ID
#Gender
#Job
#Shift
#Wage
1
Male
Engineer
Night
8000
2
Male
Engineer
Night
7865
3
Female
Worker
Day
5870
4
Male
Accountant
Day
5870
5
Female
Architecture
Day
4900
Let second one is:
#ID
#Department
1
IT
2
Quality Control
5
Construction
7
Construction
8
Human Resources
And the new DataFrame should be like:
#ID
#Department
#Job
#Wage
1
IT
Engineer
8000
2
Quality Control
Engineer
7865
5
Construction
Architecture
4900
You can use:
df_result = df1.merge(df2, on = 'ID', how = 'inner')
If you want to select only certain columns from a certain df use:
df_result = df1[['ID','Job', 'Wage']].merge(df2[['ID', 'Department']], on = `ID`, how = 'inner')
Use:
df = df2.merge(df1[['ID','Job', 'Wage']], on='ID')
I was comparing two excel files which contains the information of the students of two schools. However those files might contain different number of rows between them.
The first set that I used is to import the excel files in two dataframes:
df1 = pd.read_excel('School A - Information.xlsx')
df2 = pd.read_excel('School B - Information.xlsx')
print(df1)
Name Age Birth_Country Previous Schools
0 tom 10 USA 3
1 nick 15 MEX 1
2 juli 14 CAN 0
3 tom 19 NOR 1
print(df2)
Name Age Birth_Country Previous Schools
0 tom 10 USA 3
1 tom 19 NOR 1
2 nick 15 MEX 4
After this, I would like to check the divergences between those two dataframes (index order is not important). However I am receiving an error due to the size of the dataframes.
compare = df1.values == df2.values
<ipython-input-9-7cc64ba0e622>:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
compare = df1.values == df2.values
print(compare)
False
Adding to that, I would like to create a third DataFrame with the corresponding divergences, that shows the divergence.
import numpy as np
rows,cols=np.where(compare==False)
for item in zip(rows,cols):
df1.iloc[item[0], item[1]] = '{} --> {}'.format(df1.iloc[item[0], item[1]],df2.iloc[item[0], item[1]])
However, using this code is not working, as the index order may be different between the two dataframes.
My expected output should be the below dataframe:
You can use pd.merge to accomplish this. If you're unfamiliar with dataframe merges, here's a post that describes relational database merging ideas: link. So in this case, what we want to do is first do a left merge of df2 onto df1 to find how the Previous Schools column differs:
df_merged = pd.merge(df1, df2, how="left", on=["Name", "Age", "Birth_Country"], suffixes=["_A", "_B"])
print(df_merged)
will give you a new dataframe
Name Age Birth_Country Previous Schools_A Previous Schools_B
0 tom 10 USA 3 3.0
1 nick 15 MEX 1 4.0
2 juli 14 CAN 0 NaN
3 tom 19 NOR 1 1.0
This new dataframe has all the information you're looking for. To find just the rows where the Previous Schools entries differ:
df_different = df_merged[df_merged["Previous Schools_A"]!=df_merged["Previous Schools_B"]]
print(df_different)
Name Age Birth_Country Previous Schools_A Previous Schools_B
1 nick 15 MEX 1 4.0
2 juli 14 CAN 0 NaN
and to find the rows where Previous Schools has not changed:
df_unchanged = df_merged[df_merged["Previous Schools_A"]==df_merged["Previous Schools_B"]]
print(df_unchanged)
Name Age Birth_Country Previous Schools_A Previous Schools_B
0 tom 10 USA 3 3.0
3 tom 19 NOR 1 1.0
If I were you, I'd stop here, because creating the final dataframe you want is going to have generic object column types because of the mix of strings and integers, which will limit its uses... but maybe you need it in that particular formattting for some reason. In which case, it's all about putting together these dataframe subsets in the right way to get your desired formatting. Here's one way.
First, initialize the final dataframe with the unchanged rows:
df_final = df_unchanged[["Name", "Age", "Birth_Country", "Previous Schools_A"]].copy()
df_final = df_final.rename(columns={"Previous Schools_A": "Previous Schools"})
print(df_final)
Name Age Birth_Country Previous Schools
0 tom 10 USA 3
3 tom 19 NOR 1
now process the entries that have changed between dataframes. There are two cases here: where the entries have changed (where Previous Schools_B is not NaN) and where the entrie is new (where Previous Schools_B is NaN). We'll deal with each in turn:
changed_entries = df_different[~pd.isnull(df_different["Previous Schools_B"])].copy()
changed_entries["Previous Schools"] = changed_entries["Previous Schools_A"].astype('str') + " --> " + changed_entries["Previous Schools_B"].astype('int').astype('str')
changed_entries = changed_entries.drop(columns=["Previous Schools_A", "Previous Schools_B"])
print(changed_entries)
Name Age Birth_Country Previous Schools
1 nick 15 MEX 1 --> 4
and now process the entries that are completely new:
new_entries = df_different[pd.isnull(df_different["Previous Schools_B"])].copy()
new_entries = "NaN --> " + new_entries[["Name", "Age", "Birth_Country", "Previous Schools_A"]].astype('str')
new_entries = new_entries.rename(columns={"Previous Schools_A": "Previous Schools"})
print(new_entries)
Name Age Birth_Country Previous Schools
2 NaN --> juli NaN --> 14 NaN --> CAN NaN --> 0
and finally, concatenate all the dataframes:
df_final = pd.concat([df_final, changed_entries, new_entries])
print(df_final)
Name Age Birth_Country Previous Schools
0 tom 10 USA 3
3 tom 19 NOR 1
1 nick 15 MEX 1 --> 4
2 NaN --> juli NaN --> 14 NaN --> CAN NaN --> 0
I have a data frame of 62 undergrads from a state university with 13 column (age, class, major, GPA, etc.)
print(studentSurvey)
ID Gender Age Major ... Text Messages
1 F 20 Other 120
2 M 22 CS 50
.
.
.
62 F 21 Retail 200
I want to make pivot tables on studentSurvey. For example, I want to find out how many women took CS as major, men taking Others, etc. The closest I could code this out was through this:
studentSurvey.pivot_table(index="Gender", columns="Major", aggfunc='count')
Age ... Text Messages
Major Accounting CIS Economics/Finance ... Other Retailing/Marketing Undecided
Gender ...
Female 3.0 3.0 7.0 ... 3.0 9.0 NaN
Male 4.0 1.0 4.0 ... 4.0 5.0 3.0
That is not what I require. I only require Gender to be the index (row) with all the unique values under Major to be the columns and each cell containing the count of that gender and major. I've also tried slicing only these two columns and pivoting but the results are mixed up. Can anyone suggest something better? I'm new to advanced reshaping in pandas.
Check crosstab
pd.crosstab(df['Gender'], df['Major'])
Fix your code
studentSurvey.pivot_table(index="Gender", columns="Major", values="ID", aggfunc="count")
Try:
(studentSurvey.groupby(['Gender','Major'])
.value_counts()
.unstack('Major', fill_value=0)
)
Or you can do crosstab:
pd.crosstab(studentSurvey['Gender'], studentSurvey['Major'])
I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?
One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
This question already has answers here:
replace column values in one dataframe by values of another dataframe
(5 answers)
Closed 5 years ago.
So i got my dataframe (df1):
Number Name Gender Hobby
122 John Male -
123 Patrick Male -
124 Rudy Male -
I want to add data to hobby based on number column. Assuming i've got my list of hobby based on its number on different dataframe. Like Example (df2):
Number Hobby
124 Soccer
... ...
... ...
and df3 :
Number Hobby
122 Basketball
... ...
... ...
How can i achieve this dataframe:
Number Name Gender Hobby
122 John Male Basketball
123 Patrick Male -
124 Rudy Male Soccer
So far i've already tried this following solutions :
Select rows from a DataFrame based on values in a column in pandas
but its only selecting some data. How can i update the 'Hobby' column ?
Thanks in advance.
You can use map, merge and join will also achieve it
df['Hobby']=df.Number.map(df1.set_index('Number').Hobby)
df
Out[155]:
Number Name Gender Hobby
0 122 John Male NaN
1 123 Patrick Male NaN
2 124 Rudy Male Soccer