How to aggregate rows in a pandas dataframe - python

I have a dataframe shown in the image 1. It is a sample of pubs in London,UK (3337 pubs/rows). And the geometry is at an LSOA level. In some LSOAs, there is more than 1 pub. I want my dataframe to summarise the number of pubs in every LSOA. I already have the information by using
psdf['lsoa11nm'].value_counts()
prints out:
City of London 001F 103
City of London 001G 40
Westminster 013B 36
Westminster 018A 36
Westminster 013E 30
...
Lambeth 005A 1
Croydon 043C 1
Hackney 002E 1
Merton 022D 1
Bexley 008B 1
Name: lsoa11nm, Length: 1630, dtype: int64
I cant use this as a new dataframe because it is a key and one column as opposed two columns where one would be lsoa11nm and the other pub count.
Does anyone know how to groupby the dataframe so that there will be only one row for every lsoa, that says how many pubs are in it?

Related

Compare different df's row by row and return changes

Every month I collect data that contains details of employees to be stored in our database.
I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.
I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.
There are also some important details to mention:
Each column can also contain blank values in any of the dataframes;
The dataframes have the same column names but not necessarily the same data type;
The dataframes do not have the same number of rows necessarily;
If a row do not find its Index match, do not return to the new dataframe;
The rows of the dataframes can be matched by a column named "Index"
So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):
df1:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com
3 MKT $7600 Maria d 30-06-2021
4 I'T 8000 Peter az#i.com 14-07-2021
df2:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
5 IT 9000 John NOT PROVIDED
6 IT 9900 John NOT PROVIDED
df3:
Index Department Salary Manager Email Start_Date
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
**The differences in this example are:
Start date added in row of Index 2
Salary format corrected and email corrected for row Index 3
Department format corrected for row Index 4
What would be the best way to to this comparison?
I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.
Thank you for the support!
I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
But first you would need to align the rows between old and new dataframe via the index:
new_df_to_compare=new_df.loc[old_df.index]
When datatypes don't match. You would also need to align them:
new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())
Then compare should work just like this:
difference_df = old_df.compare(new_df_to_compare)

How to extract unique values from pandas column where values are in list

I want to extract unique cities from city column in pandas dataframe. City column has values in list. How would I extract the cities frequency like:
Lahore 3
Karachi 2
Sydney 1
etc.
Sample dataframe:
Name Age City
a jack 34 [Sydney,Delhi]
b Riti 31 [Lahore,Delhi]
c Aadi 16 [New York, Karachi, Lahore]
d Mohit 32 [Peshawar,Delhi, Karachi]
Thank you
Let us try explode + value_counts
out = df.City.explode().value_counts()

Drop duplicate rows in a dataframe of particular column

I have a dataframe like the following:
Districtname pincode
0 central delhi 110001
1 central delhi 110002
2 central delhi 110003
3 central delhi 110004
4 central delhi 110005
How can I drop rows based on column DistrictName and select the first unique value
The output I want:
Districtname pincode
0 central delhi 110001
Data Frames can be dropped using pandas.DataFrame.drop_duplicates() and defaults to keeping the first occurrence. In your case DataFrame.drop_duplicates(subset = "Districtname") should work. If you would like to update the same DataFrame DataFrame.drop_duplicates(subset = "Districtname", inplace = True) will do the job. Docs: https://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html
Use drop_duplicates with inplace=true:
df.drop_duplicates('Districtname',inplace=True)

Concat 2 Dataframes with 54 entries yields 1 row

I have created 2 dataframes with a common index based on Year and District. There are 58 rows in each dataframe and the Year and Districts are exact matches. Yet when I try to join them, I get a new dataframe with all of the columns combined (which is what I want) but only one single row - New York City. That row exists in both dataframes, as do all the rest, but only this one makes it to the merged DF. I have tried a few different methods of joining the dataframes but they all do the same thing. This example uses:
pd.concat([ groupeddf,Popdf], axis=1)
This is the Popdf with (Year, District) as Index:
Population
Year District
2017 Albany 309612
Allegany 46894
Broome 193639
Cattaraugus 77348
Cayuga 77603
This is the groupeddf indexed on Year and District (some columns eliminated for clarity):
Total SNAP Households Total SNAP Persons \
Year District
2017 Albany 223057 416302
Allegany 36935 69802
Broome 201586 363504
Cattaraugus 75567 144572
Cayuga 64168 121988
This is the merged DF after executing pd.concat([ groupeddf,Popdf], axis=1):
Population Total SNAP Households Total SNAP Persons
Year District
2017 New York City 8622698 11314598 19987958
This shows the merged dataframe has only 1 entry:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1 entries, (2017, New York City) to (2017, New York City)
Data columns (total 4 columns):
Population 1 non-null int64
Total SNAP Households 1 non-null int64
Total SNAP Persons 1 non-null int64
Total SNAP Benefits 1 non-null float64
dtypes: float64(1), int64(3)
memory usage: 170.0+ bytes
UPDATE: I tried another approach and it demonstrates that the indices which appear identical to me, are not being seen as identical.
When I execute this code, I get duplicates instead of a merge:
combined_df = groupeddf.merge(Popdf, how='outer', left_index=True, right_index=True)
The results look like this:
Year District
2017 Albany 223057.0 416302.0
Albany NaN NaN
Allegany 36935.0 69802.0
Allegany NaN NaN
Broome 201586.0 363504.0
Broome NaN NaN
Cattaraugus 75567.0 144572.0
Cattaraugus NaN NaN
The only exception is when you get down to New York City. That one does not duplicate, so is actually seen as the same index. So there is something wrong with the data, but I an not sure what.
Did you try using merge, like this:
combined_df = merge(groupeddf, Popdf, how = 'inner', on = ['Year','District'])
I did inner if you want to combine only where the district and year exist in both dataframes. If you want to keep all on the left dataframe, but only matching from the right, then do a left join, etc.
It took a while but I finally sorted it out. The District name in the population dataframe had a space at the end of the name, where there was not space in the SNAP df.
"Albany " vs "Albany"

How to avoid using iloc or hard coding the index number pandas to dynamically fetch rows from single data frame into multiple subsets?

My dataframe looks likes this
country1 state1 city1 District1
india 36 20 40
china 27 21 35
honkong 34 21 38
london 32 21 38
company technology car brand population
adf java Ford 40
ydfh java Hyundai 19
klyu java Nissan 47
hy6g dotnet Toyota 20
rghtr dotnet Hyundai 30
htryr dotnet hummer 12
I wanted to create a multiple subset from single dataframe, I do not wanted to use index number or iloc function or hard coding the index number because it will filter out whenver there is new entry either after entry london or after last entry
If there is any new entry comes it should also needs to be captured, any clues how to perform in pandas or using numpy?
hope this question is clear
Assuming your data frame is saved as df you can use groupby and save the grouped sub-data to a dictionary for future reference.
d = {}
for group, frame in df.groupby('country1'):
d[group] = frame
Also if you want to groupby multiply columns pass a list to groupby as follows
for group, frame in df.groupby(['country1', 'technology']):
d[group] = frame

Categories

Resources