Most efficient way to group, count, then sort? - python

The data is two columns, City, I need to group by city based on sum.
Table looks something like this (times a million):
City, People
Boston, 1000
Boston, 2000
New York, 2500
Chicago, 2000
In this case Boston would be number 1 with 3000 people. I would need to return the top 5% cities and their people count (sum).
What is the most efficient way to do this? Can pandas scale this up well? Should I keep track of the top 5% or do a sort at the end?

If you would prefer to use Python without external libraries, you could do as follows. First, I open the file with csv. Then we can use the built in sorted function to sort our array at a custom key (basically, check the second element). Then we grab the part we want with a [].
import csv, math
out = []
with open("data.csv","r") as fi:
inCsv = csv.reader(fi,delimiter=',')
for row in inCsv:
out.append([col.strip() for col in row])
print (sorted(out[1:], key=lambda a: a[1], reverse=True)[:int(math.ceil(len(out)*.05))])

groupby to get sums
rank to get perctiles
df = pd.read_csv(skipinitialspace=True)
d1 = df.groupby('City').People.sum()
d1.loc[d1.rank(pct=True) >= .95]
City
Boston 3000
Name: People, dtype: int64

Related

Extract part from an address in pandas dataframe column

I work through a pandas tutorial that deals with analyzing sales data (https://www.youtube.com/watch?v=eMOA1pPVUc4&list=PLFCB5Dp81iNVmuoGIqcT5oF4K-7kTI5vp&index=6). The data is already in a dataframe format, within the dataframe is one column called "Purchase Address" that contains street, city and state/zip code. The format looks like this:
Purchase Address
917 1st St, Dallas, TX 75001
682 Chestnut St, Boston, MA 02215
...
My idea was to convert the data to a string and to then drop the irrelevant list values. I used the command:
all_data['Splitted Address'] = all_data['Purchase Address'].str.split(',')
That worked for converting the data to a comma separated list of the form
[917 1st St, Dallas, TX 75001]
Now, the whole column 'Splitted Address' looks like this and I am stuck at this point. I simply wanted to drop the list indices 0 and 2 and to keep 1, i.e. the city in another column.
In the tutorial the solution was layed out using the .apply()-method:
all_data['Column'] = all_data['Purchase Address'].apply(lambda x: x.split(',')[1])
This solutions definitely looks more elegant than mine so far, but I wondered whether I can reach a solution with my approach with a comparable amount of effort.
Thanks in advance.
Use Series.str.split with selecting by indexing:
all_data['Column'] = all_data['Purchase Address'].str.split(',').str[1]

Use agg() to my dataframe, there is no result

I tried to calculate the median and counts of specific column of my data frame:
large_depts = df[df['Department'].isin(Departments_top10)]\
[['Total', 'Department']]\
.groupby('Department')\
.agg([np.median, np.size])
print(large_depts)
It said:
ValueError: no results
But when I checked the dataframe, there were values in my dataframe:
large_depts = df[df['Department'].isin(Departments_top10)]\
[['Total', 'Department']]
print(large_depts)
Total Department
0 677,680.65 Boston Police Department
1 250,893.61 Boston Police Department
2 208,676.89 Boston Police Department
3 319,319.93 Boston Police Department
4 577,123.44 Boston Police Department
I found out that When I try to groupby, there was something wrong, but I don't know why:
large_depts = df[df['Department'].isin(Departments_top10)]\
[['Total', 'Department']]\
.groupby('Department')
print(large_depts)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000000000D1C0A08>
Here's the data: https://data.boston.gov/dataset/418983dc-7cae-42bb-88e4-d56f5adcf869/resource/31358fd1-849a-48e0-8285-e813f6efbdf1/download/employeeearningscy18full.csv
You donĀ“t need call Department variable again. You can chage np.size to 'count' too. Try this code:
df[df['Department'].isin(Departments_top10)].Total.groupby('Department').agg([np.median, 'count'])
You have a couple of errors going on in your code above.
Your Total column is not a numeric type (as you pointed out in the comments, it's a String). I'm assuming you can change (though permanent) your Total column and your code may work? I don't have access to your data so I can't fully check if your groupby functions are working.
Here's code to change your string to list (as asked in the comments). Not sure if this is what you really want.
str2lst = lambda s: s.split(",")
df['Total'] = [str2lst(i) for i in df['Total']]
EDIT: After looking at your DataFrame (and realizing that Total is a number and not a list), I uncovered several rows that contained the column names as values. Removing these as well as changing your string values to float type:
df.drop([12556, 22124, 22123, 22122, 22121, 22125], inplace = True)
str2float = lambda s: s.replace(',', '')
df['Total'] = [float(str2float(i)) for i in df['Total']]
Now running agg() exactly how you have it in the question will work. Here's my results:
Total
Department median size
BPS Facility Management 53183.315 668.0
BPS Special Education 49875.830 831.0
BPS Substitute Teachers/Nurs 6164.070 1196.0
BPS Transportation 20972.770 506.0
Boston Cntr - Youth & Families 44492.625 584.0
In your last code entry, groupby has to have a method you're trying to group by with. Think about it intuitively, how are you grouping your variables? If I instructed you to group a set of cards together, you'd ask how? By color? Number? Suits? You told Python to group Department, but you didn't give it how you wanted it grouped. So Python returned a "...generic.DataFrameGroupBy object".
Try doing df...groupby('Department').count() and you'll see df grouped by Department.

How to sum up every 3 rows by column in Pandas Dataframe Python

I have a pandas dataframe top3 with data as in the image below.
Using the two columns, STNAME and SENSUS2010POP, I need to find the sum for Wyoming (sum: 91738+75450+46133=213321), then the sum for Wisconsin (sum:1825699), West Virginia and so on. Summing up the 3 counties for each state. (and need to sort them in ascending order after that).
I have tried this code to compute the answer:
topres=top3.groupby('STNAME').sum().sort_values(['CENSUS2010POP'], ascending=False)
Maybe you can suggest a more efficient way to do it? Maybe with a lambda expression?
You can use groupby:
df.groupby('STNAME').sum()
Note: I'm starting in the problem before selecting the top 3 counties per state, and jumping straight to their sum.
I found it helpful with this problem to use a list selection.
I created a data frame view of the counties with:
counties_df=census_df[census_df['SUMLEV'] == 50]
and a separate one of the states so I could get at their names.
states_df=census_df[census_df['SUMLEV'] == 40]
Then I was able to create that sum of the populations of the top 3 counties per state, by looping over all states and summing the largest 3.
res = [(x, counties_df[(counties_df['STNAME']==x)].nlargest(3,['CENSUS2010POP'])['CENSUS2010POP'].sum()) for x in states_df['STNAME']]
I converted that result to a data frame
dfObj = pd.DataFrame(res)
named its columns
dfObj.columns = ['STNAME','POP3']
sorted in place
dfObj.sort_values(by=['POP3'], inplace=True, ascending=False)
and returned the first 3
return dfObj['STNAME'].head(3).tolist()
Definitely groupby is a more compact way to do the above, but I found this way helped me break down the steps (and the associated course had not yet dealt with groupby).

how to add specific two columns and get new column as a total using pandas library?

I'm trying to add two-columns and trying to display their total in a new column and following as well
The total sum of sales in the month of Jan
The minimum sales amount in the month of Feb
The average (mean) sales for the month of Mar
and trying to create a data frame called d2 that only contains rows of data in d that don't have any missing (NaN) values
I have implemented the following code
import pandas as pd
new_val= pd.read_csv("/Users/mayur/574_repos_2019/ml-python-
class/assignments/data/assg-01-data.csv")
new_val['total'] = 'total'
new_val.to_csv('output.csv', index=False)
display(new_val)
d.head(5)# it's not showing top file lines of the .csv data
# .CSV file sample data
#account name street city state postal-code Jan Feb Mar total
#0118 Kerl, 3St . Waily Texas 28752.0 10000 62000 35000 total
#0118 mkrt, 1Wst. con Texas 22751.0 12000 88200 15000 total
It's giving me a total as a word.
When you used new_val['total'] = 'total' you basically told Pandas that you want a Column in your DataFrame called total where every variable is the string total.
What you want to fix is the variable assignment. For this I can give you quick and dirty solution that will hopefully make a more appealing solution be clearer to you.
You can iterate through your DataFrame and add the two columns to get the variable for the third.
for i,j in new_val.iterrows():
new_val.iloc[i]['total'] = new_val.iloc[i]['Jan'] + new_val.iloc[i]['Feb'] + new_val.iloc[i]['Mar']
Note, that this requires column total to have already been defined. This also requires iterating through your entire data set, so if your data set is large this is not the best option.
As mentioned by #Cavenfish, that new_val['total'] = 'total' creates a column total where value of every cell is the string total.
You should rather use new_val['total'] = new_val['Jan']+new_val['Feb']+new_val['Mar']
For treatment of NA values you can use a mask new_val.isna() which will generate boolean for all cells whether they are NA or not in your array. You can then apply any logic on top of it. For your example, the below should work:
new_val.isna().sum(axis=1)==4
Considering that you now have 4 columns in your dataframe Jan,Feb,Mar,total; it will return False in case one of the row contains NA. You can then apply this mask to new_val['total'] to assign default value in case NA is encountered in one of the columns for a row.

Append column to dataframe containing count of duplicates on another row

New to Python, using 3.x
I have a large CSV containing a list of customer names and addresses.
[Name, City, State]
I am wanting to create a 4th column that is a count of the total number of customers living in the current customer's state.
So for example:
Joe, Dallas, TX
Steve, Austin, TX
Alex, Denver, CO
would become:
Joe, Dallas, TX, 2
Steve, Austin, TX, 2
Alex, Denver, CO, 1
I am able to read the file in, and then use groupby to create a Series that contains the values for the 4th column, but I can't figure out how to take that series and match it against the million+ rows in my actual file.
import pandas as pd
mydata=pd.read_csv(r'C:\Users\customerlist.csv', index_col=False)
mydata=mydata.drop_duplicates(subset='name', keep='first')
mydata['state']=mydata['state'].str.strip()
stateinstalls=(mydata.groupby(mydata.state, as_index=False).size())
stateinstalls gives me a series [2,1] but I lose the corresponding state([TX, CO]). It needs to be a tuple, so that I can then go back and iterate through all rows of my spreadsheet and say something like:
if mydata['state'].isin(stateinstalls(0))
mydata[row]=stateinstalls(1)
I feel very lost. I know there has to be a far simpler way to do this. Like even in place within the array (like a countif type function).
Any pointers is much appreciated.

Categories

Resources