Append column to dataframe containing count of duplicates on another row - python

New to Python, using 3.x
I have a large CSV containing a list of customer names and addresses.
[Name, City, State]
I am wanting to create a 4th column that is a count of the total number of customers living in the current customer's state.
So for example:
Joe, Dallas, TX
Steve, Austin, TX
Alex, Denver, CO
would become:
Joe, Dallas, TX, 2
Steve, Austin, TX, 2
Alex, Denver, CO, 1
I am able to read the file in, and then use groupby to create a Series that contains the values for the 4th column, but I can't figure out how to take that series and match it against the million+ rows in my actual file.
import pandas as pd
mydata=pd.read_csv(r'C:\Users\customerlist.csv', index_col=False)
mydata=mydata.drop_duplicates(subset='name', keep='first')
mydata['state']=mydata['state'].str.strip()
stateinstalls=(mydata.groupby(mydata.state, as_index=False).size())
stateinstalls gives me a series [2,1] but I lose the corresponding state([TX, CO]). It needs to be a tuple, so that I can then go back and iterate through all rows of my spreadsheet and say something like:
if mydata['state'].isin(stateinstalls(0))
mydata[row]=stateinstalls(1)
I feel very lost. I know there has to be a far simpler way to do this. Like even in place within the array (like a countif type function).
Any pointers is much appreciated.

Related

Apply function by extracting values from another dataframe

I am new to pandas dataframe. I would like to apply a function on an old dataframe (df1) by extracting values from another dataframe (df2).
DF2 looks like this (the actual one has ~500 rows)
Judge old_court_name new_court_name
John eighth circuit first circuit
Ruth us court claims. fifth circuit
Ben district connecticut district ohio
Then I've written a function
def addJudgeCourt(df1, Judge, old_court_name, new_court_name):
How do I tell pandas to extract the last three items by iterating from the dataframe2? Thanks!

Extract part from an address in pandas dataframe column

I work through a pandas tutorial that deals with analyzing sales data (https://www.youtube.com/watch?v=eMOA1pPVUc4&list=PLFCB5Dp81iNVmuoGIqcT5oF4K-7kTI5vp&index=6). The data is already in a dataframe format, within the dataframe is one column called "Purchase Address" that contains street, city and state/zip code. The format looks like this:
Purchase Address
917 1st St, Dallas, TX 75001
682 Chestnut St, Boston, MA 02215
...
My idea was to convert the data to a string and to then drop the irrelevant list values. I used the command:
all_data['Splitted Address'] = all_data['Purchase Address'].str.split(',')
That worked for converting the data to a comma separated list of the form
[917 1st St, Dallas, TX 75001]
Now, the whole column 'Splitted Address' looks like this and I am stuck at this point. I simply wanted to drop the list indices 0 and 2 and to keep 1, i.e. the city in another column.
In the tutorial the solution was layed out using the .apply()-method:
all_data['Column'] = all_data['Purchase Address'].apply(lambda x: x.split(',')[1])
This solutions definitely looks more elegant than mine so far, but I wondered whether I can reach a solution with my approach with a comparable amount of effort.
Thanks in advance.
Use Series.str.split with selecting by indexing:
all_data['Column'] = all_data['Purchase Address'].str.split(',').str[1]

Creating a dictionary from a dataframe having duplicate column value

I have a dataframe df1 having more than 500k records:
state lat-long
Florida (12.34,34.12)
texas (13.45,56.0)
Ohio (-15,49)
Florida (12.04,34.22)
texas (13.35,56.40)
Ohio (-15.79,49.34)
Florida (12.8764,34.2312)
the lat-long value can differ for a particular state.
Need to get a dictonary like below. the lat-long value can differ for a particular state but need to capture the first occurrence like this.
dict_state_lat_long = {"Florida":"(12.34,34.12)","texas":"(13.45,56.0)","Ohio":"(-15,49)"}
How can I get this in most efficient way?
You can use DataFrame.groupby to group the dataframe with respect to the states and then you can apply the aggregate function first to select the first occurring values of lat-long in the grouped dataframe.
Then you can use DataFrame.to_dict() function to convert the dataframe to the python dict.
Use this:
grouped = df.groupby("state")["lat-long"].agg("first")
dict_state_lat_long = grouped.to_dict()
print(dict_state_lat_long)
Output:
{'Florida': '(12.34,34.12)', 'Ohio': '(-15,49)', 'texas': '(13.45,56.0)'}

How to add a new column to CSV according to certain equality of other columns?

Here is my question. I am working with a python CSV data set using pandas. I am comparing crime rates in NYC neighborhoods and Airbnb rents in that neighborhood using 2 different data sets. What I want to do is checking if the neighborhood names are same then adding the crime rate column next to the price column of air BnB df. However, the indexes are not same such that there are 500 crimes for the upper east side houses while there is only 1 crime number for upper east side. So how can I combine this info? HELP much needed as I have a report due by tonight thnx
So far I have done:
i only implemented csv files as df for both and then thought about creating a dctionary with crime rates data for neighbourhoods and rates and if I find equality on aribnb locations and dictionary locations i want to add crime rate values from dictionary to an empty list. And after doing this I believe list will be in ordered way with Air bnb locations so that I can add this list as a new column to Air bnb csv. Sorry my code is not proper so ı cant post it here. Also I am stuck at adding the proper value of dict to empty list by finding same locations in 2 csvs.
datasets:
http://app.coredata.nyc/?mlb=false&ntii=crime_all_rt&ntr=Community%20District&mz=14&vtl=https%3A%2F%2Fthefurmancenter.carto.com%2Fu%2Fnyufc%2Fapi%2Fv2%2Fviz%2F98d1f16e-95fd-4e52-a2b1-b7abaf634828%2Fviz.json&mln=true&mlp=true&mlat=40.718&ptsb=&nty=2018&mb=roadmap&pf=%7B%22subsidies%22%3Atrue%7D&md=table&mlv=false&mlng=-73.996&btl=Borough&atp=neighborhoods
https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

Most efficient way to group, count, then sort?

The data is two columns, City, I need to group by city based on sum.
Table looks something like this (times a million):
City, People
Boston, 1000
Boston, 2000
New York, 2500
Chicago, 2000
In this case Boston would be number 1 with 3000 people. I would need to return the top 5% cities and their people count (sum).
What is the most efficient way to do this? Can pandas scale this up well? Should I keep track of the top 5% or do a sort at the end?
If you would prefer to use Python without external libraries, you could do as follows. First, I open the file with csv. Then we can use the built in sorted function to sort our array at a custom key (basically, check the second element). Then we grab the part we want with a [].
import csv, math
out = []
with open("data.csv","r") as fi:
inCsv = csv.reader(fi,delimiter=',')
for row in inCsv:
out.append([col.strip() for col in row])
print (sorted(out[1:], key=lambda a: a[1], reverse=True)[:int(math.ceil(len(out)*.05))])
groupby to get sums
rank to get perctiles
df = pd.read_csv(skipinitialspace=True)
d1 = df.groupby('City').People.sum()
d1.loc[d1.rank(pct=True) >= .95]
City
Boston 3000
Name: People, dtype: int64

Categories

Resources