I am currently trying to crack a programming puzzle that has the very simple dataframe host with 2 columns named city and amenities (both are object datatype). Now, entries in both columns could be repeated multiple times. Below is the first few entries of host is beLOW
City Amenities Price($)
NYC {TV,"Wireless Internet", "Air conditioning","Smoke 8
detector",Essentials,"Lock on bedroom door"}
LA {"Wireless Internet",Kitchen,Washer,Dryer,"First aid
kit",Essentials,"Hair dryer","translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"}
10
SF {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,"Free
parking on premises","Pets live on this
property",Dog(s),"Indoor fireplace","Buzzer/wireless
intercom",Heating,Washer,Dryer,"Smoke detector","Carbon
monoxide detector","First aid kit","Safety card","Fire e
extinguisher",Essentials,Shampoo,"24-hour check-
in",Hangers,"Hair dryer",Iron,"Laptop friendly
workspace","translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50","Self Check-In",Lockbox} 15
NYC {"Wireless Internet","Air
conditioning",Kitchen,Heating,"Suitable for events","Smoke
detector","Carbon monoxide detector","First aid kit","Fire
extinguisher",Essentials,Shampoo,"Lock on bedroom
door",Hangers,"translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"} 20
LA {TV,Internet,"Wireless Internet","Air
conditioning",Kitchen,"Free parking on
premises",Essentials,Shampoo,"translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"}
LA {TV,"Cable TV",Internet,"Wireless Internet",Pool,Kitchen,"Free
parking on premises",Gym,Breakfast,"Hot tub","Indoor
fireplace",Heating,"Family/kid friendly",Washer,Dryer,"Smoke
detector","Carbon monoxide detector",Essentials,Shampoo,"Lock
on bedroom door",Hangers,"Private entrance"} 28
.....
Question. Output the city with the highest number of amenities.
My attempt. I tried using groupby() function to group it based on column city using host.groupby('city'). Now, I need to count successfully the number of elements in each set of Amenities. Since the data types are different, the len() function did not work because there are \ between each element in the set (for example, if I use host['amenities'][0], the output is "{TV,\"Wireless Internet\",\"Air conditioning\",\"Smoke detector\",\"Carbon monoxide detector\",Essentials,\"Lock on bedroom door\",Hangers,Iron}". Applying len() to this output would result in 134, which is clearly incorrect). I tried using host['amenities'][0].strip('\n') which removes the \, but the len() function still gives 134.
Can anyone please help me crack this problem?
My solution, inspired by ddejohn's solution:
### Transform each "string-type" entry in column "amenities" to "list" type
host["amenities"] = host["amenities"].str.replace('["{}]', "", regex=True).str.split(",")
## Create a new column that count all the amenities for each row
entry host["am_count"] = [len(data) for data in host["amenities"]]
## Output the index in the new column resulting from aggregation over the column `am_count` grouped by `city`
host.groupby("city")["am_count"].agg("sum").argmax()
Solution
import functools
# Process the Amenities strings into sets of strings
host["amenities"] = host["amenities"].str.replace('["{}]', "", regex=True).str.split(",").apply(set)
# Groupby city, perform the set union to remove duplicates, and get count of unique amenities
amenities_by_city = host.groupby("city")["amenities"].apply(lambda x: len(functools.reduce(set.union, x))).reset_index()
Output:
city amenities
0 LA 27
1 NYC 17
2 SF 29
Getting the city with the max number of amenities is achieved with
city_with_most_amenities = amenities_by_city.query("amenities == amenities.max()")
Output:
city amenities
2 SF 29
Related
I work with a lot of CSV data for my job. I am trying to use Pandas to convert the member 'Email' to populate into the row of their spouses 'PrimaryMemberEmail' column. Here is a sample of what I mean:
import pandas as pd
user_data = {'FirstName':['John','Jane','Bob'],
'Lastname':['Snack','Snack','Tack'],
'EmployeeID':['12345','12345S','54321'],
'Email':['John#issues.com','NaN','Bob#issues.com'],
'DOB':['09/07/1988','12/25/1990','07/13/1964'],
'Role':['Employee On Plan','Spouse On Plan','Employee Off Plan'],
'PrimaryMemberEmail':['NaN','NaN','NaN'],
'PrimaryMemberEmployeeId':['NaN','12345','NaN']
}
df = pd.DataFrame(user_data)
I have thousands of rows like this. I need to only populate the 'PrimaryMemberEmail' when the user is a spouse with the 'Email' of their associated primary holders email. So in this case I would want to autopopulate the 'PrimaryMemberEmail' for Jane Snack to be that of her spouse, John Snack, which is 'John#issues.com' I cannot find a good way to do this. currently I am using:
for i in (df['EmployeeId']):
p = (p + len(df['EmployeeId']) - (len(df['EmployeeId'])-1))
EEID = df['EmployeeId'].iloc[p]
if 'S' in EEID:
df['PrimaryMemberEmail'].iloc[p] = df['Email'].iloc[p-1]
What bothers me is that this only works if my file comes in correctly, like how I showed in the example DataFrame. Also my NaN values do not work with dropna() or other methods, but that is a question for another time.
I am new to python and programming. I am trying to add value to myself in my current health career and I find this all very fascinating. Any help is appreciated.
IIUC, map the values and fillna:
df['PrimaryMemberEmail'] = (df['PrimaryMemberEmployeeId']
.map(df.set_index('EmployeeID')['PrimaryMemberEmail'])
.fillna(df['PrimaryMemberEmail'])
)
Alternatively, if you have real NaNs, (not strings), use boolean indexing:
df.loc[df['PrimaryMemberEmployeeId'].notna(),
'PrimaryMemberEmail'] = df['PrimaryMemberEmployeeId'].map(df.set_index('EmployeeID')['PrimaryMemberEmail'])
output:
FirstName Lastname EmployeeID DOB Role PrimaryMemberEmail PrimaryMemberEmployeeId
0 John Mack 12345 09/07/1988 Employee On Plan John#issues.com NaN
1 Jane Snack 12345S 12/25/1990 Spouse On Plan John#issues.com 12345
2 Bob Tack 54321 07/13/1964 Employee Off Plan Bob#issues.com NaN
Here I want to search the values of paper_title column in reference column if matched/found as whole text, get the _id of that reference row (not _id of paper_title row) where it matched and save the _id in the paper_title_in column.
In[1]:
d ={
"_id":
[
"Y100",
"Y100",
"Y100",
"Y101",
"Y101",
"Y101",
"Y102",
"Y102",
"Y102"
]
,
"paper_title":
[
"translation using information on dialogue participants",
"translation using information on dialogue participants",
"translation using information on dialogue participants",
"#emotional tweets",
"#emotional tweets",
"#emotional tweets",
"#supportthecause: identifying motivations to participate in online health campaigns",
"#supportthecause: identifying motivations to participate in online health campaigns",
"#supportthecause: identifying motivations to participate in online health campaigns"
]
,
"reference":
[
"beattie, gs (2005, november) #supportthecause: identifying motivations to participate in online health campaigns may 31, 2017, from",
"burton, n (2012, june 5) depressive realism retrieved may 31, 2017, from",
"gotlib, i h, 27 hammen, c l (1992) #supportthecause: identifying motivations to participate in online health campaigns new york: wiley",
"paul ekman 1992 an argument for basic emotions cognition and emotion, 6(3):169200",
"saif m mohammad 2012a #tagspace: semantic embeddings from hashtags in mail and books to appear in decision support systems",
"robert plutchik 1985 on emotion: the chickenand-egg problem revisited motivation and emotion, 9(2):197200",
"alastair iain johnston, rawi abdelal, yoshiko herrera, and rose mcdermott, editors 2009 translation using information on dialogue participants cambridge university press",
"j richard landis and gary g koch 1977 the measurement of observer agreement for categorical data biometrics, 33(1):159174",
"tomas mikolov, kai chen, greg corrado, and jeffrey dean 2013 #emotional tweets arxiv:13013781"
]
}
import pandas as pd
df=pd.DataFrame(d)
df
Out:
Expected Results:
And finally the final result dataframe with unique values as:
Note here paper_title_in column has all the _id of title present in reference column as list.
I tried this but it returns the _id of paper_title column in paper_presented_in which is being searched than reference column where it matches. The expected result dataframe gives more clear idea. Have a look there.
def return_id(paper_title,reference, _id):
if (paper_title is None) or (reference is None):
return None
if paper_title in reference:
return _id
else:
return None
df1['paper_present_in'] = df1.apply(lambda row: return_id(row['paper_title'], row['reference'], row['_id']), axis=1)
So to solve your problem you'll be requiring two dictionaries and a list to store some values temporarily.
# A list to store unique paper titles
unique_paper_title
# A dict to store mapping of unique paper to unique ids
mapping_dict_paper_to_id = dict()
# A dict to store mapping unique idx to the ids
mapping_id_to_idx = dict()
# This gives us the unique paper title's list
unique_paper_title = df["paper_title"].unique()
# Storing values in the dict mapping_dict_paper_to_id
for value in unique_paper_title:
mapping_dict_paper_to_id[value] = df["_id"][df["paper_title"]==value].unique()[0]
# Storing values in the dict mapping_id_to_idx
for value in unique_paper_title:
# this gives us the indexes of the matched string ie. the paper_title
idx_list = df[df['reference'].str.contains(value)].index
# Storing values in the dictionary
for idx in idx_list:
mapping_id_to_idx[idx] = mapping_dict_paper_to_id[value]
# This loops check if the index have any refernce's id and then updates the paper_present_in field accordingly
for i in df.index:
if i in mapping_id_to_idx:
df['paper_present_in'][i] = mapping_id_to_idx[i]
else:
df['paper_present_in'][i] = "None"
The above code is gonna check and update the searched values in the dataframe.
I have a set of real estate ad data. Several of the lines are about the same real estate, so it's full of duplicates that aren't exactly the same. It looks like this :
ID URL CRAWL_SOURCE PROPERTY_TYPE NEW_BUILD DESCRIPTION IMAGES SURFACE LAND_SURFACE BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY ZIP_CODE DEPT_CODE PUBLICATION_START_DATE PUBLICATION_END_DATE LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE
0 22c05930-0eb5-11e7-b53d-bbead8ba43fe http://www.avendrealouer.fr/location/levallois... A_VENDRE_A_LOUER APARTMENT False Au rez de chaussée d'un bel immeuble récent,... ["https://cf-medias.avendrealouer.fr/image/_87... 72.0 NaN NaN ... Lamirand Et Associes AGENCY 54178039 Levallois-Perret 92300.0 92 2017-03-22T04:07:56.095 NaN 2017-04-21T18:52:35.733 NaN
1 8d092fa0-bb99-11e8-a7c9-852783b5a69d https://www.bienici.com/annonce/ag440414-16547... BIEN_ICI APARTMENT False Je vous propose un appartement dans la rue Col... ["http://photos.ubiflow.net/440414/165474561/p... 48.0 NaN NaN ... Proprietes Privees MANDATARY 54178039 Levallois-Perret 92300.0 92 2018-09-18T11:04:44.461 NaN 2019-06-06T10:08:10.89 2018-09-25
I want to deleter rows that are too similar not to be duplicates and keep only one which keeps and gathers the CRAWL_SOURCE of the deleted rows. For instance, let's say I want to keep one row by CRAWL_SOURCE if the description or most of the images are alike. So far I only found a way to create a new column to say when descriptions are the same:
df['is_duplicated'] = df.duplicated(['DESCRIPTION'])
Or when images are the same:
def image_similarity(imageAurls,imageBurls):
imageAurls = ast.literal_eval(imageAurls)
imageBurls = ast.literal_eval(imageBurls)
for urlA in imageAurls:
responseA = requests.get(urlA)
imgA = Image.open(BytesIO(responseA.content))
print(imgA)
for urlB in imageBurls:
responseB = requests.get(urlB)
imgB = Image.open(BytesIO(responseB.content))
hash0 = imagehash.average_hash(imgA)
hash1 = imagehash.average_hash(imgB)
cutoff = 5
if hash0 - hash1 < cutoff:
print(urlA)
print(urlB)
return('similar')
return('not similar')
df['NextImage'] = df['IMAGES'][df['IMAGES'].index - 1]
df['IsSimilar'] = df.apply(lambda x: image_similarity(x['IMAGES'], x['NextImage']), axis=1)
Therefore, how to delete rows that share the same description, or which share the same photos, and make one which gathers the CRAWL_SOURCE of the deleted rows ?
Generally speaking : How to delete rows that share the same value of a column feature to make one row which gather all the values of another column feature?
Nota : if you have any other ideas to discover the houses that might be the same as well, I will be happy to hear them. I think two rows might be talking about the same real estate if the following features are alike :
SURFACE LAND_SURFACE BALCONY_SURFACE TERRACE_SURFACE ROOM_COUNT BEDROOM_COUNT BATHROOM_COUNT LUNCHROOM_COUNT TOILET_COUNT FURNISHED FIREPLACE AIR_CONDITIONING GARDEN SWIMMING_POOL BALCONY TERRACE CELLAR PARKING PARKING_COUNT HEATING_TYPES HEATING_MODE FLOOR FLOOR_COUNT CONSTRUCTION_YEAR ELEVATOR CARETAKER ENERGY_CONSUMPTION GREENHOUSE_GAS_CONSUMPTION MARKETING_TYPE PRICE PRICE_M2
What you are looking for is a Record Linkage method and it has been done already.
I suggest you a library that detects similarities using words distances calculations and with a decent documentation: Python Record Linkage Toolkit.
Once you import the library, you must index the sources you intend to compare, something like this:
indexer = recordlinkage.Index()
#using url as intersection
indexer.block('url')
candidate_links = indexer.index(df_a, df_b)
c = recordlinkage.Compare()
Let's say you want to compare based on the similiraties of strings, but they don't match exactly:
c.string('descriptionA', 'descriptionB', method='jarowinkler', threshold=0.85)
And if you want an exact match you should use:
c.exact('imageUrl')
Anyway, there are more resources(libraries) based on Record Linkage.
I want to groupby a particular column in a dataframe and calculate sum of subgroups thus created, while retaining( displaying) all the records in each subgroup.
I am trying to create my own credit card expense tracking program. (I know there are several already available, but the idea is to learn Python.)
I have the usual fields of 'Merchant', 'Date', 'Type' and 'Amount'
I would like to do one of the following:
Group items by merchant, then within each such group, split the amount under (two new columns) 'debit' and 'credit'. I also want to be able to sum the the amounts under these columns. Repeat this for every merchant group.
If it is not possible to split based on 'Type' of the transaction (that is, as 'debit' and 'credit'), then I want to be able to sum the debits and credit SEPARATELY and also retain the line items (while displaying, that is.)Doing a sum() on the 'Amount' column gives just one number for each merchant and I verified that it is an incorrect amount.
My data frame looks like this:
Posted_Date Amount Type Merchant
0 04/20/2019 -89.70 Debit UNI
1 04/20/2019 -6.29 Debit BOOKM
2 04/20/2019 -36.42 Debit BROOKLYN
3 04/18/2019 -20.95 Debit MTA*METROCARD
4 04/15/2019 -29.90 Debit ZARA
5 04/15/2019 -7.70 Debit STILES
The code I have, after reading into a data frame and marking a transaction as credit or debit is:
merch_new = df_new.groupby(['Merchant','Type'])
merch_new.groups
for key, values in merch_new.groups.items():
df_new['Amount'].sum()
print(df_new.loc[values], "\n\n")
I was able to split it the way below:
Posted_Date Amount Type Merchant
217 05/23/2019 -41.70 Debit AT
305 04/27/2019 -12.40 Debit AT
Posted_Date Amount Type Merchant
127 07/08/2019 69.25 Credit AT
162 06/21/2019 139.19 Credit AT
Ideally, I would like something like the below:
the line items are displayed and a total for a given subgroup. In this case for merchant 'AT' and ideally sorted by date.
Date Merchant Credit Debit
305 4/27/2019 AT 0 -12.4
217 5/23/2019 AT 0 -41.7
162 6/21/2019 AT 139.19 0
127 7/8/2019 AT 69.25 0
208.44 -54.1
It appears simple, but I am unable to format it in this way.
EDIT:
I get an error for rename_axis():
rename_axis() got an unexpected keyword argument 'index'
and if I delete the index argument, I get the same error for 'columns'
I searched a lot for the usage (like Benoit showed) but I cannot find any. They all used strings or lists. I tried using:
rename_axis(None,None)
and I get the error:
ValueError: No axis named None for object type <class 'pandas.core.frame.DataFrame'>
I don't know if this is because of the python version I am using (3.6.6). I tried on both Spyder and Jupyter. But I get the same error.
I used:
rename_axis(None, axis=1) and I seem to get the desired results (sort of)
But I am unable to understand how this is being interpreted since there is no qualifier specified as to which argument it is reading into for "None". Can anyone please explain?
Any help is appreciated!
Thanks a lot!
I think you try to achieve something like this:
In [1]:
## Create example
import pandas as pd
cols = ['Posted_Date', 'Amount', 'Type', 'Merchant']
data = [['04/20/2019', -89.70, 'Debit', 'UNI'],
['04/20/2019', -6.29, 'Credit', 'BOOKM'],
['04/20/2019', -36.42, 'Debit', 'BROOKLYN'],
['04/20/2019', -6.29, 'Credit', 'BOOKM'],
['04/20/2019', -54.52, 'Credit', 'BROOKLYN'],
['04/18/2019', -20.95, 'Credit', 'BROOKLYN']]
df = pd.DataFrame(columns=cols, data=data)
## Pivot Table with aggregation function ='sum'
df_final = pd.pivot_table(df, values='Amount', index=['Posted_Date', 'Merchant'],
columns=['Type'], aggfunc='sum').fillna(0).reset_index().rename_axis(index=None, columns=None)
df_final['Total'] = df_final['Debit'] + df_final['Credit']
Out [1]:
Posted_Date Merchant Credit Debit Total
0 04/18/2019 BROOKLYN -20.95 0.00 -20.95
1 04/20/2019 BOOKM -12.58 0.00 -12.58
2 04/20/2019 BROOKLYN -54.52 -36.42 -90.94
3 04/20/2019 UNI 0.00 -89.70 -89.70
I have the following two databases:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/rgdp_catcode.merge'
df=pd.read_csv(url, index_col=0)
df.head(1)
naics catcode GeoName Description ComponentName year GDP state
0 22 E1600',\t'E1620',\t'A4000',\t'E5000',\t'E3000'... Alabama Utilities Real GDP by state 2004 5205 AL
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/mpl.Bspons.merge'
df1=pd.read_csv(url, index_col=0)
df1.head(1)
state year unemployment log_diff_unemployment id.thomas party type date bills id.fec years_exp session name disposition catcode
0 AK 2006 6.6 -0.044452 1440 Republican sen 2006-05-01 s2686-109 S2AK00010 39 109 National Cable & Telecommunications Association support C4500
Regarding df, I had to manually input the catcode values. I think that is why the formatting is off. What I would like is to simply have the values without the \t prefix. I want to merge the dfs on catcode, state, year. I made a test earlier wherein a df1.catcode with only one value per cell was matched with the values in another df.catcode that had more than one value per cell and it worked.
So technically, all I need to do is lose the \t before each consecutive value in df.catcode, but additionally, if anyone has ever done a merge of this sort before, any 'caveats' learned through experience would be appreciated. My merge code looks like this:
mplmerge=pd.merge(df1,df, on=(['catcode', 'state', 'year']), how='left' )
I think this can be done with the regex method, I'm looking at the documentation now.
Cleaning catcode column in df is rather straightforward:
catcode_fixed = df.catcode.str.findall('[A-Z][0-9]{4}')
This will produce a series with a list of catcodes in every row:
catcode_fixed.head(3)
Out[195]:
0 [E1600, E1620, A4000, E5000, E3000, E1000]
1 [X3000, X3200, L1400, H6000, X5000]
2 [X3000, X3200, L1400, H6000, X5000]
Name: catcode, dtype: object
If I understand correctly what you want, then you need to "ungroup" these lists. Here is the trick, in short:
catcode_fixed = catcode_fixed = catcode_fixed.apply(pd.Series).stack()
catcode_fixed.index = catcode_fixed.index.droplevel(-1)
So, we've got (note the index values):
catcode_fixed.head(12)
Out[206]:
0 E1600
0 E1620
0 A4000
0 E5000
0 E3000
0 E1000
1 X3000
1 X3200
1 L1400
1 H6000
1 X5000
2 X3000
dtype: object
Now, dropping the old catcode and joining in the new one:
df.drop('catcode',axis = 1, inplace = True)
catcode_fixed.name = 'catcode'
df = df.join(catcode_fixed)
By the way, you may also need to use df1.reset_index() when merging the data frames.