I have data that looks like this:
Field Value
0 CRD 146099
1 LegalName CHUNG, BUCK CHWEE
2 BusName PRINCIPA FINANCIAL ADVISORS
3 URL https://adviserinfo.sec.gov/IAPD/content/ViewF...
4 CRD 170701
5 LegalName MESSINA AND ASSOCIATES, INC
6 BusName FINANCIAL RESOURCES GROUP
7 URL https://adviserinfo.sec.gov/IAPD/content/ViewF...
8 CRD 133630
9 LegalName ALAN EDELMAN
10 BusName EDELMAN, ALAN
11 URL https://adviserinfo.sec.gov/IAPD/content/ViewF...
12 CRD 131792
13 LegalName RESOURCE MANAGEMENT LLC
14 BusName RESOURCE MANAGEMENT LLC
15 URL https://adviserinfo.sec.gov/IAPD/content/ViewF...
How can I convert it such that CRD, LegalName, BusName, URL are the columns. I tried using pd.melt but it doesn't seem to be what I'm looking for.
Use split for 2 columns first, then create counter Series by cumcount, create MultiIndex by set_index and reshape by unstack:
df[['Field','Value']] = df['Value'].str.split(n=1, expand=True)
groups = df.groupby('Field').cumcount()
df = df.set_index([groups, 'Field'])['Value'].unstack()
print (df)
Field BusName CRD LegalName \
0 PRINCIPA FINANCIAL ADVISORS 146099 CHUNG, BUCK CHWEE
1 FINANCIAL RESOURCES GROUP 170701 MESSINA AND ASSOCIATES, INC
2 EDELMAN, ALAN 133630 ALAN EDELMAN
3 RESOURCE MANAGEMENT LLC 131792 RESOURCE MANAGEMENT LLC
Field URL
0 https://adviserinfo.sec.gov/IAPD/content/ViewF...
1 https://adviserinfo.sec.gov/IAPD/content/ViewF...
2 https://adviserinfo.sec.gov/IAPD/content/ViewF...
3 https://adviserinfo.sec.gov/IAPD/content/ViewF...
I think you're looking for DataFrame.transpose
Related
I've got a dataframe that's currently aggregated by zip code, and looks similar to this:
Year Organization State Zip Number_of_people
2021 A NJ 07090 5
2020 B AZ 09876 3
2021 A NJ 01234 2
2021 C VA 23456 7
2019 A NJ 05385 1
I want to aggregate the dataframe and Number_of_People column by state instead, combining identical rows (aside from Number of people) so that the data above instead looks like this:
Year Organization State Number_of_people
2021 A NJ 7
2020 B AZ 3
2021 C VA 7
2019 A NJ 1
In other words, if rows are identical in all columns EXCEPT Number_of_people, I want to combine the rows and add the number_of_people.
I'm stuck on how to approach this problem after deleting the Zip column -- I think I need to group by Year, Organization, and State but not sure what to do after that.
A more pythonic version without zip codes
df.groupby(['Year','Organization','State'], as_index=False)['Number_of_people'].sum()
A more pythonic version with zip codes
df.groupby(['Year','Organization','State'], as_index=False).sum()
You don't have to drop zip first if you don't want, use the syntax below.
data = '''Year Organization State Zip Number_of_people
2021 A NJ 07090 5
2020 B AZ 09876 3
2021 A NJ 01234 2
2021 C VA 23456 7
2019 A NJ 05385 1'''
df = pd.read_csv(io.StringIO(data), sep='\s+', engine='python')
df[['Year','Organization','State', 'Number_of_people']].groupby(['Year','Organization','State']).sum().reset_index()
Output
Year Organization State Number_of_people
0 2019 A NJ 1
1 2020 B AZ 3
2 2021 A NJ 7
3 2021 C VA 7
If you do want to drop the zip code, then use this:
df.groupby(['Year','Organization','State']).sum().reset_index()
My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')
I have a pandas dataframe df which contains:
major men women rank
Art 5 4 1
Art 3 5 3
Art 2 4 2
Engineer 7 8 3
Engineer 7 4 4
Business 5 5 4
Business 3 4 2
Basically I am needing to find the total number of students including both men and women as one per major regardless of the rank column. So for Art for example, the total should be all men + women totaling 23, Engineer 26, Business 17.
I have tried
df.groupby(['major_category']).sum()
But this separately sums the men and women rather than combining their totals.
Just add both columns and then groupby:
(df.men+df.women).groupby(df.major).sum()
major
Art 23
Business 17
Engineer 26
dtype: int64
melt() then groupby():
df.drop('rank',1).melt('major').groupby('major',as_index=False).sum()
major value
0 Art 23
1 Business 17
2 Engineer 26
I have lists that are categorized by name, such as:
dining = ['CARLS', 'SUBWAY', 'PIZZA']
bank = ['TRANSFER', 'VENMO', 'SAVE AS YOU GO']
and I want to update a new column to the category name if any of those strings are found in the other column. An example from my other question here, I have the following data set (an example bank transactions list):
import pandas as pd
import numpy as np
dining = ['CARLS', 'SUBWAY', 'PIZZA']
bank = ['TRANSFER', 'VENMO', 'SAVE AS YOU GO']
data = [
[-68.23 , 'PAYPAL TRANSFER'],
[-12.46, 'RALPHS #0079'],
[-8.51, 'SAVE AS YOU GO'],
[25.34, 'VENMO CASHOUT'],
[-2.23 , 'PAYPAL TRANSFER'],
[-64.29 , 'PAYPAL TRANSFER'],
[-7.06, 'SUBWAY'],
[-7.03, 'CARLS JR'],
[-2.35, 'SHELL OIL'],
[-35.23, 'CHEVRON GAS']
]
df = pd.DataFrame(data, columns=['amount', 'details'])
df['category'] = np.nan
df
amount details category
0 -68.23 PAYPAL TRANSFER NaN
1 -12.46 RALPHS #0079 NaN
2 -8.51 SAVE AS YOU GO NaN
3 25.34 VENMO CASHOUT NaN
4 -2.23 PAYPAL TRANSFER NaN
5 -64.29 PAYPAL TRANSFER NaN
6 -7.06 SUBWAY NaN
7 -7.03 CARLS JR NaN
8 -2.35 SHELL OIL NaN
9 -35.23 CHEVRON GAS NaN
Is there an efficient way for me update the category column to either 'dining' or 'bank' based on if the strings in the list are found in data.details?
I.e. Desired Output:
amount details category
0 -68.23 PAYPAL TRANSFER bank
1 -12.46 RALPHS #0079 NaN
2 -8.51 SAVE AS YOU GO bank
3 25.34 VENMO CASHOUT bank
4 -2.23 PAYPAL TRANSFER bank
5 -64.29 PAYPAL TRANSFER bank
6 -7.06 SUBWAY dining
7 -7.03 CARLS JR dining
8 -2.35 SHELL OIL NaN
9 -35.23 CHEVRON GAS NaN
From my previous question, so far I'm assuming I need to work with a new list that I create by using str.extract.
We can do this with np.select since we have multiple conditions:
dining = '|'.join(dining)
bank = '|'.join(bank)
conditions = [
df['details'].str.contains(f'({dining})'),
df['details'].str.contains(f'({bank})')
]
choices = ['dining', 'bank']
df['category'] = np.select(conditions, choices, default=np.NaN)
amount details category
0 -68.23 PAYPAL TRANSFER bank
1 -12.46 RALPHS #0079 nan
2 -8.51 SAVE AS YOU GO bank
3 25.34 VENMO CASHOUT bank
4 -2.23 PAYPAL TRANSFER bank
5 -64.29 PAYPAL TRANSFER bank
6 -7.06 SUBWAY dining
7 -7.03 CARLS JR dining
8 -2.35 SHELL OIL nan
9 -35.23 CHEVRON GAS nan
You can do with findall + dict map
sub = {**dict.fromkeys(dining, 'dining'), **dict.fromkeys(bank, 'bank')}
df.details.str.findall('|'.join(sub)).str[0].map(sub)
Out[146]:
0 bank
1 NaN
2 bank
3 bank
4 bank
5 bank
6 dining
7 dining
8 NaN
9 NaN
Name: details, dtype: object
#df['category'] = df.details.str.findall('|'.join(sub)).str[0].map(sub)
My dataset is based on the results of Food Inspections in the City of Chicago.
import pandas as pd
df = pd.read_csv("C:/~/Food_Inspections.csv")
df.head()
Out[1]:
Inspection ID DBA Name \
0 1609238 JR'SJAMAICAN TROPICAL CAFE,INC
1 1609245 BURGER KING
2 1609237 DUNKIN DONUTS / BASKIN ROBINS
3 1609258 CHIPOTLE MEXICAN GRILL
4 1609244 ATARDECER ACAPULQUENO INC.
AKA Name License # Facility Type Risk \
0 NaN 2442496.0 Restaurant Risk 1 (High)
1 BURGER KING 2411124.0 Restaurant Risk 2 (Medium)
2 DUNKIN DONUTS / BASKIN ROBINS 1717126.0 Restaurant Risk 2 (Medium)
3 CHIPOTLE MEXICAN GRILL 1335044.0 Restaurant Risk 1 (High)
4 ATARDECER ACAPULQUENO INC. 1910118.0 Restaurant Risk 1 (High)
Here is how often each of the facilities appear in the dataset:
df['Facility Type'].value_counts()
Out[3]:
Restaurant 14304
Grocery Store 2647
School 1155
Daycare (2 - 6 Years) 367
Bakery 316
Children's Services Facility 262
Daycare Above and Under 2 Years 248
Long Term Care 169
Daycare Combo 1586 142
Catering 123
Liquor 78
Hospital 68
Mobile Food Preparer 67
Golden Diner 65
Mobile Food Dispenser 51
Special Event 25
Shared Kitchen User (Long Term) 22
Daycare (Under 2 Years) 18
I am trying to create a new set of data containing those rows where its Facility Type has over 50 occurrences in the dataset. How would I approach this?
Please note the list of facility counts is MUCH LARGER as I have cut out most of the information as it did not contribute to the question at hand (so simply removing occurrences of "Special Event", " Shared Kitchen User", and "Daycare" is not what I'm looking for).
IIUC then you want to filter:
df.groupby('Facility Type').filter(lambda x: len(x) > 50)
Example:
In [9]:
df = pd.DataFrame({'type':list('aabcddddee'), 'value':np.random.randn(10)})
df
Out[9]:
type value
0 a -0.160041
1 a -0.042310
2 b 0.530609
3 c 1.238046
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
In [10]:
df.groupby('type').filter(lambda x: len(x) > 1)
Out[10]:
type value
0 a -0.160041
1 a -0.042310
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
Not tested, but should work.
FT=df['Facility Type'].value_counts()
df[df['Facility Type'].isin(FT.index[FT>50])]