Add column and fill missing value based on other columns in Pandas - python

For the following input data, I need to fill missing office_numbers and create one column to distinguish if office_number is original or afterwards filled one.
Here is the example data:
df = pd.DataFrame({'id':['1010084420','1010084420','1010084420','1010084421','1010084421','1010084421','1010084425'],
'building_name': ['A', 'A', 'A', 'East Tower', 'East Tower', 'West Tower', 'T1'],
'floor': ['1', '1', '2', '10', '10', '11','11'],
'office_number':['', '','205','','','', '1101-1105'],
'company_name': ['Ariel Resources Ltd.', 'A.O. Tatneft', '', 'Agrium Inc.', 'Creo Products Inc.', 'Cott Corp.', 'Creo Products Inc.']})
print(df)
Output:
id building_name floor office_number company_name
0 1010084420 A 1 Ariel Resources Ltd.
1 1010084420 A 1 A.O. Tatneft
2 1010084420 A 2 205
3 1010084421 East Tower 10 Agrium Inc.
4 1010084421 East Tower 10 Creo Products Inc.
5 1010084421 West Tower 11 Cott Corp.
6 1010084425 T1 11 1101-1105 Creo Products Inc.
I need to fill the office_number when it's empty for the office of same id, building_name and floor, with the following rule: value of floor + F + 001, 002, 003, etc.; and create one column office_num_status, when it's not null, insert original, otherwise filled.
This is the final expected result:
id building_name floor office_num_status office_number \
0 1010084420 A 1 filled 1F001
1 1010084420 A 1 filled 1F002
2 1010084420 A 2 original 205
3 1010084421 East Tower 10 filled 10F001
4 1010084421 East Tower 10 filled 10F002
5 1010084421 West Tower 11 filled 11F001
6 1010084425 T1 11 original 1101-1105
company_name
0 Ariel Resources Ltd.
1 A.O. Tatneft
2
3 Agrium Inc.
4 Creo Products Inc.
5 Cott Corp.
6 Creo Products Inc.
I have done so far is created columns office_num_status but all values are originals:
# method 1
df['office_num_status'] = np.where(df['office_number'].isnull(), 'filled', 'original')
# method 2
df['office_num_status'] = ['filled' if x is None else 'original' for x in df['office_number']]
# method 3
df['office_num_status'] = 'filled'
df.loc[df['office_number'] is not None, 'office_num_status'] = 'original'
Could someone can help me to finish this? Thanks a lot.

Compare missing string instead missing value, add counter by GroupBy.cumcount and fill non exist values:
mask = df['office_number'] == ''
df.insert(3, 'office_num_status', np.where(mask, 'filled', 'original'))
s = df.groupby(['id','building_name','floor']).cumcount().add(1).astype(str).str.zfill(3)
df.loc[mask, 'office_number'] = df['floor'].astype(str) + 'F' + s
print (df)
id building_name floor office_num_status office_number \
0 1010084420 A 1 filled 1F001
1 1010084420 A 1 filled 1F002
2 1010084420 A 2 original 205
3 1010084421 East Tower 10 filled 10F001
4 1010084421 East Tower 10 filled 10F002
5 1010084421 West Tower 11 filled 11F001
6 1010084425 T1 11 original 1101-1105
company_name
0 Ariel Resources Ltd.
1 A.O. Tatneft
2
3 Agrium Inc.
4 Creo Products Inc.
5 Cott Corp.
6 Creo Products Inc.

Related

Get the number of IDs that have the same combination of distinct values in the 'locations' column

I have a table with ids and locations they have been to.
id
Location
1
Maryland
1
Iowa
2
Maryland
2
Texas
3
Georgia
3
Iowa
4
Maryland
4
Iowa
5
Maryland
5
Iowa
5
Texas
I'd like to perform a query that would allow me to get the number of ids per combination.
In this example table, the output would be -
Maryland, Iowa - 2
Maryland, Texas - 1
Georgia, Iowa - 1
Maryland, Iowa, Texas - 1
My original thought was to add the ASCII values of the distinct locations of each id, and see how many have each value, and what the combinations are that correspond to the value. I was not able to do that as SQL server would not let me cast an nvarchar as a numeric data type. Is there any other way I could use SQL to get the number of devices per combination? Using python to get the number of ids per combination is also acceptable, however, SQL is preferred.
If you want to solve this in SQL and you are running SQL Server 2017 or later, you can use a CTE to aggregate the locations for each id using STRING_AGG, and then count the occurrences of each aggregated string:
WITH all_locations AS (
SELECT STRING_AGG(Location, ', ') WITHIN GROUP (ORDER BY Location) AS aloc
FROM locations
GROUP BY id
)
SELECT aloc, COUNT(*) AS cnt
FROM all_locations
GROUP BY aloc
ORDER BY cnt, aloc
Output:
aloc cnt
Georgia, Iowa 1
Iowa, Maryland, Texas 1
Maryland, Texas 1
Iowa, Maryland 2
Note that I have applied an ordering to the STRING_AGG to ensure that someone who visits Maryland and then Iowa is treated the same way as someone who visits Iowa and then Maryland. If this is not the desired behaviour, simply delete the WITHIN GROUP clause.
Demo on dbfiddle
Use groupby + agg + value_counts:
new_df = df.groupby('id')['Location'].agg(list).str.join(', ').value_counts().reset_index()
Output:
>>> new_df
index Location
0 Maryland, Iowa 2
1 Maryland, Texas 1
2 Georgia, Iowa 1
3 Maryland, Iowa, Texas 1
Let us do groupby with join then value_counts
df.groupby('id')['Location'].agg(', '.join).value_counts()
Out[938]:
join
Maryland, Iowa 2
Georgia, Iowa 1
Maryland, Iowa, Texas 1
Maryland, Texas 1
dtype: int64
Use a frozenset to aggregate to ensure having unique groups:
df.groupby('id')['Location'].agg(', '.join).value_counts()
Output:
(Maryland, Iowa) 2
(Texas, Maryland) 1
(Georgia, Iowa) 1
(Texas, Maryland, Iowa) 1
Name: Location, dtype: int64
Or a sorted string join:
df.groupby('id')['Location'].agg(lambda x: ', '.join(sorted(x))).value_counts()
output:
Iowa, Maryland 2
Maryland, Texas 1
Georgia, Iowa 1
Iowa, Maryland, Texas 1
Name: Location, dtype: int64

Folium FeatureGroup in Python

I am trying to create maps using Folium Feature group. The feature group will be from a pandas dataframe row. I am able to achieve this when there is one data in the dataframe. But when there are more than 1 in the dataframe, and loop through it in the for loop I am not able to acheive what I want. Please find attached the code in Python.
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
feature_group1 = FeatureGroup(name='Tim')
feature_group2 = FeatureGroup(name='Andrew')
feature_group1.add_child(Marker([35.035075, -89.89969], popup='Tim'))
feature_group2.add_child(Marker([35.821835, -90.70503], popup='Andrew'))
mapa.add_child(feature_group1)
mapa.add_child(feature_group2)
mapa.add_child(LayerControl())
mapa
My dataframe contains the following:
Name Address
0 Dollar Tree #2020 3878 Goodman Rd.
1 Dollar Tree #2020 3878 Goodman Rd.
2 National Guard Products Inc 4985 E Raines Rd
3 434 SAVE A LOT C MID WEST 434 Kelvin 3240 Jackson Ave
4 WALGREENS 06765 108 E HIGHLAND DR
5 Aldi #69 4720 SUMMER AVENUE
6 Richmond, Christopher 1203 Chamberlain Drive
City State Zipcode Group
0 Horn Lake MS 38637 Johnathan Shaw
1 Horn Lake MS 38637 Tony Bonetti
2 Memphis TN 38118 Tony Bonetti
3 Memphis TN 38122 Tony Bonetti
4 JONESBORO AR 72401 Josh Jennings
5 Memphis TN 38122 Josh Jennings
6 Memphis TN 38119 Josh Jennings
full_address Color sequence \
0 3878 Goodman Rd.,Horn Lake,MS,38637,USA blue 1
1 3878 Goodman Rd.,Horn Lake,MS,38637,USA cadetblue 1
2 4985 E Raines Rd,Memphis,TN,38118,USA cadetblue 2
3 3240 Jackson Ave,Memphis,TN,38122,USA cadetblue 3
4 108 E HIGHLAND DR,JONESBORO,AR,72401,USA yellow 1
5 4720 SUMMER AVENUE,Memphis,TN,38122,USA yellow 2
6 1203 Chamberlain Drive,Memphis,TN,38119,USA yellow 3
Latitude Longitude
0 34.962637 -90.069019
1 34.962637 -90.069019
2 35.035367 -89.898428
3 35.165115 -89.952624
4 35.821835 -90.705030
5 35.148707 -89.903760
6 35.098829 -89.866838
The same when I am trying to loop through in the for loop, I am not able to achieve what I need. :
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,tiles='Stamen Terrain')
#mapa.add_tile_layer()
for i in range(0,len(df_addresses)):
feature_group = FeatureGroup(name=df_addresses.iloc[i]['Group'])
feature_group.add_child(Marker([df_addresses.iloc[i]['Latitude'], df_addresses.iloc[i]['Longitude']],
popup=('Address: ' + str(df_addresses.iloc[i]['full_address']) + '<br>'
'Tech: ' + str(df_addresses.iloc[i]['Group'])),
icon = plugins.BeautifyIcon(
number= str(df_addresses.iloc[i]['sequence']),
border_width=2,
iconShape= 'marker',
inner_icon_style= 'margin-top:2px',
background_color = df_addresses.iloc[i]['Color'],
)))
mapa.add_child(feature_group)
mapa.add_child(LayerControl())
This is an example dataset because I didn't want to format your df. That said, I think you'll get the idea.
print(df_addresses)
Latitude Longitude Group
0 34.962637 -90.069019 B
1 34.962637 -90.069019 B
2 35.035367 -89.898428 A
3 35.165115 -89.952624 B
4 35.821835 -90.705030 A
5 35.148707 -89.903760 A
6 35.098829 -89.866838 A
After I create the map object(maps), I perform a groupby on the group column. I then iterate through each group. I first create a FeatureGroup with the grp_name(A or B). And for each group, I iterate through that group's dataframe and create Markers and add them to the FeatureGroup
mapa = folium.Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
for grp_name, df_grp in df_addresses.groupby('Group'):
feature_group = folium.FeatureGroup(grp_name)
for row in df_grp.itertuples():
folium.Marker(location=[row.Latitude, row.Longitude]).add_to(feature_group)
feature_group.add_to(mapa)
folium.LayerControl().add_to(mapa)
mapa
Regarding the stamenterrain query, if you're referring to the appearance in the control box you can remove this by declaring your map with tiles=None and adding the TileLayer separately with control set to false: folium.TileLayer('Stamen Terrain', control=False).add_to(mapa)

Constructing a dataframe with multiple columns based on str conditions using a loop - python

I have a webscraped Twitter DataFrame that includes user location. The location variable looks like this:
2 Crockett, Houston County, Texas, 75835, USA
3 NYC, New York, USA
4 Warszawa, mazowieckie, RP
5 Texas, USA
6 Virginia Beach, Virginia, 23451, USA
7 Louisville, Jefferson County, Kentucky, USA
I would like to construct state dummies for all USA states by using a loop.
I have managed to extract users from the USA using
location_usa = location_df['location'].str.contains('usa', case = False)
However the code would be too bulky I wrote this for every single state. I have a list of the states as strings.
Also I am unable to use
pd.Series.Str.get_dummies()
as there are different locations within the same state and each entry is a whole sentence.
I would like the output to look something like this:
Alabama Alaska Arizona
1 0 0 1
2 0 1 0
3 1 0 0
4 0 0 0
Or the same with Boolean values.
Use .str.extract to get a Series of the states, and then use pd.get_dummies on that Series. Will need to define a list of all 50 states:
import pandas as pd
states = ['Texas', 'New York', 'Kentucky', 'Virginia']
pd.get_dummies(df.col1.str.extract('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(','))
Kentucky New York Texas Virginia
0 0 0 1 0
1 0 1 0 0
2 0 0 0 0
3 0 0 1 0
4 0 0 0 1
5 1 0 0 0
Note I matched on States followed by a ',' as that seems to be the pattern and allows you to avoid false matches like 'Virginia' with 'Virginia Beach', or more problematic things like 'Washington County, Minnesota'
If you expect mutliple states to match on a single line, then this becomes .extractall summing across the 0th level:
pd.get_dummies(df.col1.str.extractall('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(',')).sum(level=0).clip(upper=1)
Edit:
Perhaps there are better ways, but this can be a bit safer as suggested by #BradSolomon allowing matches on 'State,( optional 5 digit Zip,) USA'
states = ['Texas', 'New York', 'Kentucky', 'Virginia', 'California', 'Pennsylvania']
pat = '(' + '|'.join(x+',?(\s\d{5},)?\sUSA' for x in states)+ ')'
s = df.col1.str.extract(pat)[0].str.split(',').str[0]
Output: s
0 Texas
1 New York
2 NaN
3 Texas
4 Virginia
5 Kentucky
6 Pennsylvania
Name: 0, dtype: object
from Input
col1
0 Crockett, Houston County, Texas, 75835, USA
1 NYC, New York, USA
2 Warszawa, mazowieckie, RP
3 Texas, USA
4 Virginia Beach, Virginia, 23451, USA
5 Louisville, Jefferson County, Kentucky, USA
6 California, Pennsylvania, USA

A query using pandas

Hello
i need to create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014 , using pandas This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index)
you'll find a description of my data in the picture :
Consider the following demo:
In [19]: df
Out[19]:
REGION STNAME CTYNAME POPESTIMATE2014 POPESTIMATE2015
0 0 Washington Washington 10 12
1 1 Washington Washington County 11 13
2 2 Alabama Alabama County 13 15
3 4 Alaska Alaska 14 12
4 3 Montana Montana 10 11
5 2 Washington Washington 15 19
In [20]: qry = "REGION in [1,2] and POPESTIMATE2015 > POPESTIMATE2014 and CTYNAME.str.contains('^Washington')"
In [21]: df.query(qry, engine='python')[['STNAME', 'CTYNAME']]
Out[21]:
STNAME CTYNAME
1 Washington Washington County
5 Washington Washington
Use boolean indexing with mask created by isin and startswith:
mask = df['REGION'].isin([1,2]) &
df['COUNTY'].str.startswith('Washington') &
(df['POPESTIMATE2015'] > df['POPESTIMATE2014'])
df = df.loc[mask, ['STNAME', 'CTYNAME']]

Selecting data based on number of occurences using Python / Pandas

My dataset is based on the results of Food Inspections in the City of Chicago.
import pandas as pd
df = pd.read_csv("C:/~/Food_Inspections.csv")
df.head()
Out[1]:
Inspection ID DBA Name \
0 1609238 JR'SJAMAICAN TROPICAL CAFE,INC
1 1609245 BURGER KING
2 1609237 DUNKIN DONUTS / BASKIN ROBINS
3 1609258 CHIPOTLE MEXICAN GRILL
4 1609244 ATARDECER ACAPULQUENO INC.
AKA Name License # Facility Type Risk \
0 NaN 2442496.0 Restaurant Risk 1 (High)
1 BURGER KING 2411124.0 Restaurant Risk 2 (Medium)
2 DUNKIN DONUTS / BASKIN ROBINS 1717126.0 Restaurant Risk 2 (Medium)
3 CHIPOTLE MEXICAN GRILL 1335044.0 Restaurant Risk 1 (High)
4 ATARDECER ACAPULQUENO INC. 1910118.0 Restaurant Risk 1 (High)
Here is how often each of the facilities appear in the dataset:
df['Facility Type'].value_counts()
Out[3]:
Restaurant 14304
Grocery Store 2647
School 1155
Daycare (2 - 6 Years) 367
Bakery 316
Children's Services Facility 262
Daycare Above and Under 2 Years 248
Long Term Care 169
Daycare Combo 1586 142
Catering 123
Liquor 78
Hospital 68
Mobile Food Preparer 67
Golden Diner 65
Mobile Food Dispenser 51
Special Event 25
Shared Kitchen User (Long Term) 22
Daycare (Under 2 Years) 18
I am trying to create a new set of data containing those rows where its Facility Type has over 50 occurrences in the dataset. How would I approach this?
Please note the list of facility counts is MUCH LARGER as I have cut out most of the information as it did not contribute to the question at hand (so simply removing occurrences of "Special Event", " Shared Kitchen User", and "Daycare" is not what I'm looking for).
IIUC then you want to filter:
df.groupby('Facility Type').filter(lambda x: len(x) > 50)
Example:
In [9]:
df = pd.DataFrame({'type':list('aabcddddee'), 'value':np.random.randn(10)})
df
Out[9]:
type value
0 a -0.160041
1 a -0.042310
2 b 0.530609
3 c 1.238046
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
In [10]:
df.groupby('type').filter(lambda x: len(x) > 1)
Out[10]:
type value
0 a -0.160041
1 a -0.042310
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
Not tested, but should work.
FT=df['Facility Type'].value_counts()
df[df['Facility Type'].isin(FT.index[FT>50])]

Categories

Resources