I have a dataframe (df) and I would like to create a new column called country, which is calculated buy looking at the region column and where the region value is present in the EnglandRegions list then the country value is set to England else its the value from the region column.
Please see below for my desired output:
name salary region B1salary country
0 Jason 42000 London 42000 England
1 Molly 52000 South West England
2 Tina 36000 East Midland England
3 Jake 24000 Wales Wales
4 Amy 73000 West Midlands England
You can see that all the values in country are set to England except for the value assigned to Jakes record that is set to Wales (as Wales is not in the EnglandRegions list). The code below produces the following error:
File "C:/Users/stacey/Documents/scripts/stacey.py", line 20
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
^
SyntaxError: invalid syntax
The code is as follows:
import pandas as pd
import numpy as np
EnglandRegions = ["London", "South West", "East Midland", "West Midlands", "East Anglia"]
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'salary': [42000, 52000, 36000, 24000, 73000],
'region': ['London', 'South West', 'East Midland', 'Wales', 'West Midlands']}
df = pd.DataFrame(data, columns = ['name', 'salary', 'region'])
df['B1salary'] = np.where((df['salary']>=40000) & (df['salary']<=50000) , df['salary'], '')
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
print(df)
The specific issue the error is referencing is that you are missing a ] to enclose your .loc. However, fixing this won't work anyways. Try:
df['country'] = np.where(df['region'].isin(EnglandRegions), 'England', df['region'])
This is essentially what you already had in the line above it for B1salary anyways.
Related
I have one dataframe(df1) which is my raw data from which i want to filter or extract a part of the data. I have another dataframe(df2) which have my filter conditions. The catch here is my filter condition column if blank should skip tht column condition and move to the other column conditions
Example below:
DF1:
City
District
Town
Country
Continent
NY
WASHIN
DC
US
America
CZCH
SEATLLE
DC
CZCH
Europe
NY
MARYLAND
DC
US
S America
NY
WASHIN
NY
US
America
NY
SEAGA
NJ
UK
Europe
DF2:(sample filter condition table - this table can have multiple conditions)
City
District
Town
Country
Continent
NY
DC
NJ
Notice that i have left the district, country and continent column blank. As I may or may not use it later. I cannot delete these columns.
OUTPUT DF: should look like this
City
District
Town
Country
Continent
NY
WASHIN
DC
US
America
NY
MARYLAND
DC
US
S America
NY
SEAGA
NJ
UK
Europe
So basically i need a filter condition table which will extract information from the raw data for fields i input in the filter tables. I cannot change/delete columns in DF2. I can only leave the column blank if i dont require the filter condition.
Thanks in advance,
Nitz
If DF2 has always one row:
df = df1.merge(df2.dropna(axis=1))
print (df)
City District Town Country Continent
0 NY WASHIN DC US America
1 NY NJ DC US S America
If multiple rows with missing values:
Sample data:
nan = np.nan
df1 = pd.DataFrame({'City': ['NY', 'CZCH', 'NY', 'NY', 'NY'], 'District': ['WASHIN', 'SEATLLE', 'MARYLAND', 'WASHIN', 'SEAGA'], 'Town': ['DC', 'DC', 'DC', 'NY', 'NJ'], 'Country': ['US', 'CZCH', 'US', 'US', 'UK'], 'Continent': ['America', 'Europe', 'S America', 'America', 'Europe']})
df2 = pd.DataFrame({'City': ['NY', nan], 'District': [nan, nan], 'Town': ['DC', 'NJ'], 'Country': [nan, nan], 'Continent': [nan, nan]})
First remove missing values with reshape by DataFrame.stack:
print (df2.stack())
0 City NY
Town DC
1 Town NJ
dtype: object
Then for each group compare df1 columns if exist in columns names and value from df2:
m = [df1[list(v.droplevel(0).index)].eq(v.droplevel(0)).all(axis=1)
for k, v in df2.stack().groupby(level=0)]
print (m)
[0 True
1 False
2 True
3 False
4 False
dtype: bool, 0 False
1 False
2 False
3 False
4 True
dtype: bool]
Use logical_or.reduce and filter in boolean indexing:
print (np.logical_or.reduce(m))
[ True False True False True]
df = df1[np.logical_or.reduce(m)]
print (df)
City District Town Country Continent
0 NY WASHIN DC US America
2 NY MARYLAND DC US S America
4 NY SEAGA NJ UK Europe
Another possible solution, using numpy broadcasting (it works even when df2 has more than one row):
df1.loc[np.sum(np.sum(
df1.values == df2.values[:, None], axis=2) ==
np.sum(df2.notna().values, axis=1)[:,None], axis=0) == 1]
Output:
City District Town Country Continent
0 NY WASHIN DC US America
2 NY MARYLAND DC US S America
4 NY SEAGA NJ UK Europe
I have a pandas dataframe with several columns. for examlpe:
# name abbr country
0 454 Liverpool UCL England
1 454 Bayern Munich UCL Germany
2 223 Manchester United UEL England
3 454 Manchester City UCL England
and I run a function using .gropuby() - but then I want to add to each row of that group the value I calculated once.
The example code is here:
def test_func(abbreviation):
if abbreviation == 'UCL':
return 'UEFA Champions League'
elif abbreviation == 'UEL':
return 'UEFA Europe Leauge'
data = [[454, 'Liverpool', 'UCL', 'England'], [454, 'Bayern Munich', 'UCL', 'Germany'], [223, 'Manchester United', 'UEL', 'England'], [454, 'Manchester City', 'UCL', 'England']]
df = pd.DataFrame(data, columns=['#','name','abbr', 'country'])
competition_df = df.groupby('#').first()
competition_df['competition'] = competition_df.apply(lambda row: test_func(row["abbr"]), axis=1)
and now I would like to add the value of "competition" to all the cases based on group in the original dataframe (df).
Is there a good way (using 'native' pandas) to do it without iterations and lists etc.?
Edit 1:
The final output would then be the original dataframe (df) with the new column:
# name abbr country competition
0 454 Liverpool UCL England UEFA Champions League
1 454 Bayern Munich UCL Germany UEFA Champions League
2 223 Manchester United UEL England UEFA Europe Leauge
3 454 Manchester City UCL England UEFA Champions League
Edit 2:
I managed to get what I want by zipping, but its a very bad implementation and I am still wondering if I could do it better (and faster using some pandas functions) :
zipped = zip(competition_df.index, competition_df['competition'])
df['competition'] = np.nan
for num, comp in zipped:
df.loc[df['#']==num, 'competition'] = comp
I think these might be helpful.
import pandas
data = [[454, 'Liverpool', 'UCL', 'England'], [454, 'Bayern Munich', 'UCL', 'Germany'], [223, 'Manchester United', 'UEL', 'England'], [454, 'Manchester City', 'UCL', 'England']]
df = pandas.DataFrame(data, columns=['#','name','abbr', 'country'])
# option 1
abbreviation_dict = {
'UCL': 'UEFA Champions League',
'UEL': 'UEFA Europe Leauge'
}
df['competition'] = df['abbr'].replace(abbreviation_dict)
# option 2 using a function
def get_dict_for_replace(unique_values):
some_dict = {}
for unique_value in unique_values:
if unique_value == 'UCL':
value_1 = 'UEFA Champions League' # or whatever is complicated
some_dict.update({'UCL': value_1})
elif unique_value == 'UEL':
value_2 = 'UEFA Europe Leauge' # or whatever is complicated
some_dict.update({'UEL': value_2})
return some_dict
# get your unique values,
unique_values = df['abbr'].unique()
# get your dictionary
abbreviation_dict = get_dict_for_replace(unique_values)
df['competition'] = df['abbr'].replace(abbreviation_dict)
Without knowing your exact problem then this is probably the most general if you want to use a function. Run each calculation once. Pass to the dataframe. You can probably pack your dictionary more efficiently based on your actual requirements.
aside: Using groupby on '#' instead of 'abbr' might have unwanted consequences unless the mapping is 1-to-1.
Hi I am trying to clean up data but having trouble reading json file as separate dataframe column. I have thousands of records like this in a file:
{"hotel_class": 4.0,
"region_id": 60763,
"url": "http://www.tripadvisor.com/Hotel_Review-g60763-d113317-Reviews Casablanca_Hotel_Times_Square-New_York_City_New_York.html",
"phone": "",
"details": null,
"address": {"region": "NY", "street-address": "147 West 43rd Street", "postal-code": "10036", "locality": "New York City"},
"type": "hotel",
"id": 113317,
"name": "Casablanca Hotel Times Square"}
i tried to load it as:
with open('offering.txt') as datafile:
data_json = json.load(datafile)
but it is giving an error i.e
JSONDecodeError: Extra data: line 2 column 1 (char 398)
so i tried doing it with
df=pd.read_json('offering.txt',lines=True)
but if i do it this way, my address column has nested values and i want to separate them in different columns. how to do it?
df['address']
0 {'region': 'NY', 'street-address': '147 West 4...
1 {'region': 'CA', 'street-address': '300 S Dohe...
2 {'region': 'NY', 'street-address': '790 Eighth...
3 {'region': 'NY', 'street-address': '152 West 5...
4 {'region': 'NY', 'street-address': '130 West 4...
Name: address, Length: 4333, dtype: object
Try:
df = pd.read_json("offering.txt", lines=True)
df_out = pd.concat([df, df.pop("address").apply(pd.Series)], axis=1)
print(df_out)
Prints:
hotel_class region_id url phone details type id name region street-address postal-code locality
0 4 60763 http://www.tripadvisor.com/Hotel_Review-g60763-d113317-Reviews Casablanca_Hotel_Times_Square-New_York_City_New_York.html NaN hotel 113317 Casablanca Hotel Times Square NY 147 West 43rd Street 10036 New York City
1 5 60763 http://www.tripadvisor.com/Hotel_Review-g60763-d113317-Reviews Casablanca_Hotel_Times_Square-New_York_City_New_York.html NaN hotel 113317 Casablanca Hotel Times Square CA 147 West 43rd Street 10036 New York City
...
I have a pandas DataFrame like this:
city country city_population
0 New York USA 8300000
1 London UK 8900000
2 Paris France 2100000
3 Chicago USA 2700000
4 Manchester UK 510000
5 Marseille France 860000
I want to create a new column country_population by calculating a sum of every city for each country. I have tried:
df['Country population'] = df['city_population'].sum().where(df['country'])
But this won't work, could I have some advise on the problem?
Sounds like you're looking for groupby
import pandas as pd
data = {
'city': ['New York', 'London', 'Paris', 'Chicago', 'Manchester', 'Marseille'],
'country': ['USA', 'UK', 'France', 'USA', 'UK', 'France'],
'city_population': [8_300_000, 8_900_000, 2_100_000, 2_700_000, 510_000, 860_000]
}
df = pd.DataFrame.from_dict(data)
# group by country, access 'city_population' column, sum
pop = df.groupby('country')['city_population'].sum()
print(pop)
output:
country
France 2960000
UK 9410000
USA 11000000
Name: city_population, dtype: int64
Appending this Series to the DataFrame. (Arguably discouraged though, since it stores information redundantly and doesn't really fit the structure of the original DataFrame):
# add to existing df
pop.rename('country_population', inplace=True)
# how='left' to preserve original ordering of df
df = df.merge(pop, how='left', on='country')
print(df)
output:
city country city_population country_population
0 New York USA 8300000 11000000
1 London UK 8900000 9410000
2 Paris France 2100000 2960000
3 Chicago USA 2700000 11000000
4 Manchester UK 510000 9410000
5 Marseille France 860000 2960000
based on #Vaishali's comment, a one-liner
df['Country population'] = df.groupby([ 'country']).transform('sum')['city_population']
I have been trying to merge two geopandas dataframes based on a column and am getting some really weird results. To test this point I made two simple dataframes, and merged them:
import pandas as pd
import geopandas as gpd
df = pd.DataFrame(
{'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]})
gdf = gpd.GeoDataFrame(
df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))
df2 = pd.DataFrame(
{'Capital': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota',
'Caracas'],
'Abbreviation': ['ARG', 'BRA', 'CHI', 'COL', 'VZL']})
combined_df = gdf.merge(df2, left_on='City', right_on='Capital')
print(combined_df)
When I print the results, I get what I expected:
City Country ... Capital Abbreviation
0 Buenos Aires Argentina ... Buenos Aires ARG
1 Brasilia Brazil ... Brasilia BRA
2 Santiago Chile ... Santiago CHI
3 Bogota Colombia ... Bogota COL
4 Caracas Venezuela ... Caracas VZL
The two datasets are merged based on their common column, which is the 'city' column and the 'capital' column.
I have some other data I am working with. Here is a link to it
Both of the files are geopackages I've read in as geodataframes. Dataframe 1 has 16166 rows. Dataframe 2 has 15511 rows. They have a common ID column, 'ALTPARNO' AND 'altparno'. Here is the code I've tried to use to read them in and merge them:
import geopandas as gpd
dataframe1 = gpd.read_file(filepath, layer='allkeepers_2019')
dataframe2 = gpd.read_file(filepath, layer='keepers_2019')
results = dataframe1.merge(dataframe2, left_on='altparno', right_on='ALTPARNO')
When I look at my results, I have a dataframe with over 4 million rows (should be around 15,000).
What is going on?