I have a dataframe that looks like this:
Population2010
State County
AL Baldwin 90332
Douglas 92082
Rolling 52000
CA Orange 3879602
San Diego 4364594
Los Angeles 12123562
CO Boulder 161818
Denver 737728
Jefferson 222368
AZ Maricopa 2239378
Pinal 448888
Pima 1000564
I would like to put the data in descending order based on the population but also have it be ordered by the state
Population2010
State County
AL Douglas 92082
Baldwin 90332
Rolling 52000
CA Los Angeles 12123562
San Diego 4364594
Orange 3879602
CO Denver 737728
Jefferson 222368
Boulder 161818
AZ Maricopa 2239378
Pima 1000564
Pinal 448888
and then I would like to sum the first two entries of population data and give the two states with the highest sums.
'CA', 'AZ'
Question 1:
df.sort_values(['Population2010'], ascending=False)\
.reindex(sorted(df.index.get_level_values(0).unique()), level=0)
or
df.sort_values('Population2010', ascending=False)\
.sort_index(level=0, ascending=[True])
Output:
Population2010
State County
AL Douglas 92082
Baldwin 90332
Rolling 52000
AZ Maricopa 2239378
Pima 1000564
Pinal 448888
CA Los Angeles 12123562
San Diego 4364594
Orange 3879602
CO Denver 737728
Jefferson 222368
Boulder 161818
First, sort the entire dataframe by values descending, then get the values from the index for level=0, sort them and use to reindex on level=0 to sort the dataframe in groups of level 0.
Question 2 somewhat unrelated calculation to the first:
df.groupby('State')['Population2010']\
.apply(lambda x: x.nlargest(2).sum())\
.nlargest(2).index.tolist()
Output:
['CA', 'AZ']
Use nlargest to find two largest values grouped by state and sum, then use nlargest again to find the two largest states for those sums.
Related
I have tried looking for a way to create a dataframe of columns and their unique values. I know this has less use cases but would be a great way to get an initial idea of unique values. It would look something like this....
State
County
City
Colorado
Denver
Denver
Colorado
El Paso
Colorado Springs
Colorado
Larimar
Fort Collins
Colorado
Larimar
Loveland
Turns into this...
State
County
City
Colorado
Denver
Denver
El Paso
Colorado Springs
Larimar
Fort Collins
Loveland
I would use mask and a lambda
df.mask(cond=df.apply(lambda x : x.duplicated(keep='first')), other='')
State County City
0 Colorado Denver Denver
1 El Paso Colorado Springs
2 Larimar Fort Collins
3 Loveland
Reproducible example. Please add this next time to your future questions to help others answer your question.
import pandas as pd
df = pd.DataFrame({
'State': ['Colorado', 'Colorado', 'Colorado', 'Colorado'],
'County': ['Denver', 'El Paso', 'Larimar', 'Larimar'],
'City': ['Denver', 'Colorado Springs', 'Fort Collins', 'Loveland']
})
df
State County City
0 Colorado Denver Denver
1 Colorado El Paso Colorado Springs
2 Colorado Larimar Fort Collins
3 Colorado Larimar Loveland
Drop duplicates from each column separately and then concatenate. Fill NaN with empty string.
pd.concat([df[col].drop_duplicates() for col in df], axis=1).fillna('')
State County City
0 Colorado Denver Denver
1 El Paso Colorado Springs
2 Larimar Fort Collins
3 Loveland
This is the best solution I have come up with, hope to help others looking for something like it!
def create_unique_df(df) -> pd.DataFrame:
""" take a dataframe and creates a new one containing unique values for each column
note, it only works for two columns or more
:param df: dataframe you want see unique values for
:param type: pandas.DataFrame
return: dataframe of columns with unique values
"""
# using list() allows us to combine lists down the line
data_series = df.apply(lambda x: list( x.unique() ) )
list_df = data_series.to_frame()
# to create a df from lists they all neet to be the same leng. so we can append null
# values
# to lists and make them the same length. First find differenc in length of longest list and
# the rest
list_df['needed_nulls'] = list_df[0].str.len().max() - list_df[0].str.len()
# Second create a column of lists with one None value
list_df['null_list_placeholder'] = [[None] for _ in range(list_df.shape[0])]
# Third multiply the null list times the difference to get a list we can add to the list of
# unique values making all the lists the same length. Example: [None] * 3 == [None, None,
# None]
list_df['null_list_needed'] = list_df.null_list_placeholder * list_df.needed_nulls
list_df['full_list'] = list_df[0] + list_df.null_list_needed
unique_df = pd.DataFrame(
list_df['full_list'].to_dict()
)
return unique_df
The data i have is something similar to this:
country
population
area
city
city_population
USA
331893745
9833520
New York
8804190
USA
331893745
9833520
Los Angeles
3898747
USA
331893745
9833520
Chicago
2746388
UK
243610
66366000
London
7556900
UK
243610
66366000
Birmingham
984333
Canada
9984670
38532853
Toronto
2600000
Canada
9984670
38532853
Montreal
1600000
Canada
9984670
38532853
Calgary
1019942
I am looking for output like this:
country
population
area
cities
USA
331893745
9833520
{'New York' : 8804190, 'Los Angeles' : 3898747, 'Chicago' : 2746388}
UK
243610
66366000
{'London' : 7556900, 'Birmingham' : 984333}
Canada
9984670
38532853
{'Toronto' : 2600000, 'Montreal' : 1600000, 'Calgary' : 1019942}
So basically I want to group by the country column and then put city and city_population into a JSON-like column while keeping the other columns.
Any help is appreciated.
What you want is pandas groupby function, which creates groups depending on multiple columns with the same value. These groups can then be transformed with other functions based on your problem. In your case, I would apply a lambda function, which takes the city column and city_population and creates a dictionary (JSON-like structure). The next two statements are only to have a nice index and the correct column name.
(df.groupby(by=['country', 'population', 'area'])
.apply(lambda x: dict(zip(x['city'], x['city_population'])))
.reset_index()
.rename(columns={0:'Cities'}))
Output:
country population area Cities
0 Canada 9984670 38532853 {'Toronto': 2600000, 'Montreal': 1600000, 'Calgary': 1019942}
1 UK 243610 66366000 {'London': 7556900, 'Birmingham': 984333}
2 USA 331893745 9833520 {'New York': 8804190, 'Los Angeles': 3898747, 'Chicago': 2746388}
I have a dataframe column 'address' with values like this in each row:
3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)
Jackson Heights 74th Street - Roosevelt Avenue (7), 75th Street, Queens, Queens County, New York, 11372, United States, (40.74691655, -73.8914737373454)
I need only to keep the value Bronx / Queens / Manhattan / Staten Island from each row.
Is there any way to do this?
Thanks in advance.
One option is this, assuming the values are always in the same place. Using .split(', ')[2]
"3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)".split(', ')[2]
If the source file is a CSV (Comma-separated values), I would have a look at pandas and pandas.read_csv('filename.csv') and leverage all the nice features that are in pandas.
If the values are not at the same position and you need only a is in set of values or not:
import pandas as pd
df = pd.DataFrame(["The Bronx", "Queens", "Man"])
df.isin(["Queens", "The Bronx"])
You could add a column, let's call it 'district' and then populate it like this.
import pandas as pd
df = pd.DataFrame({'address':["3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)",
"Jackson Heights 74th Street - Roosevelt Avenue (7), 75th Street, Queens, Queens County, New York, 11372, United States, (40.74691655, -73.8914737373454)"]})
districts = ['Bronx','Queens','Manhattan', 'Staten Island']
df['district'] = ''
for district in districts:
df.loc[df['address'].str.contains(district) , 'district'] = district
print(df)
First DataFrame : housing, This data Frame contains MultiIndex (State, RegionName) and some relevant values in other 3 columns.
State RegionName 2008q3 2009q2 Ratio
New York New York 499766.666667 465833.333333 1.072844
California Los Angeles 469500.000000 413900.000000 1.134332
Illinois Chicago 232000.000000 219700.000000 1.055985
Pennsylvania Philadelphia 116933.333333 116166.666667 1.006600
Arizona Phoenix 193766.666667 168233.333333 1.151773
Second DataFrame : list_of_university_towns, Contains the names of States and Some regions and has default numeric index
State RegionName
1 Alabama Auburn
2 Alabama Florence
3 Alabama Jacksonville
4 Arizona Phoenix
5 Illinois Chicago
Now the inner join of the two dataframes :
uniHousingData = pd.merge(list_of_university_towns,housing,how="inner",on=["State","RegionName"])
This gives no values in the resultant uniHousingData dataframe, while it should have the bottom two values (index#4 and 5 from list_of_university_towns)
What am I doing wrong?
I found the issue. There was space at the end of the string in the RegionName column of the second dataframe. used Strip() method to remove the space and it worked like a charm.
I have a copy of a dataframe that looks like this:
heatmap_df = test['coords'].copy()
heatmap_df
0 [(Manhattanville, Manhattan, Manhattan Communi...
1 [(Mainz, Rheinland-Pfalz, 55116, Deutschland, ...
2 [(Ithaca, Ithaca Town, Tompkins County, New Yo...
3 [(Starr Hill, Charlottesville, Virginia, 22903...
4 [(Neuchâtel, District de Neuchâtel, Neuchâtel,...
5 [(Newark, Licking County, Ohio, 43055, United ...
6 [(Mae, Cass County, Minnesota, United States o...
7 [(Columbus, Franklin County, Ohio, 43210, Unit...
8 [(Canaanville, Athens County, Ohio, 45701, Uni...
9 [(Arizona, United States of America, (34.39534...
10 [(Enschede, Overijssel, Nederland, (52.2233632...
11 [(Gent, Oost-Vlaanderen, Vlaanderen, België - ...
12 [(Reno, Washoe County, Nevada, 89557, United S...
13 [(Grenoble, Isère, Auvergne-Rhône-Alpes, Franc...
14 [(Columbus, Franklin County, Ohio, 43210, Unit...
Each row has this format with some coordinates:
heatmap_df[2]
[Location(Ithaca, Ithaca Town, Tompkins County, New York, 14853, United States of America, (42.44770298533052, -76.48085858627931, 0.0)),
Location(Chapel Hill, Orange County, North Carolina, 27515, United States of America, (35.916920469999994, -79.05664845999999, 0.0))]
I want to pull the latitude and longitudes from each row and store them as separate columns in the dataframe heatmap_df. I have this so far, but I suck at writing loops. My loop is not working recursively, it only prints out the last coordinates.
x = np.arange(start=0, stop=3, step=1)
for i in x:
point_i = (heatmap_df[i][0].latitude, heatmap_df[i][0].longitude)
i = i+1
point_i
(42.44770298533052, -76.48085858627931)
I am trying to make a heat map with all the coordinates using Folium. Can someone help please? Thank you
Python doesn't know what you are trying to do it's assuming you want to store the tuple value of (heatmap_df[i][0].latitude, heatmap_df[i][0].longitude) in the variable point_i for every iteration. So what happens is it is overwritten every time. You want to declare a list outside then loop the append a lists of the Lat and Long to it creating a List of List which can easily be a DF. Also, your loop in the example isn't recursive, Check this out for recursion
Try this:
x = np.arange(start=0, stop=3, step=1)
points = []
for i in x:
points.append([heatmap_df[i][0].latitude, heatmap_df[i][0].longitude])
i = i+1
print(points)