Merge is not working on two dataframes of multi level index - python

First DataFrame : housing, This data Frame contains MultiIndex (State, RegionName) and some relevant values in other 3 columns.
State RegionName 2008q3 2009q2 Ratio
New York New York 499766.666667 465833.333333 1.072844
California Los Angeles 469500.000000 413900.000000 1.134332
Illinois Chicago 232000.000000 219700.000000 1.055985
Pennsylvania Philadelphia 116933.333333 116166.666667 1.006600
Arizona Phoenix 193766.666667 168233.333333 1.151773
Second DataFrame : list_of_university_towns, Contains the names of States and Some regions and has default numeric index
State RegionName
1 Alabama Auburn
2 Alabama Florence
3 Alabama Jacksonville
4 Arizona Phoenix
5 Illinois Chicago
Now the inner join of the two dataframes :
uniHousingData = pd.merge(list_of_university_towns,housing,how="inner",on=["State","RegionName"])
This gives no values in the resultant uniHousingData dataframe, while it should have the bottom two values (index#4 and 5 from list_of_university_towns)
What am I doing wrong?

I found the issue. There was space at the end of the string in the RegionName column of the second dataframe. used Strip() method to remove the space and it worked like a charm.

Related

Best way to replicate SQL "update case when..." with Pandas?

I have this sample data set
City
LAL
NYK
Dallas
Detroit
SF
Chicago
Denver
Phoenix
Toronto
And what I want to do is update certain values with specific values, and the rest of it I would leave as it is.
So, with SQL I would do something like this:
update table1
set city = case
when city='LAL' then 'Los Angeles'
when city='NYK' then 'New York'
Else city
end
What would be the best way to do this in Pandas?
Use replace on the City column:
df['City'] = df['City'].replace({"LAL": "Los Angeles", "NYK": "New York"})
output:
City
0 Los Angeles
1 New York
2 Dallas
3 Detroit
4 SF
5 Chicago
6 Denver
7 Phoenix
8 Toronto
You can directly replace the values like this:
replacement_dict = {"LAL": "Los Angeles", "NYK": "New York"}
for key, value in replacement_dict.items():
df['City'][df['City'] == key] = value
You can replace it using replace(). One option ist to define a dict.
Example
df = pd.DataFrame({'City':["LAL","NYK","Dallas","Detroit","SF","Chicago","Denver","Phoenix","Toronto"]})
df.replace({"LAL": "Los Angeles", "NYK": "New York"})

How to Split a column into two by comma delimiter, and put a value without comma in second column and not in first?

I have a column in a df that I want to split into two columns splitting by comma delimiter. If the value in that column does not have a comma I want to put that into the second column instead of first.
Origin
New York, USA
England
Russia
London, England
California, USA
USA
I want the result to be:
Location
Country
New York
USA
NaN
England
NaN
Russia
London
England
California
USA
NaN
USA
I used this code
df['Location'], df['Country'] = df['Origin'].str.split(',', 1)
We can try using str.extract here:
df["Location"] = df["Origin"].str.extract(r'(.*),')
df["Country"] = df["Origin"].str.extract(r'(\w+(?: \w+)*)$')
Here is a way by using str.extract() and named groups
df['Origin'].str.extract(r'(?P<Location>[A-Za-z ]+(?=,))?(?:, )?(?P<Country>\w+)')
Output:
Location Country
0 New York USA
1 NaN England
2 NaN Russia
3 London England
4 California USA
5 NaN USA

Python: no change in a pandas dataframe column when using apply function [duplicate]

This question already has answers here:
Pandas df.apply does not modify DataFrame
(2 answers)
Closed 1 year ago.
As a reproducible example, I created the following dataframe:
dictionary = {'Metropolitan area': ['New York City','New York City','Los Angeles', 'Los Angeles'],
'Population (2016 est.)[8]': [20153634, 20153634, 13310447, 13310447],
'NBA':['Knicks',' ',' ', 'Clippers']}
df = pd.DataFrame(dictionary)
to substitute any space present in df['NBA'] by 'None' I created the following function:
def transform(x):
if len(x)<2:
return None
else:
return x
which I apply over df['NBA'] using .apply method:
df['NBA'].apply(transform)
After doing this, I get the following output, which seems to have been succesful:
> 0 Knicks
1 Missing Value
2 Missing Value
3 Clippers
Name: NBA, dtype: object
But, here the problem, when I call for df, df['NBA'] is not transformed, and I get that column as it was from the beginning, and the spaces are still present and not replaced by None:
Metropolitan area Population (2016 est.)[8] NBA
0 New York City 20153634 Knicks
1 New York City 20153634
2 Los Angeles 13310447
3 Los Angeles 13310447 Clippers
What am I doing wrong? am I misunderstunding the .apply method?
The command df['NBA'].apply(transform) on its own will do the operation but not save it to the original DataFrame in the memory.
so you just have to save the new column:
df['NBA'] = df['NBA'].apply(transform)
and the resulting DataFrame should be:
Metropolitan area Population (2016 est.)[8] NBA
0 New York City 20153634 Knicks
1 New York City 20153634 None
2 Los Angeles 13310447 None
3 Los Angeles 13310447 Clippers
Assign the results of apply back to the column.
df['NBA'] = df['NBA'].apply(transform)

Loop and store coordinates

I have a copy of a dataframe that looks like this:
heatmap_df = test['coords'].copy()
heatmap_df
0 [(Manhattanville, Manhattan, Manhattan Communi...
1 [(Mainz, Rheinland-Pfalz, 55116, Deutschland, ...
2 [(Ithaca, Ithaca Town, Tompkins County, New Yo...
3 [(Starr Hill, Charlottesville, Virginia, 22903...
4 [(Neuchâtel, District de Neuchâtel, Neuchâtel,...
5 [(Newark, Licking County, Ohio, 43055, United ...
6 [(Mae, Cass County, Minnesota, United States o...
7 [(Columbus, Franklin County, Ohio, 43210, Unit...
8 [(Canaanville, Athens County, Ohio, 45701, Uni...
9 [(Arizona, United States of America, (34.39534...
10 [(Enschede, Overijssel, Nederland, (52.2233632...
11 [(Gent, Oost-Vlaanderen, Vlaanderen, België - ...
12 [(Reno, Washoe County, Nevada, 89557, United S...
13 [(Grenoble, Isère, Auvergne-Rhône-Alpes, Franc...
14 [(Columbus, Franklin County, Ohio, 43210, Unit...
Each row has this format with some coordinates:
heatmap_df[2]
[Location(Ithaca, Ithaca Town, Tompkins County, New York, 14853, United States of America, (42.44770298533052, -76.48085858627931, 0.0)),
Location(Chapel Hill, Orange County, North Carolina, 27515, United States of America, (35.916920469999994, -79.05664845999999, 0.0))]
I want to pull the latitude and longitudes from each row and store them as separate columns in the dataframe heatmap_df. I have this so far, but I suck at writing loops. My loop is not working recursively, it only prints out the last coordinates.
x = np.arange(start=0, stop=3, step=1)
for i in x:
point_i = (heatmap_df[i][0].latitude, heatmap_df[i][0].longitude)
i = i+1
point_i
(42.44770298533052, -76.48085858627931)
I am trying to make a heat map with all the coordinates using Folium. Can someone help please? Thank you
Python doesn't know what you are trying to do it's assuming you want to store the tuple value of (heatmap_df[i][0].latitude, heatmap_df[i][0].longitude) in the variable point_i for every iteration. So what happens is it is overwritten every time. You want to declare a list outside then loop the append a lists of the Lat and Long to it creating a List of List which can easily be a DF. Also, your loop in the example isn't recursive, Check this out for recursion
Try this:
x = np.arange(start=0, stop=3, step=1)
points = []
for i in x:
points.append([heatmap_df[i][0].latitude, heatmap_df[i][0].longitude])
i = i+1
print(points)

Sum each item in a Panda Series and sort by largest

Hello I have imported a csv file as a pandas dataframe and am trying to perform the below.
Data Frame Model:
STATE County POP
1 Alabama Autauga County 54571
2 Alabama Baldwin County 182265
3 Alabama Barbour County 27457
...
3168 Wisconsin Wood County 74749
3170 Wyoming Albany County 36299
3171 Wyoming Big Horn County 11668
3172 Wyoming Campbell County 46133
1.) Get a list of top two counties per State
2.) Get the sum of the top two counties for each State
3.) List the top two states with the largest population sorted from largest to smallest
I was able to accomplish item 1 using the below. Is there a way I can remove the index value from this output?
census_df.groupby('STATE')['POP'].nlargest(2)
STATE
Alabama 37 658466
49 412992
Alaska 71 291826
76 97581
Arizona 106 3817117
109 980263
Arkansas 174 382748
118 221339
But when I try to sum each item in the Series it is summing the entire series.
x.sum()
Is there a way to sum each item in the series? Also, Im not sure I am using the most efficient method to gather this info. Any help would be appreciated.
My desired output would be:
Top two most populated states:
STATE POP_SUM
Arkansas 382748
Wisconsin 271431
If I understand the issue correctly- you can pass level argument to sum to keep the grouping by state:
x.sum(level=0)

Categories

Resources