Create a dataframe with columns and their unique values in pandas - python

I have tried looking for a way to create a dataframe of columns and their unique values. I know this has less use cases but would be a great way to get an initial idea of unique values. It would look something like this....
State
County
City
Colorado
Denver
Denver
Colorado
El Paso
Colorado Springs
Colorado
Larimar
Fort Collins
Colorado
Larimar
Loveland
Turns into this...
State
County
City
Colorado
Denver
Denver
El Paso
Colorado Springs
Larimar
Fort Collins
Loveland

I would use mask and a lambda
df.mask(cond=df.apply(lambda x : x.duplicated(keep='first')), other='')
State County City
0 Colorado Denver Denver
1 El Paso Colorado Springs
2 Larimar Fort Collins
3 Loveland

Reproducible example. Please add this next time to your future questions to help others answer your question.
import pandas as pd
df = pd.DataFrame({
'State': ['Colorado', 'Colorado', 'Colorado', 'Colorado'],
'County': ['Denver', 'El Paso', 'Larimar', 'Larimar'],
'City': ['Denver', 'Colorado Springs', 'Fort Collins', 'Loveland']
})
df
State County City
0 Colorado Denver Denver
1 Colorado El Paso Colorado Springs
2 Colorado Larimar Fort Collins
3 Colorado Larimar Loveland
Drop duplicates from each column separately and then concatenate. Fill NaN with empty string.
pd.concat([df[col].drop_duplicates() for col in df], axis=1).fillna('')
State County City
0 Colorado Denver Denver
1 El Paso Colorado Springs
2 Larimar Fort Collins
3 Loveland

This is the best solution I have come up with, hope to help others looking for something like it!
def create_unique_df(df) -> pd.DataFrame:
""" take a dataframe and creates a new one containing unique values for each column
note, it only works for two columns or more
:param df: dataframe you want see unique values for
:param type: pandas.DataFrame
return: dataframe of columns with unique values
"""
# using list() allows us to combine lists down the line
data_series = df.apply(lambda x: list( x.unique() ) )
list_df = data_series.to_frame()
# to create a df from lists they all neet to be the same leng. so we can append null
# values
# to lists and make them the same length. First find differenc in length of longest list and
# the rest
list_df['needed_nulls'] = list_df[0].str.len().max() - list_df[0].str.len()
# Second create a column of lists with one None value
list_df['null_list_placeholder'] = [[None] for _ in range(list_df.shape[0])]
# Third multiply the null list times the difference to get a list we can add to the list of
# unique values making all the lists the same length. Example: [None] * 3 == [None, None,
# None]
list_df['null_list_needed'] = list_df.null_list_placeholder * list_df.needed_nulls
list_df['full_list'] = list_df[0] + list_df.null_list_needed
unique_df = pd.DataFrame(
list_df['full_list'].to_dict()
)
return unique_df

Related

Pandas Split 1 Column into Multiple Columns where Delimited Split size Can Vary

I have some address data like:
Address
Buffalo, NY, 14201
Stackoverflow Street, New York, NY, 9999
I'd like to split these into columns like:
Street City State Zip
NaN Buffalo NY 14201
StackOverflow Street New York NY 99999
Essentially, I'd like to shift my strings over by one in each column in the result.
With Pandas I know I can split columns like:
import pandas as pd
df = pd.DataFrame(
data={'Address': ['Buffalo, NY, 14201', 'Stackoverflow Street, New York, NY, 99999']}
)
df[['Street','City','State','Zip']] = (
df['Address']
.str.split(',', expand=True)
.applymap(lambda col: col.strip() if col else col)
)
but need to figure out how to conditionally shift columns when my result is only 3 columns.
First, create a function to reverse a split for each row. Because if you split normally, the NaN will be at the end, so you reverse the order and split the list now the NaN will be at the end but the list is reversed.
Then, apply it to all rows.
Then, rename the columns because they will be integers.
Finally, set them in the right order.
fn = lambda x: pd.Series([i for i in reversed(x.split(','))])
pad = df['Address'].apply(fn)
pad looks like this right now,
0 1 2 3
0 14201 NY Buffalo NaN
1 99999 NY New York Stackoverflow Street
Just need to rename the columns and flip the order back.
pad.rename(columns={0:'Zip',1:'State',2:'City',3:'Street'},inplace=True)
df = pad[['Street','City','State','Zip']]
Output:
Street City State Zip
0 NaN Buffalo NY 14201
1 Stackoverflow Street New York NY 99999
Use a bit of numpy magic to reorder the columns with None on the left:
df2 = df['Address'].str.split(',', expand=True)
df[['Street','City','State','Zip']] = df2.to_numpy()[np.arange(len(df))[:,None], np.argsort(df2.notna())]
Output:
Address Street City State Zip
0 Buffalo, NY, 14201 None Buffalo NY 14201
1 Stackoverflow Street, New York, NY, 99999 Stackoverflow Street New York NY 99999
Another idea, add as many commas as needed to have n-1 (here 3) before splitting:
df[['Street','City','State','Zip']] = (
df['Address'].str.count(',')
.rsub(4-1).map(lambda x: ','*x)
.add(df['Address'])
.str.split(',', expand=True)
)
Output:
Address Street City State Zip
0 Buffalo, NY, 14201 Buffalo NY 14201
1 Stackoverflow Street, New York, NY, 99999 Stackoverflow Street New York NY 99999
Well I found a solution but not sure if there is something more performant out there. Open to other ideas.
def split_shift(s: str) -> list[str]:
split_str: list[str] = s.split(',')
# If split is only 3 items, shift things over by inserting a NA in front
if len(split_str) == 3:
split_str.insert(0,pd.NA)
return split_str
df[['Street','City','State','Zip']] = pd.DataFrame(df['Address'].apply(lambda x: split_shift(x)).tolist())

Delete value Pandas Column

I started to learn about pandas and try to analyze a data
So in my data there is a column country which contain a few country,I only want to take the first value and change it to a new column.
An example First index have Colombia,Mexico,United Stated and I only wanna to take the first one Colombia [0] and delete the other contry[1:x],is this possible?
I try a few like loc,iloc or drop() but I hit a dead end so I asked in here
You can use Series.str.split:
df['country'] = df['country'].str.split(',').str[0]
Consider below df for example:
In [1520]: df = pd.DataFrame({'country':['Colombia, Mexico, US', 'Croatia, Slovenia, Serbia', 'Denmark', 'Denmark, Brazil']})
In [1521]: df
Out[1521]:
country
0 Colombia, Mexico, US
1 Croatia, Slovenia, Serbia
2 Denmark
3 Denmark, Brazil
In [1523]: df['country'] = df['country'].str.split(',').str[0]
In [1524]: df
Out[1524]:
country
0 Colombia
1 Croatia
2 Denmark
3 Denmark
Use .str.split():
df['country'] = df['country'].str.split(',',expand=True)[0]

Loop and store coordinates

I have a copy of a dataframe that looks like this:
heatmap_df = test['coords'].copy()
heatmap_df
0 [(Manhattanville, Manhattan, Manhattan Communi...
1 [(Mainz, Rheinland-Pfalz, 55116, Deutschland, ...
2 [(Ithaca, Ithaca Town, Tompkins County, New Yo...
3 [(Starr Hill, Charlottesville, Virginia, 22903...
4 [(Neuchâtel, District de Neuchâtel, Neuchâtel,...
5 [(Newark, Licking County, Ohio, 43055, United ...
6 [(Mae, Cass County, Minnesota, United States o...
7 [(Columbus, Franklin County, Ohio, 43210, Unit...
8 [(Canaanville, Athens County, Ohio, 45701, Uni...
9 [(Arizona, United States of America, (34.39534...
10 [(Enschede, Overijssel, Nederland, (52.2233632...
11 [(Gent, Oost-Vlaanderen, Vlaanderen, België - ...
12 [(Reno, Washoe County, Nevada, 89557, United S...
13 [(Grenoble, Isère, Auvergne-Rhône-Alpes, Franc...
14 [(Columbus, Franklin County, Ohio, 43210, Unit...
Each row has this format with some coordinates:
heatmap_df[2]
[Location(Ithaca, Ithaca Town, Tompkins County, New York, 14853, United States of America, (42.44770298533052, -76.48085858627931, 0.0)),
Location(Chapel Hill, Orange County, North Carolina, 27515, United States of America, (35.916920469999994, -79.05664845999999, 0.0))]
I want to pull the latitude and longitudes from each row and store them as separate columns in the dataframe heatmap_df. I have this so far, but I suck at writing loops. My loop is not working recursively, it only prints out the last coordinates.
x = np.arange(start=0, stop=3, step=1)
for i in x:
point_i = (heatmap_df[i][0].latitude, heatmap_df[i][0].longitude)
i = i+1
point_i
(42.44770298533052, -76.48085858627931)
I am trying to make a heat map with all the coordinates using Folium. Can someone help please? Thank you
Python doesn't know what you are trying to do it's assuming you want to store the tuple value of (heatmap_df[i][0].latitude, heatmap_df[i][0].longitude) in the variable point_i for every iteration. So what happens is it is overwritten every time. You want to declare a list outside then loop the append a lists of the Lat and Long to it creating a List of List which can easily be a DF. Also, your loop in the example isn't recursive, Check this out for recursion
Try this:
x = np.arange(start=0, stop=3, step=1)
points = []
for i in x:
points.append([heatmap_df[i][0].latitude, heatmap_df[i][0].longitude])
i = i+1
print(points)

pandas finding inverse of merge

I have two pandas dataframes, one that is a list of states, cities, and a capital flag with a multiIndex of (state, city), and another that is a non-indexed (or default indexed, if that's more appropriate) list of states and their capitals, I need to perform an inner join on the two and then also find out which items in the cities df are NOT in the join.
Cities:
capital
state city
Ohio Akron N
Toledo N
Columbus N
Colorado Boulder N
Denver N
States:
state city
0 West Virginia Charleston
1 Ohio Columbus
Inner join to find the capital of Ohio:
pd.merge(cities, states, on=['state', 'city'], how='inner')
state city capital
0 Ohio Columbus N
Now I need to get a df that includes everything in the cities df EXCEPT Columbus, Ohio. I've been looking at variations of .isin(), both with and without reset_index(), but I can't get it work.
Code to create the cities and states dfs. I have set_index() as a separate call because if I try to do it when I create the df I get an error about ValueError: Shape of passed values is (3, 3), indices imply (2, 3) and haven't figured out a way around it.
cities = pd.DataFrame({'state':['Ohio', 'Ohio', 'Ohio', 'Colorado', 'Colorado'], 'city':['Akron', 'Toledo', 'Columbus', 'Boulder', 'Denver'], 'capital':['N', 'N', 'N', 'N', 'N']}, columns=['state', 'city', 'capital'])
cities.set_index(('state', 'city'))
states = pd.DataFrame({'state':['West Virginia', 'Ohio'], 'city':['Charleston', 'Columbus']})
IIUC, you could use merge with how='outer' and indicator='source', and the keep only those that are 'left_only':
merge = cities.merge(states, on=['state', 'city'], how='outer', indicator='source')
result = merge[merge.source.eq('left_only')].drop('source', axis=1)
print(result)
Output
state city capital
0 Ohio Akron N
1 Ohio Toledo N
3 Colorado Boulder N
4 Colorado Denver N
As an alternative you could use isin, in the following way:
mask = ~cities.reset_index().city.isin(states.city)
print(cities[pd.Series(data=mask.values, index=cities.index)])
Output
capital
state city
Ohio Akron N
Toledo N
Colorado Boulder N
Denver N
The idea of the second approach is to create a boolean mask with an index matching the one in cities. A variation on the second approach is the following:
# drop the index
re_indexed = cities.reset_index()
# find the mask
mask = ~re_indexed.city.isin(states.city)
# reindex back
result = re_indexed[mask].set_index(['state', 'city'])
print(result)

Order Subindex in dataframe and sum the top "n" entries

I have a dataframe that looks like this:
Population2010
State County
AL Baldwin 90332
Douglas 92082
Rolling 52000
CA Orange 3879602
San Diego 4364594
Los Angeles 12123562
CO Boulder 161818
Denver 737728
Jefferson 222368
AZ Maricopa 2239378
Pinal 448888
Pima 1000564
I would like to put the data in descending order based on the population but also have it be ordered by the state
Population2010
State County
AL Douglas 92082
Baldwin 90332
Rolling 52000
CA Los Angeles 12123562
San Diego 4364594
Orange 3879602
CO Denver 737728
Jefferson 222368
Boulder 161818
AZ Maricopa 2239378
Pima 1000564
Pinal 448888
and then I would like to sum the first two entries of population data and give the two states with the highest sums.
'CA', 'AZ'
Question 1:
df.sort_values(['Population2010'], ascending=False)\
.reindex(sorted(df.index.get_level_values(0).unique()), level=0)
or
df.sort_values('Population2010', ascending=False)\
.sort_index(level=0, ascending=[True])
Output:
Population2010
State County
AL Douglas 92082
Baldwin 90332
Rolling 52000
AZ Maricopa 2239378
Pima 1000564
Pinal 448888
CA Los Angeles 12123562
San Diego 4364594
Orange 3879602
CO Denver 737728
Jefferson 222368
Boulder 161818
First, sort the entire dataframe by values descending, then get the values from the index for level=0, sort them and use to reindex on level=0 to sort the dataframe in groups of level 0.
Question 2 somewhat unrelated calculation to the first:
df.groupby('State')['Population2010']\
.apply(lambda x: x.nlargest(2).sum())\
.nlargest(2).index.tolist()
Output:
['CA', 'AZ']
Use nlargest to find two largest values grouped by state and sum, then use nlargest again to find the two largest states for those sums.

Categories

Resources