Creating a dictionary from a dataframe having duplicate column value - python

I have a dataframe df1 having more than 500k records:
state lat-long
Florida (12.34,34.12)
texas (13.45,56.0)
Ohio (-15,49)
Florida (12.04,34.22)
texas (13.35,56.40)
Ohio (-15.79,49.34)
Florida (12.8764,34.2312)
the lat-long value can differ for a particular state.
Need to get a dictonary like below. the lat-long value can differ for a particular state but need to capture the first occurrence like this.
dict_state_lat_long = {"Florida":"(12.34,34.12)","texas":"(13.45,56.0)","Ohio":"(-15,49)"}
How can I get this in most efficient way?

You can use DataFrame.groupby to group the dataframe with respect to the states and then you can apply the aggregate function first to select the first occurring values of lat-long in the grouped dataframe.
Then you can use DataFrame.to_dict() function to convert the dataframe to the python dict.
Use this:
grouped = df.groupby("state")["lat-long"].agg("first")
dict_state_lat_long = grouped.to_dict()
print(dict_state_lat_long)
Output:
{'Florida': '(12.34,34.12)', 'Ohio': '(-15,49)', 'texas': '(13.45,56.0)'}

Related

Pandas Groupby with no aggregation for single result

I have a Dataframe in Pandas that shows percentage of men in a city/state. The dataframe df looks like the following (note this is not my actual usage/data but my datatypes are similar)
STATE CITY PERC_MEN
ALABAMA ABBEVILLE 41.3%
ALABAMA ADAMSVILLE 53.5%
....
WYOMING WRIGHT 46.6%
Each State/percentage of men combo will have exactly 1 value returned.
How do I show the city/population values for a given state? My code looks like the following (I need the first line where I groupby STATE because I do other stuff with the data)
for state, state_df in df.groupby(by=['STATE']):
print(state_df.groupby(by=['CITY'])['PERC_MEN'])
However this prints <pandas.core.groupby.generic.SeriesGroupBy object at 0xXXXXXXX>
Normally for groupby's I use an aggregate like mean() or sum() but is there a way to just return the value?
I wouldnt iterate dataframe.
Set index and slice
df=df.set_index(['STATE','CITY'])
df.xs(('ALABAMA', 'ABBEVILLE'), level=['STATE','CITY'])
or
df.loc[('ALABAMA', 'ABBEVILLE'),:]

Append column to dataframe containing count of duplicates on another row

New to Python, using 3.x
I have a large CSV containing a list of customer names and addresses.
[Name, City, State]
I am wanting to create a 4th column that is a count of the total number of customers living in the current customer's state.
So for example:
Joe, Dallas, TX
Steve, Austin, TX
Alex, Denver, CO
would become:
Joe, Dallas, TX, 2
Steve, Austin, TX, 2
Alex, Denver, CO, 1
I am able to read the file in, and then use groupby to create a Series that contains the values for the 4th column, but I can't figure out how to take that series and match it against the million+ rows in my actual file.
import pandas as pd
mydata=pd.read_csv(r'C:\Users\customerlist.csv', index_col=False)
mydata=mydata.drop_duplicates(subset='name', keep='first')
mydata['state']=mydata['state'].str.strip()
stateinstalls=(mydata.groupby(mydata.state, as_index=False).size())
stateinstalls gives me a series [2,1] but I lose the corresponding state([TX, CO]). It needs to be a tuple, so that I can then go back and iterate through all rows of my spreadsheet and say something like:
if mydata['state'].isin(stateinstalls(0))
mydata[row]=stateinstalls(1)
I feel very lost. I know there has to be a far simpler way to do this. Like even in place within the array (like a countif type function).
Any pointers is much appreciated.

Can I combine groupby data?

I have two columns home and away. So one row will be England vs Brazil and the next row will be Brazil England. How can I count occurrences of when Brazil faces England or England vs Brazil in one count?
Based on previous solutions, I have tried
results.groupby(["home_team", "away_team"]).size()
results.groupby(["away_team", "home_team"]).size()
however this does not give me the outcome that I am looking for.
Undesired output:
home_team away_team
England Brazil 1
away_team home_team
Brazil England 1
I would like to see:
England Brazil 2
May be you need below:
df = pd.DataFrame({
'home':['England', 'Brazil', 'Spain'],
'away':['Brazil', 'England', 'Germany']
})
pd.Series('-'.join(sorted(tup)) for tup in zip(df['home'], df['away'])).value_counts()
Output:
Brazil-England 2
Germany-Spain 1
dtype: int64
PS: If you do not like the - between team names, you can use:
pd.Series(' '.join(sorted(tup)) for tup in zip(df['home'], df['away'])).value_counts()
You can sort values by numpy.sort, create DataFrame and use your original solution:
df1 = (pd.DataFrame(np.sort(df[['home','away']], axis=1), columns=['home','away'])
.groupby(["home", "away"])
.size())
Option 1
You can use numpy.sort to sort the values of the dataframe
However, as that sorts in place, maybe it is better to create a copy of the dataframe.
dfTeams = pd.DataFrame(data=df.values.copy(), columns=['team1','team2'])
dfTeams.values.sort()
(I changed the column names, because with the sorting you are changing their meaning)
After having done this, you can use your groupby.
results.groupby(['team1', 'team2']).size()
Option 2
Since a more general title for your question would be something like how can I count combination of values in multiple columns on a dataframe, independently of their order, you could use a set.
A set object is an unordered collection of distinct hashable objects.
More precisely, create a Series of frozen sets, and then count values.
pd.Series(map(lambda home, away: frozenset({home, away}),
df['home'],
df['away'])).value_counts()
Note: I use the dataframe in #Harv Ipan's answer.

Create multiple, new columns from dict values by mapping against a single column

This question is related to posts on creating a new column by mapping/lookup using a dictionary. (Adding a new pandas column with mapped value from a dictionary and pandas - add new column to dataframe from dictionary). However, what if I want to create, multiple new columns with dictionary values.
For argument's sake, let's say I have the following df:
country
0 bolivia
1 canada
2 ghana
And in a different dataframe, I have country mappings:
country country_id category color
0 canada 11 north red
1 bolivia 12 central blue
2 ghana 13 south green
I've been using pd.merge to merge the the mapping dataframe to my df using country and another index as keys it basically does the job, which gives me my desired output:
country country_id category color
0 bolivia 12 central blue
1 canada 11 north red
2 ghana 13 south green
But, lately, I've been wanting to experiment with using dictionaries. I suppose a related question is how does one determine to use pd.merge or dictionaries to accomplish my task.
For one-off columns that I'll map, I'll create a new column by mapping to a dictionary:
country_dict = dict(zip(country, country_id))
df['country_id'] = df['country'].map(entity_dict)
It seems impractical to define a function that takes in different dictionaries and to create each new column separately (e.g., dict(zip(key, value1)), dict(zip(key, value2))). I'm stuck on how to proceed in creating multiple columns at the same time. I started over, and tried creating the country mapping excel worksheet as a dictionary:
entity_dict = entity.set_index('country').T.to_dict('list')
and then from there, converting the dict values to columns:
entity_mapping = pd.DataFrame.from_dict(entity_dict, orient = 'index')
entity_mapping.columns = ['col1', 'col2', 'col3']
And I've been stuck going around in circles for the past few days. Any help/feedback would be appreciated!
Ok after tackling this some where..I guess pd.merge makes the most sense.

Python DataFrame: How to continue grouping by after several operation on Dataframe

I have a dataframe with states, counties and population statistics with the below columns:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME CENSUS2010POP
And with the below line I am grouping the dataframe and sorting for each state the county population
sorted_df = temp_df.groupby(['STNAME']).apply(lambda x: x.sort_values(['CENSUS2010POP'], ascending = False))
After the sorting I want to only keep the 3 largest counties population-wise
largestcty = sorted_df.groupby(['STNAME'])["CENSUS2010POP"].nlargest(3)
And as the next step I would like to sum the values there witrh the below command
top3sum = largestcty.groupby(['STNAME']).sum()
But the problem now is that the key 'STNAME' is not in the series after the group by. My question is how to preserve the keys of the original DataFrame in the series?
So after applying the answer I have top3sum as a dataframe
top3sum = pd.DataFrame(largestcty.groupby(['STNAME'])'STNAME','CENSUS2010POP'].sum(),columns =['CENSUS2010POP'])
top3sum[:8]
>>>
STNAME CENSUS2010POP
Alabama 1406269
Alaska 478402
Arizona 5173150
Arkansas 807152
California 15924150
Colorado 1794424
Connecticut 2673320
Delaware 897934
This is how the top3sum data look like and then I am getting:
cnty = top3sum['CENSUS2010POP'].idxmax()
And cnty = California
But then trying to use the cnty with top3sum['STNAME'] I am receiving a key error
Your issue is, that after the second grouping, you only select the column CENSUSxxx and pick the three largest values.
Please note that you don't need to sort in advance before applying nlargest, so the first command is unnecessary. But if you sort, you can easily pick the first 3 lines of the sorted grouped dataframes:
largestcty = temp_df.groupby(['TNAME']).apply(lambda x: x.sort_values(['CENSUS2010POP'], ascending = False).head(3)
Then you need to adopt the sum command in order to select your desired column:
top3sum = largestcty.groupby(['STNAME'])['CENSUS2010POP'].sum()

Categories

Resources