I want to change the data in a row
database:
Name
City
Country
John
Toronto,Canada
Canada
Smith
Seattle,United States
United States
Raj
Greater Toronto Area,Canada
Canada
The Records in the city with "," should be removed only the name of the city should be their others should be deleted
output required
Name
City
Country
John
Toronto
Canada
Smith
Seattle
United States
Raj
Greater Toronto Area
Canada
USE -df['City'] = df['City'].str.split(',').str[0]
Reproducible Code-
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['tom', 'Toronto ,ON','Canada'], [ "Raj","Greater Toronto Area, Canada","Canada"]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'City','Country'])
# print dataframe.
df['City'] = df['City'].str.split(',').str[0]
df
Ouptut-
Name City Country
0 tom Toronto Canada
1 Raj Greater Toronto Area Canada
Ref link-https://stackoverflow.com/questions/40705480/python-pandas-remove-everything-after-a-delimiter-in-a-string
Related
I have a column in a df that I want to split into two columns splitting by comma delimiter. If the value in that column does not have a comma I want to put that into the second column instead of first.
Origin
New York, USA
England
Russia
London, England
California, USA
USA
I want the result to be:
Location
Country
New York
USA
NaN
England
NaN
Russia
London
England
California
USA
NaN
USA
I used this code
df['Location'], df['Country'] = df['Origin'].str.split(',', 1)
We can try using str.extract here:
df["Location"] = df["Origin"].str.extract(r'(.*),')
df["Country"] = df["Origin"].str.extract(r'(\w+(?: \w+)*)$')
Here is a way by using str.extract() and named groups
df['Origin'].str.extract(r'(?P<Location>[A-Za-z ]+(?=,))?(?:, )?(?P<Country>\w+)')
Output:
Location Country
0 New York USA
1 NaN England
2 NaN Russia
3 London England
4 California USA
5 NaN USA
I have a dictionary and a dataframe, for example:
data={'Name': ['Tom', 'Joseph', 'Krish', 'John']}
df=pd.DataFrame(data)
print(df)
city={"New York": "123",
"LA":"456",
"Miami":"789"}
Output:
Name
0 Tom
1 Joseph
2 Krish
3 John
I've created a column called CITY by using the following:
df["CITY"]=np.random.choice(list(city), len(df))
df
Name CITY
0 Tom New York
1 Joseph LA
2 Krish Miami
3 John New Yor
Now, I would like to generate a new column - CITY2 with a random item from city dictionary, but I would like CITY will be a different item than CITY2, so basically when I'm generating CITY2 I need to exclude CITY item.
It's worth mentioning that my real df is quite large so I need it to be effective as possible.
Thanks in advance.
continue with approach you have used
have used pd.Series() as a convenience to remove value that has already been used
wrapped in apply() to get value of each row
data={'Name': ['Tom', 'Joseph', 'Krish', 'John']}
df=pd.DataFrame(data)
city={"New York": "123",
"LA":"456",
"Miami":"789"}
df["CITY"]=np.random.choice(list(city), len(df))
df["CITY2"] = df["CITY"].apply(lambda x: np.random.choice(pd.Series(city).drop(x).index))
Name
CITY
CITY2
0
Tom
Miami
New York
1
Joseph
LA
Miami
2
Krish
New York
Miami
3
John
New York
LA
You could also first group by "CITY", remove the current city per group from the city dict and then create the new random list of cities.
Maybe this is faster because you don't have to drop one city per row, but per group.
city2 = pd.Series()
for key,group in df.groupby('CITY'):
cities_subset = np.delete(np.array(list(city)),list(city).index(key))
city2 = city2.append(pd.Series(np.random.choice(cities_subset, len(group)),index=group.index))
df["CITY2"] = city2
This gives for example:
Name CITY CITY2
0 Tom New York LA
1 Joseph New York Miami
2 Krish LA New York
3 John New York LA
I have a pandas DataFrame like this:
city country city_population
0 New York USA 8300000
1 London UK 8900000
2 Paris France 2100000
3 Chicago USA 2700000
4 Manchester UK 510000
5 Marseille France 860000
I want to create a new column country_population by calculating a sum of every city for each country. I have tried:
df['Country population'] = df['city_population'].sum().where(df['country'])
But this won't work, could I have some advise on the problem?
Sounds like you're looking for groupby
import pandas as pd
data = {
'city': ['New York', 'London', 'Paris', 'Chicago', 'Manchester', 'Marseille'],
'country': ['USA', 'UK', 'France', 'USA', 'UK', 'France'],
'city_population': [8_300_000, 8_900_000, 2_100_000, 2_700_000, 510_000, 860_000]
}
df = pd.DataFrame.from_dict(data)
# group by country, access 'city_population' column, sum
pop = df.groupby('country')['city_population'].sum()
print(pop)
output:
country
France 2960000
UK 9410000
USA 11000000
Name: city_population, dtype: int64
Appending this Series to the DataFrame. (Arguably discouraged though, since it stores information redundantly and doesn't really fit the structure of the original DataFrame):
# add to existing df
pop.rename('country_population', inplace=True)
# how='left' to preserve original ordering of df
df = df.merge(pop, how='left', on='country')
print(df)
output:
city country city_population country_population
0 New York USA 8300000 11000000
1 London UK 8900000 9410000
2 Paris France 2100000 2960000
3 Chicago USA 2700000 11000000
4 Manchester UK 510000 9410000
5 Marseille France 860000 2960000
based on #Vaishali's comment, a one-liner
df['Country population'] = df.groupby([ 'country']).transform('sum')['city_population']
I have a dictionary dict of dataframes such as:
{
‘table_1’: name color type
Banana Yellow Fruit,
‘another_table_1’: city state country
Atlanta Georgia United States,
‘and_another_table_1’: firstname middlename lastname
John Patrick Snow,
‘table_2’: name color type
Red Apple Fruit,
‘another_table_2’: city state country
Arlington Virginia United States,
‘and_another_table_2’: firstname middlename lastname
Alex Justin Brown,
‘table_3’: name color type
Lettuce Green Vegetable,
‘another_table_3’: city state country
Dallas Texas United States,
‘and_another_table_3’: firstname middlename lastname
Michael Alex Smith }
I would like to merge these dataframes together based on their names so that in the end I will have only 3 dataframes:
table
name color type
Banana Yellow Fruit
Red Apple Fruit
Lettuce Green Vegetable
another_table
city state country
Atlanta Georgia United States
Arlington Virginia United States
Dallas Texas United States
and_another_table
firstname middlename lastname
John Patrick Snow
Alex Justin Brown
Michael Alex Smith
Based on my initial research it seems like this should be possible with Python:
By using .split, dictionary comprehension and itertools.groupby to group together dataframes inside the dictionary based on key names
Creating dictionary of dictionaries with these grouped results
Using pandas.concat function to loop through these dictionaries and group dataframes together
I don't have a lot of experience with Python and I am a bit lost on how to actually code this.
I have reviewed
How to group similar items in a list? and
Merge dataframes in a dictionary posts but they were not as helpful because in my case name length of dataframes varies.
Also I do not want to hardcode any dataframe names, because there are more than a 1000 of them.
Here is one way:
Give this dictionary of dataframes:
dd = {'table_1': pd.DataFrame({'Name':['Banana'], 'color':['Yellow'], 'type':'Fruit'}),
'table_2': pd.DataFrame({'Name':['Apple'], 'color':['Red'], 'type':'Fruit'}),
'another_table_1':pd.DataFrame({'city':['Atlanta'],'state':['Georgia'], 'Country':['United States']}),
'another_table_2':pd.DataFrame({'city':['Arlinton'],'state':['Virginia'], 'Country':['United States']}),
'and_another_table_1':pd.DataFrame({'firstname':['John'], 'middlename':['Patrick'], 'lastnme':['Snow']}),
'and_another_table_2':pd.DataFrame({'firstname':['Alex'], 'middlename':['Justin'], 'lastnme':['Brown']}),
}
tables = set([i.rsplit('_', 1)[0] for i in dd.keys()])
dict_of_dfs = {i:pd.concat([dd[x] for x in dd.keys() if x.startswith(i)]) for i in tables}
Outputs a new dictionary of combined tables:
dict_of_dfs['table']
# Name color type
# 0 Banana Yellow Fruit
# 0 Apple Red Fruit
dict_of_dfs['another_table']
# city state Country
# 0 Atlanta Georgia United States
# 0 Arlinton Virginia United States
dict_of_dfs['and_another_table']
# firstname middlename lastnme
# 0 John Patrick Snow
# 0 Alex Justin Brown
Another way using defaultdict from collections, create a list of combined dataframes:
from collections import defaultdict
import pandas as pd
dd = {'table_1': pd.DataFrame({'Name':['Banana'], 'color':['Yellow'], 'type':'Fruit'}),
'table_2': pd.DataFrame({'Name':['Apple'], 'color':['Red'], 'type':'Fruit'}),
'another_table_1':pd.DataFrame({'city':['Atlanta'],'state':['Georgia'], 'Country':['United States']}),
'another_table_2':pd.DataFrame({'city':['Arlinton'],'state':['Virginia'], 'Country':['United States']}),
'and_another_table_1':pd.DataFrame({'firstname':['John'], 'middlename':['Patrick'], 'lastnme':['Snow']}),
'and_another_table_2':pd.DataFrame({'firstname':['Alex'], 'middlename':['Justin'], 'lastnme':['Brown']}),
}
tables = set([i.rsplit('_', 1)[0] for i in dd.keys()])
d = defaultdict(list)
[d[i].append(dd[k]) for i in tables for k in dd.keys() if k.startswith(i)]
l_of_dfs = [pd.concat(d[i]) for i in d.keys()]
print(l_of_dfs[0])
print('\n')
print(l_of_dfs[1])
print('\n')
print(l_of_dfs[2])
Output:
city state Country
0 Atlanta Georgia United States
0 Arlinton Virginia United States
firstname middlename lastnme
0 John Patrick Snow
0 Alex Justin Brown
Name color type
0 Banana Yellow Fruit
0 Apple Red Fruit
So, here is my dataframe
import pandas as pd
cols = ['Name','Country','Income']
vals = [['Steve','USA',40000],['Matt','UK',40000],['John','USA',40000],['Martin','France',40000],]
x = pd.DataFrame(vals,columns=cols)
I have another list:
europe = ['UK','France']
I want to create a new column 'Continent' if x.Country is in europe
You need numpy.where with condition with isin:
x['Continent'] = np.where(x['Country'].isin(europe), 'Europe', 'Not Europe')
print (x)
Name Country Income Continent
0 Steve USA 40000 Not Europe
1 Matt UK 40000 Europe
2 John USA 40000 Not Europe
3 Martin France 40000 Europe
Or you can using isin directly
x['New Column']='Not Europe'
x.loc[x.Country.isin(europe),'New Column']='Europe'
Out[612]:
Name Country Income New Column
0 Steve USA 40000 Not Europe
1 Matt UK 40000 Europe
2 John USA 40000 Not Europe
3 Martin France 40000 Europe