If I have the following dataframe 'countries':
country info
england london-europe
scotland edinburgh-europe
china beijing-asia
unitedstates washington-north_america
I would like to take the info field and have to remove everything after the '-', to become:
country info
england london
scotland edinburgh
china beijing
unitedstates washington
How do I do this?
Try:
countries['info'] = countries['info'].str.split('-').str[0]
Output:
country info
0 england london
1 scotland edinburgh
2 china beijing
3 unitedstates washington
You just need to keep the first part of the string after a split on the dash character:
countries['info'] = countries['info'].str.split('-').str[0]
Or, equivalently, you can use
countries['info'] = countries['info'].str.split('-').map(lambda x: x[0])
You can also use str.extract with pattern r"(\w+)(?=\-)"
Ex:
print(df['info'].str.extract(r"(\w+)(?=\-)"))
Output:
info
0 london
1 edinburgh
2 beijing
3 washington
Related
I have a column in a df that I want to split into two columns splitting by comma delimiter. If the value in that column does not have a comma I want to put that into the second column instead of first.
Origin
New York, USA
England
Russia
London, England
California, USA
USA
I want the result to be:
Location
Country
New York
USA
NaN
England
NaN
Russia
London
England
California
USA
NaN
USA
I used this code
df['Location'], df['Country'] = df['Origin'].str.split(',', 1)
We can try using str.extract here:
df["Location"] = df["Origin"].str.extract(r'(.*),')
df["Country"] = df["Origin"].str.extract(r'(\w+(?: \w+)*)$')
Here is a way by using str.extract() and named groups
df['Origin'].str.extract(r'(?P<Location>[A-Za-z ]+(?=,))?(?:, )?(?P<Country>\w+)')
Output:
Location Country
0 New York USA
1 NaN England
2 NaN Russia
3 London England
4 California USA
5 NaN USA
I have two dataframes:
Table1:
Table2:
How to find:
The country-city combinations that are present only in Table2 but not Table1.
Here [India-Mumbai] is the output.
For each country-city combination, that's present in both the tables, find the "Initiatives" that are present in Table2 but not Table1.
Here {"India-Bangalore": [Textile, Irrigation], "USA-Texas": [Irrigation]}
To answer the first question, we can use the merge method and keep only the NaN rows :
>>> df_merged = pd.merge(df_1, df_2, on=['Country', 'City'], how='left', suffixes = ['_1', '_2'])
>>> df_merged[df_merged['Initiative_2'].isnull()][['Country', 'City']]
Country City
13 India Mumbai
For the next question, we first need to remove the NaN rows from the previously merged DataFrame :
>>> df_both_table = df_merged[~df_merged['Initiative_2'].isnull()]
>>> df_both_table
Country City Initiative_1 Initiative_2
0 India Bangalore Plants Plants
1 India Bangalore Plants Textile
2 India Bangalore Plants Irrigtion
3 India Bangalore Industries Plants
4 India Bangalore Industries Textile
5 India Bangalore Industries Irrigtion
6 India Bangalore Roads Plants
7 India Bangalore Roads Textile
8 India Bangalore Roads Irrigtion
9 USA Texas Plants Plants
10 USA Texas Plants Irrigation
11 USA Texas Roads Plants
12 USA Texas Roads Irrigation
Then, we can filter on the rows that are strictly different on columns Initiative_1 and Initiative_2 and use a groupby to get the list of Innitiative_2 :
>>> df_unique_initiative_2 = df_both_table[~(df_both_table['Initiative_1'] == df_both_table['Initiative_2'])]
>>> df_list_initiative_2 = df_unique_initiative_2.groupby(['Country', 'City'])['Initiative_2'].unique().reset_index()
>>> df_list_initiative_2
Country City Initiative_2
0 India Bangalore [Textile, Irrigation, Plants]
1 USA Texas [Irrigation, Plants]
We do the same but this time on Initiative_1 to get the list as well :
>>> df_list_initiative_1 = df_unique_initiative_2.groupby(['Country', 'City'])['Initiative_1'].unique().reset_index()
>>> df_list_initiative_1
Country City Initiative_1
0 India Bangalore [Plants, Industries, Roads]
1 USA Texas [Plants, Roads]
To finish, we use the set to remove the last redondant Initiative_1 elements to get the expected result :
>>> df_list_initiative_2['Initiative'] = (df_list_initiative_2['Initiative_2'].map(set)-df_list_initiative_1['Initiative_1'].map(set)).map(list)
>>> df_list_initiative_2[['Country', 'City', 'Initiative']]
Country City Initiative
0 India Bangalore [Textile, Irrigation]
1 USA Texas [Irrigation]
Alternative approach (df1 your Table1, df2 your Table2):
combos_1, combos_2 = set(zip(df1.Country, df1.City)), set(zip(df2.Country, df2.City))
in_2_but_not_in_1 = [f"{country}-{city}" for country, city in combos_2 - combos_1]
initiatives = {
f"{country}-{city}": (
set(df2.Initiative[df2.Country.eq(country) & df2.City.eq(city)])
- set(df1.Initiative[df1.Country.eq(country) & df1.City.eq(city)])
)
for country, city in combos_1 & combos_2
}
Results:
['India-Delhi']
{'India-Bangalore': {'Irrigation', 'Textile'}, 'USA-Texas': {'Irrigation'}}
I think you got this "The country-city combinations that are present only in Table2 but not Table1. Here [India-Mumbai] is the output" wrong: The combinations India-Mumbai is not present in Table2?
I have a dataframe (see example, because sensitive data) with a full term (string), a text snippet (one big string) containing the full term and an abbreviation (string). I have been struggling with how to replace the full term in the text snippet with the corresponding abbrevation. Can anyone help? Example:
term text_snippet abbr
0 aanvullend onderzoek aanvullend onderzoek is vereist om... ao/
So I want to end up with:
term text_snippet abbr
0 aanvullend onderzoek ao/ is vereist om... ao/
You can use apply and replace terms with abbrs:
df['text_snippet'] = df.apply(
lambda x: x['text_snippet'].replace(x['term'], x['abbr']), axis=1)
df
Output:
term text_snippet abbr
0 aanvullend onderzoek ao/ is vereist om... ao/
Here is a solution. I made up a simple dataframe:
df = pd.DataFrame({'term':['United States of America','Germany','Japan'],
'text_snippet':['United States of America is in America',
'Germany is in Europe',
'Japan is in Asia'],
'abbr':['USA','DE','JP']})
df
Dataframe before solution:
term text_snippet abbr
0 United States of America United States of America is in America USA
1 Germany Germany is in Europe DE
2 Japan Japan is in Asia JP
Use apply function on every row:
df['text_snippet'] = df.apply(lambda row : row['text_snippet'].replace(row['term'],row['abbr']), axis= 1 )
Output:
term text_snippet abbr
0 United States of America USA is in America USA
1 Germany DE is in Europe DE
2 Japan JP is in Asia JP
Let df be the dataframe as follows:
date text
0 2019-6-7 London is good.
1 2019-5-8 I am going to Paris.
2 2019-4-4 Do you want to go to London?
3 2019-3-7 I love Paris!
I would like to add a column city, which indicates the city contained in text, that is,
date text city
0 2019-6-7 London is good. London
1 2019-5-8 I am going to Paris. Paris
2 2019-4-4 Do you want to go to London? London
3 2019-3-7 I love Paris! Paris
How to do it without using lambda?
You can first match sure you have the list of city , then str.findall
df.text.str.findall('London|Paris').str[0]
Out[320]:
0 London
1 Paris
2 London
3 Paris
Name: text, dtype: object
df['city'] = df.text.str.findall('London|Paris').str[0]
Adding to #WenYoBen's method, if there is only either of Paris or London in one text then str.extract is better:
regex = '(London|Paris)'
df['city'] = df.text.str.extract(regex)
df
date text city
0 2019-6-7 London is good. London
1 2019-5-8 I am going to Paris. Paris
2 2019-4-4 Do you want to go to London? London
3 2019-3-7 I love Paris! Paris
And if you want all the cities in your regex in a text then str.extractall is an option too:
df['city'] = df.text.str.extractall(regex).values
df
date text city
0 2019-6-7 London is good. London
1 2019-5-8 I am going to Paris. Paris
2 2019-4-4 Do you want to go to London? London
3 2019-3-7 I love Paris! Paris
Note that if there are multiple matches, the extractall will return a list
I am looking for an efficient way to select matching rows in 2 x dataframes based on a shared row value, and upsert these into a new dataframe I can use to map differences between the intersection of them into a third slightly different dataframe that compares them.
**Example:**
DataFrame1
FirstName, City
Mark, London
Mary, Dallas
Abi, Madrid
Eve, Paris
Robin, New York
DataFrame2
FirstName, City
Mark, Berlin
Abi, Delhi
Eve, Paris
Mary, Dallas
Francis, Rome
In the dataframes, I have potential matching/overlapping on 'name', so the intersection on these is:
Mark, Mary, Abi, Eve
excluded from the join are:
Robin, Francis
I construct a dataframe that allows values from both to be compared:
DataFrameMatch
FirstName_1, FirstName_2, FirstName_Match, City_1, City_2, City_Match
And insert/update (upsert) so my output is:
DataFrameMatch
FirstName_1 FirstName_2 FirstName_Match City_1 City_2 City_Match
Mark Mark True London Berlin False
Abi Abi True Madrid Delhi False
Mary Mary True Dallas Dallas True
Eve Eve True Paris Paris True
I can then report on the difference between the two lists, and what particular fields are different.
merge
According to your output. You only want rows where 'FirstName' matches. You then want another column that evaluates whether cities match.
d1.merge(d2, on='FirstName', suffixes=['_1', '_2']).eval('City_Match = City_1 == City_2')
FirstName City_1 City_2 City_Match
0 Mark London Berlin False
1 Mary Dallas Dallas True
2 Abi Madrid Delhi False
3 Eve Paris Paris True
Details
You could do a simple merge and end up with
FirstName City
0 Mary Dallas
1 Eve Paris
Which takes all common columns by default. So I had to restrict the columns via the on argument, hence on='FirstName'
d1.merge(d2, on='FirstName')
FirstName City_x City_y
0 Mark London Berlin
1 Mary Dallas Dallas
2 Abi Madrid Delhi
3 Eve Paris Paris
Which gets us closer but now I want to adjust those suffixes.
d1.merge(d2, on='FirstName', suffixes=['_1', '_2'])
FirstName City_1 City_2
0 Mark London Berlin
1 Mary Dallas Dallas
2 Abi Madrid Delhi
3 Eve Paris Paris
Lastly, I'll add a new column that shows the evaluation of 'city_1' being equal to 'city_2'. I chose to use pandas.DataFrame.eval. You can see the results above.