Using regex in python for a dynamic string - python

I have a pandas columns with strings which dont have the same pattern, something like this:
{'iso_2': 'FR', 'iso_3': 'FRA', 'name': 'France'}
{'iso': 'FR', 'iso_2': 'USA', 'name': 'United States of America'}
{'iso_3': 'FR', 'iso_4': 'FRA', 'name': 'France'}
How do I only keep the name of the country for every row? I would only like to keep "France", "United States of America", "France".
I tried building the regex pattern: something like this
r"^\W+[a-z]+_[0-9]\W+"
But this turns out to be very specific, and if there is a slight change in the string the pattern wont work. How do we resolve this?

As you have dictionaries in the column, you can get the values of the name keys:
import pandas as pd
df = pd.DataFrame({'col':[{'iso_2': 'FR', 'iso_3': 'FRA', 'name': 'France'},
{'iso': 'FR', 'iso_2': 'USA', 'name': 'United States of America'},
{'iso_3': 'FR', 'iso_4': 'FRA', 'name': 'France'}]})
df['col'] = df['col'].apply(lambda x: x["name"])
Output of df['col']:
0 France
1 United States of America
2 France
Name: col, dtype: object
If the column contains stringified dictionaries, you can use ast.literal_eval before accessing the name key value:
import pandas as pd
import ast
df = pd.DataFrame({'col':["{'iso_2': 'FR', 'iso_3': 'FRA', 'name': 'France'}",
"{'iso': 'FR', 'iso_2': 'USA', 'name': 'United States of America'}",
"{'iso_3': 'FR', 'iso_4': 'FRA', 'name': 'France'}"]})
df['col'] = df['col'].apply(lambda x: ast.literal_eval(x)["name"])
And in case your column is totally messed up, yes, you can resort to regex:
df['col'] = df['col'].str.extract(r"""['"]name['"]\s*:\s*['"]([^"']+)""")
# or to support escaped " and ':
df['col'] = df['col'].str.extract(r"""['"]name['"]\s*:\s*['"]([^"'\\]+(?:\\.[^'"\\]*)*)""")>>> df['col']
0
0 France
1 United States of America
2 France
See the regex demo.

Related

Replace list of items in Pandas dataframe by tree leafs

I have a tree Locations which is has a Continent -> Country -> Location hierarchy. Now I have a dataframe which per row has a list entries of this tree.
How can I replace the entries of the list per row by the leaf's its tree.
My creativity in apply or map and possible a lambda function is lacking.
Minimal example;
import pandas as pd
Locations = {
'Europe':
{'Germany': ['Berlin'],
'France': ['Paris','Bordeaux']},
'Asia':
{'China': ['Hong Kong'],
'Indonesia': ['Jakarta']},
'North America':
{'United States':['New York','Washington']}}
df = pd.DataFrame({'Persons': ['A', 'B'], 'Locations': [
['North America','United States','Asia','France'],
['North America','Asia','Europe','Germany']]})
df = df.apply(...)?
df = df.map(...)?
# How to end up with:
pd.DataFrame({'Persons': ['A', 'B'], 'Locations': [
['New York','Washington','Hong Kong','Jakarta','Paris','Bordeaux'],
['New York','Washington','Hong Kong','Jakarta','Paris','Bordeaux','Berlin']]})
# Note the order of the locations doesn't matter is also OK
pd.DataFrame({'Persons': ['A', 'B'], 'Locations': [
['Jakarta','Washington','Hong Kong','Paris','New York','Bordeaux'],
['Jakarta','Berlin','Washington','Hong Kong','Paris','New York','Bordeaux']]})
You do not really need the apply method. You can start by changing the structure of your Locations dictionary in order to map the actual values to your exploded data frame. Then, just combine several explode, drop_duplicates and groupby statements with different aggregation logics to produce your desired result.
Code:
import pandas as pd
from collections import defaultdict
from itertools import chain
Locations = {
'Europe':{'Germany': ['Berlin'], 'France': ['Paris','Bordeaux']},
'Asia': {'China': ['Hong Kong'], 'Indonesia': ['Jakarta']},
'North America': {'United States': ['New York','Washington']}
}
df = pd.DataFrame({'Persons':['A', 'B'], 'Locations': [['North America','United States','Asia','France'], ['North America','Asia','Europe']]})
mapping_locs = defaultdict(list)
for key, val in Locations.items():
mapping_locs[key] = list(chain.from_iterable(list(val.values())))
for lkey, lval in val.items():
mapping_locs[lkey] = lval
(
df.assign(
mapped_locations=(
df.explode("Locations")["Locations"].map(mapping_locs).reset_index()
.explode("Locations").drop_duplicates(subset=["index", "Locations"])
.groupby(level=0).agg({"index": "first", "Locations": list})
.groupby("index").apply(lambda x: list(chain.from_iterable(x["Locations"])))
)
)
)
Output:
Persons Locations mapped_locations
0 A [North America, United States, Asia, France] [New York, Washington, Hong Kong, Jakarta, Par...
1 B [North America, Asia, Europe] [New York, Washington, Hong Kong, Jakarta, Ber...

Whats wrong? Pandas

SyntaxError: invalid syntax. when executing, it does not work, create groups by continent, writes that = invalid, what should be put?
def country_kl(country):
if country = ['United States', 'Mexico', 'Canada', 'Bahamas', 'Chile', 'Brazil', 'Colombia','British Virgin Islands'
,'Peru','Uruguay','Turks and Caicos Islands','Cambodia','Bermuda','Argentina']:
return '1'
elif country = ['France', 'Spain', 'Germany', 'Switzerland', 'Belgium', 'United Kingdom', 'Austria', 'Italy', 'Swaziland'
,'Russia' , 'Sweden','Czechia','Monaco','Denmark','Poland','Norway','Netherlands','Portugal','Turkey','Finland',
'Ukraine','Andorra','Hungary','Greece','Romania','Slovakia','Liechtenstein','Guernsey','Ireland']:
return '2'
elif country = ['India','China', 'Singapore', 'Hong Kong', 'Australia', 'Japan']:
return '3'
elif country = ['United Arab Emirates',
'Thailand','Malaysia','New Zealand','South Korea','Philippines','Taiwan','Israel','Vietnam','Cayman Islands',
'Kazakhstan' ,'Georgia','Bahrain','Nepal','Qatar','Oman','Lebanon']:
return '3'
else :
return '4'
One more error in your code is that you used a single "=", what
actually means substitution.
To compare two values use "==" (double "=").
But of course, to check whether a value of a variable is contained
in a list you have to use in operator, just as Ilya suggested in his comment.
Another, more readable and elegant solution is:
Create a dictionary, where the key is country name and the
value is your expected result for this country. Something like:
countries = {'United States': '1', 'Mexico': '1', 'France': '2', 'Spain': '2',
'India': '3', 'China': '3', 'Singapore': '3'}
(include other countries too).
Look up this dictionary, with default value of '4', which you
used in your code:
result = countries.get(country, default='4')
And by the way: Your question and code have nothing to do with Pandas.
You use ordinary, pythonic list and (as I suppose) a string variable.
But since you marked your question also with Pandas tag,
I came up also with a pandasonic solution:
Create a Series from the above dictionary:
ctr = pd.Series(countries.values(), index=countries.keys())
Lookup this Series, also with a default value:
result = ctr.get(country, default='4')

Aggregating and group by in Pandas considering some conditions

I have an excel file which simplified has the following structure and which I read as a dataframe:
df = pd.DataFrame({'ISIN':['US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US00206R1023'],
'Name':['ALPHABET INC.CL.A DL-,001', 'Alphabet Inc Class A', 'ALPHABET INC CLASS A', 'ALPHABET A', 'ALPHABET INC CLASS A', 'ALPHABET A', 'Alphabet Inc. Class C', 'Alphabet Inc. Class A', 'AT&T Inc'],
'Country':['United States', 'United States', 'United States', '', 'United States', 'United States', 'United States', 'United States', 'United States'],
'Category':[ '', 'big', 'big', '', 'big', 'test', 'test', 'test', 'average'],
'Category2':['important', '', 'important', '', '', '', '', '', 'irrelevant'],
'Value':[1000, 750, 60, 50, 160, 9, 10, 10, 1]})
I would love to group by ISIN and add up the values and calculate the sum like
df1 = df.groupby('ISIN').sum(['Value'])
The problem with this approach is, I dont get the other fields 'Name', 'Country', 'Category', 'Category2'.
My objective is to get as a result the following data aggregated dataframe:
df1 = pd.DataFrame({'ISIN':['US02079K3059', 'US00206R1023'],
'Name':['ALPHABET A', 'AT&T Inc'],
'Country':['United States', 'United States'],
'Category':['big', 'average'],
'Category2':['important', 'irrelevant'],
'Value':[2049, 1]})
If you compare df to df1, you will recognize some criteria/conditions I applied:
for every 'ISIN' most commonly appearing field value should be used, e.g. 'United States' in column 'Country'
If field values are equally most common, the first appearing of the most common should be used, e.g. 'big' and 'test' in column 'Category'
Exception: empty values don't count, e.g. Category2, even though '' is the most common value, 'important' is used as final value.
How can I achieve this goal? Anyone who can help me out?
try convert '' to NaN then drop 'Value' column then groupby 'ISIN' and calculate mode then map the values of sum of 'Value' column grouped by 'ISIN' to 'ISIN' column so to create 'Value' column in your Final result:
Basically the idea is to converting empty string '' to NaN so that it doesn't count in the mode and we are defining a function to handle such cases when mode of particular column groupedby 'ISIN' is NaN because of dropna=True in mode() method
def f(x):
try:
return x.mode().iat[0]
except IndexError:
return float('NaN')
Finally:
out=(df.replace('',float('NaN'))
.drop(columns='Value')
.groupby('ISIN',as_index=False).agg(f))
out['Value']=out['ISIN'].map(df.groupby('ISIN')['Value'].sum())
out['Value_perc']=out['Value'].div(out['Value'].sum()).round(5)
OR
Via passing dropna=False in mode() method and anonymous function:
out=(df.replace('',float('NaN'))
.drop(columns='Value')
.groupby('ISIN',as_index=False).agg(lambda x:x.mode(dropna=False).iat[0]))
out['Value']=out['ISIN'].map(df.groupby('ISIN')['Value'].sum())
out['Value_perc']=out['Value'].div(out['Value'].sum()).round(5)
Now If you print out you will get your desired output

How to turn a list of a list of dictionaries into a dataframe via loop

I have a list of a list of dictionaries. I managed to access each list-element within the outer list and convert the dictionary via pandas into a data-frame. I then save the DF and later concat it. That's a perfect result. But I need a loop to do that for big data.
Here is my MWE which works fine in principle.
import pandas as pd
mwe = [
[{"name": "Norway", "population": 5223256, "area": 323802.0, "gini": 25.8}],
[{"name": "Switzerland", "population": 8341600, "area": 41284.0, "gini": 33.7}],
[{"name": "Australia", "population": 24117360, "area": 7692024.0, "gini": 30.5}],
]
df0 = pd.DataFrame.from_dict(mwe[0])
df1 = pd.DataFrame.from_dict(mwe[1])
df2 = pd.DataFrame.from_dict(mwe[2])
frames = [df0, df1, df2]
result = pd.concat(frames)
It creates a nice table.
Here is what I tried to create a list of data frames:
for i in range(len(mwe)):
frame = pd.DataFrame()
frame = pd.DataFrame.from_dict(mwe[i])
frames = []
frames.append(frame)
Addendum: Thanks for all the answers. They are working on my MWE. Which made me notice that there are some strange entries in my dataset. No solution works for my dataset, since I have an inner-list element which contains two dictionaries (due to non unique data retrieval):
....
[{'name': 'United States Minor Outlying Islands', 'population': 300},
{'name': 'United States of America',
'population': 323947000,
'area': 9629091.0,
'gini': 48.0}],
...
How can I drop the entry for "United States Minor Outlying Islands"?
You could get each dict out of the containing list and just have a list of dict:
import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
# use x.pop() so that you aren't carrying around copies of the data
# for a "big data" application
df = pd.DataFrame([x.pop() for x in mwe])
df.head()
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
By bringing the list comprehension into the dataframe declaration, that list is temporary, and you don't have to worry about the cleanup. pop will also consume the dictionaries out of mwe, minimizing the amount of copies you are carrying around in memory
As a note, when doing this, mwe will then look like:
mwe
[[], [], []]
Because the contents of the sub-lists have been popped out
EDIT: New Question Content
If your data contains duplicates, or at least entries you don't want, and the undesired entries don't have matching columns to the rest of the dataset (which appears to be the case), it becomes a bit trickier to avoid copying data as above:
mwe.append([{'name': 'United States Minor Outlying Islands', 'population': 300}, {'name': 'United States of America', 'population': 323947000, 'area': 9629091.0, 'gini': 48.0}])
key_check = {}.fromkeys(["name", "population", "area", "gini"])
# the easy way but copies data
df = pd.DataFrame([item for item in data
for data in mwe
if item.keys()==key_check.keys()])
Since you'll still have the data hanging around in mwe. It might be better to use a generator
def get_filtered_data(mwe):
for data in mwe:
while data: # when data is empty, the while loop will end
item = data.pop() # still consumes data out of mwe
if item.keys() == key_check.keys():
yield item # will minimize data copying through lazy evaluation
df = pd.DataFrame([x for x in get_filtered_data(mwe)])
area gini name population
0 323802.0 25.8 Norway 5223256
1 41284.0 33.7 Switzerland 8341600
2 7692024.0 30.5 Australia 24117360
3 9629091.0 48.0 United States of America 323947000
Again, this is under the assumption that non-desired entries have invalid columns, which appears to be the case here, specifically. Otherwise, this will at least flatten out the data structure so you can filter it with pandas later
Create and empty DataFrame and loop over the list using df.append on each loop:
>>> import pandas as pd
mwe = [[{'name': 'Norway', 'population': 5223256, 'area': 323802.0, 'gini': 25.8}],
[{'name': 'Switzerland',
'population': 8341600,
'area': 41284.0,
'gini': 33.7}],
[{'name': 'Australia',
'population': 24117360,
'area': 7692024.0,
'gini': 30.5}]]
>>> df = pd.DataFrame()
>>> for country in mwe:
... df = df.append(country)
...
>>> df
area gini name population
0 323802.0 25.8 Norway 5223256
0 41284.0 33.7 Switzerland 8341600
0 7692024.0 30.5 Australia 24117360
Try this :
df = pd.DataFrame(columns = ['name', 'population', 'area', 'gini'])
for i in range(len(mwe)):
df.loc[i] = list(mwe[i][0].values())
Output :
name pop area gini
0 Norway 5223256 323802.0 25.8
1 Switzerland 8341600 41284.0 33.7
2 Australia 24117360 7692024.0 30.5

Strip Quote from key Using DictReader

I am currently reading data out from a csv files, and i wanted to turn it into a dictionary, Key Value Pair.
I was able to do that using csv.DictReader. But is there anyway to strip the quotes from the keys?
I have it print out like this
{'COUNTRY': 'Germany', 'price': '49', 'currency': 'EUR', 'ID': '1', 'CITY': 'Munich'}
{'COUNTRY': 'United Kingdom', 'price': '40', 'currency': 'GBP', 'ID': '2', 'CITY': 'London'}
{'COUNTRY': 'United Kingdom', 'price': '40', 'currency': 'GBP', 'ID': '3', 'CITY': 'Liverpool'}
is there anyway to make it look like this
{COUNTRY: 'Germany', price: '49', currency: 'EUR', ID: '1', CITY: 'Munich'}
{COUNTRY: 'United Kingdom', price: '40', currency: 'GBP', ID: '2', CITY: 'London'}
{COUNTRY: 'United Kingdom', price: '40', currency: 'GBP', ID: '3', CITY: 'Liverpool'}
import csv
input_file = csv.DictReader(open("201611022225.csv"))
for row in input_file:
print row
Python uses quotes to indicate that it is a string object when printing. In your case, the dictionary uses string as keys, so when you print, it shows the quotes. But it doesn't actually save the quotes as part of the data, it's just to indicate the data type.
For example, if you write this to a text file and open it later, it will not show you quotes.

Categories

Resources