Whats wrong? Pandas - python

SyntaxError: invalid syntax. when executing, it does not work, create groups by continent, writes that = invalid, what should be put?
def country_kl(country):
if country = ['United States', 'Mexico', 'Canada', 'Bahamas', 'Chile', 'Brazil', 'Colombia','British Virgin Islands'
,'Peru','Uruguay','Turks and Caicos Islands','Cambodia','Bermuda','Argentina']:
return '1'
elif country = ['France', 'Spain', 'Germany', 'Switzerland', 'Belgium', 'United Kingdom', 'Austria', 'Italy', 'Swaziland'
,'Russia' , 'Sweden','Czechia','Monaco','Denmark','Poland','Norway','Netherlands','Portugal','Turkey','Finland',
'Ukraine','Andorra','Hungary','Greece','Romania','Slovakia','Liechtenstein','Guernsey','Ireland']:
return '2'
elif country = ['India','China', 'Singapore', 'Hong Kong', 'Australia', 'Japan']:
return '3'
elif country = ['United Arab Emirates',
'Thailand','Malaysia','New Zealand','South Korea','Philippines','Taiwan','Israel','Vietnam','Cayman Islands',
'Kazakhstan' ,'Georgia','Bahrain','Nepal','Qatar','Oman','Lebanon']:
return '3'
else :
return '4'

One more error in your code is that you used a single "=", what
actually means substitution.
To compare two values use "==" (double "=").
But of course, to check whether a value of a variable is contained
in a list you have to use in operator, just as Ilya suggested in his comment.
Another, more readable and elegant solution is:
Create a dictionary, where the key is country name and the
value is your expected result for this country. Something like:
countries = {'United States': '1', 'Mexico': '1', 'France': '2', 'Spain': '2',
'India': '3', 'China': '3', 'Singapore': '3'}
(include other countries too).
Look up this dictionary, with default value of '4', which you
used in your code:
result = countries.get(country, default='4')
And by the way: Your question and code have nothing to do with Pandas.
You use ordinary, pythonic list and (as I suppose) a string variable.
But since you marked your question also with Pandas tag,
I came up also with a pandasonic solution:
Create a Series from the above dictionary:
ctr = pd.Series(countries.values(), index=countries.keys())
Lookup this Series, also with a default value:
result = ctr.get(country, default='4')

Related

python dictionary in dictionary changing value [duplicate]

This question already has answers here:
Dictionary creation with fromkeys and mutable objects. A surprise [duplicate]
(3 answers)
Closed 11 months ago.
teams = ['Argentina', 'Australia', 'Belgium', 'Brazil', 'Colombia', 'Costa Rica', 'Croatia', 'Denmark', 'Egypt', 'England', 'France', 'Germany', 'Iceland', 'Iran', 'Japan', 'Mexico', 'Morocco', 'Nigeria', 'Panama', 'Peru', 'Poland', 'Portugal', 'Russia', 'Saudi Arabia', 'Senegal', 'Serbia', 'South Korea', 'Spain', 'Sweden', 'Switzerland', 'Tunisia', 'Uruguay']
wins = {"wins" : 0}
combined_data = dict.fromkeys(teams, wins)
combined_data["Russia"]['wins'] +=1
print(combined_data["Belgium"]['wins'])
should print 0 but all 'wins' go up by 1 and printout 1
I also tried 'wins' with 32 keys, but not working.
Many thanks in advance.
The wins dictionary is shared across all keys inside the combined_data dictionary.
To resolve this issue, use a dictionary comprehension instead:
combined_data = {team: {"wins" : 0} for team in teams}
Your code is not creating a new dictionary for each country, it's putting a reference to the exact same dictionary in each key's value. You can see this by running id() on two separate keys of the dataframe:
id(combined_data['Argentina']) # 5077746752
id(combined_data['Sweden']) # 5077746752
To do what you mean to do, you should use something like a dictionary expression:
combined_data = {team: {"wins" : 0} for team in teams}

Using regex in python for a dynamic string

I have a pandas columns with strings which dont have the same pattern, something like this:
{'iso_2': 'FR', 'iso_3': 'FRA', 'name': 'France'}
{'iso': 'FR', 'iso_2': 'USA', 'name': 'United States of America'}
{'iso_3': 'FR', 'iso_4': 'FRA', 'name': 'France'}
How do I only keep the name of the country for every row? I would only like to keep "France", "United States of America", "France".
I tried building the regex pattern: something like this
r"^\W+[a-z]+_[0-9]\W+"
But this turns out to be very specific, and if there is a slight change in the string the pattern wont work. How do we resolve this?
As you have dictionaries in the column, you can get the values of the name keys:
import pandas as pd
df = pd.DataFrame({'col':[{'iso_2': 'FR', 'iso_3': 'FRA', 'name': 'France'},
{'iso': 'FR', 'iso_2': 'USA', 'name': 'United States of America'},
{'iso_3': 'FR', 'iso_4': 'FRA', 'name': 'France'}]})
df['col'] = df['col'].apply(lambda x: x["name"])
Output of df['col']:
0 France
1 United States of America
2 France
Name: col, dtype: object
If the column contains stringified dictionaries, you can use ast.literal_eval before accessing the name key value:
import pandas as pd
import ast
df = pd.DataFrame({'col':["{'iso_2': 'FR', 'iso_3': 'FRA', 'name': 'France'}",
"{'iso': 'FR', 'iso_2': 'USA', 'name': 'United States of America'}",
"{'iso_3': 'FR', 'iso_4': 'FRA', 'name': 'France'}"]})
df['col'] = df['col'].apply(lambda x: ast.literal_eval(x)["name"])
And in case your column is totally messed up, yes, you can resort to regex:
df['col'] = df['col'].str.extract(r"""['"]name['"]\s*:\s*['"]([^"']+)""")
# or to support escaped " and ':
df['col'] = df['col'].str.extract(r"""['"]name['"]\s*:\s*['"]([^"'\\]+(?:\\.[^'"\\]*)*)""")>>> df['col']
0
0 France
1 United States of America
2 France
See the regex demo.

Select Rows where one cell contains 'some' word and store in a variable

I'm trying to analyze StackOverflow Survey Data which can be found here.
I want to get all rows which contains in "United States" in "Country" column and then store the values in a variable.
For example =
my_data = `{'Age1stCode': 12, 13, 15, 16, 18
'Country': 'India', 'China', 'United States', 'England', 'United States'
'x': 'a', 'b', 'c', 'd', 'e'
}`
what I want =
my_result_data = `{'Age1stCode': 15, 18
'Country': 'United States', 'United States'
'x': 'c', 'e'
}`
If you need to filter or search for specific string in specific column in dataframe
you need just to use
df_query = df.loc[ df.country.str.contains('United States')]
you can set case = True to make the search case sensitive
check doc for more infos : pandas doc

Aggregating and group by in Pandas considering some conditions

I have an excel file which simplified has the following structure and which I read as a dataframe:
df = pd.DataFrame({'ISIN':['US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US02079K3059', 'US00206R1023'],
'Name':['ALPHABET INC.CL.A DL-,001', 'Alphabet Inc Class A', 'ALPHABET INC CLASS A', 'ALPHABET A', 'ALPHABET INC CLASS A', 'ALPHABET A', 'Alphabet Inc. Class C', 'Alphabet Inc. Class A', 'AT&T Inc'],
'Country':['United States', 'United States', 'United States', '', 'United States', 'United States', 'United States', 'United States', 'United States'],
'Category':[ '', 'big', 'big', '', 'big', 'test', 'test', 'test', 'average'],
'Category2':['important', '', 'important', '', '', '', '', '', 'irrelevant'],
'Value':[1000, 750, 60, 50, 160, 9, 10, 10, 1]})
I would love to group by ISIN and add up the values and calculate the sum like
df1 = df.groupby('ISIN').sum(['Value'])
The problem with this approach is, I dont get the other fields 'Name', 'Country', 'Category', 'Category2'.
My objective is to get as a result the following data aggregated dataframe:
df1 = pd.DataFrame({'ISIN':['US02079K3059', 'US00206R1023'],
'Name':['ALPHABET A', 'AT&T Inc'],
'Country':['United States', 'United States'],
'Category':['big', 'average'],
'Category2':['important', 'irrelevant'],
'Value':[2049, 1]})
If you compare df to df1, you will recognize some criteria/conditions I applied:
for every 'ISIN' most commonly appearing field value should be used, e.g. 'United States' in column 'Country'
If field values are equally most common, the first appearing of the most common should be used, e.g. 'big' and 'test' in column 'Category'
Exception: empty values don't count, e.g. Category2, even though '' is the most common value, 'important' is used as final value.
How can I achieve this goal? Anyone who can help me out?
try convert '' to NaN then drop 'Value' column then groupby 'ISIN' and calculate mode then map the values of sum of 'Value' column grouped by 'ISIN' to 'ISIN' column so to create 'Value' column in your Final result:
Basically the idea is to converting empty string '' to NaN so that it doesn't count in the mode and we are defining a function to handle such cases when mode of particular column groupedby 'ISIN' is NaN because of dropna=True in mode() method
def f(x):
try:
return x.mode().iat[0]
except IndexError:
return float('NaN')
Finally:
out=(df.replace('',float('NaN'))
.drop(columns='Value')
.groupby('ISIN',as_index=False).agg(f))
out['Value']=out['ISIN'].map(df.groupby('ISIN')['Value'].sum())
out['Value_perc']=out['Value'].div(out['Value'].sum()).round(5)
OR
Via passing dropna=False in mode() method and anonymous function:
out=(df.replace('',float('NaN'))
.drop(columns='Value')
.groupby('ISIN',as_index=False).agg(lambda x:x.mode(dropna=False).iat[0]))
out['Value']=out['ISIN'].map(df.groupby('ISIN')['Value'].sum())
out['Value_perc']=out['Value'].div(out['Value'].sum()).round(5)
Now If you print out you will get your desired output

Strip Quote from key Using DictReader

I am currently reading data out from a csv files, and i wanted to turn it into a dictionary, Key Value Pair.
I was able to do that using csv.DictReader. But is there anyway to strip the quotes from the keys?
I have it print out like this
{'COUNTRY': 'Germany', 'price': '49', 'currency': 'EUR', 'ID': '1', 'CITY': 'Munich'}
{'COUNTRY': 'United Kingdom', 'price': '40', 'currency': 'GBP', 'ID': '2', 'CITY': 'London'}
{'COUNTRY': 'United Kingdom', 'price': '40', 'currency': 'GBP', 'ID': '3', 'CITY': 'Liverpool'}
is there anyway to make it look like this
{COUNTRY: 'Germany', price: '49', currency: 'EUR', ID: '1', CITY: 'Munich'}
{COUNTRY: 'United Kingdom', price: '40', currency: 'GBP', ID: '2', CITY: 'London'}
{COUNTRY: 'United Kingdom', price: '40', currency: 'GBP', ID: '3', CITY: 'Liverpool'}
import csv
input_file = csv.DictReader(open("201611022225.csv"))
for row in input_file:
print row
Python uses quotes to indicate that it is a string object when printing. In your case, the dictionary uses string as keys, so when you print, it shows the quotes. But it doesn't actually save the quotes as part of the data, it's just to indicate the data type.
For example, if you write this to a text file and open it later, it will not show you quotes.

Categories

Resources