Create new column based on value of another column - python

I have a solution below to give me a new column as a universal identifier, but what if there is additional data in the NAME column, how can I tweak the below to account for a wildcard like search term?
I want to basically have so if German/german or Mexican/mexican is in that row value then to give me Euro or South American value in new col
df["Identifier"] = (df["NAME"].str.lower().replace(
to_replace = ['german', 'mexican'],
value = ['Euro', 'South American']
))
print(df)
NAME Identifier
0 German Euro
1 german Euro
2 Mexican South American
3 mexican South American
Desired output
NAME Identifier
0 1990 German Euro
1 german 1998 Euro
2 country Mexican South American
3 mexican city 2006 South American

Based on an answer in this post:
r = '(german|mexican)'
c = dict(german='Euro', mexican='South American')
df['Identifier'] = df['NAME'].str.lower().str.extract(r, expand=False).map(c)
Another approach would be using np.where with those two conditions, but probably there is a more ellegant solution.

below code will work. i tried it using apply function but somehow can't able to get it. probably in sometime. meanwhile workable code below
df3['identifier']=''
js_ref=[{'german':'Euro'},{'mexican':'South American'}]
for i in range(len(df3)):
for l in js_ref:
for k,v in l.items():
if k.lower() in df3.name[i].lower():
df3.identifier[i]=v
break

Related

Compare values between 2 dataframes and transform data

The main aim of this script is to compare the regex format of the data present in the csv with the official ZIP Code regex format for that country, and if the format does not match, the script would carry out transformations on said data and output it all in one final dataframe.
I have 2 csv files, one (countries.csv) containing the following columns & data examples
INPUT:
Contact ID
Country
Zip Code
1
USA
71293
2
Italy
IT 2310219
and another csv (Regex.csv) with the following data examples:
Country
Regex format
USA
[0-9]{5}(?:-[0-9]{4})?
Italy
\d{5}
Now, the first csv has some 35k records so I would like to create a function which loops through the regex.csv (Dataframe) to grab the country column and also the regex format. Then it would loop through the country list to grab every instance where regex['country'] == countries['country'] and it would apply the regex transformation to the ZIP Codes for that country.
So far I have this function but I can't get it to work.
def REGI (dframe):
dframe=pd.DataFrame().reindex_like(contacts)
cols = list(contacts.columns)
for index,row in mergeOne.iterrows():
country = (row['Country'])
reg = (row[r'regex'])
for i, r in contactsS.iterrows():
if (r['Country of Residence'] == country or r['Country of Residence.1'] == country or r['Mailing Country (text only)'] == country or r['Other Country (text only)'] == country) :
dframe.loc[i] = r
dframe['Mailing Zip/Postal Code']=dframe['Mailing Zip/Postal Code'].apply(str).str.extractall(reg).unstack().apply(lambda x:','.join(x.dropna()), axis=1)
contacts.loc[contacts['Contact ID'].isin(dframe['Contact ID']),cols] = dframe[cols]
dframe = dframe.dropna(how='all')
return dframe
['Contact ID'] is being used as an identifier column.
The second for loop works on its own however I would need to manually re-type a new dataframe name, regex format and country name (without the first for loop).
At the moment I am getting the following error:
ValueError
ValueError: pattern contains no capture groups
removed some columns to mimic example given above
dataframes & error
error continued
If I paste the results into a new dataframe, it returns the following:
results in a new dataframe
Example as text
Account ID
Country
Zip/Postal Code
1
United Kingdom
WV9 5BT
2
Ireland
D24 EO29
3
Latvia
1009
4
United Kingdom
EN6 1JE
5
Italy
22010
REGEX table
Country
Regex
United Kingdom
([Gg][Ii][Rr] 0[Aa]{2})
(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})
([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
Latvia
[L]{1}[V]{1}-{4}
Ireland
STRNG_LTN_EXT_255
Italy
\d{5}
United Kingdom regex:
([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
Based on your response to my comment, I would suggest to directly fix the zip code using your regexes:
df3 = df2.set_index('Country')
df1['corrected_Zip'] = (df1.groupby('Country')
['Zip Code']
.apply(lambda x: x.str.extract('(%s)' % df3.loc[x.name, 'Regex format']))
)
df1
This groups by country, applies the regex for this country, and extract the value.
output:
Contact ID Country Zip Code corrected_Zip
0 1 USA 71293 71293
1 2 Italy IT 2310219 23102
NB. if you want you can directly overwrite Zip Code by doing df1['Zip Code'] = …
NB2. This will work only if all countries have an entry in df2, if this is not the case, you need to add a check for that (let me know)
NB3. if you want to know which rows had an invalid zip, you can fetch them using:
df1[df1['Zip Code']!=df1['corrected_Zip']]

How can I get the max value from groupby object with multiple values?

Sorry if the question is confusing, I was not sure how to word it. Please let me know if this is duplicated question.
I have a groupby object looks like this:
us.groupby(['category_id', 'title']).sum()[['views']]
us
category_id title views
Autos & Vehicle 1980 toyota corolla liftback commercial 13061
1992 Chevy Lumina Euro commercial 18470406
2019 Chevrolet Silverado First Look 13061
Music Backyard Boys 133
Eminem - Song 1223
Cardi B - Wap 1111122
Travel & Events Welcome to Winter PUNderland 437576
What Spring Looks Like Around The World 17554672
And I want to get only max value for each category, such as:
category_id title views
Autos & Vehicle 1992 Chevy Lumina Euro commercial 18470406
Music Cardi B - Wap 1111122
Travel & Events What Spring Looks Like Around The World 17554672
How can I do this?
I tried .first() method, and also us.groupby(['category_id', 'title']).sum()[['views']].sort_values(by='views', ascending=False)[:1] something like this, but it only gives first row of entire dataframe. Is there any function I can use to only filter max value of groupby object?
Thank you!
You can try:
us_group = us.groupby(['category_id', 'title']).sum()[['views']]
(us_group.reset_index().sort_values(['views'])
.drop_duplicates('category_id', keep='last')
)

pandas operations inside a for-loop

Here is a sample of my data
threats =
binomial_name
continent threat_type
Africa Agriculture & Aquaculture 143
Biological Resource Use 102
Climate Change 3
Commercial Development 36
Energy Production & Mining 30
... ... ...
South America Human Intrusions 1
Invasive Species 3
Natural System Modifications 1
Transportation Corridor 2
Unknown 38
I want to use a for loop and obtain and append together the top 5 values of each continent into a data frame.
Here is my code -
continents = threats.continent.unique()
for i in continents:
continen = (threats
.query('continent == i')
.groupby(['continent','threat_type'])
.sort_values(by=('binomial_name'), ascending=False).
.head())
top5 = appended_data.append(continen)
I am however getting the error - KeyError: 'i'
Where am I going wrong?
So, the canonical way to do this:
df.groupby('continent', as_index=False).apply(
lambda grp: grp.nlargest(5, 'binomial_value'))
If you want to do this in a loop, replace this part:
for i in continents:
continen = threats[threats['continent'] == i].nlargest(2, 'binomial_name')
appended_data.append(continen)

Rank country by crop using pandas DataFrame

My DataFrame looks like this:
,Area,Item,Year,Unit,Value
524473,Ecuador,Sesame,2018,tonnes,16.0
524602,Ecuador,Sorghum,2018,tonnes,14988.0
524776,Ecuador,Soybeans,2018,tonnes,25504.0
524907,Ecuador,Spices nes,2018,tonnes,746.0
525021,Ecuador,Strawberries,2018,tonnes,1450.0
525195,Ecuador,Sugar beet,2018,tonnes,4636.0
525369,Ecuador,Sugar cane,2018,tonnes,7502251.0
...
1075710,Mexico,Tomatoes,2018,tonnes,4559375.0
1075865,Mexico,Triticale,2018,tonnes,25403.0
1076039,Mexico,Vanilla,2018,tonnes,495.0
1076213,Mexico,"Vegetables, fresh nes",2018,tonnes,901706.0
1076315,Mexico,"Vegetables, leguminous nes",2018,tonnes,75232.0
1076469,Mexico,Vetches,2018,tonnes,93966.0
1076643,Mexico,"Walnuts, with shell",2018,tonnes,159535.0
1076817,Mexico,Watermelons,2018,tonnes,1472459.0
1076991,Mexico,Wheat,2018,tonnes,2943445.0
1077134,Mexico,Yautia (cocoyam),2018,tonnes,38330.0
1077308,Mexico,Cereals (Rice Milled Eqv),2018,tonnes,35974485.0
In DataFrame there are all countries of the world and all agriculture products.
That's what i want to do:
Choose country, for example France.
Find the place of France in the world ranking for the production of a particular crop.
And so on all crops.
France ranks 1 in the world in oats production.
France ranks 2 in the world in cucumber production.
France ranks 2 in the world in rye production.
France ranks .... and so on on each product if France produces it.
I started with
df = df.loc[df.groupby('Item')['Value'].idxmax()]
but I need not only first place, but the second, third, fourth.... Help me please.
I am very new in pandas.
You can assign a rank column:
df['rank'] = df.groupby('Item')['Value'].rank(ascending=False)
and then extract information for a country with:
df[df['Area']=='France']
Check with rank
s = df.groupby('Item')['Value'].rank(ascending = False)
Then
d = { x : y for x , y in df.groupby(s)}
d[1] # output put rank one

Extract Strings from Dataframe

from pandas import DataFrame,Series
import pandas as pd
df
text region
The Five College Region The Five College Region
South Hadley (Mount Holyoke College) South Hadley
Waltham (Bentley University), (Brandeis Univer..) Waltham
The region should extract from text.
If the row contains "(",remove anything after "(",and then remove the white space.
If the row doesn't contain "(", keep it and copy to the region.
I know I can deal it with str.extract function. But I'm troubled in writing right regex pattern
df['Region'] =df['text'].str.extract(r'(.+)\(.*')
This regex pattern can not extract first string
I also acknowledge that using split functon works for this problem
str.split('(')[0]
But I don't know how to put the result in a column.
Hope to receive answers covering both methods.
option 1
assign + str.split
df.text.str.split('\s*\(').str[0]
0 The Five College Region
1 South Hadley
2 Waltham
Name: text, dtype: object
df.assign(region=df.text.str.split('\s*\(').str[0])
text region
0 The Five College Region The Five College Region
1 South Hadley (Mount Holyoke College) South Hadley
2 Waltham (Bentley University), (Brandeis Univer..) Waltham
option 2
join + str.extract
df.text.str.extract('(?P<region>[^\(]+)\s*\(*', expand=False)
0 The Five College Region
1 South Hadley
2 Waltham
Name: text, dtype: object
df.join(df.text.str.extract('(?P<region>[^\(]+)\s*\(*', expand=False))
text region
0 The Five College Region The Five College Region
1 South Hadley (Mount Holyoke College) South Hadley
2 Waltham (Bentley University), (Brandeis Univer..) Waltham

Categories

Resources