The main aim of this script is to compare the regex format of the data present in the csv with the official ZIP Code regex format for that country, and if the format does not match, the script would carry out transformations on said data and output it all in one final dataframe.
I have 2 csv files, one (countries.csv) containing the following columns & data examples
INPUT:
Contact ID
Country
Zip Code
1
USA
71293
2
Italy
IT 2310219
and another csv (Regex.csv) with the following data examples:
Country
Regex format
USA
[0-9]{5}(?:-[0-9]{4})?
Italy
\d{5}
Now, the first csv has some 35k records so I would like to create a function which loops through the regex.csv (Dataframe) to grab the country column and also the regex format. Then it would loop through the country list to grab every instance where regex['country'] == countries['country'] and it would apply the regex transformation to the ZIP Codes for that country.
So far I have this function but I can't get it to work.
def REGI (dframe):
dframe=pd.DataFrame().reindex_like(contacts)
cols = list(contacts.columns)
for index,row in mergeOne.iterrows():
country = (row['Country'])
reg = (row[r'regex'])
for i, r in contactsS.iterrows():
if (r['Country of Residence'] == country or r['Country of Residence.1'] == country or r['Mailing Country (text only)'] == country or r['Other Country (text only)'] == country) :
dframe.loc[i] = r
dframe['Mailing Zip/Postal Code']=dframe['Mailing Zip/Postal Code'].apply(str).str.extractall(reg).unstack().apply(lambda x:','.join(x.dropna()), axis=1)
contacts.loc[contacts['Contact ID'].isin(dframe['Contact ID']),cols] = dframe[cols]
dframe = dframe.dropna(how='all')
return dframe
['Contact ID'] is being used as an identifier column.
The second for loop works on its own however I would need to manually re-type a new dataframe name, regex format and country name (without the first for loop).
At the moment I am getting the following error:
ValueError
ValueError: pattern contains no capture groups
removed some columns to mimic example given above
dataframes & error
error continued
If I paste the results into a new dataframe, it returns the following:
results in a new dataframe
Example as text
Account ID
Country
Zip/Postal Code
1
United Kingdom
WV9 5BT
2
Ireland
D24 EO29
3
Latvia
1009
4
United Kingdom
EN6 1JE
5
Italy
22010
REGEX table
Country
Regex
United Kingdom
([Gg][Ii][Rr] 0[Aa]{2})
(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})
([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
Latvia
[L]{1}[V]{1}-{4}
Ireland
STRNG_LTN_EXT_255
Italy
\d{5}
United Kingdom regex:
([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
Based on your response to my comment, I would suggest to directly fix the zip code using your regexes:
df3 = df2.set_index('Country')
df1['corrected_Zip'] = (df1.groupby('Country')
['Zip Code']
.apply(lambda x: x.str.extract('(%s)' % df3.loc[x.name, 'Regex format']))
)
df1
This groups by country, applies the regex for this country, and extract the value.
output:
Contact ID Country Zip Code corrected_Zip
0 1 USA 71293 71293
1 2 Italy IT 2310219 23102
NB. if you want you can directly overwrite Zip Code by doing df1['Zip Code'] = …
NB2. This will work only if all countries have an entry in df2, if this is not the case, you need to add a check for that (let me know)
NB3. if you want to know which rows had an invalid zip, you can fetch them using:
df1[df1['Zip Code']!=df1['corrected_Zip']]
Sorry if the question is confusing, I was not sure how to word it. Please let me know if this is duplicated question.
I have a groupby object looks like this:
us.groupby(['category_id', 'title']).sum()[['views']]
us
category_id title views
Autos & Vehicle 1980 toyota corolla liftback commercial 13061
1992 Chevy Lumina Euro commercial 18470406
2019 Chevrolet Silverado First Look 13061
Music Backyard Boys 133
Eminem - Song 1223
Cardi B - Wap 1111122
Travel & Events Welcome to Winter PUNderland 437576
What Spring Looks Like Around The World 17554672
And I want to get only max value for each category, such as:
category_id title views
Autos & Vehicle 1992 Chevy Lumina Euro commercial 18470406
Music Cardi B - Wap 1111122
Travel & Events What Spring Looks Like Around The World 17554672
How can I do this?
I tried .first() method, and also us.groupby(['category_id', 'title']).sum()[['views']].sort_values(by='views', ascending=False)[:1] something like this, but it only gives first row of entire dataframe. Is there any function I can use to only filter max value of groupby object?
Thank you!
You can try:
us_group = us.groupby(['category_id', 'title']).sum()[['views']]
(us_group.reset_index().sort_values(['views'])
.drop_duplicates('category_id', keep='last')
)
Here is a sample of my data
threats =
binomial_name
continent threat_type
Africa Agriculture & Aquaculture 143
Biological Resource Use 102
Climate Change 3
Commercial Development 36
Energy Production & Mining 30
... ... ...
South America Human Intrusions 1
Invasive Species 3
Natural System Modifications 1
Transportation Corridor 2
Unknown 38
I want to use a for loop and obtain and append together the top 5 values of each continent into a data frame.
Here is my code -
continents = threats.continent.unique()
for i in continents:
continen = (threats
.query('continent == i')
.groupby(['continent','threat_type'])
.sort_values(by=('binomial_name'), ascending=False).
.head())
top5 = appended_data.append(continen)
I am however getting the error - KeyError: 'i'
Where am I going wrong?
So, the canonical way to do this:
df.groupby('continent', as_index=False).apply(
lambda grp: grp.nlargest(5, 'binomial_value'))
If you want to do this in a loop, replace this part:
for i in continents:
continen = threats[threats['continent'] == i].nlargest(2, 'binomial_name')
appended_data.append(continen)
My DataFrame looks like this:
,Area,Item,Year,Unit,Value
524473,Ecuador,Sesame,2018,tonnes,16.0
524602,Ecuador,Sorghum,2018,tonnes,14988.0
524776,Ecuador,Soybeans,2018,tonnes,25504.0
524907,Ecuador,Spices nes,2018,tonnes,746.0
525021,Ecuador,Strawberries,2018,tonnes,1450.0
525195,Ecuador,Sugar beet,2018,tonnes,4636.0
525369,Ecuador,Sugar cane,2018,tonnes,7502251.0
...
1075710,Mexico,Tomatoes,2018,tonnes,4559375.0
1075865,Mexico,Triticale,2018,tonnes,25403.0
1076039,Mexico,Vanilla,2018,tonnes,495.0
1076213,Mexico,"Vegetables, fresh nes",2018,tonnes,901706.0
1076315,Mexico,"Vegetables, leguminous nes",2018,tonnes,75232.0
1076469,Mexico,Vetches,2018,tonnes,93966.0
1076643,Mexico,"Walnuts, with shell",2018,tonnes,159535.0
1076817,Mexico,Watermelons,2018,tonnes,1472459.0
1076991,Mexico,Wheat,2018,tonnes,2943445.0
1077134,Mexico,Yautia (cocoyam),2018,tonnes,38330.0
1077308,Mexico,Cereals (Rice Milled Eqv),2018,tonnes,35974485.0
In DataFrame there are all countries of the world and all agriculture products.
That's what i want to do:
Choose country, for example France.
Find the place of France in the world ranking for the production of a particular crop.
And so on all crops.
France ranks 1 in the world in oats production.
France ranks 2 in the world in cucumber production.
France ranks 2 in the world in rye production.
France ranks .... and so on on each product if France produces it.
I started with
df = df.loc[df.groupby('Item')['Value'].idxmax()]
but I need not only first place, but the second, third, fourth.... Help me please.
I am very new in pandas.
You can assign a rank column:
df['rank'] = df.groupby('Item')['Value'].rank(ascending=False)
and then extract information for a country with:
df[df['Area']=='France']
Check with rank
s = df.groupby('Item')['Value'].rank(ascending = False)
Then
d = { x : y for x , y in df.groupby(s)}
d[1] # output put rank one
from pandas import DataFrame,Series
import pandas as pd
df
text region
The Five College Region The Five College Region
South Hadley (Mount Holyoke College) South Hadley
Waltham (Bentley University), (Brandeis Univer..) Waltham
The region should extract from text.
If the row contains "(",remove anything after "(",and then remove the white space.
If the row doesn't contain "(", keep it and copy to the region.
I know I can deal it with str.extract function. But I'm troubled in writing right regex pattern
df['Region'] =df['text'].str.extract(r'(.+)\(.*')
This regex pattern can not extract first string
I also acknowledge that using split functon works for this problem
str.split('(')[0]
But I don't know how to put the result in a column.
Hope to receive answers covering both methods.
option 1
assign + str.split
df.text.str.split('\s*\(').str[0]
0 The Five College Region
1 South Hadley
2 Waltham
Name: text, dtype: object
df.assign(region=df.text.str.split('\s*\(').str[0])
text region
0 The Five College Region The Five College Region
1 South Hadley (Mount Holyoke College) South Hadley
2 Waltham (Bentley University), (Brandeis Univer..) Waltham
option 2
join + str.extract
df.text.str.extract('(?P<region>[^\(]+)\s*\(*', expand=False)
0 The Five College Region
1 South Hadley
2 Waltham
Name: text, dtype: object
df.join(df.text.str.extract('(?P<region>[^\(]+)\s*\(*', expand=False))
text region
0 The Five College Region The Five College Region
1 South Hadley (Mount Holyoke College) South Hadley
2 Waltham (Bentley University), (Brandeis Univer..) Waltham