I have a dataframe as:-
Filtered_data
['defence possessed russia china','factors driving china modernise']
['force bolster pentagon','strike capabilities pentagon congress detailing china']
[missiles warheads', 'deterrent face continued advances']
......
......
I just want to split each list elements into sub-elements(tokenized words).So, output Im looking for as:-
Filtered_data
[defence, possessed,russia,factors,driving,china,modernise]
[force,bolster,strike,capabilities,pentagon,congress,detailing,china]
[missiles,warheads, deterrent,face,continued,advances]
here is my code what I have tried
for text in df['Filtered_data'].iteritems():
for i in text.split():
print (i)
Use list comprehension with split and flatenning:
df['Filtered_data'] = df['Filtered_data'].apply(lambda x: [z for y in x for z in y.split()])
print (df)
Filtered_data
0 [defence, possessed, russia, china, factors, d...
1 [force, bolster, pentagon, strike, capabilitie...
2 [missiles, warheads, deterrent, face, continue...
EDIT:
For unique values is standard way use sets:
df['Filtered_data'] = df['Filtered_data'].apply(lambda x: list(set([z for y in x for z in y.split()])))
print (df)
Filtered_data
0 [russia, factors, defence, driving, china, mod...
1 [capabilities, detailing, china, force, pentag...
2 [deterrent, advances, face, warheads, missiles...
But if ordering of values is important use pandas.unique:
df['Filtered_data'] = df['Filtered_data'].apply(lambda x: pd.unique([z for y in x for z in y.split()]).tolist())
print (df)
Filtered_data
0 [defence, possessed, russia, china, factors, d...
1 [force, bolster, pentagon, strike, capabilitie...
2 [missiles, warheads, deterrent, face, continue...
You can use itertools.chain + toolz.unique. The benefit of toolz.unique versus set is it preserves ordering.
from itertools import chain
from toolz import unique
df = pd.DataFrame({'strings': [['defence possessed russia china','factors driving china modernise'],
['force bolster pentagon','strike capabilities pentagon congress detailing china'],
['missiles warheads', 'deterrent face continued advances']]})
df['words'] = df['strings'].apply(lambda x: list(unique(chain.from_iterable(i.split() for i in x))))
print(df.iloc[0]['words'])
['defence', 'possessed', 'russia', 'china', 'factors', 'driving', 'modernise']
Related
I have a pandas dataset like below:
import pandas as pd
data = {'id': ['001', '002', '003'],
'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA, United States",
"1180 Shelard Tower\nMinneapolis, MN 55426\nUSA, United States",
"William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA, United States"]
}
df = pd.DataFrame(data)
print(df)
I need to convert address column to text delimited by \n and create new columns like name, address line 1, City, State, Zipcode, Country like below:
id Name addressline1 City State Zipcode Country
1 William J. Clare 290 Valley Dr. Casper WY 82604 United States
2 null 1180 Shelard Tower Minneapolis MN 55426 United States
3 William N. Barnard 145 S. Durbin Casper WY 82601 United States
I am learning python and from morning I am solving this. Any help will be greatly appreciated.
Thanks,
Right now, Pandas is returning you the table with 2 columns. If you look at the value in the second column, the essential information is separated with the comma. Therefore, if you saved your dataframe to df you can do the following:
df['address_and_city'] = df['address'].apply(lambda x: x.split(',')[0])
df['state_and_postal'] = df['address'].apply(lambda x: x.split(',')[1])
df['country'] = df['address'].apply(lambda x: x.split(',')[2])
Now, you have additional three columns in your dataframe, the last one contains the full information about the country already. Now from the first two columns that you have created you can extract the info you need in a similar way.
df['address_first_line'] = df['address_and_city'].apply(lambda x: ' '.join(x.split('\n')[:-1]))
df['city'] = df['address_and_city'].apply(lambda x: x.split('\n')[-1])
df['state'] = df['state_and_postal'].apply(lambda x: x.split(' ')[1])
df['postal'] = df['state_and_postal'].apply(lambda x: x.split(' ')[2].split('\n')[0])
Now you should have all the columns you need. You can remove the excess columns with:
df.drop(columns=['address','address_and_city','state_and_postal'], inplace=True)
Of course, it all can be done faster and with fewer lines of code, but I think it is the clearest way of doing it, which I hope you will find useful. If you don't understand what I did there, check the documentation for split and join methods, and also for apply method, native to pandas.
I have dataframe column with typos.
ID
Banknane
1
Bank of America
2
bnk of America
3
Jp Morg
4
Jp Morgan
And I have a list with the right names of the banks.
["Bank of America", "JPMorgan Chase]
I want to check and replace wrong banknames with the right names of the list with the help of levenshtein distance.
Here is one simple way to do it using Python standard library difflib module, which provides helpers for computing deltas.
from difflib import SequenceMatcher
# Define a helper function
def match(x, values, threshold):
def ratio(a, b):
return SequenceMatcher(None, a, b).ratio()
results = {
value: ratio(value, x) for value in values if ratio(value, x) > threshold
}
return max(results, key=results.get) if results else x
And then:
import pandas as pd
df = pd.DataFrame(
{
"ID": [1, 2, 3, 4],
"Bankname": ["Bank of America", "bnk of America", "Jp Morg", "Jp Morgan"],
}
)
names = ["Bank of America", "JPMorgan Chase"]
df["Bankname"] = df["Bankname"].apply(lambda x: match(x, names, 0.4))
So that:
print(df)
# Output
ID Bankname
0 1 Bank of America
1 2 Bank of America
2 3 JPMorgan Chase
3 4 JPMorgan Chase
Of course, you can replace the inner ratio function with any other more appropriated sequence matcher.
My DataFrame looks like this:
,Area,Item,Year,Unit,Value
524473,Ecuador,Sesame,2018,tonnes,16.0
524602,Ecuador,Sorghum,2018,tonnes,14988.0
524776,Ecuador,Soybeans,2018,tonnes,25504.0
524907,Ecuador,Spices nes,2018,tonnes,746.0
525021,Ecuador,Strawberries,2018,tonnes,1450.0
525195,Ecuador,Sugar beet,2018,tonnes,4636.0
525369,Ecuador,Sugar cane,2018,tonnes,7502251.0
...
1075710,Mexico,Tomatoes,2018,tonnes,4559375.0
1075865,Mexico,Triticale,2018,tonnes,25403.0
1076039,Mexico,Vanilla,2018,tonnes,495.0
1076213,Mexico,"Vegetables, fresh nes",2018,tonnes,901706.0
1076315,Mexico,"Vegetables, leguminous nes",2018,tonnes,75232.0
1076469,Mexico,Vetches,2018,tonnes,93966.0
1076643,Mexico,"Walnuts, with shell",2018,tonnes,159535.0
1076817,Mexico,Watermelons,2018,tonnes,1472459.0
1076991,Mexico,Wheat,2018,tonnes,2943445.0
1077134,Mexico,Yautia (cocoyam),2018,tonnes,38330.0
1077308,Mexico,Cereals (Rice Milled Eqv),2018,tonnes,35974485.0
In DataFrame there are all countries of the world and all agriculture products.
That's what i want to do:
Choose country, for example France.
Find the place of France in the world ranking for the production of a particular crop.
And so on all crops.
France ranks 1 in the world in oats production.
France ranks 2 in the world in cucumber production.
France ranks 2 in the world in rye production.
France ranks .... and so on on each product if France produces it.
I started with
df = df.loc[df.groupby('Item')['Value'].idxmax()]
but I need not only first place, but the second, third, fourth.... Help me please.
I am very new in pandas.
You can assign a rank column:
df['rank'] = df.groupby('Item')['Value'].rank(ascending=False)
and then extract information for a country with:
df[df['Area']=='France']
Check with rank
s = df.groupby('Item')['Value'].rank(ascending = False)
Then
d = { x : y for x , y in df.groupby(s)}
d[1] # output put rank one
I have a dataframe which has a column called regional_codes. Now I need to add a new column into the dataframe where the regional codes are replaced by the list of countries that are attributed to that region.
For eg. if the regional_codes contains ['asia'] then I need my new column to have the list of asian countries like ['china','japan','india','bangaldesh'...]
Currently what I do is that I have created a separate list for each region and I use something like this code
asia_list= ['asia','china','japan','india'...]
output_list = []
output_list+= [asia_list for w in regional_codes if w in asia_list]
output_list+= [africa_list for w in regional_codes if w in africa_list]
and so on until all the regional lists are exhausted
With the codes that I have provided above, my results are exactly what I need and it is efficient in terms of running time as well. However, I feel like I am doing it in a very long way. Therefore, I am looking for some suggestions that can help me shorten my code.
One way I found to do this is to create a DataFrame with all the needed data for your regional_codes and the regional_lists
import pandas as pd
import itertools
import numpy as np
# DF is your dataframe
# df is the dataframe containing the association between the regional_code and regional lists
df = pd.DataFrame({'regional_code': ['asia', 'africa', 'europe'], 'ragional_list': [['China', 'Japan'], ['Morocco', 'Nigeria', 'Ghana'], ['France', 'UK', 'Germany', 'Spain']]})
# regional_code ragional_list
# 0 asia [China, Japan]
# 1 africa [Morocco, Nigeria, Ghana]
# 2 europe [France, UK, Germany, Spain]
df2 = pd.DataFrame({'regional_code': [['asia', 'africa'],['africa', 'europe']], 'ragional_list': [1,2]})
# regional_code ragional_list
# 0 [asia, africa] 1
# 1 [africa, europe] 2
df2['list'] = df2.apply(lambda x: list(itertools.chain.from_iterable((df.loc[df['regional_code']==i, 'ragional_list'] for i in x.loc['regional_code']))), axis=1)
# In [95]: df2
# Out[95]:
# regional_code ragional_list list
# 0 [asia, africa] 1 [[China, Japan], [Morocco, Nigeria, Ghana]]
# 1 [africa, europe] 2 [[Morocco, Nigeria, Ghana], [France, UK, Germa...
Now we flatten the df2['list']
df2['list'] = df2['list'].apply(np.concatenate)
# regional_code ragional_list list
# 0 [asia, africa] 1 [China, Japan, Morocco, Nigeria, Ghana]
# 1 [africa, europe] 2 [Morocco, Nigeria, Ghana, France, UK, Germany,...
I guess this answers your question?
TLDR; How can I improve my code and make it more pythonic?
Hi,
One of the interesting challenge(s) we were given in a tutorial was the following:
"There are X missing entries in the data frame with an associated code but a 'blank' entry next to the code. This is a random occurance across the data frame. Using your knowledge of pandas, map each missing 'blank' entry to the associated code."
So this looks like the following:
|code| |name|
001 Australia
002 London
...
001 <blank>
My approach I have used is as follows:
Loop through entire dataframe and identify areas with blanks "". Replace all blanks via copying the associated correct code (ordered) to the dataframe.
code_names = [ "",
'Economic management',
'Public sector governance',
'Rule of law',
'Financial and private sector development',
'Trade and integration',
'Social protection and risk management',
'Social dev/gender/inclusion',
'Human development',
'Urban development',
'Rural development',
'Environment and natural resources management'
]
df_copy = df_.copy()
# Looks through each code name, and if it is empty, stores the proper name in its place
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
if(df_copy.mjtheme_namecode[x][y]['name'] == ""):
df_copy.mjtheme_namecode[x][y]['name'] = code_names[int(df_copy.mjtheme_namecode[x][y]['code'])]
limit = 25
counter = 0
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
print(df_copy.mjtheme_namecode[x][y])
counter += 1
if(counter >= limit):
break
While the above approach works - is there a better, more pythonic way of achieving what I'm after? I feel the approach I have used is very clunky due to my skills not being very well developed.
Thank you!
Method 1:
One way to do this would be to replace all your "" blanks with NaN, sort the dataframe by code and name, and use fillna(method='ffill'):
Starting with this:
>>> df
code name
0 1 Australia
1 2 London
2 1
You can apply the following:
new_df = (df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.fillna(method='ffill')
.sort_index())
>>> new_df
code name
0 1 Australia
1 2 London
2 1 Australia
Method 2:
This is more convoluted, but will work as well:
Using groupby, first, and sqeeze, you can create a pd.Series to map the codes to non-blank names, and use .map to map that series to your code column:
df['name'] = (df['code']
.map(
df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.groupby('code')
.first()
.squeeze()
))
>>> df
code name
0 1 Australia
1 2 London
2 1 Australia
Explanation: The pd.Series map that this creates looks like this:
code
1 Australia
2 London
And it works because it gets the first instance for every code (via the groupby), sorted in such a manner that the NaNs are last. So as long as each code is associated with a name, this method will work.