I am working on an assignment for the coursera Introduction to Data Science course. I have a dataframe with 'Country' as the index and 'Rank" as one of the columns. When I try to reduce the data frame only to include the rows with countries in rank 1-15, the following works but excludes Iran, which is ranked 13.
df.set_index('Country', inplace=True)
df.loc['Iran', 'Rank'] = 13 #I did this in case there was some sort of
corruption in the original data
df_top15 = df.where(df.Rank < 16).dropna().copy()
return df_top15
When I try
df_top15 = df.where(df.Rank == 12).dropna().copy()
I get the row for Spain.
But when I try
df_top15 = df.where(df.Rank == 13).dropna().copy()
I just get the column headers, no row for Iran.
I also tried
df.Rank == 13
and got a series with False for all countries but Iran, which was True.
Any idea what could be causing this?
Your code works fine:
df = pd.DataFrame([['Italy', 5],
['Iran', 13],
['Tinbuktu', 20]],
columns=['Country', 'Rank'])
res = df.where(df.Rank < 16).dropna()
print(res)
Country Rank
0 Italy 5.0
1 Iran 13.0
However, I dislike this method because via mask the dtype of your Rank series becomes float due to initial conversion of some values to NaN.
A better idea, in my opinion, is to use query or loc. Using either method obviates the need for dropna:
res = df.query('Rank < 16')
res = df.loc[df['Rank'] < 16]
print(res)
Country Rank
0 Italy 5
1 Iran 13
Related
So, I've got two dataframes, one with 54k rows and 1 column and another with 139k rows and 3 columns, I need to check weather the values of a column from first dataframe lies in between values of two columns in second dataframe, and if they match, I need to replace that particular value with corresponding string value in the second dataframe into first dataframe.
I tried doing it with simple for loops and if else statements, but the number of iteration are huge and my cell is taking forever to run. I've attached some snippets down below, If there is any better way to rewrite that particular part of code, It would be great help. Thanks in advance.
First DataFrame:
ip_address_to_clean
IP_Address_clean
0 815237196
1 1577685417
2 979279225
3 3250268602
4 2103448748
... ...
54208 4145673247
54209 1344187002
54210 3156712153
54211 1947493810
54212 2872038579
54213 rows × 1 columns
Second DataFrame:
ip_boundaries_file
country lower_bound_ip_address_clean upper_bound_ip_address_clean
0 Australia 16777216 16777471
1 China 16777472 16777727
2 China 16777728 16778239
3 Australia 16778240 16779263
4 China 16779264 16781311
... ... ... ...
138841 Hong Kong 3758092288 3758093311
138842 India 3758093312 3758094335
138843 China 3758095360 3758095871
138844 Singapore 3758095872 3758096127
138845 Australia 3758096128 3758096383
138846 rows × 3 columns
Code I've written :
ip_address_to_clean_copy = ip_address_to_clean.copy()
o_ip = ip_address_to_clean['IP_Address_clean'].values
l_b = ip_boundaries_file['lower_bound_ip_address_clean'].values
for i in range(len(o_ip)):
for j in range(len(l_b)):
if (ip_address_to_clean['IP_Address_clean'][i] > ip_boundaries_file['lower_bound_ip_address_clean'][j]) and (ip_address_to_clean['IP_Address_clean'][i] < ip_boundaries_file['upper_bound_ip_address_clean'][j]):
ip_address_to_clean_copy['IP_Address_clean'][i] = ip_boundaries_file['country'][j]
#print(ip_address_to_clean_copy['IP_Address_clean'][i])
#print(i)
This works (I tested it on small tables).
replacement1 = [None]*3758096384
replacement2 = []
for _, row in ip_boundaries_file.iterrows():
a,b,c = row['lower_bound_ip_address_clean'], row['upper_bound_ip_address_clean'], row['country']
replacement1[a+1:b]=[len(replacement2)]*(b-a-1)
replacement2.append(c)
ip_address_to_clean_copy['IP_Address_clean'] = ip_address_to_clean_copy['IP_Address_clean'].apply(lambda x:replacement2[replacement1[x]] if (x < len(replacement1) and replacement1[x]!=None) else x)
I tweaked the lambda function to keep the original ip if it's not in the replacement table.
Notes:
Compared to my comment, I added the replacement2 table to hold the actual strings, and put the indexes in replacement1 to make it more memory efficient.
This is based on one of the methods to sort a list in O(n) when you know the contained values are bounded.
Example:
Inputs:
ip_address_to_clean = pd.DataFrame([10,33,2,179,2345,123], columns = ['IP_Address_clean'])
ip_boundaries_file = pd.DataFrame([['China',1,12],
['Australia', 20,40],
['China',2000,3000],
['France', 100,150]],
columns = ['country', 'lower_bound_ip_address_clean',
'upper_bound_ip_address_clean'])
Output:
ip_address_to_clean_copy
# Out[13]:
# IP_Address_clean
# 0 China
# 1 Australia
# 2 China
# 3 179
# 4 China
# 5 France
As I mentioned in another comment, here's another script that performs a dichotomy search on the 2nd DataFrame; it works in O(n log(p)), which is slower than the above script, but consumes far less memory!
def replace(n, df):
if len(df) == 0:
return n
i = len(df)//2
if df.iloc[i]['lower_bound_ip_address_clean'] < n < df.iloc[i]['upper_bound_ip_address_clean']:
return df.iloc[i]['country']
elif len(df) == 1:
return n
else:
if n <= df.iloc[i]['lower_bound_ip_address_clean']:
return replace(n, df.iloc[:i-1])
else:
return replace(n, df.iloc[i+1:])
ip_address_to_clean_copy['IP_Address_clean'] = ip_address_to_clean['IP_Address_clean'].apply(lambda x: replace(x,ip_boundaries_file))
I am working on the following dataset: https://drive.google.com/file/d/1UVgSfIO-46aLKHeyk2LuKV6nVyFjBdWX/view?usp=sharing
I am trying to replace the countries in the "Nationality" column whose value_counts() are less than 450 with the value of "Others".
def collapse_category(df):
df.loc[df['Nationality'].map(df['Nationality'].value_counts(normalize=True)
.lt(450)), 'Nationality'] = 'Others'
print(df['Nationality'].unique())
This is the code I used but it returns the result as this: ['Others']
Here is the link to my notebook for reference: https://colab.research.google.com/drive/1MfwwBfi9_4E1BaZcPnS7KJjTy8xVsgZO?usp=sharing
Use boolean indexing:
s = df['Nationality'].value_counts()
df.loc[df['Nationality'].isin(s[s<450].index), 'Nationality'] = 'Others'
New value_counts after the change:
FRA 12307
PRT 11382
DEU 10164
GBR 8610
Others 5354
ESP 4864
USA 3398
... ...
FIN 632
RUS 578
ROU 475
Name: Nationality, dtype: int64
value_filter = df.Nationality.value_counts().lt(450)
temp_dict = value_filter[value_filter == False].replace({False: "others"}).to_dict()
df = df.replace(temp_dict)
In general, the third line will look up the entire df rather than a particular column. But the above code will work for you.
I'm really amateur-level with both python and pandas, but I'm trying to solve an issue for work that's stumping me.
I have two dataframes, let's call them dfA and dfB:
dfA:
project_id Category Initiative
10
20
30
40
dfB:
project_id field_id value
10 100 lorem
10 200 lorem1
10 300 lorem2
20 200 ipsum
20 300 ipsum1
20 500 ipsum2
Let's say I know "Category" from dfA correlates to field_id "100" from dfB, and "Initiative" correlates to field_id "200".
I need to look through dfB and for a given project_id/field_id combination, take the corresponding value in the "value" column and place it in the correct cell in dfA.
The result would look like this:
dfA:
project_id Category Initiative
10 lorem lorem1
20 ipsum
30
40
Bonus difficulty: not every project in dfA exists in dfB, and not every field_id is used in every project_id.
I hope I've explained this well enough; I feel like there must be a relatively simple way to handle this that I'm missing.
You could do something like this although it's not very elegant, there must be a better way. I had to use try/except because of the cases where the project Id is not available in the dfB. I put NaN values for the missing ones but you can easily put empty strings.
def get_value(row):
try:
res = dfB[(dfB['field_id'] == 100) & (dfB['project_id'] == row['project_id'])]['value'].iloc[0]
except:
res = np.nan
row['Categorie'] = res
try:
res = dfB[(dfB['field_id'] == 200) & (dfB['project_id'] == row['project_id'])]['value'].iloc[0]
except:
res = np.nan
row['Initiative'] = res
return row
dfA = dfA.apply(get_value, axis=1)
EDIT: as mentioned in comment, this is not very flexible as some values are hardcoded but you can easily change that with something like the below. This way, if the field_id change or you need to add/remove a column, just update the dictionary.
columns_field = {"Category": 100, "Initiative": 200}
def get_value(row):
for key, value in columns_fields.items():
try:
res = dfB[(dfB['field_id'] == value) & (dfB['project_id'] == row['project_id'])]['value'].iloc[0]
except:
res = np.nan
row[key] = res
return row
dfA = dfA.apply(get_value, axis=1)
I sumed the ytd using this code:
df['YTD19'] = df['January 19']+df['February 19']+df['March 19']+df['April 19']+df['May 19'] + df['June 19']
df['YTD20'] = df['January 20']+df['February 20']+df['March 20']+df['April 20']+df['May 20'] + df['June 20']
But as a result, some rows (especially with null values) did not sum:
Could you please help me how to improve my code?
To improve your code, you can first replace white space with nan. Then, to create the YTD19, you can sum all the columns that contain '19' in their name, using filter(like= ...) - similar logic applies for YTD2020:
# replace empty string and records with only spaces
df = df.replace(r'^\s*$', np.nan, regex=True)
# create your 2 columns
df['YTD19'] = df.filter(like='19').sum(axis=1)
df['YTD20'] = df.filter(like='20').sum(axis=1)
>>> df[['Manufacturer','Category','Country','YTD19','YTD20']]
Manufacturer Category Country YTD19 YTD20
0 X Joist Czech 2910 2677.0
1 Y Joist Poland 3269 2366.0
2 Z Joist Slovakia 4204 2012.0
I know what was the problem:
I forgot to replace null values to zero, that is why it printed bad output. So the only thing needed is:
df.fillna(0)
I have the following two databases:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/rgdp_catcode.merge'
df=pd.read_csv(url, index_col=0)
df.head(1)
naics catcode GeoName Description ComponentName year GDP state
0 22 E1600',\t'E1620',\t'A4000',\t'E5000',\t'E3000'... Alabama Utilities Real GDP by state 2004 5205 AL
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/mpl.Bspons.merge'
df1=pd.read_csv(url, index_col=0)
df1.head(1)
state year unemployment log_diff_unemployment id.thomas party type date bills id.fec years_exp session name disposition catcode
0 AK 2006 6.6 -0.044452 1440 Republican sen 2006-05-01 s2686-109 S2AK00010 39 109 National Cable & Telecommunications Association support C4500
Regarding df, I had to manually input the catcode values. I think that is why the formatting is off. What I would like is to simply have the values without the \t prefix. I want to merge the dfs on catcode, state, year. I made a test earlier wherein a df1.catcode with only one value per cell was matched with the values in another df.catcode that had more than one value per cell and it worked.
So technically, all I need to do is lose the \t before each consecutive value in df.catcode, but additionally, if anyone has ever done a merge of this sort before, any 'caveats' learned through experience would be appreciated. My merge code looks like this:
mplmerge=pd.merge(df1,df, on=(['catcode', 'state', 'year']), how='left' )
I think this can be done with the regex method, I'm looking at the documentation now.
Cleaning catcode column in df is rather straightforward:
catcode_fixed = df.catcode.str.findall('[A-Z][0-9]{4}')
This will produce a series with a list of catcodes in every row:
catcode_fixed.head(3)
Out[195]:
0 [E1600, E1620, A4000, E5000, E3000, E1000]
1 [X3000, X3200, L1400, H6000, X5000]
2 [X3000, X3200, L1400, H6000, X5000]
Name: catcode, dtype: object
If I understand correctly what you want, then you need to "ungroup" these lists. Here is the trick, in short:
catcode_fixed = catcode_fixed = catcode_fixed.apply(pd.Series).stack()
catcode_fixed.index = catcode_fixed.index.droplevel(-1)
So, we've got (note the index values):
catcode_fixed.head(12)
Out[206]:
0 E1600
0 E1620
0 A4000
0 E5000
0 E3000
0 E1000
1 X3000
1 X3200
1 L1400
1 H6000
1 X5000
2 X3000
dtype: object
Now, dropping the old catcode and joining in the new one:
df.drop('catcode',axis = 1, inplace = True)
catcode_fixed.name = 'catcode'
df = df.join(catcode_fixed)
By the way, you may also need to use df1.reset_index() when merging the data frames.