How to properly use pandas vectorization? - python

According to an article, the vectorization is much faster than apply a function to a pandas dafaframe column.
But I had a somehow special case like this:
import pandas as pd
df = pd.DataFrame({'IP': [ '1.0.64.2', '100.23.154.63', '54.62.1.3']})
def compare3rd(ip):
"""Check if the 3dr part of an IP is greater than 100 or not"""
ip_3rd = ip.split('.')[2]
if int(ip_3rd) > 100:
return True
else:
return False
# This works but very slow
df['check_results'] = df.IP.apply(lambda x: compare3rd(x))
print df
# This is supposed to be much faster
# But it doesn't work ...
df['check_results_2'] = compare3rd(df['IP'].values)
print df
Full error traceback look like this:
Traceback (most recent call last):
File "test.py", line 16, in <module>
df['check_results_2'] = compare3rd(df['IP'].values)
File "test.py", line 6, in compare3rd
ip_3rd = ip.split('.')[2]
AttributeError: 'numpy.ndarray' object has no attribute 'split'
My question is: how do I properly use this vectorization method in this case?

Check with str in pandas
df.IP.str.split('.').str[2].astype(int)>100
0 False
1 True
2 False
Name: IP, dtype: bool
Since you mention vectorize
import numpy as np
np.vectorize(compare3rd)(df.IP.values)
array([False, True, False])

Related

Apply function to data frame in python

In the following codes, I try to define a function first and apply the function to a dataframe to reset the geozone.
import pandas as pd
testdata ={'country': ['USA','AUT','CHE','ABC'], 'geozone':[0,0,0,0]}
d =pd.DataFrame.from_dict(testdata, orient = 'columns')
def setgeozone(dataframe, dcountry, dgeozone):
dataframe.loc[dataframe['dcountry'].isin(['USA','CAN']),'dgeozone'] =1
dataframe.loc[dataframe['dcountry'].isin(['AUT','BEL']),'dgeozone'] =2
dataframe.loc[dataframe['dcountry'].isin(['CHE','DNK']),'dgeozone'] =3
setgeozone(d, country, geozone)
I got error message saying:
Traceback (most recent call last):
File "<ipython-input-56-98dad4781f73>", line 1, in <module>
setgeozone(d, country, geozone)
NameError: name 'country' is not defined
Can someone help me understand what I did wrong.
Many thanks.
You don't need to pass parameters other than the DataFrame itself to your function. Try this:
def setgeozone(df):
df.loc[df['country'].isin(['USA','CAN']),'geozone'] = 1
df.loc[df['country'].isin(['AUT','BEL']),'geozone'] = 2
df.loc[df['country'].isin(['CHE','DNK']),'geozone'] = 3
setgeozone(df)
Here's two other (also better) ways to accomplish what you need:
Use map:
df["geozone"] = df["country"].map({"USA": 1, "CAN": 1, "AUT": 2, "BEL": 2, "CHE": 3, "DNK": 3})
Use numpy.select:
import numpy as np
df["geozone"] = np.select([df["country"].isin(["USA", "CAN"]), df["country"].isin(["AUT", "BEL"]), df["country"].isin(["CHE", "DNK"])],
[1, 2, 3])

AttributeError: 'method_descriptor' object has no attribute 'df_new' in python when replacing string

I am writing a short piece of code to remove web browser version numbers from the name in a column of data in a pandas dataframe. i.e. replace a string containing alpha and numerical characters with just the alpha characters.
I have written:
df_new=(rename())
str.replace.df_new[new_browser]('[.*0-9]',"",regex=True)
I am getting this error message and I don't understand what it's telling me
AttributeError Traceback (most recent call last)
<ipython-input-4-d8c6f9119b9f> in <module>
3 df_new=(rename())
4
----> 5 str.replace.df_new[new_browser]('[.*0-9]',"",regex=True)
AttributeError: 'method_descriptor' object has no attribute 'df_new'
The code above is following this code/function in a Jupyter Notebook
import pandas as pd
import numpy as np
import re
#write a function to change column names using a dictionary
def rename():
dict_col = {'Browser':'new_browser', 'Page':'web_page', 'URL':'web_address', 'Visitor ID':'visitor_ID'}
df = pd.read_csv ('dummy_webchat_data.csv')
for y in df.columns:
if y in dict_col:
df_new=df.rename(columns={y:dict_col}[y])
return df_new
rename()
I've been having trouble with the dataframe updates not being recognised when I next call it. Usually in JN I just keep writing the amends to the df and it retains the updates. But even the code df_new.head(1) needs to be written like this to work after the first function is run for some reason (mentioning as it feels like a similar problem even though the error messages are different):
df_new=(rename())
df_new.head(1)
can anyone help me please?
Best
Miriam
The error tells you that you are not using the Series.str.replace() method correctly.
You have:
str.replace.df_new[new_browser]('[.*0-9]',"",regex=True)
when what you want is:
df_new[new_browser].str.replace('[.*0-9]',"",regex=True)
See this:
>>> import pandas as pd
>>>
>>> s = pd.Series(['a1', 'b2', 'c3', 'd4'])
>>> s.str.replace('[.*0-9]',"",regex=True)
0 a
1 b
2 c
3 d
dtype: object
and compare it with this (which is what you get, s here is your df_new[new_browser]):
>>> import pandas as pd
>>>
>>> s = pd.Series(['a1', 'b2', 'c3', 'd4'])
>>> str.replace.s('[.*0-9]',"",regex=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'method_descriptor' object has no attribute 's'

Why is this error occuring when I am using filter in pandas: TypeError: 'int' object is not iterable

When I want to remove some elements which satisfy a particular condition, python is throwing up the following error:
TypeError Traceback (most recent call last)
<ipython-input-25-93addf38c9f9> in <module>()
4
5 df = pd.read_csv('fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv;
----> 6 df = filter(df,~('-02-29' in df['Date']))
7 '''tmax = []; tmin = []
8 for dates in df['Date']:
TypeError: 'int' object is not iterable
The following is the code :
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv');
df = filter(df,~('-02-29' in df['Date']))
What wrong could I be doing?
Following is sample data
Sample Data
Use df.filter() (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html)
Also please attach the csv so we can run it locally.
Another way to do this is to use one of pandas' string methods for Boolean indexing:
df = df[~ df['Date'].str.contains('-02-29')]
You will still have to make sure that all the dates are actually strings first.
Edit:
Seeing the picture of your data, maybe this is what you want (slashes instead of hyphens):
df = df[~ df['Date'].str.contains('/02/29')]

Python - Pandas - resample issue

I am trying to adapt a Pandas.Series with a certain frequency to a Pandas.Series with a different frequency. Therefore I used the resample function but it does not recognize for instance that 'M' is a subperiod of '3M' and raised an error
import pandas as pd
idx_1 = pd.period_range('2017-01-01', periods=6, freq='M')
data_1 = pd.Series(range(6), index=idx_1)
data_higher_freq = data_1.resample('3M', kind="Period").sum()
Raises the following exception:
Traceback (most recent call last): File "/home/mitch/Programs/Infrastructure_software/Sandbox/spyderTest.py", line 15, in <module>
data_higher_freq = data_1.resample('3M', kind="Period").sum() File "/home/mitch/anaconda3/lib/python3.6/site-packages/pandas/core/resample.py", line 758, in f return self._downsample(_method, min_count=min_count) File "/home/mitch/anaconda3/lib/python3.6/site-packages/pandas/core/resamplepy", line 1061, in _downsample 'sub or super periods'.format(ax.freq, self.freq))
pandas._libs.tslibs.period.IncompatibleFrequency: Frequency <MonthEnd> cannot be resampled to <3 * MonthEnds>, as they are not sub or super periods
This seems to be due to the pd.tseries.frequencies.is_subperiod function:
import pandas as pd
pd.tseries.frequencies.is_subperiod('M', '3M')
pd.tseries.frequencies.is_subperiod('M', 'Q')
Indeed it returns False for the first command and True for the second.
I would really appreciate any hints about any solution.
Thks.
Try changing from PeriodIndex to DateTimeIndex before resampling:
import pandas as pd
idx_1 = pd.period_range('2017-01-01', periods=6, freq='M')
data_1 = pd.Series(range(6), index=idx_1)
data_1.index = data_1.index.astype('datetime64[ns]')
data_higher_freq = data_1.resample('3M', kind='period').sum()
Output:
data_higher_freq
Out[582]:
2017-01 3
2017-04 12
Freq: 3M, dtype: int64

How to solve TypeError: 'numpy.ndarray' object is not callable on Python

I am working to aggregate Json file in python
I use a list comprehension to get all the agency responsibles
import pandas as pd
import numpy as np
url = "http://311api.cityofchicago.org/open311/v2/requests.json";
d= pd.read_json(url)
ar = [x.get("agency_responsible") for x in d.values()]
I got this error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'numpy.ndarray' object is not callable
Then I tried to solve this by adding numpy and dealing with array.
import numpy as np
np.[x.get("agency_responsible") for x in d.values()]
But it seems that it doensn't work out !
values is a property of a DataFrame, not a method. Just use d.values to access the array.
In fact, I think what you want is simply:
ar = d['agency_responsible'].values
or
ar = d.agency_responsible.values
Here's an actual session:
In [1]: import pandas as pd
In [2]: url = "http://311api.cityofchicago.org/open311/v2/requests.json"
In [3]: d = pd.read_json(url)
In [4]: type(d)
Out[4]: pandas.core.frame.DataFrame
In [5]: ar = d.agency_responsible.values
In [6]: ar[0]
Out[6]: u'Bureau of Street Operations - Graffiti'
In [7]: ar[:4]
Out[7]:
array([u'Bureau of Street Operations - Graffiti',
u'Division of Electrical Operations CDOT',
u'Bureau of Rodent Control - S/S',
u'Division of Electrical Operations CDOT'], dtype=object)

Categories

Resources