I want to find the country names for a data frame columns with top level domains such as 'de', 'it', 'us'.. by using the iso3166 package.
There are domains in the dataset that does not exist in iso3166, therefore, Value Error got raised.
I tried to solve the value error by letting the code return Boolean values but it runs for a really long time. Will be great to know how to speed it up.
Sample data: df['country']
0 an
1 de
2 it
My code (Note the code does not raise KeyError error. My question is how to make it faster)
df['country_name'] = df['country'].apply(lambda x: countries.get(x)[0] if \
df['country'].str.find(x).any() == countries.get(x)[1].lower() else 'unknown')
df['country] is the data frame column. countries.get() is for getting country names from iso3166
df['country'].str.find(x).any() finds top level domains and countries.get(x)[1].lower()returns top level domains. If they are the same then I use countries.get(x)[0] to return the country name
Expected output
country country_name
an unknown
de Germany
it Italy
Error if I run df['country_name'] = df['country'].apply(lambda x: countries.get(x)[0]) (I renamed the dataframe so it's different from the error message)
KeyError Traceback (most recent call last)
<ipython-input-110-d51176ce2978> in <module>
----> 1 bf['country_name'] = bf['country'].apply(lambda x: countries.get(x)[0])
/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
3846 else:
3847 values = self.astype(object).values
-> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype)
3849
3850 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-110-d51176ce2978> in <lambda>(x)
----> 1 bf['country_name'] = bf['country'].apply(lambda x: countries.get(x)[0])
/opt/anaconda3/lib/python3.8/site-packages/iso3166/__init__.py in get(self, key, default)
358
359 if r == NOT_FOUND:
--> 360 raise KeyError(key)
361
362 return r
KeyError: 'an'```
A little error handling and defining your logic outside of the apply() method should get you where you want to go. Something like:
def get_country_name(x):
try:
return countries.get(x)[0]
except:
return 'unknown'
df['country_name'] = df['country'].apply(lambda x: get_country_name(x))
This James Tollefson's answer, narrowed down to its core, didn't want to change his answer too much, here's the implementation:
def get_country_name(x: str) -> str:
return countries[x][0] if countries.get(x) else 'unknown'
df['country_name'] = df['country'].apply(get_country_name)
Related
Why do these two functions work properly and run with no errors?
def avgSalespWpY(df, weekStart_col_name, sku_col_name, order_col_name):
df['SKU_WEEK'] = pd.DatetimeIndex(df[weekStart_col_name]).week
grouped_data = df.groupby(['SKU_WEEK']).mean().reset_index()
return grouped_data[['SKU_WEEK', 'R_ORDER_QT']]
def avg_sales(df, prediction_window):
earliest_date = prediction_window['WEEK_START_DT'].min()
filter_date = df[df['WEEK_START_DT'] < earliest_date]
avg_sales_df = avgSalespWpY(filter_date, 'WEEK_START_DT', 'SKU', 'R_ORDER_QT')
prediction_window['SKU_WEEK'] = pd.DatetimeIndex(prediction_window['WEEK_START_DT']).week
avg_sales_df.columns = ['SKU_WEEK', 'OA_SPW']
avg_sales_total = avg_sales_df['OA_SPW'].mean()
prediction_window = pd.merge(prediction_window, avg_sales_df, how = 'left', on = 'SKU_WEEK')
prediction_window = prediction_window.loc[:, prediction_window.columns != 'SKU_WEEK']
prediction_window['OA_SPW'] = prediction_window['OA_SPW'].fillna(avg_sales_total)
return prediction_window
But these two do not work and run an error?
def avgSalespMpY(df, weekStart_col_name, order_col_name):
df['SKU_MONTH'] = pd.DatetimeIndex(df[weekStart_col_name]).month
grouped_data = df.groupby(['SKU_MONTH'])[order_col_name].mean()
return grouped_data
def avg_salesMY(df, prediction_window):
earliest_date = prediction_window['WEEK_START_DT'].min()
filter_date = df[df['WEEK_START_DT'] < earliest_date]
avg_sales_df = avgSalespMpY(filter_date, 'WEEK_START_DT', 'R_ORDER_QT')
prediction_window['SKU_MONTH'] = pd.DatetimeIndex(prediction_window['WEEK_START_DT']).month
avg_sales_df.columns = ['SKU_MONTH', 'OA_SPW']
avg_sales_total = avg_sales_df['OA_SPW'].mean()
prediction_window = pd.merge(prediction_window, avg_sales_df, how = 'left', on = 'SKU_MONTH')
prediction_window = prediction_window.loc[:, prediction_window.columns != 'SKU_MONTH']
prediction_window['OA_SPW'] = prediction_window['OA_SPW'].fillna(avg_sales_total)
print(prediction_window['OA_SPW'])
return prediction_window
This is the error I am getting when I run the second function (avg_salesMY):
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
5 frames
pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()
pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()
KeyError: 'OA_SPW'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'OA_SPW'
The output for the first function (avgSalespMpY) is what we expect and is as follows:
SKU_MONTH
1 535.687921
2 577.925649
3 611.837803
4 678.377140
5 496.411170
6 601.806244
7 688.770197
8 574.510967
9 636.203457
10 876.896305
11 719.614757
12 553.642329
Name: R_ORDER_QT, dtype: float64
I am very confused what is going on because the two segments are almost identical. There is an issue happening accessing the OA_SPW column but the one code is above the other, so why does it all of a sudden not want to run properly?
The first segment is supposed to take in data and find the overall average sales per week of year in the first segment and the second is supposed to take in prediction data and add in the output of the previous data. The second segment is supposed to take in data and find the overall average sales per month in the first segment and the second is supposed to take in prediction data and add in the output of the previous data. The biggest difference is that one's results is week by week and the other is based on month.
Could anyone please help me with why I am getting the below error, everything worked before when I used the same logic, after I converted my data type of date columns to the appropriate format.
Below is the line of code I am trying to run
data['OPEN_DT'] = data['OPEN_DT'].apply(lambda x: datetime.strptime(x,'%Y-%m-%d') if len(x[:x.find ('-')]) == 4 else datetime.strptime(x, '%d-%m-%Y'))
Error being received :
AttributeError Traceback (most recent call last)
<ipython-input-93-f0a22bfffeee> in <module>
----> 1 data['OPEN_DT'] = data['OPEN_DT'].apply(lambda x: datetime.strptime(x,'%Y-%m-%d') if len(x[:x.find ('-')]) == 4 else datetime.strptime(x, '%d-%m-%Y'))
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3846 else:
3847 values = self.astype(object).values
-> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype)
3849
3850 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-93-f0a22bfffeee> in <lambda>(x)
----> 1 data['OPEN_DT'] = data['OPEN_DT'].apply(lambda x: datetime.strptime(x,'%Y-%m-%d') if len(x[:x.find ('-')]) == 4 else datetime.strptime(x, '%d-%m-%Y'))
AttributeError: 'Timestamp' object has no attribute 'find'
ValueError: time data '30/09/2020' does not match format '%d-%m-%Y'
Many thanks.
as ValueError is showing - %d-%m-%Y this needs to be changed to %d/%m/%Y to read date - 30/09/2020.
from datetime import datetime
import pandas as pd
data = {}
dates = {'27/09/2020', '28/09/2020', '29/09/2020', '30/09/2020'}
data['OPEN_DT'] = pd.Series(list(dates))
data['OPEN_DT'] = data['OPEN_DT'].apply(lambda x: datetime.strptime(x,'%Y/%m/%d') if len(x[:str(x).find ('-')]) == 4 else datetime.strptime(x, '%d/%m/%Y'))
print(data)
x is timestamp you should convert it to str then look for '-' in it:
str(x).find('-')
And Why don't you simply use infer_time_format to automatically detect the format by pandas?
I have got this error when try split my one column to few columns. But it split on just on one or two columns.If you wanna split on 3,4,5 columns it writes:
ValueError Traceback (most recent call last)
/usr/local/Cellar/jupyterlab/2.1.5/libexec/lib/python3.8/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
349 try:
--> 350 return self._range.index(new_key)
351 except ValueError:
ValueError: 2 is not in range
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-19-d4e6a4d03e69> in <module>
22 data_old[Col_1_Label] = newz[0]
23 data_old[Col_2_Label] = newz[1]
---> 24 data_old[Col_3_Label] = newz[2]
25 #data_old[Col_4_Label] = newz[3]
26 #data_old[Col_5_Label] = newz[4]
/usr/local/Cellar/jupyterlab/2.1.5/libexec/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
2798 if self.columns.nlevels > 1:
2799 return self._getitem_multilevel(key)
-> 2800 indexer = self.columns.get_loc(key)
2801 if is_integer(indexer):
2802 indexer = [indexer]
/usr/local/Cellar/jupyterlab/2.1.5/libexec/lib/python3.8/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
350 return self._range.index(new_key)
351 except ValueError:
--> 352 raise KeyError(key)
353 return super().get_loc(key, method=method, tolerance=tolerance)
354
KeyError: 2
There is my code.I have csv file.And when pandas read it - create one column with value 'Контракт'.Then. I split it on another columns. But it split till two columns.I wanna 7 columns!Help please to understand this logic!
import pandas as pd
from pandas import Series, DataFrame
import re
dframe1 = pd.read_csv('po.csv')
columns = ['Контракт']
data_old = pd.read_csv('po.csv', header=None, names=columns)
data_old
# The thing you want to split the column on
SplitOn = ':'
# Name of Column you want to split
Split_Col = 'Контракт'
newz = data_old[Split_Col].str.split(pat=SplitOn, n=-1, expand=True)
# Column Labels (you can add more if you will have more)
Col_1_Label = 'Номер телефону'
Col_2_Label = 'Тарифний пакет'
Col_3_Label = 'Вихідні дзвінки з України за кордон'
Col_4_Label = 'ВАРТІСТЬ ПАКЕТА/ЩОМІСЯЧНА ПЛАТА'
Col_5_Label = 'ЗАМОВЛЕНІ ДОДАТКОВІ ПОСЛУГИ ЗА МЕЖАМИ ПАКЕТА'
Col_6_Label = 'Вартість послуги "Корпоративна мережа'
Col_7_Label = 'ЗАГАЛОМ ЗА КОНТРАКТОМ (БЕЗ ПДВ ТА ПФ)'
data_old[Col_1_Label] = newz[0]
data_old[Col_2_Label] = newz[1]
data_old[Col_3_Label] = newz[2]
#data_old[Col_4_Label] = newz[3]
#data_old[Col_5_Label] = newz[4]
#data_old[Col_6_Label] = newz[5]
#data_old[Col_7_Label] = newz[6]
data_old
Pandas does not support "unstructured text", you should convert it to a standard format or python objects and then create a dataframe from it
Imagine that you have a file with this text named data.txt:
Contract № 12345679 Number of phone: +7984563774
Total price for month : 00.00000
Total price: 10.0000
You can load an process it with Python like this:
with open('data.txt') as f:
content = list(data.readlines())
# First line contains the contract number and phone information
contract, phone = content[0].split(':')
# find contract number using regex
contract = re.findall('\d+', contract)[0]
# The phone is strightforward
phone = phone.strip()
# Second line and third line for prices
total_price = float(content[1].split(':')[1].strip())
total_month_price = float(content[2].split(':')[1].strip())
Then with these variables you can create a dataframe
df = pd.DataFrame([dict(N_of_contract=contract, total_price=total_price, total_month_price =total_month_price )])
Repeat the same for all files.
I'm trying to extract dates from txt files using datefinder.find_dates which returns a generator object. Everything works fine until I try to convert the generator to list, when i get the following error.
I have been looking around for a solution but I can't figure out a solution to this, not sure I really understand the problem neither.
import datefinder
import glob
path = "some_path/*.txt"
files = glob.glob(path)
dates_dict = {}
for name in files:
with open(name, encoding='utf8') as f:
dates_dict[name] = list(datefinder.find_dates(f.read()))
Returns :
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-53-a4b508b01fe8> in <module>()
1 for name in files:
2 with open(name, encoding='utf8') as f:
----> 3 dates_dict[name] = list(datefinder.find_dates(f.read()))
C:\ProgramData\Anaconda3\lib\site-packages\datefinder\__init__.py in
find_dates(self, text, source, index, strict)
29 ):
30
---> 31 as_dt = self.parse_date_string(date_string, captures)
32 if as_dt is None:
33 ## Dateutil couldn't make heads or tails of it
C:\ProgramData\Anaconda3\lib\site-packages\datefinder\__init__.py in
parse_date_string(self, date_string, captures)
99 # otherwise self._find_and_replace method might corrupt
them
100 try:
--> 101 as_dt = parser.parse(date_string, default=self.base_date)
102 except ValueError:
103 # replace tokens that are problematic for dateutil
C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in
parse(timestr, parserinfo, **kwargs)
1354 return parser(parserinfo).parse(timestr, **kwargs)
1355 else:
-> 1356 return DEFAULTPARSER.parse(timestr, **kwargs)
1357
1358
C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in
parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
651 raise ValueError("String does not contain a date:",
timestr)
652
--> 653 ret = self._build_naive(res, default)
654
655 if not ignoretz:
C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in
_build_naive(self, res, default)
1222 cday = default.day if res.day is None else res.day
1223
-> 1224 if cday > monthrange(cyear, cmonth)[1]:
1225 repl['day'] = monthrange(cyear, cmonth)[1]
1226
C:\ProgramData\Anaconda3\lib\calendar.py in monthrange(year, month)
122 if not 1 <= month <= 12:
123 raise IllegalMonthError(month)
--> 124 day1 = weekday(year, month, 1)
125 ndays = mdays[month] + (month == February and isleap(year))
126 return day1, ndays
C:\ProgramData\Anaconda3\lib\calendar.py in weekday(year, month, day)
114 """Return weekday (0-6 ~ Mon-Sun) for year (1970-...), month(1- 12),
115 day (1-31)."""
--> 116 return datetime.date(year, month, day).weekday()
117
118
OverflowError: Python int too large to convert to C long
Can someone explain this clearly?
Thanks in advance
REEDIT : After taking into consideration the remarks that were made, I found a minimal, readable and verifiable example. The error occurs on :
import datefinder
generator = datefinder.find_dates("466990103060049")
for s in generator:
pass
This looks to be a bug in the library you are using. It is trying to parse the string as a year, but that this year is too big to be handled by Python. The library that datefinder is using says that it raises an OverflowError in this instance, but that datefinder is ignoring this possibility.
One quick and dirty hack just to get it working would be to do:
>>> datefinder.ValueError = ValueError, OverflowError
>>> list(datefinder.find_dates("2019/02/01 is a date and 466990103060049 is not"))
[datetime.datetime(2019, 2, 1, 0, 0)]
It's seems possible to relate with Japanese Language problem,
So I asked in Japanese StackOverflow also.
When I use string just object, it works fine.
I tried to encode but I couldn't find the reason of this error.
Could you please give me advice?
MeCab is an open source text segmentation library for use with text written in the Japanese language originally developed by the Nara Institute of Science and Technology and currently maintained by Taku Kudou (工藤拓) as part of his work on the Google Japanese Input project.
https://en.wikipedia.org/wiki/MeCab
sample.csv
0,今日も夜まで働きました。
1,オフィスには誰もいませんが、エラーと格闘中
2,デバッグばかりしていますが、どうにもなりません。
This is Pandas Python3 code
import pandas as pd
import MeCab
# https://en.wikipedia.org/wiki/MeCab
from tqdm import tqdm_notebook as tqdm
# This is working...
df = pd.read_csv('sample.csv', encoding='utf-8')
m = MeCab.Tagger ("-Ochasen")
text = "りんごを食べました、そして、みかんも食べました"
a = m.parse(text)
print(a)# working!
# But I want to use Pandas's Series
def extractKeyword(text):
"""Morphological analysis of text and returning a list of only nouns"""
tagger = MeCab.Tagger('-Ochasen')
node = tagger.parseToNode(text)
keywords = []
while node:
if node.feature.split(",")[0] == u"名詞": # this means noun
keywords.append(node.surface)
node = node.next
return keywords
aa = extractKeyword(text) #working!!
me = df.apply(lambda x: extractKeyword(x))
#TypeError: ("in method 'Tagger_parseToNode', argument 2 of type 'char const *'", 'occurred at index 0')
This is the trace error
りんご リンゴ りんご 名詞-一般
を ヲ を 助詞-格助詞-一般
食べ タベ 食べる 動詞-自立 一段 連用形
まし マシ ます 助動詞 特殊・マス 連用形
た タ た 助動詞 特殊・タ 基本形
、 、 、 記号-読点
そして ソシテ そして 接続詞
、 、 、 記号-読点
みかん ミカン みかん 名詞-一般
も モ も 助詞-係助詞
食べ タベ 食べる 動詞-自立 一段 連用形
まし マシ ます 助動詞 特殊・マス 連用形
た タ た 助動詞 特殊・タ 基本形
EOS
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-174-81a0d5d62dc4> in <module>()
32 aa = extractKeyword(text) #working!!
33
---> 34 me = df.apply(lambda x: extractKeyword(x))
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4260 f, axis,
4261 reduce=reduce,
-> 4262 ignore_failures=ignore_failures)
4263 else:
4264 return self._apply_broadcast(f, axis)
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
4356 try:
4357 for i, v in enumerate(series_gen):
-> 4358 results[i] = func(v)
4359 keys.append(v.name)
4360 except Exception as e:
<ipython-input-174-81a0d5d62dc4> in <lambda>(x)
32 aa = extractKeyword(text) #working!!
33
---> 34 me = df.apply(lambda x: extractKeyword(x))
<ipython-input-174-81a0d5d62dc4> in extractKeyword(text)
20 """Morphological analysis of text and returning a list of only nouns"""
21 tagger = MeCab.Tagger('-Ochasen')
---> 22 node = tagger.parseToNode(text)
23 keywords = []
24 while node:
~/anaconda3/lib/python3.6/site-packages/MeCab.py in parseToNode(self, *args)
280 __repr__ = _swig_repr
281 def parse(self, *args): return _MeCab.Tagger_parse(self, *args)
--> 282 def parseToNode(self, *args): return _MeCab.Tagger_parseToNode(self, *args)
283 def parseNBest(self, *args): return _MeCab.Tagger_parseNBest(self, *args)
284 def parseNBestInit(self, *args): return _MeCab.Tagger_parseNBestInit(self, *args)
TypeError: ("in method 'Tagger_parseToNode', argument 2 of type 'char const *'", 'occurred at index 0')w
I see you got some help on the Japanese StackOverflow, but here's an answer in English:
The first thing to fix is that read_csv was treating the first line of your example.csv as the header. To fix that, use the names argument in read_csv.
Next, df.apply will by default apply the function on columns of the dataframe. You need to do something like df.apply(lambda x: extractKeyword(x['String']), axis=1), but this won't work because each sentence will have a different number of nouns and Pandas will complain it cannot stack a 1x2 array on top of a 1x5 array. The simplest way is to apply on the Series of String.
The final problem is, there's a bug in the MeCab Python3 bindings: see https://github.com/SamuraiT/mecab-python3/issues/3 You found a workaround by running parseToNode twice, you can also call parse before parseToNode.
Putting all these three things together:
import pandas as pd
import MeCab
df = pd.read_csv('sample.csv', encoding='utf-8', names=['Number', 'String'])
def extractKeyword(text):
"""Morphological analysis of text and returning a list of only nouns"""
tagger = MeCab.Tagger('-Ochasen')
tagger.parse(text)
node = tagger.parseToNode(text)
keywords = []
while node:
if node.feature.split(",")[0] == u"名詞": # this means noun
keywords.append(node.surface)
node = node.next
return keywords
me = df['String'].apply(extractKeyword)
print(me)
When you run this script, with the example.csv you provide:
➜ python3 demo.py
0 [今日, 夜]
1 [オフィス, 誰, エラー, 格闘, 中]
2 [デバッグ]
Name: String, dtype: object
parseToNode fail everytime ,
so needed to put this code
tagger.parseToNode('dummy')
before
node = tagger.parseToNode(text)
and It's worked!
But I don't know the reason, maybe parseToNode method has bug..
def extractKeyword(text):
"""Morphological analysis of text and returning a list of only nouns"""
tagger = MeCab.Tagger('-Ochasen')
tagger.parseToNode('ダミー')
node = tagger.parseToNode(text)
keywords = []
while node:
if node.feature.split(",")[0] == u"名詞": # this means noun
keywords.append(node.surface)
node = node.next
return keywords