I wanted to create a webservice which will provide a summary of texts in the given URL using python , beautifulsoup and nltk.
However I encounter the following error in Azure ML Studio
Schematics in AZURE:
EnterData Module is having URL from wiki
Execute Python Script is having following code
import pandas as pd
import urllib.request as ur
from bs4 import BeautifulSoup
def azureml_main(dataframe1="https://en.wikipedia.org/wiki/Fluid_mechanics", dataframe2 = None):
wiki = dataframe1[0].to_string()
page = ur.urlopen(wiki)
soup = BeautifulSoup(page)
df= pd.DataFrame([soup.find_all('p')[0].get_text()], columns =['article_text'])
return dataframe1,
Running this experiment producing following error:
Error 0085: The following error occurred during script evaluation, please view the output log for more information:
---------- Start of error message from Python interpreter ----------
Caught exception while executing function: Traceback (most recent call last):
File "C:\pyhome\lib\site-packages\pandas\indexes\base.py", line 1876, in get_loc
return self._engine.get_loc(key)
File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4027)
File "pandas\index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas\index.c:3891)
File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12408)
File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12359)
KeyError: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\server\invokepy.py", line 199, in batch
odfs = mod.azureml_main(*idfs)
File "C:\temp\84d7e9fbcfe54596a2e7de022b4d236c.py", line 23, in azureml_main
wiki = dataframe1[0][0].to_string()
File "C:\pyhome\lib\site-packages\pandas\core\frame.py", line 1992, in __getitem__
return self._getitem_column(key)
File "C:\pyhome\lib\site-packages\pandas\core\frame.py", line 1999, in _getitem_column
return self._get_item_cache(key)
File "C:\pyhome\lib\site-packages\pandas\core\generic.py", line 1345, in _get_item_cache
values = self._data.get(item)
File "C:\pyhome\lib\site-packages\pandas\core\internals.py", line 3225, in get
loc = self.items.get_loc(item)
File "C:\pyhome\lib\site-packages\pandas\indexes\base.py", line 1878, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4027)
File "pandas\index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas\index.c:3891)
File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12408)
File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12359)
KeyError: 0
Process returned with non-zero exit code 1
---------- End of error message from Python interpreter ----------
Start time: UTC 11/11/2018 15:34:21
End time: UTC 11/11/2018 15:34:30
I am using Anaconda 4.0/Python 3.5 to run this snippet.
when I assign the URL to the variable wiki, the code runs successfully in my local machine
I am not sure why I cannot fetch the value from the input dataframe1.
Input dataframe is not having header hence dataframe1[0] should fetch the URL directly..
Thanks to help me on this.
your dataframe1 is look like this
dataframe1 = {'Col1' : ['https://en.wikipedia.org/wiki/Finite_element_method']}
the key is not index (int), but its 'Col1', you can fix it with
wiki = dataframe1['Col1'].to_string(index=0)
but it raise another error, the URL is trimmed if too long
https://en.wikipedia.org/wiki/Finite_element....
so it better using
wiki = dataframe1['Col1'][0]
another error is
return dataframe1,
it should be
return df,
fixed code
import pandas as pd
import urllib.request as ur
from bs4 import BeautifulSoup
def azureml_main(dataframe1="https://en.wikipedia.org/wiki/Fluid_mechanics", dataframe2 = None):
wiki = dataframe1['Col1'][0]
page = ur.urlopen(wiki)
soup = BeautifulSoup(page)
df= pd.DataFrame([soup.find_all('p')[0].get_text()], columns=['article_text'])
return df,
Related
I'm new into this coding world (like 2 weeks old) so I just ran into a problem. I was following a tutorial like most of us did in the begging. The task was to add a new column called "Month". To do that they suggest to take the 2 first numbers from the column called "Order Date". I wrote the code by letter from the tutorial, the only difference was that I was using Pycharm and they Jupyter Notebook. I like Pycharm so maybe someone knows how to solve this.
The code is the following:
import pandas as pd
import os
files = [file for file in os.listdir("./Files")]
allmonths = pd.DataFrame()
for file in files:
df = pd.read_csv("./Files/" + file)
allmonths = pd.concat([allmonths,df])
alldata = pd.read_csv("allmonths.csv")
### Month Column addition
alldata["Month"] = alldata["Order Date"].str[0:2]
allmonths['Month']
print(alldata.head())
The Traceback:
Traceback (most recent call last):
File "D:\Coding\Sales_Data\venv\lib\site-packages\pandas\core\indexes\base.py", line 3621, in get_loc
return self._engine.get_loc(casted_key)
File "pandas_libs\index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas_libs\hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Order Date'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "D:\Coding\Sales_Data\Sales Anal.py", line 11, in
alldata["Month"] = alldata["Order Date"].str[0:2]
File "D:\Coding\Sales_Data\venv\lib\site-packages\pandas\core\frame.py", line 3505, in getitem
indexer = self.columns.get_loc(key)
File "D:\Coding\Sales_Data\venv\lib\site-packages\pandas\core\indexes\base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 'Order Date'
I know the problem is something about the column names, and maybe that Pycharm can't get it from the CSV file. But, HOW to solve it... IDK
Could someone help me figure out why my files dont open.
import pandas as pd
file = "C://Dev//20211103_logfile Box 2.8.xlsx"
temp=pd.read_excel(file)
Here is the full error!
PS C:\Dev> & C:/Users/keyur/AppData/Local/Programs/Python/Python39/python.exe c:/Dev/test_excel.py
C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\openpyxl\reader\workbook.py:88:
UserWarning: File contains an invalid specification for 20211103_logfile. This will be removed
warn(msg)
Traceback (most recent call last):
File "c:\Dev\test_excel.py", line 6, in <module>
temp=pd.read_excel(file)
File "C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 372, in read_excel
data = io.parse(
File "C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 1272, in parse
return self._reader.parse(
File "C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 537, in parse
sheet = self.get_sheet_by_index(asheetname)
File "C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_openpyxl.py", line 546, in get_sheet_by_index
self.raise_if_bad_sheet_by_index(index)
File "C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 468, in raise_if_bad_sheet_by_index
raise ValueError(
ValueError: Worksheet index 0 is invalid, 0 worksheets found
PS C:\Dev>
There are problem with your excel,
try make a new excel and copy pase all data ,then try again ,this method works for me.
I am working on a project which requires web scraping from my university site. The university site is https://erp.aktu.ac.in/WebPages/OneView/OneView.aspx . When I enter the roll no(ex- 1513310*** *** from 001 to 100), the result gets shown but when I copy the URL and paste again in the browser it redirects me to entering roll no again. I assume the same things happening while fetching it from pd.read_html() function. Is there any way to bypass it?
import pandas as pd
>>> pd.read_html('https://erp.aktu.ac.in/WebPages/OneView/OVEngine.aspx?enc=NnCOpTxI4+e2v6OtxoLaIVhtGRRyQHWhl51tE9IxJAlzwgkcwHudd8EEQQF6+chV')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python36\lib\site-packages\pandas\io\html.py", line 1100, in read_html
displayed_only=displayed_only,
File "C:\Python36\lib\site-packages\pandas\io\html.py", line 915, in _parse
raise retained
File "C:\Python36\lib\site-packages\pandas\io\html.py", line 895, in _parse
tables = p.parse_tables()
File "C:\Python36\lib\site-packages\pandas\io\html.py", line 213, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "C:\Python36\lib\site-packages\pandas\io\html.py", line 545, in _parse_tables
raise ValueError("No tables found")
ValueError: No tables found
The error is shown because the result page cannot be obtained. Is there any solution around it?
Beginner here. trying to load this table via python so i can figure out how i can manipulate it and gain some insight with the eventual intention of calculating the WOE and/or running a regression.
The command ran fine on a test db of two rows i created so it must be something to do with the format of the csv im trying to use. Its a file with 8000 customers and 50 associated variables including some dates and then counts, sums and averages for 30, 60 and 90 day windows of a number of different factors. Could any of this be the reason i get the error message at the bottom?
(* are just redaction's)
data = pd.read_csv("C:\Users\******\Desktop\*******.csv")
>>> data = pd.read_csv(r"C:\Users\******\Desktop\**************")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 429, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 895, in __init__
self._make_engine(self.engine)
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1122, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1853, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 705, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'C:\\Users\\******\\Desktop\\**************' does not exist: b'C:\\Users\\******\\Desktop\\**************'
....
add r(raw string) before ":
data = pd.read_csv(r"C:\Users******\Desktop*******.csv")
You should replace single backslash with double backslash, like so
data = pd.read_csv("C:\\Users******\\Desktop*******.csv")
or prefix path with r
data = pd.read_csv(r"C:\Users******\Desktop*******.csv")
See here for full description on which characters need escaping in python strings.
Its better to create a separate folder where keep both data and your csv file...
Then just read by only file name... Try to press tab when you are in parenthesis
because it will give you also suggestion where you will see if the file is available or not.
df = pd.read_csv('filename.csv)
I would like to create a date_range() with using pandas. I am kinda sure it worked before I updated pandas package.
with following line of code, I am trying to create the date_range():
date_time_index = pd.date_range(start='1/1/2018', periods=8760, freq='H')
and here is the error message:
ValueError: Error parsing datetime string "1/1/2018" at position 1
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "main.py", line 36, in <module>
date_time_index = pd.date_range(start='1/1/2018', periods=8760, freq='H')
File "/usr/local/lib/python3.6/dist-packages/pandas/tseries/index.py", line 2024, in date_range
closed=closed, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/pandas/tseries/index.py", line 301, in __new__
ambiguous=ambiguous)
File "/usr/local/lib/python3.6/dist-packages/pandas/tseries/index.py", line 403, in _generate
start = Timestamp(start)
File "pandas/tslib.pyx", line 406, in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:9940)
File "pandas/tslib.pyx", line 1401, in pandas.tslib.convert_to_tsobject (pandas/tslib.c:25239)
File "pandas/tslib.pyx", line 1516, in pandas.tslib.convert_str_to_tsobject (pandas/tslib.c:26859)
File "pandas/src/datetime.pxd", line 141, in datetime._string_t
SystemError: <class 'str'> returned a result with an error set
What am I doing wrong?
Pandas version 0.19.1 date_range() does not work with the input I gave. I updated pandas to 0.23.4 now everything is fine.
Meanwhile:
pip3 install --upgrade pandas