Python UDF - import/read external files

Python UDF - import/read external files - python

I would like to import other python/csv files into my python udf to perform some operations.
Like,
Comparing the table data(which flows in as a stream, row by row) to an external .csv row.
When I try to read data of .csv file, it gives me an error
IOError: File /home/abc/xyz/myfile.csv does not exist
While the code works perfectly well when it is written as a regular python script (not like udf)

If I understood it right . You can try
ADD FILE [Your complete file path]
or
Add FILES [Your directory path].
Because before referring anything on cluster you must add it to the distribution cache so that code there can access that portion.
you can have a look at it.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli

Be careful about the syntax! It may cause many problems and unfortunately, query language interpreter is not able to show where the problem is coming from and it just shows some generic error report.
Look at kind of the same problem here that was caused by a syntax issue in addressing the file!
Accessing external file in Python UDF

Related

Jupyter Notebook issue

I ran some commands on Jupyter Notebook and expected to get a printed output containing data in tabulated form in a .csv file, but then i get an uncompleted output
This is the result i get from the .csv file
I ran this command;
df1=pandas.read_csv("supermarkets.csv", on_bad_lines='skip')
df1
I expected to get a printed output in a tabulated like in the image attached......
The data get printed in well tabulated form here
Here is a link to the online version of the file
[pythonhow.com/supermarkets.csv]

Getting good, clean quality data where the file extension correctly matches the actual content is often a challenge. Assessing the state of the input data is generally always a very important first step.
It appears the data you are trying to get is also online here. Github will render that as a table in the browser because it has a viewer mode. To look at the 'raw' file content, click here. You'll see it is nice comma-delimited file with columns separated by commas and rows each on a different line. The header with the column names is on the first line.
Now open in a good text editor the file you have that you are working with and compare it to the content I pointed you at. That should guide you on what is the issue.
At this point you may just wish to switch to using the version of the file that I pointed you at.
Use the link below to obtain it as proper csv file:
https://raw.githubusercontent.com/kenvilar/data-analysis-using-python/master/supermarkets.csv
You should be able to paste that link in your browser and then right click on the page and choose 'Save as..' to download it to your locak machine. The obtained file should open just fine using the code you showed in the screenshot in your post here.
Please work on writing better questions with specific titles, see here for guidance. The title at present is overly broad and is actually not accurate. This code would not work with the data you apparently have even if you were running it inside a Python code-based script. And so it is not a Jupyter notebook issue. For how to think about making it specific, a good thing to keep in mind is to write for your future self. If you continue to use notebooks you'll have hundreds that would be considered a 'Jupyter Notebook issue', but what makes this issue different from those?

I believe there is an issue with your csv file, not the code.
To me it looks like the data in your csv file are written in json format.
Have you opened the supermarkets.csv file using excel? it should look like a table, not a json formatted file.

did you try df1.show() to see if the csv got read in the first place?

How to solve "new-line character seen in unquoted field - do you need to open the file in universal-newline mode?"

We have data where csv files are generated using pandas and uploaded into s3. Downstream uses stitch were we are encountering this issue. in python we are using utf8-sig as encoding and ends up having failure often. is there a way to cleanse this from python itself. and how to identify the special character causing this issue .i have tried vi and other editors,till now unable to find the real issue.

Error when using Writer.Close() function within my Pandas and Openpyxl code

I have written a code which combines some CSV files into a single Excel file, and ended the 'writer' with the code:
writer.save()
writer.close()
However, I get the following error when trying to then open that file after the code has finalised:
We found a problem with some content in 'the file.xlsx'. Do you want us to try to recover as much as we can? If you trust the source of this workbook, click Yes.'
This seems to purely be related to the 'Writer.Close()' aspect, as without it I don't get the error. However, instead I cannot open the file as it states that someone else is using it (ie - openpyxl)
I'm not sure if relevant, but my file system runs on a OneDrive cloud based system.
My current plan beyond the 'writer.close()' is to pause the script to allow me to print the excel to PDF (I found this to be unreliable via Python), and then 'hit continue' to continue with exporting the PDF via Email.
Any ideas on how to resolve this error?

With out seeing more of your code and maybe an example of the data you are writing it's tough to make any assumptions. Based on the error you are experiencing it is likely due to the inputs/data going into the actual xlsx file that is causing the issue and not with the actual 'writer'. This is Excel saying that data in your file is 'corrupted' from their standards perspective and needs to be fixed.
You should be able to do a 'recovery' of the file through excel and it will identify the problem spots in your file which you can then back track into your python program and properly address to eliminate the probelm.

Outputting A .xls File In Python

I have been teaching myself Python to automate some of our work processes. So far reading from Excel files (.xls, .xlsx) has gone great.
Currently I have hit a bit of a snag. Although I can output .xlsx files fine, the software system that we have to use for our primary work task can only take .xls files as an input - it cannot handle .xlsx files, and the vendor sees no reason to add .xlsx support at any point in the foreseeable future.
When I try to output a .xls file using either Pandas or OpenPyXl, and open that file in Excel, I get a warning that the file format and extension of the file do not match, which leads me to think that attempting to open this file using our software could lead to some pretty unexpected consequences (because it's actually a .xlsx file, just not named as such)
I've tried to search for how to fix this all on Google, but all I can find are guides for how to convert a .xls file to a .xlsx file (which is almost the opposite of what I need). So I was wondering if anybody could please help me on whether this can be achieved, and if it can, how.
Thank you very much for your time

Under the pandas.DataFrame.to_excel documentation you should notice a parameter called engine, which states:
engine : str, optional
Write engine to use, openpyxl or xlsxwriter. You can also set this via the options io.excel.xlsx.writer, io.excel.xls.writer, and io.excel.xlsm.writer.
What it does not state is that the engine param is automatically picked based on your file extension -- therefore, easy fix:
import pandas as pd
df = pd.DataFrame({"data": [1, 2, 3]})
df.to_excel("file.xls") # Notice desired file extension.
This will automatically use the xlwt engine, so make sure you have it installed via pip install xlwt.

Felipe is right the filename extension will set the engine parameter.
So basically all it's saying is that the old Excel format ".xls" extension is no longer supported in Pandas. So if you specify the output spreadsheet with the ".xlsx" extension the warning message disappears.

I FINALLY have the answer!
I have libreoffice installed and am using the following in the command line on windows:
"C:\Program Files\LibreOffice\program\soffice.exe" --headless --convert-to xlsx test2.xls
Currently trying to use subprocess to automate this.

how to convert a .csv file to .bag in python?

I have some VLP16 LiDar data in .csv file format, have to load the data in Ros Rviz for which I need the Rosbag file(.bag). I have tried finding it in the Ros tutorial, what I got was to convert .bag to .csv

I'm not actually expert in processing .bag files but I think you need to go through your CSV file and manually add the values using rosbag Python API

Not direct answer but check this script in python, which might help you.
Regarding C++ I propose this repository: convert_csv_to_rosbag which is even closer to what you asked.
However, it seems that you need to do it by yourself based on these examples.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.