Python & Pandas - Using chr(255) as a separator

Python & Pandas - Using chr(255) as a separator - python

I'm trying to parse firewall logs with Python & Pandas but i'm having issues with getting the correct separator to work;
In my current log data :
num�date�time�orig�type�action�alert�i/f_name�i/f_dir�product�log_sys_message�origin_id�ProductFamily�src�dst�proto�message_info�service�s_port�rule�rule_uid�rule_name�service_id�xlatesrc�xlatedst�NAT_rulenum�NAT_addtnl_rulenum�xlatedport�xlatesport�ICMP�ICMP Type�ICMP Code�rule_guid�hit�policy�first_hit_time�last_hit_time�log_id�description�status�version�comment�update_service�TCP packet out of state�tcp_flags�sys_message:�inzone�outzone�Protection Name�Severity�Confidence Level�protection_id�SmartDefense Profile�Performance Impact�Industry Reference�Protection Type�Update Version�Attack Info�attack�capture_uuid�FollowUp�Total logs�Suppressed logs
0�24Oct2017�23:59:00�10.100.255.190�control� ��daemon�inbound�VPN-1 & FireWall-1�Log file has been switched to: 2017-10-24_235900.log�cteafmfw1�Network��������������������������������������������������
and the code:
import pandas as pd
file = pd.read_csv('2017-10-25_235900.log-export.csv', sep='\xff',
header=0, index_col=False)
print(file)
when i run this i can see that the separator is not processed; I've tried also assignin it to variable with value chr(255) as was proposed for similiar issue but cannot seem to get this separator prosessed at all.
I know that i can process the file and replace separator but as there is tons of data with this separator already would be nice to know is it even possible to get this working?

For other wondering the same:
Adding "encoding='latin-1'" to read_csv params solved this
Thanks # COLDSPEED

Related

Python notebook - Importing data file with delimiter with two characters causes an error

Question: We are encountering the following error when loading a data file - that has two characters delimiter - to an Azure SQL Db. What we may be doing wrong and how can the issue be resolved?
Using a Python notebook in Azure Databricks, we are trying to load a data file into Azure SQL Db. The delimiter in the data file has two characters ~*. On the following code we get the errors shown below:
pandas dataframe low memory not supported with the 'python' engine
Code:
import sqlalchemy as sq
import pandas as pd
data_df = pd.read_csv('/dbfs/FileStore/tables/MyDataFile.txt', sep='~*', engine='python', low_memory=False, quotechar='"', header='infer' , encoding='cp1252')
.............
.............
Remarks: If we remove the low_memory option, we get the following error. Although with other data files that are even larger than this file but have delimiter with a single character, we don't get the following error.
ConnectException: Connection refused (Connection refused)
Error while obtaining a new communication channel
ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

From the documentation of Pandas.read_csv():
In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine.
Since it's being interpreted as a regular expression, and * has special meaning in regexp, you need to escape it. Use sep=r'~\*'

Probably your File ist too large and the dataframe does not fit in memory. Can you try to split the Processing Up? I.e. read 1000 Limes, make a dataframe from that, Push to SQL, Thema read next 1000 lines etc?
nrows and skiprows passed to read_csv can be used for this.
Maybe a workaround: preprocess the file with sed s/-*/;/g, then you can use the c engine with lower memory footprint.

Save errors to a variable while reading csv file

I’m trying to read an unknown large csv file with pandas.
I came across some errors so I added the following arguments:
df = pd.read_csv(csv_file, engine="python", error_bad_lines=False, warn_bad_lines=True)
It is working good and skipping offending lines, and errors are prompted to the terminal correctly, such as:
Skipping line 31175: field larger than field limit (131072)
However, I’d like to save all errors to a variable instead of printing them.
How can I do it?
Note that I have a big program here and can't change the output of all logs from file=sys.stdout to something else. I need a case specific solution.
Thanks!

use on_bad_lines capability instead (available in pandas 1.4+):
badlines_list = []
def badlines_collect (bad_line: list[str]) -> None:
badlines_list.append(bad_line)
return None
df = pd.read_csv(csv_file, engine="python",on_bad_lines=badlines_collect)

UnicodeDecodeError ('utf-8') for pandas read_csv from folder nested Zip File

I currently have a zip file which contains a list of N folders, each containing 1+ .csv files. I am looking to simply read in a selection of these .csv files from the zip and use pandas to create a list of DataFrames.
I've done this successfully the 'manual' way where I unzip the files locally and just read in the individual .csv's.
However, when I use a zipfile method but I'm getting the following error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xab in position ****: invalid start byte
I thought this would be a straight forward task, but I seem to be missing some step. I've given my code below. However I suspect the issue is rooted in the way zipfile unpacks the documents compared to macOS (technically The Unarchiver).I have generated a test zip file and successfully got a pandas DataFrame output. I'm just getting myself mixed up on how to achieve the same result on the 'real' data.
Sadly I am not able to post the original data in question here.
import pandas as pd
import zipfile
# Sample loader for testing
sample_path = "Sample_ZipFile.zip"
with ZipFile(sample_path) as zipfiles:
sample_file_names = [file.filename for file in zipfiles.infolist() if file.filename[-4:]=='.csv']
data = zipfiles.open(sample_file_names[0])
testdat = pd.read_csv(data,dtype='str',index_col=False)

So after some frustrated searching the next morning, I eventually stumbled across a similar problem in the Pandas github page which you can look at here.
It simply seems to be down to a difference in how Google Colab and Jupyter handle pandas (pd) pd.read_csv (and pd.to_csv).
For anyone stumbling across the same error, I managed to get through the problem using:
Adding engine='python' to pd.read_csv()
OR adding encoding='cp1252' which a colleague suggested.
I am assuming I was just lucky in my Jupyter Notebooks up until now in not seeing any encoding bugs. But I hope this answer helps anyone who might get as stuck as I did...

How to deal with warning : "Workbook contains no default style, apply openpyxl's default "

I have the -current- latest version of pandas, openpyxl, xlrd.
openpyxl : 3.0.6.
pandas : 1.2.2.
xlrd : 2.0.1.
I have a generated excel xlsx- file (export from a webapplication).
I read it in pandas:
myexcelfile = pd.read_excel(easy_payfile, engine="openpyxl")
Everything goes ok, I can successfully read the file.
But I do get a warning:
/Users/*******/projects/environments/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:214: UserWarning: Workbook contains no default style, apply openpyxl's default
warn("Workbook contains no default style, apply openpyxl's default")
The documentation doesn't shed too much light on it.
Is there any way I can add an option to avoid this warning?
I prefer not to suppress it.

I don't think the library offers you a way to disable this thus you are going to need to use the warnings package directly.
A simple and punctual solution to the problem would be doing:
import warnings
with warnings.catch_warnings(record=True):
warnings.simplefilter("always")
myexcelfile = pd.read_excel(easy_payfile, engine="openpyxl")

df=pd.read_excel("my.xlsx",engine="openpyxl") passing the engine parameter got rid of the warning for me. Default = None, so I think it is just warning you that it using openpyxl for default style.

I had the same warning. Just changed the sheet name of my excel file from "sheet_1" to "Sheet1", then the warning disappeared. very similar with Yoan. I think pandas should fix this warning later.

#ruhanbidart solution is better because you turn off warnings just for the call to read_excel, but if you have dozens of calls to pd.read_excel, you can simply disable all warnings:
import warnings
warnings.simplefilter("ignore")

I had the exact same warning and was unable to read the file. In my case the problem was coming from the Sheet name in the Excel file.
The initial name contained a . (ex: MDM.TARGET) I simply replace the . with _ and everything's fine.

In my situation some columns' names had a dollar sign ($) in them. Replacing '$' to '_' solved the issue.

How to write pandas dataframe into Databricks dbfs/FileStore?

I'm new to the Databricks, need help in writing a pandas dataframe into databricks local file system.
I did search in google but could not find any case similar to this, also tried the help guid provided by databricks (attached) but that did not work either. Attempted the below changes to find my luck, the commands goes just fine, but the file is not getting written in the directory (expected wrtdftodbfs.txt file gets created)
df.to_csv("/dbfs/FileStore/NJ/wrtdftodbfs.txt")
Result: throws the below error
FileNotFoundError: [Errno 2] No such file or directory:
'/dbfs/FileStore/NJ/wrtdftodbfs.txt'
df.to_csv("\\dbfs\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
df.to_csv("dbfs\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
df.to_csv(path ="\\dbfs\\FileStore\\NJ\\",file="wrtdftodbfs.txt")
Result: TypeError: to_csv() got an unexpected keyword argument 'path'
df.to_csv("dbfs:\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
df.to_csv("dbfs:\\dbfs\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
The directory exists and the files created manually shows up but pandas to_csv never writes nor error out.
dbutils.fs.put("/dbfs/FileStore/NJ/tst.txt","Testing file creation and existence")
dbutils.fs.ls("dbfs/FileStore/NJ")
Out[186]: [FileInfo(path='dbfs:/dbfs/FileStore/NJ/tst.txt',
name='tst.txt', size=35)]
Appreciate your time and pardon me if the enclosed details are not clear enough.

Try with this in your notebook databricks:
import pandas as pd
from io import StringIO
data = """
CODE,L,PS
5d8A,N,P60490
5d8b,H,P80377
5d8C,O,P60491
"""
df = pd.read_csv(StringIO(data), sep=',')
#print(df)
df.to_csv('/dbfs/FileStore/NJ/file1.txt')
pandas_df = pd.read_csv("/dbfs/FileStore/NJ/file1.txt", header='infer')
print(pandas_df)

This worked out for me:
outname = 'pre-processed.csv'
outdir = '/dbfs/FileStore/'
dfPandas.to_csv(outdir+outname, index=False, encoding="utf-8")
To download the file, add files/filename to your notebook url (before the interrogation mark ?):
https://community.cloud.databricks.com/files/pre-processed.csv?o=189989883924552#
(you need to edit your home url, for me is :
https://community.cloud.databricks.com/?o=189989883924552#)
dbfs file explorer

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python & Pandas - Using chr(255) as a separator - python

For other wondering the same: Adding "encoding='latin-1'" to read_csv params solved this Thanks # COLDSPEED

Related

Python notebook - Importing data file with delimiter with two characters causes an error

Save errors to a variable while reading csv file

UnicodeDecodeError ('utf-8') for pandas read_csv from folder nested Zip File

How to deal with warning : "Workbook contains no default style, apply openpyxl's default "

How to write pandas dataframe into Databricks dbfs/FileStore?

Categories

Resources