String methods fail with Modin, but same work with Pandas - python

I'm currently trying to improve processing speed on several large log files, to extract some metrics to then store on a Postgres database. Currently, I'm just trying the first step, which is, simply filtering only relevant lines of the log after having them processed.
This is the sample code that currently works in regular Pandas:
import os
import regex as re
import pandas as pd
fp = "server.log"
data_lines = []
with open(fp, "rt", encoding="utf8") as file:
lines = file.readlines()
# data_lines += [
# line for line in lines
# if "POST" in line
# ]
data_lines += lines
# Processing
df = pd.DataFrame({"src": data_lines})
df.src = df.src.astype("string")
df = df[df.src.str.contains("POST")]
But, when I try to replace import pandas as pd with import modin.pandas as pd, I get this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 67: invalid continuation byte
As shown, the text file is being open with the correct encoding, and no error is thrown when using the same code with Pandas. Please, advise in case this is not the intended way to use Modin.

Related

Read CSV file into pandas dataframe from FTPS server

I am unable to grab the data from a CSV file to be able to put it into a pandas dataframe. I am able to get into the directory and see all of the files there, but I haven't been able to access the document.
Here is my code:
from ftplib import FTP_TLS
import socket
import pandas as pd
server=ftplib.FTP_TLS(‘server’,certfile = r'C:/’)
server.login(user,pw)
# get into respective directory
server.cwd('Banana')
server.prot_p()
# This piece here is needed in order to see what is in my directory, I don't understand why.
# Something about the server not being set up correctly?
server.af = socket.AF_INET6
# check location
server.pwd()
# check files
server.dir()
# Get CSV file data
import io
download_file = io.BytesIO()
download_file.seek(0)
server.retrbinary('RETR ' + str('file.csv'), download_file.write)
download_file.seek(0)
file_to_process = pd.read_csv(download_file, engine='python')
The error that I get is that the last code from import io down to file_to_process just sits there and does nothing. Maybe it times itself out? Unsure the issue.
New error is this:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3376: character maps to <undefined>
Edit: Now I'm trying to save to disk. But this code deletes the contents of the file. Do I not understand how write works?
filematch = ‘Try20.csv’
target_dir = r'\\server’
import os
for filename in server.nlst(filematch):
target_file_name = os.path.join(target_dir,os.path.basename(filename))
with open(target_file_name,'wb') as fhandle:
server.retrbinary('RETR %s' %filename, fhandle.write)
Secondarily, I don't understand how to write the contents of fhandle into a dataframe now.

"UnicodeEncodeError: 'charmap' codec can't encode character" When Writing to csv Using a Webscraper

I've written a webscraper that scrapes NBA box score data off of basketball-reference. The specific webpage that my error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u0107' in position 11: character maps to <undefined>
is occurring on is here. Lastly, the specific player data that is tripping it up and throwing this specific UnicodeEncodeError is this one (although I am sure the error is more generalized and will be produced with any character that contains an obscure accent mark).
The minimal reproducible code:
def get_boxscore_basic_table(tag): #used to only get specific tables
tag_id = tag.get("id")
tag_class = tag.get("class")
return (tag_id and tag_class) and ("basic" in tag_id and "section_wrapper" in tag_class and not "toggleable" in tag_class)
import requests
from bs4 import BeautifulSoup
import lxml
import csv
import re
website = 'https://www.basketball-reference.com/boxscores/202003110MIA.html'
r = requests.get(website).text
soup = BeautifulSoup(r, 'lxml')
tables = soup.find_all(get_boxscore_basic_table)
in_file = open('boxscore.csv', 'w', newline='')
csv_writer = csv.writer(in_file)
column_names = ['Player','Name','MP','FG','FGA','FG%','3P','3PA','3P%','FT','FTA','FT%','ORB','DRB','TRB','AST','STL','BLK','TOV','PF','PTS','+/-']
csv_writer.writerow(column_names)
for table in tables:
rows = table.select('tbody tr')
for row in rows:
building_player = [] #temporary container to hold player and stats
player_name = row.th.text
if 'Reserves' not in player_name:
building_player.append(player_name)
stats = row.select('td.right')
for stat in stats:
building_player.append(stat.text)
csv_writer.writerow(building_player) #writing to csv
in_file.close()
What is the best way around this?
I've seen some stuff online about changing the encoding and specifically using the.encode('utf-8') method on the string before writing to the csv but it seems that this .encode() method, although it stops an error from being thrown, has several of its own problems. For instance; player_name.encode('utf-8') before writing to csv turns the name 'Willy Hernangómez' into 'b'Willy Hernang\xc3\xb3mez'' within by csv... not exactly a step in the right direction.
Any help with this and an explanation as to what is happening would be much appreciated!
use
in_file = open('boxscore.csv', 'w', newline='', encoding='utf-8')
instead of
in_file = open('boxscore.csv', 'w', newline='')
and keep everything the same. Make sure you open Excel in utf-8 encoding

Reading from csv to pandas, chardet and error bad lines options do not work in my case

I checked similar questions before I write here, also I tried to use try/except... where try does nothing, except prints bad line but couldn't solve my issue. So currently I have:
import pandas as pd
import chardet
# Read the file
with open("full_data.csv", 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
df1 = pd.read_csv("full_data.csv", sep=';',
encoding=result['encoding'], error_bad_lines=False, low_memory=False, quoting=csv.QUOTE_NONE)
But I still get the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 9: invalid start byte
Is there any option similar to error = 'replace' in open csv ? Or any other solutions
Using engine option sovles my problem:
df1 = pd.read_csv("full_data.csv", sep=";", engine="python")

Unable to convert json file in dataframe

I am building an recommendation engine. This json file contains event data, I want to convert it into a dataframe. I tried read_json method but it give an error
UnicodeDecodeError:'charmap'codec can't decode byte 0x81
in position 21573281:charactermaps to <undefined>
Below is some entries from json:
{"_id":{"$oid":"57a30ce268fd0809ec4d194f"},"session":{"start_timestamp":{"$numberLong":"1470183490481"},"session_id":"def5faa9-20160803-001810481"},"metrics":{},"arrival_timestamp":{"$numberLong":"1470183523054"},"event_type":"OfferViewed","event_timestamp":{"$numberLong":"1470183505399"},"event_version":"3.0","application":{"package_name":"com.think.vito","title":"Vito","version_code":"5","app_id":"7ffa58dab3c646cea642e961ff8a8070","cognito_identity_pool_id":"us-east-1:4d9cf803-0487-44ec-be27-1e160d15df74","version_name":"2.0.0.0","sdk":{"version":"2.2.2","name":"aws-sdk-android"}},"client":{"cognito_id":"us-east-1:2e26918b-f7b1-471e-9df4-b931509f7d37","client_id":"ee0b61b0-85cf-4b2f-960e-e2aedef5faa9"},"device":{"locale":{"country":"US","code":"en_US","language":"en"},"platform":{"version":"5.1.1","name":"ANDROID"},"make":"YU","model":"AO5510"},"attributes":{"Category":"120000","CustomerID":"4078","OfferID":"45436"}}
{"_id":{"$oid":"57a30ce268fd0809ec4d1950"},"session":{"start_timestamp":{"$numberLong":"1470183490481"},"session_id":"def5faa9-20160803-001810481"},"metrics":{},"arrival_timestamp":{"$numberLong":"1470183523054"},"event_type":"ContextMenuItemSelected","event_timestamp":{"$numberLong":"1470183500206"},"event_version":"3.0","application":{"package_name":"com.think.vito","title":"Vito","version_code":"5","app_id":"7ffa58dab3c646cea642e961ff8a8070","cognito_identity_pool_id":"us-east-1:4d9cf803-0487-44ec-be27-1e160d15df74","version_name":"2.0.0.0","sdk":{"version":"2.2.2","name":"aws-sdk-android"}},"client":{"cognito_id":"us-east-1:2e26918b-f7b1-471e-9df4-b931509f7d37","client_id":"ee0b61b0-85cf-4b2f-960e-e2aedef5faa9"},"device":{"locale":{"country":"US","code":"en_US","language":"en"},"platform":{"version":"5.1.1","name":"ANDROID"},"make":"YU","model":"AO5510"},"attributes":{"MenuItem":"OfferList","CustomerID":"4078"}}
{"_id":{"$oid":"57a30ce268fd0809ec4d1951"},"session":{"start_timestamp":{"$numberLong":"1470183490481"},"session_id":"def5faa9-20160803-001810481"},"metrics":{},"arrival_timestamp":{"$numberLong":"1470183523054"},"event_type":"CategoryPageCategorySelection","event_timestamp":{"$numberLong":"1470183499171"},"event_version":"3.0","application":{"package_name":"com.think.vito","title":"Vito","version_code":"5","app_id":"7ffa58dab3c646cea642e961ff8a8070","cognito_identity_pool_id":"us-east-1:4d9cf803-0487-44ec-be27-1e160d15df74","version_name":"2.0.0.0","sdk":{"version":"2.2.2","name":"aws-sdk-android"}},"client":{"cognito_id":"us-east-1:2e26918b-f7b1-471e-9df4-b931509f7d37","client_id":"ee0b61b0-85cf-4b2f-960e-e2aedef5faa9"},"device":{"locale":{"country":"US","code":"en_US","language":"en"},"platform":{"version":"5.1.1","name":"ANDROID"},"make":"YU","model":"AO5510"},"attributes":{"Category":"Recharge","CustomerID":"4078"}}
{"_id":{"$oid":"57a30ce268fd0809ec4d1952"},"session":{"start_timestamp":{"$numberLong":"1470183490481"},"session_id":"def5faa9-20160803-001810481"},"metrics":{},"arrival_timestamp":{"$numberLong":"1470183523054"},"event_type":"_session.start","event_timestamp":{"$numberLong":"1470183490481"},"event_version":"3.0","application":{"package_name":"com.think.vito","title":"Vito","version_code":"5","app_id":"7ffa58dab3c646cea642e961ff8a8070","cognito_identity_pool_id":"us-east-1:4d9cf803-0487-44ec-be27-1e160d15df74","version_name":"2.0.0.0","sdk":{"version":"2.2.2","name":"aws-sdk-android"}},"client":{"cognito_id":"us-east-1:2e26918b-f7b1-471e-9df4-b931509f7d37","client_id":"ee0b61b0-85cf-4b2f-960e-e2aedef5faa9"},"device":{"locale":{"country":"US","code":"en_US","language":"en"},"platform":{"version":"5.1.1","name":"ANDROID"},"make":"YU","model":"AO5510"},"attributes":{"CustomerID":"4078"}}
{"_id":{"$oid":"57a30ce268fd0809ec4d1953"},"session":{"start_timestamp":{"$numberLong":"1470181311752"},"session_id":"def5faa9-20160802-234151752","stop_timestamp":{"$numberLong":"1470181484875"}},"metrics":{},"arrival_timestamp":{"$numberLong":"1470183523054"},"event_type":"_session.stop","event_timestamp":{"$numberLong":"1470183490480"},"event_version":"3.0","application":{"package_name":"com.think.vito","title":"Vito","version_code":"5","app_id":"7ffa58dab3c646cea642e961ff8a8070","cognito_identity_pool_id":"us-east-1:4d9cf803-0487-44ec-be27-1e160d15df74","version_name":"2.0.0.0","sdk":{"version":"2.2.2","name":"aws-sdk-android"}},"client":{"cognito_id":"us-east-1:2e26918b-f7b1-471e-9df4-b931509f7d37","client_id":"ee0b61b0-85cf-4b2f-960e-e2aedef5faa9"},"device":{"locale":{"country":"US","code":"en_US","language":"en"},"platform":{"version":"5.1.1","name":"ANDROID"},"make":"YU","model":"AO5510"},"attributes":{}}
{"_id":{"$oid":"57a30ce268fd0809ec4d1954"},"session":{"start_timestamp":{"$numberLong":"1470193238841"},"session_id":"7b606a93-20160803-030038841"},"metrics":{},"arrival_timestamp":{"$numberLong":"1470193295093"},"event_type":"_session.start","event_timestamp":{"$numberLong":"1470193238844"},"event_version":"3.0","application":{"package_name":"com.think.vito","title":"Vito","version_code":"2","app_id":"7ffa58dab3c646cea642e961ff8a8070","cognito_identity_pool_id":"us-east-1:4d9cf803-0487-44ec-be27-1e160d15df74","version_name":"1.0.2","sdk":{"version":"2.2.2","name":"aws-sdk-android"}},"client":{"cognito_id":"us-east-1:e96515c9-5824-4c66-a42f-33cceb78b6e3","client_id":"efed74fd-40d8-41a2-b37e-e85c7b606a93"},"device":{"locale":{"country":"GB","code":"en_GB","language":"en"},"platform":{"version":"5.1.1","name":"ANDROID"},"make":"samsung","model":"SM-J200G"},"attributes":{}}
{"_id":{"$oid":"57a30ce268fd0809ec4d1955"},"session":{"start_timestamp":{"$numberLong":"1470193253960"},"session_id":"7b606a93-20160803-030053960","stop_timestamp":{"$numberLong":"1470193256359"}},"metrics":{},"arrival_timestamp":{"$numberLong":"1470193404776"},"event_type":"_session.stop","event_timestamp":{"$numberLong":"1470193278227"},"event_version":"3.0","application":{"package_name":"com.think.vito","title":"Vito","version_code":"2","app_id":"7ffa58dab3c646cea642e961ff8a8070","cognito_identity_pool_id":"us-east-1:4d9cf803-0487-44ec-be27-1e160d15df74","version_name":"1.0.2","sdk":{"version":"2.2.2","name":"aws-sdk-android"}},"client":{"cognito_id":"us-east-1:e96515c9-5824-4c66-a42f-33cceb78b6e3","client_id":"efed74fd-40d8-41a2-b37e-e85c7b606a93"},"device":{"locale":{"country":"GB","code":"en_GB","language":"en"},"platform":{"version":"5.1.1","name":"ANDROID"},"make":"samsung","model":"SM-J200G"},"attributes":{}}
{"_id":{"$oid":"57a30ce268fd0809ec4d1956"},"session":{"start_timestamp":{"$numberLong":"1470193253960"},"session_id":"7b606a93-20160803-030053960"},"metrics":{},"arrival_timestamp":{"$numberLong":"1470193404776"},"event_type":"_session.start","event_timestamp":{"$numberLong":"1470193253960"},"event_version":"3.0","application":{"package_name":"com.think.vito","title":"Vito","version_code":"2","app_id":"7ffa58dab3c646cea642e961ff8a8070","cognito_identity_pool_id":"us-east-1:4d9cf803-0487-44ec-be27-1e160d15df74","version_name":"1.0.2","sdk":{"version":"2.2.2","name":"aws-sdk-android"}},"client":{"cognito_id":"us-east-1:e96515c9-5824-4c66-a42f-33cceb78b6e3","client_id":"efed74fd-40d8-41a2-b37e-e85c7b606a93"},"device":{"locale":{"country":"GB","code":"en_GB","language":"en"},"platform":{"version":"5.1.1","name":"ANDROID"},"make":"samsung","model":"SM-J200G"},"attributes":{}}
{"_id":{"$oid":"57a30ce268fd0809ec4d1957"},"session":{"start_timestamp":{"$numberLong":"1470193238841"},"session_id":"7b606a93-20160803-030038841","stop_timestamp":{"$numberLong":"1470193244581"}},"metrics":{},"arrival_timestamp":{"$numberLong":"1470193404776"},"event_type":"_session.stop","event_timestamp":{"$numberLong":"1470193253959"},"event_version":"3.0","application":{"package_name":"com.think.vito","title":"Vito","version_code":"2","app_id":"7ffa58dab3c646cea642e961ff8a8070","cognito_identity_pool_id":"us-east-1:4d9cf803-0487-44ec-be27-1e160d15df74","version_name":"1.0.2","sdk":{"version":"2.2.2","name":"aws-sdk-android"}},"client":{"cognito_id":"us-east-1:e96515c9-5824-4c66-a42f-33cceb78b6e3","client_id":"efed74fd-40d8-41a2-b37e-e85c7b606a93"},"device":{"locale":{"country":"GB","code":"en_GB","language":"en"},"platform":{"version":"5.1.1","name":"ANDROID"},"make":"samsung","model":"SM-J200G"},"attributes":{}}
{"_id":{"$oid":"57a30ce268fd0809ec4d1958"},"session":{"start_timestamp":{"$numberLong":"1470193331290"},"session_id":"7b606a93-20160803-030211290"},"metrics":{},"arrival_timestamp":{"$numberLong":"1470193404776"},"event_type":"_session.start","event_timestamp":{"$numberLong":"1470193331291"},"event_version":"3.0","application":{"package_name":"com.think.vito","title":"Vito","version_code":"2","app_id":"7ffa58dab3c646cea642e961ff8a8070","cognito_identity_pool_id":"us-east-1:4d9cf803-0487-44ec-be27-1e160d15df74","version_name":"1.0.2","sdk":{"version":"2.2.2","name":"aws-sdk-android"}},"client":{"cognito_id":"us-east-1:e96515c9-5824-4c66-a42f-33cceb78b6e3","client_id":"efed74fd-40d8-41a2-b37e-e85c7b606a93"},"device":{"locale":{"country":"GB","code":"en_GB","language":"en"},"platform":{"version":"5.1.1","name":"ANDROID"},"make":"samsung","model":"SM-J200G"},"attributes":{}}
Wrong encoding. Explicitely read it as utf-8 e.g. (edit: +'dirty' Line Feeds (LF aka. \n)
with open(datafilename, encoding="utf8") as f:
# Reading file as list of lines
data = f.readlines()
# Removing useless whitespaces
data = [line.rstrip() for line in data]
# Joining lines together
data = ''.join(data)
# Loading dataframe from json str
df = pandas.read_json(datafile)
You could try using:
import json
with open('myfile.json') as json_data:
d = json.load(json_data)
print(d)
Without more info its difficult to advise.
As the error says, you have an issue with the encoding. When you read in the file, you need to change the encoding:
file = open(filename, encoding="utf8")

Properly encoding sc.textFile data (python 2.7)

My CSV was originally created by Excel. Anticipating encoding anomalies, I opened and re-saved the file with UTF-8 BOM encoding using Sublime Text.
Imported into the notebook:
filepath = "file:///Volumes/PASSPORT/Inserts/IMAGETRAC/csv/universe_wcsv.csv"
uverse = sc.textFile(filepath)
header = uverse.first()
data = uverse.filter(lambda x:x<>header)
Formatted my fields:
fields = header.replace(" ", "_").replace("/", "_").split(",")
Structured the data:
import csv
from StringIO import StringIO
from collections import namedtuple
Products = namedtuple("Products", fields, verbose=True)
def parse(row):
reader = csv.reader(StringIO(row))
row = reader.next()
return Products(*row)
products = data.map(parse)
If I then do products.first(), I get the first record as expected. However, if I want to, say, see the count by brand and so run:
products.map(lambda x: x.brand).countByValue()
I still get an UnicodeEncodeError related Py4JJavaError:
File "<ipython-input-18-4cc0cb8c6fe7>", line 3, in parse
UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in
position 125: ordinal not in range(128)
How can I fix this code?
csv module in legacy Python versions doesn't support Unicode input. Personally I would recommend using Spark csv data source:
df = spark.read.option("header", "true").csv(filepath)
fields = [c.strip().replace(" ", "_").replace("/", "_") for c in df.columns]
df.toDF(*fields).rdd
For most applications Row objects should work as well as namedtuple (it extends tuple and provides similar attribute getters) but you can easily follow convert one into another.
You could also try reading data as without decoding:
uverse = sc.textFile(filepath, use_unicode=False)
and decoding fields manually after initial parsing:
(data
.map(parse)
.map(lambda prod: Products(*[x.decode("utf-8") for x in prod])))
Related question Reading a UTF8 CSV file with Python

Categories

Resources