Decoding UTF8 literals in a CSV file - python

Question:
Does anyone know how I could transform this b"it\\xe2\\x80\\x99s time to eat" into this it's time to eat
More details & my code:
Hello everyone,
I'm currently working with a CSV file which full of rows with UTF8 literals in them, for example:
b"it\xe2\x80\x99s time to eat"
The end goal is to to get something like this:
it's time to eat
To achieve this I have tried using the following code:
import pandas as pd
file_open = pd.read_csv("/Users/Downloads/tweets.csv")
file_open["text"]=file_open["text"].str.replace("b\'", "")
file_open["text"]=file_open["text"].str.encode('ascii').astype(str)
file_open["text"]=file_open["text"].str.replace("b\"", "")[:-1]
print(file_open["text"])
After running the code the row that I took as an example is printed out as:
it\xe2\x80\x99s time to eat
I have tried solving this issue
using the following code to open the CSV file:
file_open = pd.read_csv("/Users/Downloads/tweets.csv", encoding = "utf-8")
which printed out the example row in the following manner:
it\xe2\x80\x99s time to eat
and I have also tried decoding the rows using this:
file_open["text"]=file_open["text"].str.decode('utf-8')
Which gave me the following error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Thank you very much in advance for your help.

b"it\\xe2\\x80\\x99s time to eat" sounds like your file contains an escaped encoding.
In general, you can convert this to a proper Python3 string with something like:
x = b"it\\xe2\\x80\\x99s time to eat"
x = x.decode('unicode-escape').encode('latin1').decode('utf8')
print(x) # it’s time to eat
(Use of .encode('latin1') explained here)
So, if after you use pd.read_csv(..., encoding="utf8") you still have escaped strings, you can do something like:
pd.read_csv(..., encoding="unicode-escape")
# ...
# Now, your values will be strings but improperly decoded:
# itâs time to eat
#
# So we encode to bytes then decode properly:
val = val.encode('latin1').decode('utf8')
print(val) # it’s time to eat
But I think it's probably better to do this to the whole file instead of to each value individually, for example with StringIO (if the file isn't too big):
from io import StringIO
# Read the csv file into a StringIO object
sio = StringIO()
with open('yourfile.csv', 'r', encoding='unicode-escape') as f:
for line in f:
line = line.encode('latin1').decode('utf8')
sio.write(line)
sio.seek(0) # Reset file pointer to the beginning
# Call read_csv, passing the StringIO object
df = pd.read_csv(sio, encoding="utf8")

Related

Read in-memory csv file with open() in python

I have a dataframe. I want to write this dataframe from an excel file to a csv without saving it to disk. Looked like StringIO() was the clear solution. I then wanted to open the file like object from in memory with open() but was getting a type error. How do I get around this type error and read the in memory csv file with open()? Or, actually, is it wrong to assume the open() will work?
TypeError: expected str, bytes or os.PathLike object, not StringIO
The error cites the row below. From the code that's even further below.
f = open(writer_file)
To get a proper example I had to open the file "pandas_example" after creation. I then removed a row, then the code runs into an empty row.
from pandas import util
df = util.testing.makeDataFrame()
df.to_excel('pandas_example.xlsx')
df_1 = pd.read_excel('pandas_example.xlsx')
writer_file = io.StringIO()
write_to_the_in_mem_file = csv.writer(writer_file, dialect='excel', delimiter=',')
write_to_the_in_mem_file.writerow(df_1)
f = open(writer_file)
while f.readline() not in (',,,,,,,,,,,,,,,,,,\n', '\n'):
pass
final_df = pd.read_csv(f, header=None)
f.close()
Think of writer_file as what is returned from open(), you don't need to open it again.
For example:
import pandas as pd
from pandas import util
import io
# Create test file
df = util.testing.makeDataFrame()
df.to_excel('pandas_example.xlsx')
df_1 = pd.read_excel('pandas_example.xlsx')
writer_file = io.StringIO()
df_1.to_csv(writer_file)
writer_file.seek(0) # Rewind back to the start
for line in writer_file:
print(line.strip())
The to_csv() call writes the dataframe into your memory file in CSV format.
After
writer_file = io.StringIO()
, writer_file is already a file-like object. It already has a readline method, as well as read, write, seek, etc. See io.TextIOBase, which io.StringIO inherits from.
In other words, open(writer_file) is unnecessary or, rather, it results in a type error (as you already observed).

Decoding bytes in pandas returns float instead of strings [duplicate]

I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
Now, the text data is stored like this:
"b'Lorem Ipsum\xc2\xa0Assignment '"
I tried to decode this using this code (there is more data in other columns, text is in 3rd column):
with open('data.csv','rt',encoding='utf-8') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print(row[3])
But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[3]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!
How can I decode the text data?
Edit: Here's a sample line from the csv file:
67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6 | #abcde',52,18
Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.
The easiest way is as below. Try it out.
import csv
from io import StringIO
byte_content = b"iam byte content"
content = byte_content.decode()
file = StringIO(content)
csv_data = csv.reader(file, delimiter=",")
If your input file really contains strings with Python syntax b prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval() function as #Ry suggested — although I would use it in a slightly different manner, as shown below.
This will provide a safe way to parse strings in the file which are prefixed with a b indicating they are byte-strings. The rest will be passed through unchanged.
Note that this doesn't require reading the entire CSV file into memory.
import ast
import csv
def _parse_bytes(field):
"""Convert string represented in Python byte-string literal b'' syntax into
a decoded character string - otherwise return it unchanged.
"""
result = field
try:
result = ast.literal_eval(field)
finally:
return result.decode() if isinstance(result, bytes) else result
def my_csv_reader(filename, /, **kwargs):
with open(filename, 'r', newline='') as file:
for row in csv.reader(file, **kwargs):
yield [_parse_bytes(field) for field in row]
reader = my_csv_reader('bytes_data.csv', delimiter=',')
for row in reader:
print(row)
You can use ast.literal_eval to convert the incorrect fields back to bytes safely:
import ast
def _parse_bytes(bytes_repr):
result = ast.literal_eval(bytes_repr)
if not isinstance(result, bytes):
raise ValueError("Malformed bytes repr")
return result

Import Data from scraping into CSV

I'm using pycharm and Python 3.7.
I would like to write data in a csv, but my code writes in the File just the first line of my data... someone knows why?
This is my code:
from pytrends.request import TrendReq
import csv
pytrend = TrendReq()
pytrend.build_payload(kw_list=['auto model A',
'auto model C'])
# Interest Over Time
interest_over_time_df = pytrend.interest_over_time()
print(interest_over_time_df.head(100))
writer=csv.writer(open("C:\\Users\\
Desktop\\Data\\c.csv", 'w', encoding='utf-8'))
writer.writerow(interest_over_time_df)
try using pandas,
import pandas as pd
interest_over_time_df.to_csv("file.csv")
Once i encountered the same problem and solve it like below:
with open("file.csv", "rb", encoding="utf-8) as fh:
precise details:
r = read mode
b = mode specifier in the open() states that the file shall be treated as binary,
so contents will remain a bytes. No decoding attempt will happen this way.
As we know python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
You could try something like:
import csv
with open(<path to output_csv>, "wb") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in interest_over_time_df:
writer.writerow(line)
Read more here: https://www.pythonforbeginners.com/files/with-statement-in-python
You need to loop over the data and write in line by line

Strange character while reading a CSV file

I try to read a CSV file in Python, but the first element in the first row is read like that 0, while the strange character isn't in the file, its just a simple 0. Here is the code I used:
matriceDist=[]
file=csv.reader(open("distanceComm.csv","r"),delimiter=";")
for row in file:
matriceDist.append(row)
print (matriceDist)
I had this same issue. Save your excel file as CSV (MS-DOS) vs. UTF-8 and those odd characters should be gone.
Specifying the byte order mark when opening the file as follows solved my issue:
open('inputfilename.csv', 'r', encoding='utf-8-sig')
Just use pandas together with some encoding (utf-8 for example) is gonna be easier
import pandas as pd
df = pd.read_csv('distanceComm.csv', header=None, encoding = 'utf8', delimiter=';')
print(df)
I don't know what your input file is. But since it has a Byte Order Mark for UTF-8, you can use something like this:
import codecs
matriceDist=[]
file=csv.reader(codecs.open('distanceComm.csv', encoding='utf-8'),delimiter=";")
for row in file:
matriceDist.append(row)
print (matriceDist)

Pandas guess delimiter with sep=None

Pandas documentation has this:
With sep=None, read_csv will try to infer the delimiter automatically
in some cases by “sniffing”.
How can I access pandas' guess for the delimiter?
I want to read in 10 lines of my file, have pandas guess the delimiter, and start up my GUI with that delimiter already selected. But I don't know how to access what pandas thinks is the delimiter.
Also, is there a way to pass pandas a list of strings to restrict it's guesses to?
Looking at the source code, I doubt that it's possible to get the delimiter out of read_csv. But pandas internally uses the Sniffer class from the csv module. Here's an example that should get you going:
import csv
s = csv.Sniffer()
print s.sniff("a,b,c").delimiter
print s.sniff("a;b;c").delimiter
print s.sniff("a#b#c").delimiter
Output:
,
;
#
What remains, is reading the first line from a file and feeding it to the Sniffer.sniff() function, but I'll leave that up to you.
The csv.Sniffer is the simplest solution, but it doesn't work if you need to use compressed files.
Here's what's working, although it uses a private member, so beware:
reader = pd.read_csv('path/to/file.tar.gz', sep=None, engine='python', iterator=True)
sep = reader._engine.data.dialect.delimiter
reader.close()

Categories

Resources