How to read files (with special characters) with Pandas? - python

I have an (encode/decode) problem.
Although I used "utf-8" to read the file into a DataFrame with the code shown below, the characters look very different in the output. The language is French. I would be very happy if you can help with this, thank you in advance.
The first line of data examined
b"Sur la #route des stations ou de la maison\xf0\x9f\x9a\x98\xe2\x9d\x84\xef\xb8\x8f?\nCet apr\xc3\xa8s-midi, les #gendarmes veilleront sur vous, comme dans l'#Yonne, o\xc3\xb9 les exc\xc3\xa8s de #vitesse & les comportements dangereux des usagers de l'#A6 seront verbalis\xc3\xa9s\xe2\x9a\xa0\xef\xb8\x8f\nAlors prudence, \xc3\xa9quipez-vous & n'oubliez-pas la r\xc3\xa8gle des 3\xf0\x9f\x85\xbf\xef\xb8\x8f !"
import pandas as pd
data = pd.read_csv('C:\\Users\\Lenovo\\Desktop\\gendarmerie_tweets.csv', delimiter=";", encoding="utf-8")
data.head()
Output:
text
0 b"Sur la #route des stations ou de la maison\x...
1 b"#Guyane Soutien \xc3\xa0 nos 10 #gendarmes e...
2 b'#CoupDeCoeur \xf0\x9f\x92\x99 Journ\xc3\xa9e...
3 b'RT #servicepublicfr: \xf0\x9f\x97\xb3\xef\xb...
4 b"\xe2\x9c\x85 7 personnes interpell\xc3\xa9es...

I believe for this cases you can try with different encoding. I believe the decoding parameter that might help you solve this issue is 'ISO-8859-1':
data = pd.read_csv('C:\\Users\\Lenovo\\Desktop\\gendarmerie_tweets.csv', delimiter=";", encoding='iso-8859-1')
Edit:
Given the output of reading the file:
<_io.TextIOWrapper name='C:\\Users\\Lenovo\\Desktop\\gendarmerie_tweets.csv' mode='r' encoding='cp1254'>
From python's codec cp1254 alias windows-1254 is language turkish so I suggested trying latin5 and windows-1254 too but none of these options seems to help.

Related

Lemmatization taking forever with Spacy

I'm trying to lemmatize chat registers in a dataframe using spacy. My code is:
nlp = spacy.load("es_core_news_sm")
df["text_lemma"] = df["text"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))
I have aprox 600.000 rows and the apply takes more than two hours to execute. Is there a faster package/way to lemmatize? (I need a solution that works for spanish)
I have only tried using spacy package
The slow-down in processing speed is coming from the multiple calls to the spaCy pipeline via nlp(). The faster way to process large texts is to instead process them as a stream using the nlp.pipe() command. When I tested this on 5000 rows of dummy text, it offered a ~3.874x improvement in speed (~9.759sec vs ~2.519sec) over the original method. There are ways to improve this further if required, see this checklist for spaCy optimisation I made.
Solution
# Assume dataframe (df) already contains column "text" with text
# Load spaCy pipeline
nlp = spacy.load("es_core_news_sm")
# Process large text as a stream via `nlp.pipe()` and iterate over the results, extracting lemmas
lemma_text_list = []
for doc in nlp.pipe(df["text"]):
lemma_text_list.append(" ".join(token.lemma_ for token in doc))
df["text_lemma"] = lemma_text_list
Full code for testing timings
import spacy
import pandas as pd
import time
# Random Spanish sentences
rand_es_sentences = [
"Tus drafts influirán en la puntuación de las cartas según tu número de puntos DCI.",
"Información facilitada por la División de Conferencias de la OMI en los cuestionarios enviados por la DCI.",
"Oleg me ha dicho que tenías que decirme algo.",
"Era como tú, muy buena con los ordenadores.",
"Mas David tomó la fortaleza de Sion, que es la ciudad de David."]
# Duplicate sentences specified number of times
es_text = [sent for i in range(1000) for sent in rand_es_sentences]
# Create data-frame
df = pd.DataFrame({"text": es_text})
# Load spaCy pipeline
nlp = spacy.load("es_core_news_sm")
# Original method (very slow due to multiple calls to `nlp()`)
t0 = time.time()
df["text_lemma_1"] = df["text"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))
t1 = time.time()
print("Total time: {}".format(t1-t0)) # ~9.759 seconds on 5000 rows
# Faster method processing rows as stream via `nlp.pipe()`
t0 = time.time()
lemma_text_list = []
for doc in nlp.pipe(df["text"]):
lemma_text_list.append(" ".join(token.lemma_ for token in doc))
df["text_lemma_2"] = lemma_text_list
t1 = time.time()
print("Total time: {}".format(t1-t0)) # ~2.519 seconds on 5000 rows

Error when trying to encode text in python

I’ve been trying to correct this text for the standard view for hours. I tried several ways for utf-8 and nothing works. can anybody help me?
I believe it is not a duplicate question because I have tried everything and failed.
Here is an example of one of the codes I used:
string_old = u"\u00c2\u00bfQu\u00c3\u00a9 le pasar\u00c3\u00a1 a quien desobedezca los mandamientos? "
print(string_old.encode("utf-8"))
Result:
>>> b'\xc3\x82\xc2\xbfQu\xc3\x83\xc2\xa9 le pasar\xc3\x83\xc2\xa1 a quien desobedezca los mandamientos? '
I expect the following result:
>>> "¿Qué le pasará a quien desobedezca los mandamientos? "
The string was wrongly decoded as Latin1 (or cp1252):
string_old.encode('latin1').decode('utf8')
# '¿Qué le pasará a quien desobedezca los mandamientos? '

How to use multiline flag in python regex?

I want to transform chunks of text into a database of single line entries database with regex. But I don't know why the regex group isn't recognized.
Maybe because the multiline flag isn't properly set.
I am a beginner at python.
import re
with open("a-j-0101.txt", encoding="cp1252") as f:
start=1
ecx=r"(?P<entrcnt>[0-9]{1,3}) célébrités ou évènements"
ec1=""
nmx=r"(?P<ename>.+)\r\nAfficher le.*"
nm1=""
for line in f:
if start == 1:
out = open('AST0101.txt' + ".txt", "w", encoding="cp1252") #utf8 cp1252
ec1 = re.search(ecx,line)
out.write(ec1.group("entrcnt"))
start=0
out.write(r"\r\n")
nm1 = re.search(nmx,line, re.M)
out.write(str(nm1.group("ename")).rstrip('\r\n'))
out.close()
But I get the error:
File "C:\work-python\transform-asth-b.py", line 16, in <module>
out.write(str(nm1.group("ename")).rstrip('\r\n'))
builtins.AttributeError: 'NoneType' object has no attribute 'group'
here is the input:
210 célébrités ou évènements ont été trouvés pour la date du 1er janvier.
Création de l'euro
Afficher le...
...
...
...
expected output:
210
Création de l'euro ;...
... ;...
... ;...
EDIT: I try to change nmx to match \n or \r\n but no result:
nmx=r"(?P<ename>.+)(\n|\r\n)Afficher le"
best regards
In this statement:
nm1 = re.search(nmx,line, re.M)
you get an NoneType object (nm1 = None), because no matches were found. So make more investigation on the nmx attribute, why you get no matches in the regex.
By the way if it´s possible to get a NoneType object, you can avoid this by preventing a NoneType:
If nm1 is not None:
out.write(str(nm1.group("ename")).rstrip('\r\n'))
else:
#handle your NoneType case
If you are reading a single line at a time, there is no way for a regex to match on a previous line you have read and then forgotten.
If you read a group of lines, you can apply a regex to the collection of lines, and the multiline flag will do something useful. But your current code should probably simply search for r'^Afficher le\.\.\.' and use the state machine (start == 0 or start == 1) to do this in the right context.

removing line breaks in a csv file

I have a csv file with lines, each line begins with (#) and all the fields within a line are separated with (;). One of the fields, that contains "Text" (""[ ]""), has some line breaks that produce errors while importing the whole csv file to excel or access. The text after the line breaks is considered as independent lines, not following the structure of the table.
#4627289301; Lima, Peru; 490; 835551022915420161; Sat Feb 25 18:04:22 +0000 2017; ""[OJO!
la premiacin de los #Oscar, nuestros amigos de #cinencuentro revisan las categoras.
+info: co/plHcfSIfn8]""; 0
#624974422; None; 114; 835551038581137416; Sat Feb 25 18:04:26 +0000 2017; ""[Porque nunca dejamos de amar]""; 0
any help with this using a python script? or any other solution...
as output I would like to have the lines:
#4627289301; Lima, Peru; 490; 835551022915420161; Sat Feb 25 18:04:22 +0000 2017; ""[OJO! la premiacin de los #Oscar, nuestros amigos de #cinencuentro revisan las categoras. +info: co/plHcfSIfn8]""; 0
#624974422; None; 114; 835551038581137416; Sat Feb 25 18:04:26 +0000 2017; ""[Porque nunca dejamos de amar]""; 0
any help? I a csv file (54MB) with a lot of lines with line breaks... some other lines are ok...
You should share your expected output as well.
Anyways, I suggest you first clean your file to remove the newline characters. Then you can read it as csv. One solution can be (I believe someone will suggest something better :-) )
Clean the file (on linux):
sed ':a;N;$!ba;s/\n/ /g' input_file | sed "s/ #/\n#/g" > output_file
Read file as csv (You can read it using any other method)
import pandas as pd
df = pd.read_csv('output_file', delimiter=';', header=None)
df.to_csv('your_csv_file_name', index=False)
Let's see if it helps you :-)
You can search for lines that are followed by a line that doesn't start with "#", like this \r?\n+(?!#\d+;).
The following was generated from this regex101 demo. It replaces such line ends with a space. You can change that to whatever you like.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"\r?\n+(?!#\d+;)"
test_str = ("#4627289301; Lima, Peru; 490; 835551022915420161; Sat Feb 25 18:04:22 +0000 2017; \"\"[OJO!\n"
"la premiacin de los #Oscar, nuestros amigos de #cinencuentro revisan las categoras.\n"
"+info: co/plHcfSIfn8]\"\"; 0\n"
"#624974422; None; 114; 835551038581137416; Sat Feb 25 18:04:26 +0000 2017; \"\"[Porque nunca dejamos de amar]\"\"; 0")
subst = " "
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Need help writing and reading with Python

At this moment I wrote this code:
class device:
naam_device = ''
stroomverbuirk = 0
aantal_devices = int(input("geef het aantal devices op: "))
i = aantal_devices
x = 0
voorwerp = {}
while i > 0:
voorwerp[x] = device()
i = i - 1
x = x + 1
i = 0
while i < aantal_devices :
voorwerp[i].naam_device = input("Wat is device %d voor een device: " % (i+1))
# hier moet nog gekeken worden naar afvang van foute invoer bijv. als gebruiker een string of char invoert ipv een float
voorwerp[i].stroomverbruik = float(input("hoeveel ampére is uw device?: "))
i += 1
i = 0
totaal = 0.0
##test while print
while i < aantal_devices:
print(voorwerp[i].naam_device,voorwerp[i].stroomverbruik)
#dit totaal moet nog worden geschreven naar een bestand zodat je na 256 invoeren een totaal kan bepalen.
totaal = totaal + voorwerp[i].stroomverbruik
i = i + 1
print("totaal ampére = ",totaal)
aantal_koelbox = int(input("Hoeveel koelboxen neemt u mee?: "))
if aantal_koelbox <= 2 or aantal_koelbox > aantal_devices:
if aantal_koelbox > aantal_devices:
toestaan = input("Deelt u de overige koelboxen met mede-deelnemers (ja/nee)?: ")
if toestaan == "ja":
print("Uw gegevens worden opgeslagen! u bent succesvol geregistreerd.")
if toestaan == "nee":
print("Uw gegevens worden niet opgeslagen! u voldoet niet aan de eisen.")
else:
print("Uw gegevens worden niet opgeslagen! u voldoet niet aan de eisen.")
Now I want to write the value of totaal to a file, and later when I saved 256 of these inputs I want to write another program to read the 256 inputs and give the sum of those and divide that number by 14. If someone could help me on the right track with writing the values and later read them I can try to find out how to do the last part.
But I've been trying for 2 days now and still found no good solution for writing and reading.
The tutorial covers this very nicely, as MattDMo points out. But I'll summarize the relevant part here.
The key idea is to open a file, then write each totaal in some format, then make sure the file gets closed at the end.
What format? Well, that depends on your data. Sometimes you have fixed-shape records, which you can store as CSV rows. Sometimes you have arbitrary Python objects, which you can store as pickles. But in this case, you can get away with using the simplest format at all: a line of text. As long as your data are single values that can be unambiguously converted to text and back, and don't have any newline or other special characters in them, this works. So:
with open('thefile.txt', 'w') as f:
while i < aantal_devices:
print(voorwerp[i].naam_device,voorwerp[i].stroomverbruik)
#dit totaal moet nog worden geschreven naar een bestand zodat je na 256 invoeren een totaal kan bepalen.
totaal = totaal + voorwerp[i].stroomverbruik
f.write('{}\n'.format(totaal))
i = i + 1
That's it. The open opens the file, creating it if necessary. The with makes sure it gets closed at the end of the block. The write writes a line consisting of whatever's in totaal, formatted as a string, followed by a newline character.
To read it back later is even simpler:
with open('thefile.txt') as f:
for line in f:
totaal = int(line)
# now do stuff with totaal
Use serialization to store the data in the files and then de-serialize them back in to the original state for computation.
By serializing the data you can restore the data to the original state (value and type, i.e. 1234 as a int and not as string)
Off you go to the docs :) : https://docs.python.org/2/library/pickle.html
P.S. For people to be able to help you he code needs to be readable, That way you can get a better answer in the future.
You can write them to a file like so:
with open(os.path.join(output_dir, filename), 'w') as output_file:
output_file.write("%s" % totaal)
And then sum them like this:
sum = 0
for input_file in os.listdir(output_dir):
if os.path.isfile(input_file):
with open(os.path.join(output_dir, input_file), 'r') as infile:
sum += int(infile.read())
print sum/14
However, I would consider whether you really need to write each totaal to a separate file. There's probably a better way to solve your problem, but I think this should do what you asked for.
P.S. I would try to read your code and make a more educated attempt to help you, but I don't know Dutch!

Categories

Resources