I have an (encode/decode) problem.
Although I used "utf-8" to read the file into a DataFrame with the code shown below, the characters look very different in the output. The language is French. I would be very happy if you can help with this, thank you in advance.
The first line of data examined
b"Sur la #route des stations ou de la maison\xf0\x9f\x9a\x98\xe2\x9d\x84\xef\xb8\x8f?\nCet apr\xc3\xa8s-midi, les #gendarmes veilleront sur vous, comme dans l'#Yonne, o\xc3\xb9 les exc\xc3\xa8s de #vitesse & les comportements dangereux des usagers de l'#A6 seront verbalis\xc3\xa9s\xe2\x9a\xa0\xef\xb8\x8f\nAlors prudence, \xc3\xa9quipez-vous & n'oubliez-pas la r\xc3\xa8gle des 3\xf0\x9f\x85\xbf\xef\xb8\x8f !"
import pandas as pd
data = pd.read_csv('C:\\Users\\Lenovo\\Desktop\\gendarmerie_tweets.csv', delimiter=";", encoding="utf-8")
data.head()
Output:
text
0 b"Sur la #route des stations ou de la maison\x...
1 b"#Guyane Soutien \xc3\xa0 nos 10 #gendarmes e...
2 b'#CoupDeCoeur \xf0\x9f\x92\x99 Journ\xc3\xa9e...
3 b'RT #servicepublicfr: \xf0\x9f\x97\xb3\xef\xb...
4 b"\xe2\x9c\x85 7 personnes interpell\xc3\xa9es...
I believe for this cases you can try with different encoding. I believe the decoding parameter that might help you solve this issue is 'ISO-8859-1':
data = pd.read_csv('C:\\Users\\Lenovo\\Desktop\\gendarmerie_tweets.csv', delimiter=";", encoding='iso-8859-1')
Edit:
Given the output of reading the file:
<_io.TextIOWrapper name='C:\\Users\\Lenovo\\Desktop\\gendarmerie_tweets.csv' mode='r' encoding='cp1254'>
From python's codec cp1254 alias windows-1254 is language turkish so I suggested trying latin5 and windows-1254 too but none of these options seems to help.
In Python 3 I have a series of links with "fixed-width files". They are websites with public information about companies. Each line has information about companies
Example links:
http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214AC
and
http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214RO
I have these links in a dictionary. The key is the name of the region of the country in which the companies are and the value is the link
for chave, valor in dict_val.items():
print (f'Region of country: {chave} - and link with information: {valor}')
Region of country: Acre - and link with information: http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214AC
Region of country: Espírito Santo - and link with information: http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214ES
...
I want to read these links (fixed-width files) and save the content to a CSV file. Example content:
0107397388000155ASSOCIACAO CULTURAL
02073973880001552 16MARIA DO SOCORRO RODRIGUES ALVES BRAGA
0101904573000102ABREU E SILVA COMERCIO DE MEDICAMENTOS LTDA-ME - ME
02019045730001022 49JETEBERSON OLIVEIRA DE ABREU
02019045730001022 49LUZINETE SANTOS DA SILVA ABREU
0101668652000161CONSELHO ESCOLAR DA ESCOLA ULISSES GUIMARAES
02016686520001612 10REGINA CLAUDIA RAMOS DA SILVA PESSOA
0101631137000107FORTERM * REPRESENTACOES E COMERCIO LTDA
02016311370001072 49ANTONIO MARCOS GONCALVES
02016311370001072 22IVANEIDE BERNARDO DE MENEZES
But to fill the rows of the CSV columns I need to separate and test on each line of the links with "fixed-width files"
I must follow rules like these:
1. If the line begins with "01" is a line with the company's registration number and its name. Example: "0107397388000155ASSOCIACAO CULTURAL"
1.1 - The "01" indicates this /
1.2 - The next 14 positions on the line are the company code - starts at position 3 and ends at 16 - (07397388000155) /
1.3 - The following 150 positions are the company name - starts at position 17 and ends at 166 - (ASSOCIACAO CULTURAL)
and
2. If the line starts with "02" it will have information about the partners of the company. Example: "02073973880001552 16MARIA DO SOCORRO RODRIGUES ALVES BRAGA" /
2.1 - The "02" indicates this /
2.2 - The next fourteen positions are the company registration code - starts at position 3 and ends at 16 (07397388000155) /
2.3 - The next number is a member identifier code, which can be 1, 2 or 3 - starts and ends at position 17 - (2) /
2.4 - The next fourteen positions are another code identifying the member - starts at position 18 and ends at 31 -("" - in this case is empty) /
2.5 - The next two positions are another code identifying the member - starts at position 32 and ends at 33 (16) /
2.6 - And the 150 final positions are the name of the partner - starts at position 34 and ends at 183 (MARIA DO SOCORRO RODRIGUES ALVES BRAGA)
Please in this case one possible strategy would be to save each link as TXT? And then try to separate the positions?
Or is there a better way to wipe a fixed-width files?
You can take a look at any URL parsing modules. I recommend Requests, although you can use urllib which comes bundled with python.
With that in mind, you can the text from the page, and seeing as it doesn't require a login of any from, with requests it would simply be a matter of:
import requests
r = requests.get('Your link from receita.fazenda.gov.br')
page_text = r.text
Read more in the Quickstart section of requests. I'll leave the 'position-separating' to you.
Hint: Use regex.
Using scrapy it's possible to read the content from the link as a stream and process it without saving to file. Documentation for scrapy is here
There's also a related question here: How do you open a file stream for reading using Scrapy?
NOTE: this question covers why the script is so slow. However, if you are more the kind of person who wants to improve something you can take a look atmy post on CodeReview which aims to improve the performance.
I am working on a project which crunches plain text files (.lst).
The name of the file names (fileName) are important because I'll extract node (e.g. abessijn) and component (e.g. WR-P-E-A) from them into a dataframe. Examples:
abessijn.WR-P-E-A.lst
A-bom.WR-P-E-A.lst
acroniem.WR-P-E-C.lst
acroniem.WR-P-E-G.lst
adapter.WR-P-E-A.lst
adapter.WR-P-E-C.lst
adapter.WR-P-E-G.lst
Each file consists of one or more line. Each line consists of a sentence (inside <sentence> tags). Example (abessijn.WR-P-E-A.lst)
/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml: <sentence>Vooral mijn abessijn ruikt heerlijk kruidig .. : ) )</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-E-A/WR-P-E-A0000364.data.ids.xml: <sentence>Mijn abessijn denkt daar heel anders over .. : ) ) Maar mijn kinderen richt ik ook niet af , zit niet in mijn bloed .</sentence>
From each line I extract the sentence, do some small modifications to it, and call it sentence. Up next is an element called leftContext, which takes the first part of the split between node (e.g. abessijn) and the sentence it came from. Finally, from leftContext I get precedingWord, which is the word preceding node in sentence, or the right most word in leftContext (with some limitations such as the option of a compound formed with a hyphen). Example:
ID | filename | node | component | precedingWord | leftContext | sentence
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 adapter.WR-P-P-F.lst adapter WR-P-P-F aanpassingseenheid Een aanpassingseenheid ( Een aanpassingseenheid ( adapter ) ,
2 adapter.WR-P-P-F.lst adapter WR-P-P-F toestel Het toestel ( Het toestel ( adapter ) draagt zorg voor de overbrenging van gegevens
3 adapter.WR-P-P-F.lst adapter WR-P-P-F de de aansluiting tussen de sensor en de de aansluiting tussen de sensor en de adapter ,
4 airbag.WS-U-E-A.lst airbag WS-U-E-A den ja voor den ja voor den airbag op te pompen eh :p
5 airbag.WS-U-E-A.lst airbag WS-U-E-A ne Dobby , als ze valt heeft ze dan wel al ne Dobby , als ze valt heeft ze dan wel al ne airbag hee
That dataframe is exported as dataset.csv.
After that, the intention of my project comes at hand: I create a frequency table that takes node and precedingWord into account. From a variable I define neuter and non_neuter, e.g (in Python)
neuter = ["het", "Het"]
non_neuter = ["de","De"]
and a rest category unspecified. When precedingWord is an item from the list, assign it to the variable. Example of a frequency table output:
node | neuter | nonNeuter | unspecified
-------------------------------------------------
A-bom 0 4 2
acroniem 3 0 2
act 3 2 1
The frequency list is exported as frequencies.csv.
I started out with R, considering that later on I'd do some statistical analyses on the frequencies. My current R script (also available as paste):
# ---
# STEP 0: Preparations
start_time <- Sys.time()
## 1. Set working directory in R
setwd("")
## 2. Load required library/libraries
library(dplyr)
library(mclm)
library(stringi)
## 3. Create directory where we'll save our dataset(s)
dir.create("../R/dataset", showWarnings = FALSE)
# ---
# STEP 1: Loop through files, get data from the filename
## 1. Create first dataframe, based on filename of all files
files <- list.files(pattern="*.lst", full.names=T, recursive=FALSE)
d <- data.frame(fileName = unname(sapply(files, basename)), stringsAsFactors = FALSE)
## 2. Create additional columns (word & component) based on filename
d$node <- sub("\\..+", "", d$fileName, perl=TRUE)
d$node <- tolower(d$node)
d$component <- gsub("^[^\\.]+\\.|\\.lst$", "", d$fileName, perl=TRUE)
# ---
# STEP 2: Loop through files again, but now also through its contents
# In other words: get the sentences
## 1. Create second set which is an rbind of multiple frames
## One two-column data.frame per file
## First column is fileName, second column is data from each file
e <- do.call(rbind, lapply(files, function(x) {
data.frame(fileName = x, sentence = readLines(x, encoding="UTF-8"), stringsAsFactors = FALSE)
}))
## 2. Clean fileName
e$fileName <- sub("^\\.\\/", "", e$fileName, perl=TRUE)
## 3. Get the sentence and clean
e$sentence <- gsub(".*?<sentence>(.*?)</sentence>", "\\1", e$sentence, perl=TRUE)
e$sentence <- tolower(e$sentence)
# Remove floating space before/after punctuation
e$sentence <- gsub("\\s(?:(?=[.,:;?!) ])|(?<=\\( ))", "\\1", e$sentence, perl=TRUE)
# Add space after triple dots ...
e$sentence <- gsub("\\.{3}(?=[^\\s])", "... ", e$sentence, perl=TRUE)
# Transform HTML entities into characters
# It is unfortunate that there's no easier way to do this
# E.g. Python provides the HTML package which can unescape (decode) HTML
# characters
e$sentence <- gsub("'", "'", e$sentence, perl=TRUE)
e$sentence <- gsub("&", "&", e$sentence, perl=TRUE)
# Avoid R from wrongly interpreting ", so replace by single quotes
e$sentence <- gsub(""|\"", "'", e$sentence, perl=TRUE)
# Get rid of some characters we can't use such as ³ and ¾
e$sentence <- gsub("[^[:graph:]\\s]", "", e$sentence, perl=TRUE)
# ---
# STEP 3:
# Create final dataframe
## 1. Merge d and e by common column name fileName
df <- merge(d, e, by="fileName", all=TRUE)
## 2. Make sure that only those sentences in which df$node is present in df$sentence are taken into account
matchFunction <- function(x, y) any(x == y)
matchedFrame <- with(df, mapply(matchFunction, node, stri_split_regex(sentence, "[ :?.,]")))
df <- df[matchedFrame, ]
## 3. Create leftContext based on the split of the word and the sentence
# Use paste0 to make sure we are looking for the node, not a compound
# node can only be preceded by a space, but can be followed by punctuation as well
contexts <- strsplit(df$sentence, paste0("(^| )", df$node, "( |[!\",.:;?})\\]])"), perl=TRUE)
df$leftContext <- sapply(contexts, `[`, 1)
## 4. Get the word preceding the node
df$precedingWord <- gsub("^.*\\b(?<!-)(\\w+(?:-\\w+)*)[^\\w]*$","\\1", df$leftContext, perl=TRUE)
## 5. Improve readability by sorting columns
df <- df[c("fileName", "component", "precedingWord", "node", "leftContext", "sentence")]
## 6. Write dataset to dataset dir
write.dataset(df,"../R/dataset/r-dataset.csv")
# ---
# STEP 4:
# Create dataset with frequencies
## 1. Define neuter and nonNeuter classes
neuter <- c("het")
non.neuter<- c("de")
## 2. Mutate df to fit into usable frame
freq <- mutate(df, gender = ifelse(!df$precedingWord %in% c(neuter, non.neuter), "unspecified",
ifelse(df$precedingWord %in% neuter, "neuter", "non_neuter")))
## 3. Transform into table, but still usable as data frame (i.e. matrix)
## Also add column name "node"
freqTable <- table(freq$node, freq$gender) %>%
as.data.frame.matrix %>%
mutate(node = row.names(.))
## 4. Small adjustements
freqTable <- freqTable[,c(4,1:3)]
## 5. Write dataset to dataset dir
write.dataset(freqTable,"../R/dataset/r-frequencies.csv")
diff <- Sys.time() - start_time # calculate difference
print(diff) # print in nice format
However, since I'm using a big dataset (16,500 files, all with multiple lines) it seemed to take quite long. On my system the whole process took about an hour and a quarter. I thought to myself that there ought to be a language out there that could do this more quickly, so I went and taught myself some Python and asked a lot of question here on SO.
Finally I came up with the following script (paste).
import os, pandas as pd, numpy as np, regex as re
from glob import glob
from datetime import datetime
from html import unescape
start_time = datetime.now()
# Create empty dataframe with correct column names
columnNames = ["fileName", "component", "precedingWord", "node", "leftContext", "sentence" ]
df = pd.DataFrame(data=np.zeros((0,len(columnNames))), columns=columnNames)
# Create correct path where to fetch files
subdir = "rawdata"
path = os.path.abspath(os.path.join(os.getcwd(), os.pardir, subdir))
# "Cache" regex
# See http://stackoverflow.com/q/452104/1150683
p_filename = re.compile(r"[./\\]")
p_sentence = re.compile(r"<sentence>(.*?)</sentence>")
p_typography = re.compile(r" (?:(?=[.,:;?!) ])|(?<=\( ))")
p_non_graph = re.compile(r"[^\x21-\x7E\s]")
p_quote = re.compile(r"\"")
p_ellipsis = re.compile(r"\.{3}(?=[^ ])")
p_last_word = re.compile(r"^.*\b(?<!-)(\w+(?:-\w+)*)[^\w]*$", re.U)
# Loop files in folder
for file in glob(path+"\\*.lst"):
with open(file, encoding="utf-8") as f:
[n, c] = p_filename.split(file.lower())[-3:-1]
fn = ".".join([n, c])
for line in f:
s = p_sentence.search(unescape(line)).group(1)
s = s.lower()
s = p_typography.sub("", s)
s = p_non_graph.sub("", s)
s = p_quote.sub("'", s)
s = p_ellipsis.sub("... ", s)
if n in re.split(r"[ :?.,]", s):
lc = re.split(r"(^| )" + n + "( |[!\",.:;?})\]])", s)[0]
pw = p_last_word.sub("\\1", lc)
df = df.append([dict(fileName=fn, component=c,
precedingWord=pw, node=n,
leftContext=lc, sentence=s)])
continue
# Reset indices
df.reset_index(drop=True, inplace=True)
# Export dataset
df.to_csv("dataset/py-dataset.csv", sep="\t", encoding="utf-8")
# Let's make a frequency list
# Create new dataframe
# Define neuter and non_neuter
neuter = ["het"]
non_neuter = ["de"]
# Create crosstab
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest"
freqDf = pd.crosstab(df.node, df.gender)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
# How long has the script been running?
time_difference = datetime.now() - start_time
print("Time difference of", time_difference)
After making sure that the output of both scripts is identical, I thought I'd put them to the test.
I am running on Windows 10 64 bit with a quad-core processor and 8 GB Ram. For R I'm using RGui 64 bit 3.2.2 and Python runs on version 3.4.3 (Anaconda) and is executed in Spyder. Note that I'm running Python in 32 bit because I'd like to use the nltk module in the future and they discourage users to use 64 bit.
What I found was that R finished in approximately 55 minutes. But Python has been running for two hours straight already and I can see in the variable explorer that it's only at business.wr-p-p-g.lst (files are sorted alphabetically). It is waaaaayyyy slower!
So what I did was make a test case and see how both scripts perform with a much smaller dataset. I took around 100 files (instead of 16,500) and ran the script. Again, R was much faster. R finished in around 2 seconds, Python in 17!
Seeing that the goal of Python was to make everything go more smoothly, I was confused. I read Python was fast (and R rather slow), so where did I go wrong? What is the problem? Is Python slower in reading files and lines, or in doing regexes? Or is R simply better equipped to dealing with dataframes and can't it be beaten by pandas? Or is my code simply badly optimised and should Python indeed be the victor?
My question is thus: why is Python slower than R in this case, and - if possible - how can we improve Python to shine?
Everyone who is willing to give either script a try can download the test data I used here. Please give me a heads-up when you downloaded the files.
The most horribly inefficient thing you do is calling the DataFrame.append method in a loop, i.e.
df = pandas.DataFrame(...)
for file in files:
...
for line in file:
...
df = df.append(...)
NumPy data structures are designed with functional programming in mind, hence this operation is not meant to be used in an iterative imperative fashion, because the call doesn't change your data frame in-place, but it creates a new one, resulting in an enormous increase in time and memory complexity. If you really want to use data frames, accumulate your rows in a list and pass it to the DataFrame constructor, e.g.
pre_df = []
for file in files:
...
for line in file:
...
pre_df.append(processed_line)
df = pandas.DataFrame(pre_df, ...)
This is the easiest way since it will introduce minimal changes to the code you have. But the better (and computationally beautiful) way is to figure out how to generate your dataset lazily. This can be easily achieved by splitting your workflow into discrete functions (in the sense of functional programming style) and compose them using lazy generator expressions and/or imap, ifilter higher-order functions. Then you can use the resulting generator to build your dataframe, e.g.
df = pandas.DataFrame.from_records(processed_lines_generator, columns=column_names, ...)
As for reading multiple files in one run you might want to read this.
P.S.
If you've got performance issues you should profile your code before trying to optimise it.
At this moment I wrote this code:
class device:
naam_device = ''
stroomverbuirk = 0
aantal_devices = int(input("geef het aantal devices op: "))
i = aantal_devices
x = 0
voorwerp = {}
while i > 0:
voorwerp[x] = device()
i = i - 1
x = x + 1
i = 0
while i < aantal_devices :
voorwerp[i].naam_device = input("Wat is device %d voor een device: " % (i+1))
# hier moet nog gekeken worden naar afvang van foute invoer bijv. als gebruiker een string of char invoert ipv een float
voorwerp[i].stroomverbruik = float(input("hoeveel ampére is uw device?: "))
i += 1
i = 0
totaal = 0.0
##test while print
while i < aantal_devices:
print(voorwerp[i].naam_device,voorwerp[i].stroomverbruik)
#dit totaal moet nog worden geschreven naar een bestand zodat je na 256 invoeren een totaal kan bepalen.
totaal = totaal + voorwerp[i].stroomverbruik
i = i + 1
print("totaal ampére = ",totaal)
aantal_koelbox = int(input("Hoeveel koelboxen neemt u mee?: "))
if aantal_koelbox <= 2 or aantal_koelbox > aantal_devices:
if aantal_koelbox > aantal_devices:
toestaan = input("Deelt u de overige koelboxen met mede-deelnemers (ja/nee)?: ")
if toestaan == "ja":
print("Uw gegevens worden opgeslagen! u bent succesvol geregistreerd.")
if toestaan == "nee":
print("Uw gegevens worden niet opgeslagen! u voldoet niet aan de eisen.")
else:
print("Uw gegevens worden niet opgeslagen! u voldoet niet aan de eisen.")
Now I want to write the value of totaal to a file, and later when I saved 256 of these inputs I want to write another program to read the 256 inputs and give the sum of those and divide that number by 14. If someone could help me on the right track with writing the values and later read them I can try to find out how to do the last part.
But I've been trying for 2 days now and still found no good solution for writing and reading.
The tutorial covers this very nicely, as MattDMo points out. But I'll summarize the relevant part here.
The key idea is to open a file, then write each totaal in some format, then make sure the file gets closed at the end.
What format? Well, that depends on your data. Sometimes you have fixed-shape records, which you can store as CSV rows. Sometimes you have arbitrary Python objects, which you can store as pickles. But in this case, you can get away with using the simplest format at all: a line of text. As long as your data are single values that can be unambiguously converted to text and back, and don't have any newline or other special characters in them, this works. So:
with open('thefile.txt', 'w') as f:
while i < aantal_devices:
print(voorwerp[i].naam_device,voorwerp[i].stroomverbruik)
#dit totaal moet nog worden geschreven naar een bestand zodat je na 256 invoeren een totaal kan bepalen.
totaal = totaal + voorwerp[i].stroomverbruik
f.write('{}\n'.format(totaal))
i = i + 1
That's it. The open opens the file, creating it if necessary. The with makes sure it gets closed at the end of the block. The write writes a line consisting of whatever's in totaal, formatted as a string, followed by a newline character.
To read it back later is even simpler:
with open('thefile.txt') as f:
for line in f:
totaal = int(line)
# now do stuff with totaal
Use serialization to store the data in the files and then de-serialize them back in to the original state for computation.
By serializing the data you can restore the data to the original state (value and type, i.e. 1234 as a int and not as string)
Off you go to the docs :) : https://docs.python.org/2/library/pickle.html
P.S. For people to be able to help you he code needs to be readable, That way you can get a better answer in the future.
You can write them to a file like so:
with open(os.path.join(output_dir, filename), 'w') as output_file:
output_file.write("%s" % totaal)
And then sum them like this:
sum = 0
for input_file in os.listdir(output_dir):
if os.path.isfile(input_file):
with open(os.path.join(output_dir, input_file), 'r') as infile:
sum += int(infile.read())
print sum/14
However, I would consider whether you really need to write each totaal to a separate file. There's probably a better way to solve your problem, but I think this should do what you asked for.
P.S. I would try to read your code and make a more educated attempt to help you, but I don't know Dutch!