Issue reading text file with pound sign - python

I was trying to read a tab-delimited text file like this:
1 2# 3
using:
test = genfromtxt('test2.txt', delimiter='\t', dtype = 'string', skip_header=0)
However, I get the output only of 1 and 2. The # acts like an ending character in the txt file. Is there any way to solve this if I want to read the pound sign as a string?

the_string.split('\t') should do the job if you don't have to use genfromtxt

Related

How to read csv files (with special characters) in Python? How can I decode the text data? Read encoded text from file and convert to string

I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
import pandas as pd
data = pd.read_csv('C:\Users\Lenovo\Desktop\_Carabinieri_10_tweets.csv', delimiter=",", encoding="utf-8")
data.head()
print(data.head())
Now, the text data is stored like this:
OUTPUT
id … text
0 1228280254256623616 … b'RT #MinisteroDifesa: #14febbraio Il Ministro…
1 1228257366841405441 … b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti…
2 1228235394954620928 … b'Eseguite dai #Carabinieri del Nucleo Investi…
3 1228219588589965316 … b'Il pianeta brucia\nConosci il black carbon?...
4 1228020579485261824 … b'RT #Coninews: Emozioni tricolore \xe2\x9c\xa…
Although I used "utf-8" to read the file into a DataFrame with the code shown below, the characters look very different in the output. the output looks like bytes. The language is italian.
I tried to decode this using this code (there is more data in other columns, text is in second column). But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[2]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!
How can I decode the text data?
I would be very happy if you can help with this, thank you in advance.
The problem is likely to come from the way you have written you csv file. I would bet a coin that when read as text (with a simple text editor like notepad, notepad++, or vi) is actually contains:
1228280254256623616,…,b'RT #MinisteroDifesa: #14febbraio Il Ministro...'
1228257366841405441,…,b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti...'
...
or:
1228280254256623616,…,"b'RT #MinisteroDifesa: #14febbraio Il Ministro...'"
1228257366841405441,…,"b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti...'"
...
Pandas read_csv then correctly reads the text representation of a byte string.
The correct fix would be to write true UTF-8 encoded strings, but as I do not know the code, I cannot propose a fix.
A possible workaround is to use ast.literal_eval to convert the text representation into a byte string and decode it:
df['text'] = df['text'].apply(lambda x: ast.literal_eval(x).decode('utf8'))
It should give:
id ... text
0 1228280254256623616 ... RT #MinisteroDifesa: #14febbraio Il Ministro...
1 1228257366841405441 ... “Non t’ama chi amor ti...
...

How to import a text file that both contains values and text in python?

I am aware that a lot of questions are already asked on this topic, but none of them worked for my specific case.
I want to import a text file in python, and want to be able to access each value seperatly in python. My text file looks like (it's seperated by tabs):
example dataset
For example, the data '1086: CampNou' is written in one cell. I am mainly interested in getting access to the values presented here. Does anybody have a clue how to do this?
1086: CampNou 2084: Hospi 2090: Sants 2094: BCN-S 2096: BCN-N 2101: UNI 2105: B23 Total
1086: CampNou 0 15,6508 12,5812 30,3729 50,2963 0 56,0408 164,942
2084: Hospi 15,7804 0 19,3732 37,1791 54,1852 27,4028 59,9297 213,85
2090: Sants 12,8067 22,1304 0 30,6268 56,7759 29,9935 62,5204 214,854
2096: BCN-N 51,135 54,8545 57,3742 46,0102 0 45,6746 56,8001 311,849
2101: UNI 0 28,9589 31,4786 37,5029 31,6773 0 50,2681 179,886
2105: B23 51,1242 38,5838 57,3634 75,1552 56,7478 40,2728 0 319,247
Total 130,846 160,178 178,171 256,847 249,683 143,344 285,559 1404,63'
You can use pandas to open and manipulate your data.
import pandas as pd
df = pd.read_csv("mytext.txt")
This should read properly your file
def read_file(filename):
"""Returns content of file"""
file = open(filename, 'r')
content = file.read()
file.close()
return content
content = read_file("the_file.txt") # or whatever your text file is called
items = content.split(' ')
Then your values will be in the list items: ['', '1086: CampNou', '2084: Hospi', '2090: Sants', ...]

Get rid of unicode decimal charater

I have a huge file which looks like this :
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h&#7893mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;h&#432&#417u cao cổ152;298;0
6854;huy&#7873n đề62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
As you can see the file contains some unicode decimal, I would like to replace all of them with their latin character before using the file. Even opening it with the utf-8 encoding, the errors are not suppress.
do you know a way to do it. I want to create a dictionary and retrieve the Numbers at index 2.
for : 6883;jumarre;83;295;0; => i have 83
for : 6887;khướu;62;325;0 => i have &#7899 => which is false , i should have 62
with codecs.open('JeuxdeMotsPolarise_test.txt', 'r', 'utf-8', errors = 'ignore') as text_file:
text_file =(text_file.read())
#print(text_file)
dico_lexique = ({i.split(";")[1]:i.split(";")[2:]for i in text_file.split("\n") if i})
This is the result given with trying #serge proposition, but it leaves blank spaces between lines.
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hi âu;81;294;0
6819;hi cu;64;338;0
6820;hi yn;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hu cao c;152;298;0
6854;huyn ;62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kn kn;73;303;0
6886;khoang;64;323;0
6887;khu;62;325;0
Edit : I redownload the original file and the error of missing ";" was corrected.
for example:
=> 6850;hổ mang;54;298;0 (that is how is appeared in the now update file)
Thank you everybody
#PanagiotisKanavos has correctly guessed that html.unescape was able to replace the xml char reference with their unicode character. The hard part is that some refs are correctly ended with their terminating semicolon (;) while others are not. And in that latter case, if one entity if followed with a semicolon separator, the separator will be eaten by the convertion, shifting the following fields.
So the only reliable way is to:
process the file line by line as as CSV file with ; delimiter
eventually contatenate the middle field from the second to the fourth starting form the end
unescape that middle field
If you want to convert the file, you could do:
with open('file.csv') as fd, open('fixed.csv', 'w', newline='') as fdout:
rd = csv.reader(fd, delimiter=';')
wr = csv.writer(fdout, delimiter=';')
for row in rd:
if len(row)> 5:
row[1] = ';'.join(row[1:len(row)-3])
del row[2:len(row)-3]
row[1] = html.unescape(row[1])
wr.writerow(row)
If you only want to build a mapping from field 0 to field 2:
values = {}
with open('file.csv') as fd:
rd = csv.reader(fd, delimiter=';')
for row in rd:
values[field[0]] = field[-3]
This text isn't UTF8 or Unicode in general. It's HTML-encoded text, most likely Vietnamese. Those escape sequences correspond to Vietnamese characters, for example &#432 is ư - in fact, I just typed the edit sequence in the SO edit box and the correct character appeared. ớ is ớ.
Copying the entire text outside a code block produces
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h&#7893mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;h&#432&#417u cao cổ152;298;0
6854;huy&#7873n đề62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
Googling for Họ Khướu returns this Wikipedia page about Họ Khướu.
I think it's safe to assume this is HTML-encoded Vietnamese text. To convert it to Unicode you can use html.unescape :
import html
line='6887;khướu;62;325;0'
properLine=html.unescape(line)
UPDATE
The text posted above is just the original text with an extra newline per page. It's SO's markdown renderer that converts the escape sequences to the corresponding glyphs.
The funny thing is that this line :
6853;h&#432&#417u cao cổ152;298;0
Can't be rendered because the HTML entities aren't properly terminated. html.unescape on the other hand will convert the characters. Clearly, html.unescape is far more forgiving than SO's markdown renderer.
Either of these lines :
html.unescape('6853;hươu cao cổ152;298;0')
html.unescape('6853;h&#432&#417u cao cổ152;298;0')
Returns :
6853;h\u01b0\u01a1u cao c\u1ed5152;298;0
Repair the file first before you load it into a CSV parser.
Assuming Maarten in the comments is right, change the encoding:
iconv -f cp1252 -t utf-8 < JeuxdeMotsPolarise_test.txt > JeuxdeMotsPolarise_test.utf8.txt
Then replace the escapes with proper characters.
perl -C -i -lpe'
s/&#([0-9]+);?/chr $1/eg; # replace entities
s/;?(\d+;\d+;\d+)$/;$1/; # put back semicolon
# if it was consumed accidentally
' JeuxdeMotsPolarise_test.utf8.txt
Contents of JeuxdeMotsPolarise_test.utf8.txt after running the substitutions:
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;hổmang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hươu cao cổ;152;298;0
6854;huyền đề;62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0

Data reading - csv

I have some datas in a .dfx file and I trying to read it as a csv with pandas. But it has some special characters which are not read by pandas. They are separators as well.I attached one line from it
The "DC4" is being removed when I print the file. The SI is read as space, correctly. I tried some encoding (utf-8, latin1 etc), but no success.
I attached the printed first line as well. I marked the place where the characters should be.
My code is simple:
import pandas
file_log = pandas.read_csv("file_log.DFX", header=None)
print(file_log)
I hope I was clear and someone has an idea.
Thanks in advance!
EDIT:
The input. LINK: drive.google.com/open?id=0BxMDhep-LHOIVGcybmsya2JVM28
The expected output:
88.4373 0 12.07.2014/17:05:22 38.0366 38.5179 1.3448 31.9839
30.0070 0 12.07.2014/17:14:27 38.0084 38.5091 0.0056 0.0033
By examining the example.DFX in hex (with xxd), the two separators are 0x14 and 0x0f accordingly.
Read the csv with multiple separators using python engine:
import pandas
sep1 = chr(0x14) # the one shows dc4
sep2 = chr(0x0f) # the one shows si
file_log = pandas.read_csv('example.DFX', header=None, sep='{}|{}'.format(sep1, sep2), engine='python')
print file_log
And you get:
0 1 2 3 4 5 6 7
0 88.4373 0 12.07.2014/17:05:22 38.0366 38.5179 1.3448 31.9839 NaN
1 30.0070 0 12.07.2014/17:14:27 38.0084 38.5091 0.0056 0.0033 NaN
It seems it has an empty column at the end. But I'm sure you can handle that.
The encoding seems to be ASCII here. DC4 stands for "device control 4" and SI for "Shift In". These are control characters in an ASCII file and not printable. Thus you cannot see them when you issue a "print(file_log)", although it might do something depending on your terminal to view this (like \n would do a new-line).
Try typing file_log in your interpreter to get the representation of that variable and check if those special characters are included. Chances are that you'll see DC4 in the representation as '\x14' which means hexadecimal 14.
You may then further process these strings in your program by using string manipulation like replace.

Zeroes appearing when reading file (where aren't any)

When reading a file (UTF-8 Unicode text, csv) with Python on Linux, either with:
csv.reader()
file()
values of some columns get a zero as their first characeter (there are no zeroues in input), other get a few zeroes, which are not seen when viewing file with Geany or any other editor. For example:
Input
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
Output
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;0378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
See 378983561 > 0378983561
Reading with:
f = file('/home/foo/data.csv', 'r')
data = f.read()
split_data = data.splitlines()
lines = list(line.split(';') for line in split_data)
print data[51220][8]
>>> '0378983561' #should have been '478983561' (reads like this in Geany etc.)
Same result with csv.reader().
Help me solve the mystery, what could be the cause of this? Could it be related to encoding/decoding?
The data you're getting is a string.
print data[51220][8]
>>> '0478983561'
If you want to use this as an integer, you should parse it.
print int(data[51220][8])
>>> 478983561
If you want this as a string, you should convert it back to a string.
print repr(int(data[51220][8]))
>>> '478983561'
csv.reader treats all columns as strings. Conversion to the appropriate type is up to you as in:
print int(data[51220][8])

Categories

Resources