CSV file with Arabic characters is displayed as symbols in Excel

CSV file with Arabic characters is displayed as symbols in Excel - python

I am using python to extract Arabic tweets from twitter and save it as a CSV file, but when I open the saved file in excel the Arabic language displays as symbols. However, inside python, notepad, or word, it looks good.
May I know where is the problem?

This is a problem I face frequently with Microsoft Excel when opening CSV files that contain Arabic characters. Try the following workaround that I tested on latest versions of Microsoft Excel on both Windows and MacOS:
Open Excel on a blank workbook
Within the Data tab, click on From Text button (if not
activated, make sure an empty cell is selected)
Browse and select the CSV file
In the Text Import Wizard, change the File_origin to "Unicode (UTF-8)"
Go next and from the Delimiters, select the delimiter used in your file e.g. comma
Finish and select where to import the data
The Arabic characters should show correctly.

Just use encoding='utf-8-sig' instead of encoding='utf-8' as follows:
import csv
data = u"اردو"
with(open('example.csv', 'w', encoding='utf-8-sig')) as fh:
writer = csv.writer(fh)
writer.writerow([data])
It worked on my machine.

The only solution that i've found to save arabic into an excel file from python is to use pandas and to save into the xlsx extension instead of csv, xlsx seems a million times better here's the code i've put together which worked for me
import pandas as pd
def turn_into_csv(data, csver):
ids = []
texts = []
for each in data:
texts.append(each["full_text"])
ids.append(str(each["id"]))
df = pd.DataFrame({'ID': ids, 'FULL_TEXT': texts})
writer = pd.ExcelWriter(csver + '.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', encoding="utf-8-sig")
# Close the Pandas Excel writer and output the Excel file.
writer.save()

Fastest way is after saving the file into .csv from python:
open the .csv file using Notepad++
from Encoding drop-down menu choose UTF-8-BOM
click save as and save at with same name with .csv extension (e.g. data.csv) and keep the file type as it is .txt
re-open the file again with Microsoft Excel.

Excel is known to have an awful csv import sytem. Long story short if on same system you import a csv file that you have just exported, it will work smoothly. Else, the csv file is expected to use the Windows system encoding and delimiter.
A rather awkward but robust system is to use LibreOffice or Oracle OpenOffice. Both are far beyond Excel on any feature but the csv module: they will allow you to specify the delimiters and optional quoting characters along with the encoding of the csv file and you will be able to save the resulting file in xslx.

Although my CSV file encoding was UTF-8; but explicitly redoing it again using the Notepad resolved it.
Steps:
Open your CSV file in Notepad.
Click File --> Save as...
In the "Encoding" drop-down, select UTF-8.
Rename your file using the .csv extension.
Click Save.
Reopen the file with Excel.

Related

exporting to csv converts text to date

From Python i want to export to csv format a dataframe
The dataframe contains two columns like this
So when i write this :
df['NAME'] = df['NAME'].astype(str) # or .astype('string')
df.to_csv('output.csv',index=False,sep=';')
The excel output in csv format returns this :
and reads the value "MAY8218" as a date format "may-18" while i want it to be read as "MAY8218".
I've tried many ways but none of them is working. I don't want an alternative like putting quotation marks to the left and the right of the value.
Thanks.

If you want to export the dataframe to use it in excel just export it as xlsx. It works for me and maintains the value as string in the original format.
df.to_excel('output.xlsx',index=False)

The CSV format is a text format. The file contains no hint for the type of the field. The problem is that Excel has the worst possible support for CSV files: it assumes that CSV files always use its own conventions when you try to read one. In short, one Excel implementation can only read correctly what it has written...
That means that you cannot prevent Excel to interpret the csv data the way it wants, at least when you open a csv file. Fortunately you have other options:
import the csv file instead of opening it. This time you have options to configure the way the file should be processed.
use LibreOffice calc for processing CSV files. LibreOffice is a little behind Microsoft Office on most points except for csv file handling where it has an excellent support.

Python - trying to import/open incorrectly formatted .xls file

I'm trying to write some Python code which needs to take data from an .xls file created by another application (outside of my control). I've tried using pandas and xlrd and neither are able to open the file, I get the error messages:
"Excel file format cannot be determined, you must specify an engine manually." using Pandas.
"Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\t'" using xlrd
I think it has to do with the way the file is exported from the program that creates it. When opened directly through Excel, I get the error message "The file format and extension don't match". However, you can ignore this message and the file opens in a usable format and can be edited and all of the expected values are in the right cells etc. Interestingly, when I go to save the file in Excel, the default option that comes up is a webpage.
Currently I have a workaround in that I can just open the file in Excel, save it as a .csv then read it into Python as a csv. This does have to be done through Excel through, if I just change the file extension to .csv, the resulting file is garbage.
However, ideally I would like to avoid the user having to do anything manaully. Would be greatly appreciated if anyone has any suggestions of ways that this might be possible (i.e. can I 'open' the file in Excel and save it through Excel using Python commands?) or if there are any packages or comands I can use to open/fix badly formatted .xls files.
Cheers!
P.S. I'm pretty new to Python and only have experience in R otherwise so my current knowledge is quite limited, apologies in advance!

try this :
from pathlib import Path
import pandas as pd
file_path = Path(filename)
df = pd.read_excel(file.read(), engine='openpyxl')

Any way to stop auto reformatting of data in excel

I am asking a follow up question from here (File downloaded is different from what is on server).
I have datetime in csv file which is getting reformatted.
My CSV has data like this 1-Jan-15,1-Feb-15,1-Mar-15.
But, the reformated csv is like Jan-15, Feb-15, Mar-15.......
Is there any way to stop automatic reformatting of data?

Instead of opening the .csv file directly in Excel, open a new blank workbook in Excel and use Get Data from Text (under the Data tab of the Ribbon) to import the .csv file.
This will open the Text Import Wizard, which has 3 total screens.
On Step 1, choose Delimited.
On Step 2, choose Comma.
And on Step 3, highlight all columns with dates in them and choose Text.
Click Finish.
The General format (which is also what happens by default if you open a .csv file in Excel directly) will recognize the dates as being dates and reformat them according to your locale settings. By instructing Excel to interpret those columns as text, they will not be recognized as dates and therefore left as they are.

Convert Excel zip file content to actual Excel file?

I am using cmis package available in python to download the document from FileNet repository. I am using getcontentstream method available in the package. However it returns content file that beings with 'Pk' and ends in 'PK'. when I googled I came to know it is excel zip package content. is there a way to save the content into an excel file. I should be able to open the downloaded excel. I am using below code. but getting byte-liked object is required not str. I noticed type of result is string.io.
# expport the result
result = testDoc.getContentStream()
outfile = open(sample.xlsx, 'wb')
outfile.write(result.read())
result.close()
outfile.close()

Hi there and welcome to stackoverflow. There are a few bits I noticed about your post.
To answer the error code you are getting directly. You called the outfile FileStream to be in terms of binary, however the result.read() must be in Unicode string format which is why you are getting this error. You can try to encode it before passing it to the outfile.write() function (ex: outfile.write(result.read().encode())).
You can also simply just write Unicode directly by:
result = testDoc.getContentStream()
result_text = result.read()
from zipfile import ZipFile
with ZipFile(filepath, 'w') as zf:
zf.writestr('filename_that_is_zipped', result_text)
Not I am not sure what you have in your ContentStream but note that a excel file is made up of xml files zipped up. The minimum file structure you need for an excel file is as follows:
_rels/.rels contains excel schemas
docProps/app.xml contains number of sheets and sheet names
docProps/core.xml boiler plate user info and date created
xl/workbook.xml contains sheet names rdId to workbook link
xl/worksheets/sheet1.xml (and more sheets in this folder) contains cell data for each sheet
xl/_rels/workbook.xml.rels contains sheet file locations within zipfile
xl/sharedStrings.xml if you have string only cell values
[Content_Types].xmlapplies schemas to file types
I recently went through piecing together an excel file from scratch, if you want to see the code check out https://github.com/PydPiper/pylightxl

Download Excel Spreadsheet Python

I am still pretty new to Python, so perhaps I am missing something obvious. I am trying to download a simple spreadsheet from Google Docs, save the file, and open it in Excel. When I did a test run with text files instead of excel files, it worked fine. However, using xls and xlsx, when excel opens the newly downloaded file, it says that the data is corrupted. How can I fix this?
import urllib2
print "Downloading..."
myfile = urllib2.urlopen("https://docs.google.com/spreadsheet/pub?key=0AoJYUIVnE85odGZxVHkybGxYRXF1TFpuQXdqZlJwNXc&output=xls")
output = open('C:\\Users\\Lucas\\Desktop\\downloaded.xlsx', 'w')
output.write(myfile.read())
output.close()
print "Done"
import subprocess
subprocess.call(['C:\\Program Files (x86)\\Microsoft Office\\Office14\\EXCEL.exe', 'C:\\Users\\Lucas\\Desktop\\downloaded.xlsx'])

you would want to make it wb you can take a look at the docs here

You're writing the file in plain-text, ascii mode. Excel documents are not plain text: under this assumption, you'll mishandle the content.
To use data as-is, with zero assumptions about its format, you use binary mode. Here:
output = open('C:\\Users\\Lucas\\Desktop\\downloaded.xlsx', 'wb')
Notice the 'b' flag at the end.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.