Export DataFrame from Python to CSV - python

I have a data frame I created with my original data appended with the topics from topic modeling. I keep running into errors when trying to export the data table into csv.
I've tried both csv module and pandas but get errors from both.
The data table has 1765 rows so writing the file row by row is not really an option.
When using pandas, most common errors are
DataFrame constructor not properly called!
and
function object has no attribute 'to_csv'
Code used:
import pandas as pd
data = (before.head)
df = pd.DataFrame(before.head)
df.to_csv (r'C:\Users\***\Desktop\beforetopics.csv', index = False, header=True)
print (df)
For the CSV module, there have been several errors such as
iterable expected, not method
Basically, how do I export this table (screenshot attached) into a csv file?

What is the command that you're trying to run?
Try this:
dataframe.to_csv('file_name.csv')
Or if it is the unicode error that you're coming across,
Try this:
dataframe.to_csv('file_name.csv', header=True, index=False, encoding='utf-8')
Since your dataframe's name is before,
Try this:
before.to_csv('file_name.csv', header=True, index=False, encoding='utf-8')

You can use the to_csv function:
before.to_csv('file_name.csv')
If you need extra options, you can check the documentation from here.

Related

Cannot read content in CSV File in Pandas

I have a dataset from the State Security Department in my county that has some problems.
I can't read the records at all from the file that is made available in CSV, bringing up only empty records. When I convert the file to XLSX it does get read.
I would like to know if there is any possible solution to the above problem.
The dataset is available at: here or here.
I tried the code below, but i only get nulls, except for the first row in the first column:
df = pd.read_csv('mensal_ss.csv', sep=';', names=cols, encoding='latin1')
image
Thank you!
If you try with utf-16 as the encoding, it seems to work. However, note that the year rows complicates the parsing, so you may need some extra manipulation of the csv to circumvent that depending on what you want to do with the data
df = pd.read_csv('mensal_ss.csv', sep=';', encoding='utf-16')
try to use 'utf-16-le':
import pandas as pd
df = pd.read_csv('mensal_ss.csv', sep=';', encoding='utf-16-le')
print(df.head())

Importing CSV Data and formatting in Excel via Python

I am importing CSV based data in an Excel spreadsheet via Python. I would like to know if it is possible to import the data and divide it in several columns (like we would do via the importing menu under DATA in Excel).
So far, I convert my CSV to a pandas and imported it in Excel, but all my data is clustered in 1 column :
df = pd.read_csv(r'C:\Users\Contractuel\Desktop\Test\Candiac_TypeLum_UTF8.csv')
writer = pd.ExcelWriter('TypeLum_TEST.xlsx')
df.to_excel(writer, index=False)
writer.save()
Thanks!
The read_csv method takes an argument sep= which tells pandas what separates the data. You probably need to use this to specify what the separator in the CSV file is. Default is , but CSVs sometimes have ; or other things as separators.

Unable to get correct output from tsv file using pandas

I have a tsv file which I am trying to read by the help of pandas. The first two rows of the files are of no use and needs to be ignored. Although, when I get the output, I get it in the form of two columns. The name of the first column is Index and the name of second column is a random row from the csv file.
import pandas as pd
data = pd.read_csv('zahlen.csv', sep='\t', skiprows=2)
Please refer to the screenshot below.
The second column name is in bold black, which is one of the row from the file. Moreover, using '\t' as delimiter does not separate the values in different column. I am using Spyder IDE for this. Am I doing something wrong here?
Try this:
data = pd.read_table('zahlen.csv', header=None, skiprows=2)
read_table() is more suited for tsv files and read_csv() is a more specialized version of it. Then header=None will make first row data, instead of header.

How to write a dataframe in pyspark having null values to CSV

I'm using the below code to write to a CSV file.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("nullValue"," ").save("/home/user/test_table/")
when I execute it, I'm getting the following error:
java.lang.UnsupportedOperationException: CSV data source does not support null data type.
Could anyone please help?
I had the same problem (not using that command with the nullValue option) and I solved it by using the fillna method.
And I also realised that fillna was not working with _corrupt_record, so I dropped since I didn't need it.
df = df.drop('_corrupt_record')
df = df.fillna("")
df.write.option('header', 'true').format('csv').save('file_csv')

Export Pandas data frame with text column containg utf-8 text and URLs to Excel

My Pandas data frame consists of Tweets and meta data of each tweet (300.000 rows). Some of my colleagues need to work with this data in Excel which is why I need to export it.
I wanted to use either .to_csv or .to_excel which are both provided by Pandas but I can't get it to work properly.
When I use .to_csv my problem is that it keeps failing in the text part of the data frame. I've played around with different separators but the file is never 100% aligned. The text column seems to contain tabs, pipe characters etc. which confuses Excel.
df.to_csv('test.csv', sep='\t', encoding='utf-8')
When I try to use .to_excel together with the xlsxwriter engine I'm confronted with a different problem, which is that my text column contains to many URLs (I think). xlswriter tries to make special clickable links of these URLs instead of just handling them as strings. I've found some information on how to circumvent this but, again, I can't get it to work.
The following bit of code should be used to disable the function that I think is causing trouble:
workbook = xlsxwriter.Workbook(filename, {'strings_to_urls': False})
However, when using to_excel I can't seem to adjust this setting of the Workbook object before I load the data frame into the Excel file.
In short how do I export a column with wildly varying text from a Pandas data frame to something that Excel understands?
edit:
example:
#geertwilderspvv #telegraaf ach Wilders toch, nep-voorzitter van een nep-partij met maar één lid, \nzeur niet over nep-premier of parlement!
So in this case It is obviously a line brake that is my data. I will try to find some more examples.
edit2:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><logFileName>error047600_01.xml</logFileName><summary>Er zijn fouten aangetroffen in bestand C:\Users\Guy Mahieu\Desktop\Vu ipython notebook\pandas_simple.xlsx</summary><removedRecords summary="Hier volgt een lijst van verwijderde records:"><removedRecord>Verwijderde records: Formule van het onderdeel /xl/worksheets/sheet1.xml</removedRecord></removedRecords></recoveryLog>
Translation of Dutch stuff:
Errors were found in "file". Here follows a list of removed records: removed records: formula of the part /xl/worksheets/sheet1.xml
I don't think it is currently possible to pass XlsxWriter constructor options via the Pandas API but you can workaround the strings_to_url issue as follows:
import pandas as pd
df = pd.DataFrame({'Data': ['http://python.org']})
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')
# Don't convert url-like strings to urls.
writer.book.strings_to_urls = False
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Update: In recent version of Pandas you can pass XlsxWriter constructor options to ExcelWriter() directly and you do not need to set writer.book.strings_to_urls indirectly:
writer = pd.ExcelWriter('pandas_simple.xlsx',
engine='xlsxwriter',
options={'strings_to_urls': False})
writer = pd.ExcelWriter(report_file, engine='xlsxwriter', options={'strings_to_urls': False,
'strings_to_formulas': False})

Categories

Resources