I am importing CSV based data in an Excel spreadsheet via Python. I would like to know if it is possible to import the data and divide it in several columns (like we would do via the importing menu under DATA in Excel).
So far, I convert my CSV to a pandas and imported it in Excel, but all my data is clustered in 1 column :
df = pd.read_csv(r'C:\Users\Contractuel\Desktop\Test\Candiac_TypeLum_UTF8.csv')
writer = pd.ExcelWriter('TypeLum_TEST.xlsx')
df.to_excel(writer, index=False)
writer.save()
Thanks!
The read_csv method takes an argument sep= which tells pandas what separates the data. You probably need to use this to specify what the separator in the CSV file is. Default is , but CSVs sometimes have ; or other things as separators.
Related
Is there a method I can use to output the inferred schema on a large CSV using pandas?
In addition, any way to have it tell me with that type if it is nullable/blank based off the CSV?
File is about 500k rows with 250 columns.
With my new job, I'm constantly being handed CSV files with zero format documentation.
Is it necessary to load the whole csv file? At least you could use the read_csv function if you know the separator or doing a cat of the file to know the separator. Then use the .info():
df = pd.read_csv(path_to_file,...)
df.info()
My Python program converts Excel files (.xlsx) into a CSV file using Panda's read_excel and to_csv function, and at some point in the future, the CSV is converted back into an Excel file. Maintaining the data is fine, but of course all of the formatting and styling is gone. So I could use some help in being able to capture the that information to use when after converting the CSV back into an Excel file.
import pandas as pd
import xlsxwriter
EXCEL_PATH_FROM = r'C:\absolute\path\to\excel.xlsx'
EXCEL_PATH_TO = r'C:\absolute\path\to\other\excel.xlsx'
CSV_PATH = r'C:\absolute\path\to\csv.csv'
# read excel and convert to csv
def saveData():
read_excel = pd.read_excel(EXCEL_PATH_FROM)
print("writing csv...")
read_excel.to_csv(CSV_PATH, index=None, header=True)
# get csv data and import that data into an excel file
def createFromData():
csv = pd.read_csv(CSV_PATH)
excel = pd.ExcelWriter(EXCEL_PATH_TO, engine='xlsxwriter')
csv.to_excel(excel, index=None)
excel.save()
Some ideas I had were to save the Excel as a XML and insert format and style information as attributes or something, or to create both a CSV and XML from the Excel (one for data and one for styling). One problem I have is figuring out how to access that information.
Are there currently any packages that support Python 3 (currently using 3.8) that could help simplify this process? I dug through openpyxl's documentation and they have some stylesheet classes that aren't meant to be used directly I don't think and I couldn't figure out how to use them directly.
I have a data frame I created with my original data appended with the topics from topic modeling. I keep running into errors when trying to export the data table into csv.
I've tried both csv module and pandas but get errors from both.
The data table has 1765 rows so writing the file row by row is not really an option.
When using pandas, most common errors are
DataFrame constructor not properly called!
and
function object has no attribute 'to_csv'
Code used:
import pandas as pd
data = (before.head)
df = pd.DataFrame(before.head)
df.to_csv (r'C:\Users\***\Desktop\beforetopics.csv', index = False, header=True)
print (df)
For the CSV module, there have been several errors such as
iterable expected, not method
Basically, how do I export this table (screenshot attached) into a csv file?
What is the command that you're trying to run?
Try this:
dataframe.to_csv('file_name.csv')
Or if it is the unicode error that you're coming across,
Try this:
dataframe.to_csv('file_name.csv', header=True, index=False, encoding='utf-8')
Since your dataframe's name is before,
Try this:
before.to_csv('file_name.csv', header=True, index=False, encoding='utf-8')
You can use the to_csv function:
before.to_csv('file_name.csv')
If you need extra options, you can check the documentation from here.
the below code simply reads in an excel file, stores it as a df and writes the df back into an excel file. When I open the output file in excel, the columns (Dates, numbers) are not the same... some are text , some or numbers ect..
import pandas as pd
df = pd.read_csv("test.csv", encoding = "ISO-8859-1", dtype=object)
writer = pd.ExcelWriter('outputt.xlsx', engine='xlsxwriter')
df.to_excel(writer, index = False, sheet_name='Sheet1') #drop the index
writer.save()
Is there a way to have the column types (as defined in the initial file) be preserved or revert back to the datatypes when the file was read in?
You are reading in a csv file which is certainly not the same as an excel file. You can read a csv file with excel in Windows, but the encoding is different when the file is saved. You can certainly format cells according xlsxwriter specifications.
However, it is important to note that xlsxwriter cannot format any cells that already have a format such as the header or index, or dates or datetime objects. If you have multiple datatypes in a single column, that will also be problematic, as pandas will then default that column to object. An item of type "object" will be inferred in output, so again it will be dynamically assigned as a "best guess".
When you read your csv in you should specify the format if you want it to be maintained. Right now you are having pandas do this dynamically (Pandas will try to guess column types using the first 100 or so rows).
Change the line where you read in to include dtypes and they will be preserved in output. I am going to assume your columns have headers "ColumnA", "ColumnB", "ColumnC":
import pandas as pd
from datetime import datetime
df = pd.read_csv("test.csv", encoding = "ISO-8859-1", dtype={'ColumnA': int,
'ColumnB': float,
'ColumnC': str})
Let's use "ColumnC" as a column example of dates. I like to first read in dates as a string, then ensure the formatting I desire. So you could add this:
df['ColumnC'] = pd.to_datetime(df['ColumnC'].dt.strftime('%m/%d/%Y')
# date would look like: 06/08/2016, but you can look at other formatting for dt.strftime
This will ensure specific types in output. Further formatting can be applied such as the number of decimals in a float, including percents in output by following guides here.
My advice if you have columns with multiple data types: Don't. This is unorganized and makes use cases much more complex for downstream applications. Spend more time organizing data on the front end so you have less headache on the back end.
I have a tsv file which I am trying to read by the help of pandas. The first two rows of the files are of no use and needs to be ignored. Although, when I get the output, I get it in the form of two columns. The name of the first column is Index and the name of second column is a random row from the csv file.
import pandas as pd
data = pd.read_csv('zahlen.csv', sep='\t', skiprows=2)
Please refer to the screenshot below.
The second column name is in bold black, which is one of the row from the file. Moreover, using '\t' as delimiter does not separate the values in different column. I am using Spyder IDE for this. Am I doing something wrong here?
Try this:
data = pd.read_table('zahlen.csv', header=None, skiprows=2)
read_table() is more suited for tsv files and read_csv() is a more specialized version of it. Then header=None will make first row data, instead of header.