I've been trying to use Python and Pandas to take a csv as input, clean the dataset, and assign the output to a new csv file. One of the columns in the original csv has trademark symbols. When I export the new csv, the columns sometimes have ™ instead of just the trademark symbol, or sometimes they're turned into ™. This is how I imported the original csv and exported the new csv:
import pandas as pd
df=pd.read_csv("original_df.csv", encoding='latin1',dtype='unicode')
This is how I exported a new dataframe to csv:
df_new.to_csv('new_test_df.csv', index = False)
How do I export the string without the extra symbols (i.e. how it was in the original)?
Thanks!
Just fixed this same problem. Answer and explanation can be found here
Quick answer is to use encoding "utf-8-sig"
Related
please see attached photo
here's the image
I only need to import a specific column with conditions(such as specific data found in that column). And also, I only need to remove unnecessary columns. dropping them takes too much code. What specific code or syntax is applicable?
How to get a column from pandas dataframe is answered in Read specific columns from a csv file with csv module?
To quote:
Pandas is spectacular for dealing with csv files, and the following
code would be all you need to read a csv and save an entire column
into a variable:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
So in your case, you just save the the filtered data frame in a new variable.
This means you do newdf = data.loc[...... and then use the code snippet from above to extract the column you desire, for example newdf.continent
I have problem with saving pandas DataFrame to csv. I run code on jupyter notebook and everything works fine. After runing the same code on server columns values are saved to random columns…
csvPath = r''+str(pathlib.Path().absolute())+ '/products/'+brand['file_name']+'_products.csv'
productFrame.to_csv(csvPath,index=True)
I've print DataFrame before saving – looks as it should be. After saving, I open file and values ale mixed.
How to make it always work in the proper way?
If you want to force the column order when exporting to csv, use
df[cols].to_csv()
where cols is a list of column names in the desired order.
I have a dataframe where one of the columns contains strings. But they have some tab formatting for each of them. Below is a snippet of how it looks like
formatted_line_items[1:3]
Out[393]: ['\t<string1>', '\t\t<string2>']
However when I write the dataframe using to_csv the formatting is lost. How can I write this to a csv file or excel file and still retain the formatting?
EDIT: I got to know that csv doesn't retain formatting so I used the pandas to_excel function but still no luck with the formatting.
Just found XlsxWriter has a set_indent function where we can specify the indentation.
import pandas as pd
check = pd.read_csv('1.csv')
nocheck = check['CUSIP'].str[:-1]
nocheck = nocheck.to_frame()
nocheck['CUSIP'] = nocheck['CUSIP'].astype(str)
nocheck.to_csv('NoCheck.csv')
This works but while writing the csv, a value for an identifier like 0003418 (type = str) converts to 3418 (type = general) when the csv file is opened in Excel. How do I avoid this?
I couldn't find a dupe for this question, so I'll post my comment as a solution.
This is an Excel issue, not a python error. Excel autoformats numeric columns to remove leading 0's. You can "fix" this by forcing pandas to quote when writing:
import csv
# insert pandas code from question here
# use csv.QUOTE_ALL when writing CSV.
nocheck.to_csv('NoCheck.csv', quoting=csv.QUOTE_ALL)
Note that this will actually put quotes around each value in your CSV. It will render the way you want in Excel, but you may run into issues if you try to read the file some other way.
Another solution is to write the CSV without quoting, and change the cell format in Excel to "General" instead of "Numeric".
I have a tsv file which I am trying to read by the help of pandas. The first two rows of the files are of no use and needs to be ignored. Although, when I get the output, I get it in the form of two columns. The name of the first column is Index and the name of second column is a random row from the csv file.
import pandas as pd
data = pd.read_csv('zahlen.csv', sep='\t', skiprows=2)
Please refer to the screenshot below.
The second column name is in bold black, which is one of the row from the file. Moreover, using '\t' as delimiter does not separate the values in different column. I am using Spyder IDE for this. Am I doing something wrong here?
Try this:
data = pd.read_table('zahlen.csv', header=None, skiprows=2)
read_table() is more suited for tsv files and read_csv() is a more specialized version of it. Then header=None will make first row data, instead of header.