I have problem with saving pandas DataFrame to csv. I run code on jupyter notebook and everything works fine. After runing the same code on server columns values are saved to random columns…
csvPath = r''+str(pathlib.Path().absolute())+ '/products/'+brand['file_name']+'_products.csv'
productFrame.to_csv(csvPath,index=True)
I've print DataFrame before saving – looks as it should be. After saving, I open file and values ale mixed.
How to make it always work in the proper way?
If you want to force the column order when exporting to csv, use
df[cols].to_csv()
where cols is a list of column names in the desired order.
Related
I was wondering why I get funny behaviour using a csv file that has been "changed" in excel.
I have a csv file of around 211,029 rows and pass this csv file into pandas using a Jupyter-notebook
The simplest example I can give of a change is simply clicking on the filter icon in excel saving the file, unclicking the filter icon and saving again (making no physical changes in the data).
When I pass my csv file through pandas, after a few filter operations, some rows go missing.
This is in comparison to that of doing absolutely nothing with the csv file. Leaving the csv file completely alone gives me the correct number of rows I need after filtering compared to "making changes" to the csv file.
Why is this? Is it because of the number of rows in a csv file? Are we supposed to leave csv files untouched if we are planning to filter through pandas anyways?
(As a side note I'm using Excel on a MacBook.)
Excel does not leave any file "untouched". It applies formatting to every file it opens (e.g. float values like "5.06" will be interpreted as date and changed to "05 Jun"). Depending on the expected datatype these rows might be displayed wrongly or missing in your notebook.
Better use sed or awk to manipulate csv files (or a text editor for smaller files).
I've been trying to use Python and Pandas to take a csv as input, clean the dataset, and assign the output to a new csv file. One of the columns in the original csv has trademark symbols. When I export the new csv, the columns sometimes have ™ instead of just the trademark symbol, or sometimes they're turned into ™. This is how I imported the original csv and exported the new csv:
import pandas as pd
df=pd.read_csv("original_df.csv", encoding='latin1',dtype='unicode')
This is how I exported a new dataframe to csv:
df_new.to_csv('new_test_df.csv', index = False)
How do I export the string without the extra symbols (i.e. how it was in the original)?
Thanks!
Just fixed this same problem. Answer and explanation can be found here
Quick answer is to use encoding "utf-8-sig"
I have multiple scripts, each of them have a Dataframe. I want to Export one column from each script/Dataframe into a single csv.
I can crate a csv from my "first" script with one column:
Vorhersagen = pd.DataFrame(columns=["solar_prediction"])
Vorhersagen["solar_prediction"] = Erzeugung["erzeugung_predicted"]
Vorhersagen.to_csv(r"H:/.../Vorhersagen_2017.csv")
Now i have a csv (called "Vorhersagen_2017") with the column "solar_prediction". But how can I add another column (from another script) to the same csv as a second column? (The columns have the same length)
If I understood correctly you want to update the csv file by running different scripts. If this is the case then I would just read the file, append the new column and save the file again. Something like:
df=pd.read_csv('Vorhersagen_2017.csv',...)
df2=pd.concat([df,df1],axis=1) #df1 is the dataframe created by your second script
df2.to_csv(...)
Then you would have to run this iteratively in all your scripts.
However, I think is more efficient to import all your script as modules in a main script and run them from there. From this main script you could easily concatenate the various columns and save them as a csv at once.
I'm importing an .xlsx file with pd.read_excel(). I received this .xlsx file as an CSV file and used excel to seperate it by comma so I get the proper .xlsx file with columns etc. Six of the dataframe columns have a number as header (e.g. 5030, 5031,...). When I want to change the column name with df = df.rename(columns={...}) this does not work. Also df["5030"] does not work, it throws an error: KeyError:'5030'. This code works for columns which have regular/non-integer names.
However, when I import the raw .csv file with pd.read_csv(), all the code above does work. I can just rename column names. The df's do look exactly the same when imported with both techniques, but apparently something is different.
It is not a serious issue as I can change the column name to non-integers manually in excel, but I'm very curious about what the underlying "problem" is here and how these two function operate in a different way.
Thanks!
I have a tsv file which I am trying to read by the help of pandas. The first two rows of the files are of no use and needs to be ignored. Although, when I get the output, I get it in the form of two columns. The name of the first column is Index and the name of second column is a random row from the csv file.
import pandas as pd
data = pd.read_csv('zahlen.csv', sep='\t', skiprows=2)
Please refer to the screenshot below.
The second column name is in bold black, which is one of the row from the file. Moreover, using '\t' as delimiter does not separate the values in different column. I am using Spyder IDE for this. Am I doing something wrong here?
Try this:
data = pd.read_table('zahlen.csv', header=None, skiprows=2)
read_table() is more suited for tsv files and read_csv() is a more specialized version of it. Then header=None will make first row data, instead of header.