Python: Using Pandas library. How to keep quotes on text? - python

I'm using the following code of Python using the Pandas library. The purpose of the code is to join 2 CSV files and works as exptected. In the CSV files all the values are within "". When using the Pandas libray they dissapear. I wonder what I can do to keep them? I have read the documentation and tried lots of options but can't seem to get it right.
Any help is much appreciated.
Code:
import pandas
csv1 = pandas.read_csv('WS-Produktlista-2015-01-25.csv', quotechar='"',comment='"')
csv2 = pandas.read_csv('WS-Prislista-2015-01-25.csv', quotechar='"', comment='"')
merged = csv1.merge(csv2, on='id')
merged.to_csv("output.csv", index=False)
Instead of getting a line like this:
"1","Cologne","4711","4711","100ml",
I'm getting:
1,Cologne,4711,4711,100ml,
EDIT:
I now found the problem. My files contains a header with 16 columns. The data lines contains 16 values separated with ",".
Just found that some lines contains values within "" that contains ",". This is confusing the parser. Instead of expecting 15 commas, it finds 18. One example below:
"23210","Cosmetic","Lancome","Eyes Virtuose Palette Makeup",**"7,2g"**,"W","Decorative range","5x**1,2**g Eye Shadow + **1,2**g Powder","http://image.jpg","","3660732000104","","No","","1","1"
How can make the parser ignore the comma sign within ""?

Related

reading columns of csv file using pandas not working

I am trying to read the following .csv file, but I want to read each column of it. However, usecols is not working as it is giving the following error:
ValueError: Usecols do not match columns, columns expected but not found: ['sources', 'RMS']
this is how I am reading it:
train=pd.read_csv("parameters.csv", usecols = ['sources','RMS'])
And this is my csv file:
how can I read each column of this file?
edit: I had an unclose " but the problem persists
If you want to read each column, just remove the usecols part like so: train=pd.read_csv("parameters.csv"). I'm guessing the reason it doesn't work is that your columns have spaces after the names, so the actual name of one of your columns is something like sources .
It might be an error on your part, but you have unclosed quotation marks in your code snippet.
If the order of the columns will always be the same, you can also use an integer-list with usecols
df = pd.read_csv('file.csv',usecols=[0,4] #this selects just 0 and 4

Pandas html to df - commas in numbers

I'm newbie in Python. I need to download some tables from Polish language webpages.
I have problem with commas in numbers because it seems that Pandas delete them?
For example:
import pandas as pd
x = pd.read_html('https://www.gpw.pl/wskazniki', encoding='utf-8', decimal=",")[1]
The result in C/WK column is "021" instead "0,21".
How to download it properly or change to "0.21".
Thank you
The issue is with the thousands separator, which also defaults to common.
To read the data and parse it correctly, use:
pd.read_html('https://www.gpw.pl/wskazniki',encoding = 'utf-8', decimal=',', thousands='.')[1]
The result is:

how to write comma separated list items to csv in a single column in python

I have a list(fulllist) of 292 items and converted to data frame. Then tried writing it to csv in python.
import pandas as pd
my_df = pd.DataFrame(fulllist)
my_df.to_csv('Desktop/pgm/111.csv', index=False,sep=',')
But the some comma separated values fills each columns of csv. I am trying to make that values in single column.
Portion of output is shown below.
I have tried with writerows but wont work.
import csv
with open('Desktop/pgm/111.csv', "wb") as f:
writer = csv.writer(fulllist)
writer.writerows(fulllist)
Also tried with "".join at each time, when the length of list is higher than 1. It also not giving the result. How to make the proper csv so that each fields fill each columns?
My expected output csv is
Please keep in mind that .csv files are in fact plain text files and understanding of .csv by given software depends on implementation, for example some might allow newline character as part of field, when it is between " and ", while other treat every newline character as next row.
Do you have to use .csv format? If not consider other possibilities:
DSV https://en.wikipedia.org/wiki/Delimiter-separated_values is similiar to csv, but you can use for example ; instead of ,, which should help if you do not have ; in your data
openpyxl allows writing and reading of .xlsx files.

Count number of columns in multiple csv files in directory

I have a directory that contains a large number of CSV files (more than 1000). I am using python pandas library to count the number of columns in each CSV file.
But the problem is that the separator used in some of CSV file is not only"," but "|" and ";"
How to tackle this problem:
import pandas as pd
import csv
import os
from collections import OrderedDict
path="C:\\Users\\Username\\Documents\\Sample_Data_August10\\outbound"
files=os.listdir(path)
col_count_dict=OrderedDict()
for file in files:
df=pd.read_csv(os.path.join(path,file),error_bad_lines=False,sep=",|;|\|",engine='python')
col_count_dict[file]=len(df.columns)
I am storing it as a dictionary.
I am getting an error like:
Error could possibly be due to quotes being ignored when a multi-char delimiter is used
I have used sep=None, but that didn't work.
Edit :
One of the csv is like this :
Number|CommentText|CreationDate|Detail|EventDate|ProfileLocale_ISO|Event_Number|Message_Number|ProfileInformation_Number|Substitute_UserNo|User_UserNo
Second one is like:
Number,Description
I can't reveal the data. I have just given the column name as the data is sensitive.
Update
After a little bit of tweaking and using print status to figure out using the code of andrey-portnoy, I came to know that csv sniffer was identifying the delimiter for "|" as "e" so using an if statement I changed it back to "|". Now it is giving me correct output.
Also in place of read() , I used readline() . in following line of code in Andrey's answer : dialect = csv.Sniffer().sniff(csvfile.read(1024))
But the problem remains unsolved. I was able to figure out this after a lot of inspection but every time I may not be correct to guess and this can lead to error.
Any help will be awaited.
By specifying the separator as sep=",|;|\|", you make that whole string a separator.
Instead, you want to use the Sniffer from the csv module to detect the CSV dialect used in each file, in particular the delimiter.
For example, for a single file example.csv:
import csv
with open('example.csv', newline='') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
sep = dialect.delimiter
df = pd.read_csv('example.csv', sep=sep)
Don't enable the Python engine by default, as it is much slower.

Unable to get correct output from tsv file using pandas

I have a tsv file which I am trying to read by the help of pandas. The first two rows of the files are of no use and needs to be ignored. Although, when I get the output, I get it in the form of two columns. The name of the first column is Index and the name of second column is a random row from the csv file.
import pandas as pd
data = pd.read_csv('zahlen.csv', sep='\t', skiprows=2)
Please refer to the screenshot below.
The second column name is in bold black, which is one of the row from the file. Moreover, using '\t' as delimiter does not separate the values in different column. I am using Spyder IDE for this. Am I doing something wrong here?
Try this:
data = pd.read_table('zahlen.csv', header=None, skiprows=2)
read_table() is more suited for tsv files and read_csv() is a more specialized version of it. Then header=None will make first row data, instead of header.

Categories

Resources