I have a data that looks like this
id,receiver_id,date,name,age
123,"a,b,c",2012-03-05,"john",32
456,"x,y,z",2012-06-05,"max",49
789,"abc",2012-07-05,"sam",19
In a nutshell, the delimiter is a comma (,). However, in case where a particular field is
surrounded with quotes, there might be comma within the value
Related
I need help with data cleaning.
How do I make the language column(D) into form in column(F)
Basically just get rid of the brackets and apostrophe, and save the comma between each language.
It can be done either using python or excel itself
Thanks!
I tried to google but it didn't work
You could use pandas to read the csv into a dataframe then "apply" a function to the column that did something like this:
def clean(value: str) -> str:
for c in "[']":
value = value.replace(c, "")
return value
It's worth noting if you remove the brackets you'll have an "embedded" delimiter meaning you wont be able to save this as a "csv" without a few headaches.
If you want to clean it up in Excel you could, similarly to my formula above, add a formula that replaces all of the unwanted characters with an empty string with something like this:
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(D2,"'",""),"[",""),"]","")
*where D2 is the first language cell
FWIW, I'd take a few minutes to play with this in pandas too -- it's always good to pick up a new skill
Based on the image, the data type in column 'D' seems to be a list of elements, hence when written to the cells, the square brackets are included.
Their is a simple trick to handle this:
While you iterate over the column and the row (basically the cell), simply run a list comprehension preceded by a join, something like the following
''.join[_val for _val in str(_cell_val) if str(_val) not in ['[', ']', '\'']
search for list to str conversion using list comprehension and you should have enough samples.
I'm working with a pandas dataframe that I want to plot. Instead of being float numbers, some values inside the dataframe are strings because they have a semicolon.
This causes a ValueError when the dataframe is to be plotted. I have found out that the semicolons only occur at the end of each line.
Is there a keyword in the read_csv method to let pandas recognize the semicolons so they can be removed?
If the semicolon is being used to separate lines, you can use the lineterminator parameter to correctly parse the file:
pd.read_csv(..., lineterminator=";")
Pandas CSV documentation
I have a CSV file with the following structure:
word1|word2|word3,word4,0.20,0.20,0.11,0.54,2.70,0.07,1.75
That is, a first column of strings (some capitalized, some not) that are separated by '|' and by ',' (this marks differences in patterns of association) and then 7 digits each separated by ','.
n.b. This dataframe has multiple millions of rows. I have tried to load it as follows:
pd.read_csv('pattern_association.csv',sep= ',(?!\D)', engine='python',chunksize=10000)
I have followed the advice on here to use a regular expression which aims to capture every column after a digit, but I need one that both selects the first column as a whole string and ignores commas between strings, and then also parses out the 7 columns that are comprised of digits.
How can I get pandas to parse this?
I always get the error.
Error could possibly be due to quotes being ignored when a
multi-char delimiter is used.
I have tried many variations and the regex I am using seems to work outside the context of pandas on toy expressions.
Thanks for any tips.
I am currently trying to write a dataframe to a csv using to_csv. The input and output for the data is below. How do I write the to_csv to ensure that the fields with commas still get double quoted but the row with Katie doesn't get additional double quotes?
Input:
Title
Johnny,Appleseed
Lauren,Appleseed
Katie_"Appleseed"
Output:
Title
"Johnny,Appleseed"
"Lauren,Appleseed"
Katie_""Appleseed""
code
df.to_csv(r"filelocation", sep=',')
Escaping quotes with double quotes is part of the CSV standard:
"7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote."
I am using the following code to export my data frame to csv:
data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath')
Note that I use delimiter="\t", as I don't want to add additional quotation marks around each field. However, when I checked the output csv file, there are still some fields which are enclosed by quotation marks. e.g.
abcdABCDAAbbcd ....
1234_3456ABCD ...
"-12345678AbCd" ...
It seems that the quotation mark appears when the leading character of a field is "-". Why is this happening and is there a way to avoid this? Thanks!
You don't use all the options provided by the CSV writer. It has quoteMode parameter which takes one of the four values (descriptions from the org.apache.commons.csv documentation:
ALL - quotes all fields
MINIMAL (default) - quotes fields which contain special characters such as a delimiter, quotes character or any of the characters in line separator
NON_NUMERIC - quotes all non-numeric fields
NONE - never quotes fields
If want to avoid quoting the last options looks a good choice, doesn't it?