Remove Semicolons as Line Delimiters Reading csv-file Using pandas.read_csv - python

I'm working with a pandas dataframe that I want to plot. Instead of being float numbers, some values inside the dataframe are strings because they have a semicolon.
This causes a ValueError when the dataframe is to be plotted. I have found out that the semicolons only occur at the end of each line.
Is there a keyword in the read_csv method to let pandas recognize the semicolons so they can be removed?

If the semicolon is being used to separate lines, you can use the lineterminator parameter to correctly parse the file:
pd.read_csv(..., lineterminator=";")
Pandas CSV documentation

Related

python- CSV parsing issue- ignore the separator within the curly brackets

Facing an issue for parsing the following CSV file row:
UPDATED,464,**"{\"node-id\":\"\",\"change-type\":\"UPDATED\",\"object-type\":\"service\",\"internalgeneratedepoch\":1674472915591000,\"topic-name\":\"Service\",\"object-id\":\"wdm_tpdr_service1\",\"changed-attributes\":{\"lifecycle-state\":{\"old-value\":\" \",\"new-value\":\"planned\"},\"administrative-state\":{\"old-value\":\" \",\"new-value\":\"outOfService\"}},\"internaleventid\":464}"**,1674472915591000,,wdm_tpdr_service1,service
Issue is with the column 3 data (highlighted in bold) which has commas inside the curly braces and double quotes. I am not able to read this column data as a single data point, pandas is splitting this data across the commas which are read as separators. Can someone help please.
Want to read the following string as a single data point:
"{"node-id":"","change-type":"UPDATED","object-type":"service","internalgeneratedepoch":1674472915591000,"topic-name":"Service","object-id":"wdm_tpdr_service1","changed-attributes":{"lifecycle-state":{"old-value":" ","new-value":"planned"},"administrative-state":{"old-value":" ","new-value":"outOfService"}},"internaleventid":464}"
Tried this code:
csv_input = pd.read_csv(file_name, delimiter=',(?![^{]*})',engine="python",index_col=False)
But its not working for all the rows.
Any help will be appreciated.
The code you have provided doesn't work because it contains an invalid regex expression as the delimiter which is not allowed. The regex expression is not valid because it is looking for a closing curly brace which may not be present in some of the rows in the comma separated file. To fix this, you can either remove the regex expression and use a simple comma as the delimiter or you can look for a more specific pattern within the string in the delimiter argument such as a certain set of characters or words.
You can try using the json library to parse the string that is in the third column:
import json
csv_input = pd.read_csv(file_name)
# read the third column in the csv
third_column = csv_input[2]
# parse the string as json
parsed_data = json.loads(third_column)
# use the parsed json data however you want
# If you want to store the parsed data in the csv, you can create a new column and add the results there.
csv_input['parsed_data'] = [json.loads(x) for x in third_column]

How to parse a csv file with a custom delimiter

I have a csv file with a custom delimiter as "$&$" like this:
$&$value$&$,$&$type$&$,$&$text$&$
$&$N$&$,$&$N$&$,$&$text of the message$&$
$&$N$&$,$&$F$&$,$&$text of the message_2$&$
$&$B$&$,$&$N$&$,$&$text of the message_3$&$
and I'm not able to parse it with the following code:
df = pd.read_csv('messages.csv', delimiter= '$\&$', engine='python)
can you help me, please??
From the docs:
... separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
So, to fix your case it should be like this:
df = pd.read_csv('messages.csv', delimiter= '\$\&\$,\$\&\$|\$\&\$', usecols=[1,2,3])
Note that there are going to be 2 additional columns, the first one and the last one. They exist because all data start/end with $&$. In addition to that, the delimiter is actually $&$,$&$. So, usecols get rid of them.
This is the output from the provided sample:
value
type
text
N
N
text of the message

Reading text file into pandas which contains escape characters results in missing data

I have a long list of regex patters that I want replace with in some text data, so I created a text file with all the regex patterns and the replacements I want for them, however, when reading this text file as a pandas dataframe, one particular regex pattern is missing in the dataframe. I tried to read it as a text file instead and build a pandas dataframe out of it but since some of the regex patterns contain "," in them, it becomes difficult to separate each line as 2 different columns.
the key(regex_pattern), value(the values I want to replace them with look as follows):
key,value
"\bnan\b|\bnone\b"," "
"^\."," "
"\s+"," "
"\.+","."
"-+","-"
"\(\)"," "
when I read this as a dataframe using the following lines of code,
normalization_mapper = pd.read_csv(f"{os.getcwd()}/normalize_dict.txt", sep=",")
I also tried the datatype argument as follows
dtype={'key':'string', 'value':'string'}
what I can't figure out is how to handle this "\(\)" missing pattern in the dataframe, pretty sure it's related to the escape character formatting by the pandas data reader, is there a way to handle this? Also, weirdly escape characters in other rows have no issues when reading the text file.

How to use pandas read_csv in this situation?

I have a data that looks like this
id,receiver_id,date,name,age
123,"a,b,c",2012-03-05,"john",32
456,"x,y,z",2012-06-05,"max",49
789,"abc",2012-07-05,"sam",19
In a nutshell, the delimiter is a comma (,). However, in case where a particular field is
surrounded with quotes, there might be comma within the value

How does the pandas read_csv parse regular expressions, exactly?

I have a CSV file with the following structure:
word1|word2|word3,word4,0.20,0.20,0.11,0.54,2.70,0.07,1.75
That is, a first column of strings (some capitalized, some not) that are separated by '|' and by ',' (this marks differences in patterns of association) and then 7 digits each separated by ','.
n.b. This dataframe has multiple millions of rows. I have tried to load it as follows:
pd.read_csv('pattern_association.csv',sep= ',(?!\D)', engine='python',chunksize=10000)
I have followed the advice on here to use a regular expression which aims to capture every column after a digit, but I need one that both selects the first column as a whole string and ignores commas between strings, and then also parses out the 7 columns that are comprised of digits.
How can I get pandas to parse this?
I always get the error.
Error could possibly be due to quotes being ignored when a
multi-char delimiter is used.
I have tried many variations and the regex I am using seems to work outside the context of pandas on toy expressions.
Thanks for any tips.

Categories

Resources