How to parse a csv file with a custom delimiter - python

I have a csv file with a custom delimiter as "$&$" like this:
$&$value$&$,$&$type$&$,$&$text$&$
$&$N$&$,$&$N$&$,$&$text of the message$&$
$&$N$&$,$&$F$&$,$&$text of the message_2$&$
$&$B$&$,$&$N$&$,$&$text of the message_3$&$
and I'm not able to parse it with the following code:
df = pd.read_csv('messages.csv', delimiter= '$\&$', engine='python)
can you help me, please??

From the docs:
... separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
So, to fix your case it should be like this:
df = pd.read_csv('messages.csv', delimiter= '\$\&\$,\$\&\$|\$\&\$', usecols=[1,2,3])
Note that there are going to be 2 additional columns, the first one and the last one. They exist because all data start/end with $&$. In addition to that, the delimiter is actually $&$,$&$. So, usecols get rid of them.
This is the output from the provided sample:
value
type
text
N
N
text of the message

Related

python- CSV parsing issue- ignore the separator within the curly brackets

Facing an issue for parsing the following CSV file row:
UPDATED,464,**"{\"node-id\":\"\",\"change-type\":\"UPDATED\",\"object-type\":\"service\",\"internalgeneratedepoch\":1674472915591000,\"topic-name\":\"Service\",\"object-id\":\"wdm_tpdr_service1\",\"changed-attributes\":{\"lifecycle-state\":{\"old-value\":\" \",\"new-value\":\"planned\"},\"administrative-state\":{\"old-value\":\" \",\"new-value\":\"outOfService\"}},\"internaleventid\":464}"**,1674472915591000,,wdm_tpdr_service1,service
Issue is with the column 3 data (highlighted in bold) which has commas inside the curly braces and double quotes. I am not able to read this column data as a single data point, pandas is splitting this data across the commas which are read as separators. Can someone help please.
Want to read the following string as a single data point:
"{"node-id":"","change-type":"UPDATED","object-type":"service","internalgeneratedepoch":1674472915591000,"topic-name":"Service","object-id":"wdm_tpdr_service1","changed-attributes":{"lifecycle-state":{"old-value":" ","new-value":"planned"},"administrative-state":{"old-value":" ","new-value":"outOfService"}},"internaleventid":464}"
Tried this code:
csv_input = pd.read_csv(file_name, delimiter=',(?![^{]*})',engine="python",index_col=False)
But its not working for all the rows.
Any help will be appreciated.
The code you have provided doesn't work because it contains an invalid regex expression as the delimiter which is not allowed. The regex expression is not valid because it is looking for a closing curly brace which may not be present in some of the rows in the comma separated file. To fix this, you can either remove the regex expression and use a simple comma as the delimiter or you can look for a more specific pattern within the string in the delimiter argument such as a certain set of characters or words.
You can try using the json library to parse the string that is in the third column:
import json
csv_input = pd.read_csv(file_name)
# read the third column in the csv
third_column = csv_input[2]
# parse the string as json
parsed_data = json.loads(third_column)
# use the parsed json data however you want
# If you want to store the parsed data in the csv, you can create a new column and add the results there.
csv_input['parsed_data'] = [json.loads(x) for x in third_column]

Read Plain Text Document in pandas, only one column

here is the photo
how can I FIX this, thank you
You have to specify that your csv file seperator is whitespace.
To do so you have to add sep='\s+' as this says that the seperator in the file is one or more spaces (Regex expression).
The better way
You have to specify delim_whitespace=True as parametes as it is faster that regex I shown you above.
So your code should be like:
pd.read_csv("beer_drug_1687_1739.txt", header=None, delim_whitespace=True)
And also I see that your first row has the names of your columns. So you have to change header=None to header=[0] to get the names of your columns.
If you have any questions feel free to ask.

Remove Semicolons as Line Delimiters Reading csv-file Using pandas.read_csv

I'm working with a pandas dataframe that I want to plot. Instead of being float numbers, some values inside the dataframe are strings because they have a semicolon.
This causes a ValueError when the dataframe is to be plotted. I have found out that the semicolons only occur at the end of each line.
Is there a keyword in the read_csv method to let pandas recognize the semicolons so they can be removed?
If the semicolon is being used to separate lines, you can use the lineterminator parameter to correctly parse the file:
pd.read_csv(..., lineterminator=";")
Pandas CSV documentation

How does the pandas read_csv parse regular expressions, exactly?

I have a CSV file with the following structure:
word1|word2|word3,word4,0.20,0.20,0.11,0.54,2.70,0.07,1.75
That is, a first column of strings (some capitalized, some not) that are separated by '|' and by ',' (this marks differences in patterns of association) and then 7 digits each separated by ','.
n.b. This dataframe has multiple millions of rows. I have tried to load it as follows:
pd.read_csv('pattern_association.csv',sep= ',(?!\D)', engine='python',chunksize=10000)
I have followed the advice on here to use a regular expression which aims to capture every column after a digit, but I need one that both selects the first column as a whole string and ignores commas between strings, and then also parses out the 7 columns that are comprised of digits.
How can I get pandas to parse this?
I always get the error.
Error could possibly be due to quotes being ignored when a
multi-char delimiter is used.
I have tried many variations and the regex I am using seems to work outside the context of pandas on toy expressions.
Thanks for any tips.

Customizing the separator in pandas read_csv

I am reading many different data files into various pandas dataframes. The columns in these datafiles are separated by spaces. However, for each file, the number of spaces is different (for some of them, there is only one space, for others, there are two spaces and so on). Thus, every time I import the file, I have to manually go to that file and see the number of spaces that have been used and give those many number of spaces in sep:
import pandas as pd
df = pd.read_csv('myfile.dat', sep = ' ')
Is there any way I can tell pandas to assume "any number of spaces" as the separator? Also, is there any way I can tell pandas to use either tab (\t) or spaces as the separator?
Yes, you can use a simple regular expression like sep='\s+' to denote one or more spaces.
You can also use the parameter skipinitialspace=True which skips the leading spaces after any delimiter.
You can directly use delim_whitespace:
import pandas as pd
df = pd.read_csv('myfile.dat', delim_whitespace=True )
The argument delim_whitespace controls whether or not whitespace (e.g. ' ' or ' ') will be used as separator. See pandas.read_csv for details.
One thing I found is if you use a unsupported separator. Pandas/Dask will have to use the Python engine instead of the C engine. This is a good deal slower.

Categories

Resources