How does the pandas read_csv parse regular expressions, exactly? - python

I have a CSV file with the following structure:
word1|word2|word3,word4,0.20,0.20,0.11,0.54,2.70,0.07,1.75
That is, a first column of strings (some capitalized, some not) that are separated by '|' and by ',' (this marks differences in patterns of association) and then 7 digits each separated by ','.
n.b. This dataframe has multiple millions of rows. I have tried to load it as follows:
pd.read_csv('pattern_association.csv',sep= ',(?!\D)', engine='python',chunksize=10000)
I have followed the advice on here to use a regular expression which aims to capture every column after a digit, but I need one that both selects the first column as a whole string and ignores commas between strings, and then also parses out the 7 columns that are comprised of digits.
How can I get pandas to parse this?
I always get the error.
Error could possibly be due to quotes being ignored when a
multi-char delimiter is used.
I have tried many variations and the regex I am using seems to work outside the context of pandas on toy expressions.
Thanks for any tips.

Related

How to parse a csv file with a custom delimiter

I have a csv file with a custom delimiter as "$&$" like this:
$&$value$&$,$&$type$&$,$&$text$&$
$&$N$&$,$&$N$&$,$&$text of the message$&$
$&$N$&$,$&$F$&$,$&$text of the message_2$&$
$&$B$&$,$&$N$&$,$&$text of the message_3$&$
and I'm not able to parse it with the following code:
df = pd.read_csv('messages.csv', delimiter= '$\&$', engine='python)
can you help me, please??
From the docs:
... separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
So, to fix your case it should be like this:
df = pd.read_csv('messages.csv', delimiter= '\$\&\$,\$\&\$|\$\&\$', usecols=[1,2,3])
Note that there are going to be 2 additional columns, the first one and the last one. They exist because all data start/end with $&$. In addition to that, the delimiter is actually $&$,$&$. So, usecols get rid of them.
This is the output from the provided sample:
value
type
text
N
N
text of the message

Reading text file into pandas which contains escape characters results in missing data

I have a long list of regex patters that I want replace with in some text data, so I created a text file with all the regex patterns and the replacements I want for them, however, when reading this text file as a pandas dataframe, one particular regex pattern is missing in the dataframe. I tried to read it as a text file instead and build a pandas dataframe out of it but since some of the regex patterns contain "," in them, it becomes difficult to separate each line as 2 different columns.
the key(regex_pattern), value(the values I want to replace them with look as follows):
key,value
"\bnan\b|\bnone\b"," "
"^\."," "
"\s+"," "
"\.+","."
"-+","-"
"\(\)"," "
when I read this as a dataframe using the following lines of code,
normalization_mapper = pd.read_csv(f"{os.getcwd()}/normalize_dict.txt", sep=",")
I also tried the datatype argument as follows
dtype={'key':'string', 'value':'string'}
what I can't figure out is how to handle this "\(\)" missing pattern in the dataframe, pretty sure it's related to the escape character formatting by the pandas data reader, is there a way to handle this? Also, weirdly escape characters in other rows have no issues when reading the text file.

Python: Replacing alphanumeric values in Dataframe

I have words with \t and \r at the beginning of the words that I am trying to strip out without stripping the actual words.
For example "\tWant to go to the mall.\rTo eat something."
I have tried a few things from SO over three days. Its a Pandas Dataframe so I thought this answer pertained the best:
Pandas DataFrame: remove unwanted parts from strings in a column
But formulating from that for my own solution is not working.
i = df['Column'].replace(regex=False,inplace=False,to_replace='\t',value='')
I did not want to use regex since the expression has been difficult to make being that I am attempting to strip out '\t' and if possible also '\r'.
Here is my regular expression: https://regex101.com/r/92CUV5/5
Try the following code:
def remove_chars(text):
return str(re.sub(r'[\t\r]','',text))
i = df['Column'].map(remove_chars)

Pyspark: help filtering out any rows which have unwanted characters

Writing to a parquet file gives me an error that states that " ,;{}()\n\t=" characters are not allowed.
I'd like to eliminate rows that have any of these characters anywhere.
Would I use "like", "rlike" or something else?
I have tried this:
df = df.filter(df.account_number.rlike('*\n*', '*\ *','*,*','*;*','*{*','*}*','*)*','*(*','*\t*') == False)
Obviously this does not work. I'm unsure what the right regex syntax is, or if I even need a regex in this particular case.
You would use rlike since it's for regular expressions:
df.filter(~df.account_number.rlike("[ ,;{}()\n\t=]"))
When you put characters between [] it means any of the following characters.
I don't see why these characters wouldn't be allowed in the dataframe rows, there might be an invalid character in the column names instead. You can use .withColumnRenamed() to rename it.

HTML/XML special characters cause row breaks reaing csv file

I'm trying read a semicolon delimited CSV file into python. There is a column which contains some XML code, and in some rows, these codes contain special entities such as < for < and so on. Those semicolons in cause wrong row breaks, resulting in inconsistent number of columns for certain rows. Is there a way to avoid that without replacing every problematic character?
Here is an example of such rows (I've shortened it for matter of visibility):
20160210-12:45:43:047;C2ALLIANCE.EAM.EVENT.EAMEVENTREPORT.DPROB14;<?xml version="1.0"?><FAP:Message><eam:Data id="LOTTYPE">R&D</eam:Data></FAP:Message>;EVENT;DPROB14;
There are actually 5 columns, while the semicolon in & causes an extra break, so my code gets wrong number of columns.
I need certain columns and I use numpy:
data = numpy.genfromtxt('csvfile.csv', delimiter=";", dtype='str',usecols=(0, 1, 3), skip_header=1)
If the offending semicolon was between quotation marks, it would be possible to ignore it using pandas; but here, it's completely taken for a delimiter (I'm not the author of the data).

Categories

Resources