I have a CSV file with the following structure:
word1|word2|word3,word4,0.20,0.20,0.11,0.54,2.70,0.07,1.75
That is, a first column of strings (some capitalized, some not) that are separated by '|' and by ',' (this marks differences in patterns of association) and then 7 digits each separated by ','.
n.b. This dataframe has multiple millions of rows. I have tried to load it as follows:
pd.read_csv('pattern_association.csv',sep= ',(?!\D)', engine='python',chunksize=10000)
I have followed the advice on here to use a regular expression which aims to capture every column after a digit, but I need one that both selects the first column as a whole string and ignores commas between strings, and then also parses out the 7 columns that are comprised of digits.
How can I get pandas to parse this?
I always get the error.
Error could possibly be due to quotes being ignored when a
multi-char delimiter is used.
I have tried many variations and the regex I am using seems to work outside the context of pandas on toy expressions.
Thanks for any tips.
Related
I have a csv file with a custom delimiter as "$&$" like this:
$&$value$&$,$&$type$&$,$&$text$&$
$&$N$&$,$&$N$&$,$&$text of the message$&$
$&$N$&$,$&$F$&$,$&$text of the message_2$&$
$&$B$&$,$&$N$&$,$&$text of the message_3$&$
and I'm not able to parse it with the following code:
df = pd.read_csv('messages.csv', delimiter= '$\&$', engine='python)
can you help me, please??
From the docs:
... separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
So, to fix your case it should be like this:
df = pd.read_csv('messages.csv', delimiter= '\$\&\$,\$\&\$|\$\&\$', usecols=[1,2,3])
Note that there are going to be 2 additional columns, the first one and the last one. They exist because all data start/end with $&$. In addition to that, the delimiter is actually $&$,$&$. So, usecols get rid of them.
This is the output from the provided sample:
value
type
text
N
N
text of the message
I have a long list of regex patters that I want replace with in some text data, so I created a text file with all the regex patterns and the replacements I want for them, however, when reading this text file as a pandas dataframe, one particular regex pattern is missing in the dataframe. I tried to read it as a text file instead and build a pandas dataframe out of it but since some of the regex patterns contain "," in them, it becomes difficult to separate each line as 2 different columns.
the key(regex_pattern), value(the values I want to replace them with look as follows):
key,value
"\bnan\b|\bnone\b"," "
"^\."," "
"\s+"," "
"\.+","."
"-+","-"
"\(\)"," "
when I read this as a dataframe using the following lines of code,
normalization_mapper = pd.read_csv(f"{os.getcwd()}/normalize_dict.txt", sep=",")
I also tried the datatype argument as follows
dtype={'key':'string', 'value':'string'}
what I can't figure out is how to handle this "\(\)" missing pattern in the dataframe, pretty sure it's related to the escape character formatting by the pandas data reader, is there a way to handle this? Also, weirdly escape characters in other rows have no issues when reading the text file.
I have words with \t and \r at the beginning of the words that I am trying to strip out without stripping the actual words.
For example "\tWant to go to the mall.\rTo eat something."
I have tried a few things from SO over three days. Its a Pandas Dataframe so I thought this answer pertained the best:
Pandas DataFrame: remove unwanted parts from strings in a column
But formulating from that for my own solution is not working.
i = df['Column'].replace(regex=False,inplace=False,to_replace='\t',value='')
I did not want to use regex since the expression has been difficult to make being that I am attempting to strip out '\t' and if possible also '\r'.
Here is my regular expression: https://regex101.com/r/92CUV5/5
Try the following code:
def remove_chars(text):
return str(re.sub(r'[\t\r]','',text))
i = df['Column'].map(remove_chars)
Writing to a parquet file gives me an error that states that " ,;{}()\n\t=" characters are not allowed.
I'd like to eliminate rows that have any of these characters anywhere.
Would I use "like", "rlike" or something else?
I have tried this:
df = df.filter(df.account_number.rlike('*\n*', '*\ *','*,*','*;*','*{*','*}*','*)*','*(*','*\t*') == False)
Obviously this does not work. I'm unsure what the right regex syntax is, or if I even need a regex in this particular case.
You would use rlike since it's for regular expressions:
df.filter(~df.account_number.rlike("[ ,;{}()\n\t=]"))
When you put characters between [] it means any of the following characters.
I don't see why these characters wouldn't be allowed in the dataframe rows, there might be an invalid character in the column names instead. You can use .withColumnRenamed() to rename it.
I'm trying read a semicolon delimited CSV file into python. There is a column which contains some XML code, and in some rows, these codes contain special entities such as < for < and so on. Those semicolons in cause wrong row breaks, resulting in inconsistent number of columns for certain rows. Is there a way to avoid that without replacing every problematic character?
Here is an example of such rows (I've shortened it for matter of visibility):
20160210-12:45:43:047;C2ALLIANCE.EAM.EVENT.EAMEVENTREPORT.DPROB14;<?xml version="1.0"?><FAP:Message><eam:Data id="LOTTYPE">R&D</eam:Data></FAP:Message>;EVENT;DPROB14;
There are actually 5 columns, while the semicolon in & causes an extra break, so my code gets wrong number of columns.
I need certain columns and I use numpy:
data = numpy.genfromtxt('csvfile.csv', delimiter=";", dtype='str',usecols=(0, 1, 3), skip_header=1)
If the offending semicolon was between quotation marks, it would be possible to ignore it using pandas; but here, it's completely taken for a delimiter (I'm not the author of the data).