Pandas ignores separator passed as parameter - python

I am working on a function that among other tasks, is supposed to read a csv in pandas. As one of the parameters, I would like to pass the separator as a string. However, for some reason, probably something to do with regular expressions, pandas is totally ignoring my passed parser and defaults to '\t', which does not parse my data correcty.
import pandas as pd
def open_df(separator):
df = pd.read_csv('filename.csv', sep=separator)
return df
Question is, how am I suppose to pass the separator parameter in this case?

Please Check this link:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can,
meaning the latter will be used and automatically detect the separator
by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators
longer than 1 character and different from '\s+' will be interpreted
as regular expressions and will also force the use of the Python
parsing engine. Note that regex delimiters are prone to ignoring
quoted data. Regex example: '\r\t'.

I passed the seperator string as "raw" string and that worked fine for me.
I you use a raw string \ is interpreted as a normal character and also \t will work
When you call open_df() you have to write a r before the string quotes like open_df(r"\t")
Example:
test_string = r"\t\n"
print(test_string)
\t\n
And I also passed "python" as engine parameter in order to not display the parser warning :-).

Related

How to parse a csv file with a custom delimiter

I have a csv file with a custom delimiter as "$&$" like this:
$&$value$&$,$&$type$&$,$&$text$&$
$&$N$&$,$&$N$&$,$&$text of the message$&$
$&$N$&$,$&$F$&$,$&$text of the message_2$&$
$&$B$&$,$&$N$&$,$&$text of the message_3$&$
and I'm not able to parse it with the following code:
df = pd.read_csv('messages.csv', delimiter= '$\&$', engine='python)
can you help me, please??
From the docs:
... separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
So, to fix your case it should be like this:
df = pd.read_csv('messages.csv', delimiter= '\$\&\$,\$\&\$|\$\&\$', usecols=[1,2,3])
Note that there are going to be 2 additional columns, the first one and the last one. They exist because all data start/end with $&$. In addition to that, the delimiter is actually $&$,$&$. So, usecols get rid of them.
This is the output from the provided sample:
value
type
text
N
N
text of the message

Pandas don't recognize "||" as string to split

I'm trying to split a DataFrame in two columns and get the left part in result, but pandas don't recognize that string and give me an out in empty.
q=['Sar || var','lol ||']
y=pd.DataFrame(q)
split_data = y[0].str.split("||", n = 1, expand = False).str[0]
print(split_data)
out
0
1
Name: 0, dtype: object
The documentation is somewhat deceptive for this method. What is happening is that for patterns longer than 1 character, pandas interprets the separator as a regular expression. You can see the implementation here.
You can use "||" as a literal, non-regex separator by escaping the character "|" (which has special meaning in regular expressions) using a backslash:
series.str.split("\\|\\|")
Note that python provides a "raw" syntax for string literals that can be useful for writing regular expressions, removing the need to escape the backslashes themselves:
series.str.split(r"\|\|")
You can consult the documentation for the re module for a list of special characters that will need to be escaped when using multi-character separators. Alternatively, just use the function re.escape:
import re
series.str.split(re.escape("||"))

split() not producing expected results

I have a problem with python split which I can't figure out what I am missing that results in the split function not to work properly. I have been using similar splits before and they worked just fine.
content=open(file).read)()
Sep = content.split(r'Document [a-zA-Z0-9]{25}\n')
The file I am reading is something very easy as:
"I like coffee.
Document CLASSAR020181030eeat0000l
I like tea as well.
Document CLASSAR020181030eeat0000l
I like both coffee and tea."
str.split() splits using a fixed delimiter, not a regular expression. You need to use re.split().
import re
sep = re.split(r'Document [a-zA-Z0-9]{25}\n', content)
Error - regular expression syntax on string methods
content is a string. You cannot call the split method on this variable as it will invoke a string-based method that expects a separator. This separator must be a fixed string, and not a regular expression.
Solution - Use re module
You can instead use methods within the regular expression module, as you're using regular expression syntax:
import re
with open(file, 'r') as fp:
content = fp.read()
pattern = re.compile(r'Document \w{25}\n')
separated = pattern.split(content)
The with block is just best practice for opening files in python. It
is a context manager that automatically closes your file when you're
finished. You may run into problems in the future if you don't use
this.
The regular expression I have used is slightly different to yours. It
does exactly the same thing. However, \w is short for
[a-zA-Z0-9]. I.e., it matches any alphanumeric character.
We are using the split method again. However, this split method is part of the re module, not string, as our pattern variable is an re object.

Loading regular expression patterns from external source?

I have a series of regular expression patterns defined for automated processing of text. Due to the design of the program, it's better to have these patterns separate in a text file, namely a JSON file. The pattern in Python is of r'' type, but all I can provide is a string. I'd like to retain functionalities such as grouping. I'd like to have features such as entities ([A-z]), so I'm not talking about escaping everything.
I'm using Python 3.4. How do I properly load these patterns into the re module? And what kind of escaping problem should I watch out for?
I am not sure what you want but have a look at this.:
If you have a file called input.txt containing \d+
Then you can use it this way:
import re
f=open("input.txt","r")
x="asasd3243sdfdsf23234sdsdf"
print re.findall(r""+f.readline(),x)
Output:['3243', '23234']
When you use r mode you need not escape anything.
The r'' thing in Python is not a different type than simple ''. The r'' syntax simply creates a string that looks exactly like the one you typed, so the \n sequence stays as \n, and isn't turned into a new line (same thing happens to other special characters). This little r simply escapes everything you type.
Check it yourself with this two simple lines in the console:
print('test \n test')
print(r'test \n test')
print(type(r''))
print(type(''))
Now, while you read lines from JSON file, the escaping is done for you. I don't know how will you create the JSON file, but you should take a look at the json module, and the load method, that will allow you to read a JSON file.
You can use re.escape to escape the strings. However this is escaping everything and you might want some special chars. I'd just use the strings and be careful about placing \ in the right places.
BTW: If you have many regular expressions, matching might get slow. You might want to consider some alternatives such esmre.

Pandas read_csv. Using '^A' as a delimeter

I have a csv file where the field separators are ^A characters. When I try
df = pd.read_csv(p_file, sep='^A')
The file looks as follows:
0J0NrQDHHx^A989.0^A1
0J0NrQDHHx^A1204.0^A1
0U0NrQDHHx^A1654.0^A1
0N0NrQDHHx^A1679.0^A3
...
However, when I run the command above, I get everything in one column. Why?
Use sep='\^A:
pd.read_csv(p_file, sep='\^A')
Reason is that sep also accepts regular expressions, and ^ has a special meaning in regular expressions, so the \ is used to escape this.

Categories

Resources