I am reading many different data files into various pandas dataframes. The columns in these datafiles are separated by spaces. However, for each file, the number of spaces is different (for some of them, there is only one space, for others, there are two spaces and so on). Thus, every time I import the file, I have to manually go to that file and see the number of spaces that have been used and give those many number of spaces in sep:
import pandas as pd
df = pd.read_csv('myfile.dat', sep = ' ')
Is there any way I can tell pandas to assume "any number of spaces" as the separator? Also, is there any way I can tell pandas to use either tab (\t) or spaces as the separator?
Yes, you can use a simple regular expression like sep='\s+' to denote one or more spaces.
You can also use the parameter skipinitialspace=True which skips the leading spaces after any delimiter.
You can directly use delim_whitespace:
import pandas as pd
df = pd.read_csv('myfile.dat', delim_whitespace=True )
The argument delim_whitespace controls whether or not whitespace (e.g. ' ' or ' ') will be used as separator. See pandas.read_csv for details.
One thing I found is if you use a unsupported separator. Pandas/Dask will have to use the Python engine instead of the C engine. This is a good deal slower.
Related
I have a csv file with a custom delimiter as "$&$" like this:
$&$value$&$,$&$type$&$,$&$text$&$
$&$N$&$,$&$N$&$,$&$text of the message$&$
$&$N$&$,$&$F$&$,$&$text of the message_2$&$
$&$B$&$,$&$N$&$,$&$text of the message_3$&$
and I'm not able to parse it with the following code:
df = pd.read_csv('messages.csv', delimiter= '$\&$', engine='python)
can you help me, please??
From the docs:
... separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
So, to fix your case it should be like this:
df = pd.read_csv('messages.csv', delimiter= '\$\&\$,\$\&\$|\$\&\$', usecols=[1,2,3])
Note that there are going to be 2 additional columns, the first one and the last one. They exist because all data start/end with $&$. In addition to that, the delimiter is actually $&$,$&$. So, usecols get rid of them.
This is the output from the provided sample:
value
type
text
N
N
text of the message
I would like to split some columns in my dataframe based off of certain keywords and integers. In excel, this would look like this using the moveable delimiter:
.
I am aware of Pandas' str.split but this seems limited to one delimiter at a time, and does not seem to account for integers. With regex, I could do something like this to split the string accordingly
s = "zone entries bin 1 zone center"
s = re.split(r'(bin)|(\s+[0-9]+\s+)', s)
(I am not great at regex and with this latter result I'd have to remove None values). But it seems like regex expressions won't work with Pandas str.split. What is the most optimal way to accomplish this Text to Column functionality?
str.split supports regular expressions. Series.str.split(pat=None, n=- 1, expand=False)
pat : str, optional
string or regular expression to split on. If not specified, split on whitespace.
See pandas docs. Also here is an answer with good examples.
I'm working with a pandas dataframe that I want to plot. Instead of being float numbers, some values inside the dataframe are strings because they have a semicolon.
This causes a ValueError when the dataframe is to be plotted. I have found out that the semicolons only occur at the end of each line.
Is there a keyword in the read_csv method to let pandas recognize the semicolons so they can be removed?
If the semicolon is being used to separate lines, you can use the lineterminator parameter to correctly parse the file:
pd.read_csv(..., lineterminator=";")
Pandas CSV documentation
I am working on a function that among other tasks, is supposed to read a csv in pandas. As one of the parameters, I would like to pass the separator as a string. However, for some reason, probably something to do with regular expressions, pandas is totally ignoring my passed parser and defaults to '\t', which does not parse my data correcty.
import pandas as pd
def open_df(separator):
df = pd.read_csv('filename.csv', sep=separator)
return df
Question is, how am I suppose to pass the separator parameter in this case?
Please Check this link:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can,
meaning the latter will be used and automatically detect the separator
by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators
longer than 1 character and different from '\s+' will be interpreted
as regular expressions and will also force the use of the Python
parsing engine. Note that regex delimiters are prone to ignoring
quoted data. Regex example: '\r\t'.
I passed the seperator string as "raw" string and that worked fine for me.
I you use a raw string \ is interpreted as a normal character and also \t will work
When you call open_df() you have to write a r before the string quotes like open_df(r"\t")
Example:
test_string = r"\t\n"
print(test_string)
\t\n
And I also passed "python" as engine parameter in order to not display the parser warning :-).
I have a CSV file with the following structure:
word1|word2|word3,word4,0.20,0.20,0.11,0.54,2.70,0.07,1.75
That is, a first column of strings (some capitalized, some not) that are separated by '|' and by ',' (this marks differences in patterns of association) and then 7 digits each separated by ','.
n.b. This dataframe has multiple millions of rows. I have tried to load it as follows:
pd.read_csv('pattern_association.csv',sep= ',(?!\D)', engine='python',chunksize=10000)
I have followed the advice on here to use a regular expression which aims to capture every column after a digit, but I need one that both selects the first column as a whole string and ignores commas between strings, and then also parses out the 7 columns that are comprised of digits.
How can I get pandas to parse this?
I always get the error.
Error could possibly be due to quotes being ignored when a
multi-char delimiter is used.
I have tried many variations and the regex I am using seems to work outside the context of pandas on toy expressions.
Thanks for any tips.