here is the photo
how can I FIX this, thank you
You have to specify that your csv file seperator is whitespace.
To do so you have to add sep='\s+' as this says that the seperator in the file is one or more spaces (Regex expression).
The better way
You have to specify delim_whitespace=True as parametes as it is faster that regex I shown you above.
So your code should be like:
pd.read_csv("beer_drug_1687_1739.txt", header=None, delim_whitespace=True)
And also I see that your first row has the names of your columns. So you have to change header=None to header=[0] to get the names of your columns.
If you have any questions feel free to ask.
Related
i'm trying to automate a process. But one of the files have a very strange separation.
The columns are separated by spaces, but some rows have more spaces then others
Any one have idea how solve this.
Thank a lot! :D
First of all, I'd make sure that this is not an artefact of the interface you use for viewing the file, because some of them might actually just display tabs like this.
You can split the file using regular expressions on multiple spaces.
import re
for line in file:
fields = re.split("\s+", line)
I have a csv file with a custom delimiter as "$&$" like this:
$&$value$&$,$&$type$&$,$&$text$&$
$&$N$&$,$&$N$&$,$&$text of the message$&$
$&$N$&$,$&$F$&$,$&$text of the message_2$&$
$&$B$&$,$&$N$&$,$&$text of the message_3$&$
and I'm not able to parse it with the following code:
df = pd.read_csv('messages.csv', delimiter= '$\&$', engine='python)
can you help me, please??
From the docs:
... separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
So, to fix your case it should be like this:
df = pd.read_csv('messages.csv', delimiter= '\$\&\$,\$\&\$|\$\&\$', usecols=[1,2,3])
Note that there are going to be 2 additional columns, the first one and the last one. They exist because all data start/end with $&$. In addition to that, the delimiter is actually $&$,$&$. So, usecols get rid of them.
This is the output from the provided sample:
value
type
text
N
N
text of the message
I have a long list of regex patters that I want replace with in some text data, so I created a text file with all the regex patterns and the replacements I want for them, however, when reading this text file as a pandas dataframe, one particular regex pattern is missing in the dataframe. I tried to read it as a text file instead and build a pandas dataframe out of it but since some of the regex patterns contain "," in them, it becomes difficult to separate each line as 2 different columns.
the key(regex_pattern), value(the values I want to replace them with look as follows):
key,value
"\bnan\b|\bnone\b"," "
"^\."," "
"\s+"," "
"\.+","."
"-+","-"
"\(\)"," "
when I read this as a dataframe using the following lines of code,
normalization_mapper = pd.read_csv(f"{os.getcwd()}/normalize_dict.txt", sep=",")
I also tried the datatype argument as follows
dtype={'key':'string', 'value':'string'}
what I can't figure out is how to handle this "\(\)" missing pattern in the dataframe, pretty sure it's related to the escape character formatting by the pandas data reader, is there a way to handle this? Also, weirdly escape characters in other rows have no issues when reading the text file.
I'm working with a pandas dataframe that I want to plot. Instead of being float numbers, some values inside the dataframe are strings because they have a semicolon.
This causes a ValueError when the dataframe is to be plotted. I have found out that the semicolons only occur at the end of each line.
Is there a keyword in the read_csv method to let pandas recognize the semicolons so they can be removed?
If the semicolon is being used to separate lines, you can use the lineterminator parameter to correctly parse the file:
pd.read_csv(..., lineterminator=";")
Pandas CSV documentation
I am reading many different data files into various pandas dataframes. The columns in these datafiles are separated by spaces. However, for each file, the number of spaces is different (for some of them, there is only one space, for others, there are two spaces and so on). Thus, every time I import the file, I have to manually go to that file and see the number of spaces that have been used and give those many number of spaces in sep:
import pandas as pd
df = pd.read_csv('myfile.dat', sep = ' ')
Is there any way I can tell pandas to assume "any number of spaces" as the separator? Also, is there any way I can tell pandas to use either tab (\t) or spaces as the separator?
Yes, you can use a simple regular expression like sep='\s+' to denote one or more spaces.
You can also use the parameter skipinitialspace=True which skips the leading spaces after any delimiter.
You can directly use delim_whitespace:
import pandas as pd
df = pd.read_csv('myfile.dat', delim_whitespace=True )
The argument delim_whitespace controls whether or not whitespace (e.g. ' ' or ' ') will be used as separator. See pandas.read_csv for details.
One thing I found is if you use a unsupported separator. Pandas/Dask will have to use the Python engine instead of the C engine. This is a good deal slower.