Understanding parsing function here - python

I have googled it but it is not clear what the function parse is doing here. At least, I do not quite understand. Please if someone could clarify it for me, I would be grateful.
Data = pd.ExcelFile(filename[0])
ncols = Data.book.sheet_by_index(0).ncols #class book google it
Data_df = Data.parse(0, converters={i : str for i in range(ncols-1)}, encoding="utf-8")

I presume that the snippet presented was preceded with
import pandas as pd
The ExcelFile class is described here, in the Pandas documentation. The ExcelFile.parse function is a thin wrapper around pd.read_excel; the converters argument is described in the last link:
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.
The book object accessed in line 2 is part of the xlrd package, which is the underlying implementation used by panda to read Excel files. It is documented here, and the sheet_by_index method here (although those just do what you might expect); the ncols field in a Sheet is documented here, and it just returns the number of columns in the sheet, ignoring trailing empty columns.
In short, range(ncols-1) will produce the indices of all the columns except the last one, so the converters dictionary {i : str for i in range(ncols-1)} has the effect of treating every column except the last as a simple string, instead of attempting to parse each cell to decide its datatype.

Related

.strip() with in-place solution not working

I'm trying to find a solution for stripping blank spaces from some strings in my DataFrame. I found this solution, where someone said this:
I agree with the other answers that there's no inplace parameter for
the strip function, as seen in the
documentation
for str.strip.
To add to that: I've found the str functions for pandas Series
usually used when selecting specific rows. Like
df[df['Name'].str.contains('69'). I'd say this is a possible reason
that it doesn't have an inplace parameter -- it's not meant to be
completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative
indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or
we'll consistently get "last 5 characters" instead!
So, I have a list of DataFrames called 'dataframes'. On the first dataframe (which is dataframes[0]), I have a column named 'CNJ' with string values, some of them with a blank space in the end. For example:
Input:
dataframes[0]['cnj'][9]
Output:
'0100758-73.2019.5.01.0064 '
So, following the comment above, I did this:
Input:
dataframes[0]['cnj'] = dataframes[0]['cnj'].strip()
Then I get the following error:
AttributeError: 'Series' object has no attribute 'strip'
Since the solution given on the other topic worked, what am I doing wrong to get this error? It seemed to me it shouldn't work because its a Series, but it should get the same result as the one mentioned above (data['Name'] = data['Name'].str.strip().str[-5:]), right?
Use
dataframes[0]['cnj']=dataframes[0]['cnj'].str.strip()
or better yet, store the dataframe in a variable first:
df0=dataframes[0]
df0['cnj']=df0['cnj'].str.strip()
The code in the solution you posted uses .str. :
data['Name'] = data['Name'].str.strip().str[-5:]
The Pandas Series object has no string or date manipulation methods methods. These are exposed through the Series.str and Series.dt accessor objects.
The result of Series.str.strip() is a new series. That's why .str[-5:] is needed to retrieve the last 5 characters. That results is a new series again. That expression is equivalent to :
temp_series=data['Name'].str.strip()
data['Name'] = temp_series.str[-5:]
You could just apply a transformation function on the column values like this.
data["Name"] = data["Name"].apply(lambda x: str(x).strip()[-5:])
What you need is a string without the right spaces is a series or a dataframe right, at least that's my understanding looking at your query, use str.rstrip() which will work both on series and dataframe objects.
Note: strip() usually is only for string datatypes, so the error you are getting is appropriate.
Refer to link , and try implementing str.rstrip() provided by pandas.
For str.strip() you can refer to this link, it works for me.
In your case, assuming the dataframe column to be s, you can use the below code:
df[s].str.strip()

CSV cannot be interpreted by numeric values

(This is a mix between code and 'user' issue, but since i suspect the issue is code, i opted to post in StackOverflow instead of SuperUser Exchange).
I generated a .csv file with pandas.DataFrame.to_csv() method. This file consists in 2 columns: one is a label (text) and another is a numeric value called accuracy (float). The delimiter used to separate columns is comma (,) and all float values are stored with dot ponctuation like this: 0.9438245862
Even saving this column as float, Excel and Google Sheets infer its type as text. And when i try to format this column as number, they ignore "0." and return a very high value instead of decimals like:
(text) 0.9438245862 => (number) 9438245862,00
I double-checked my .csv file reimporting it again with pandas.read_csv() and printing dataframe.dtypes and the column is imported as float succesfully.
I'd thank for some guidance on what am i missing.
Thanks,
By itself, the csv file should be correct. Both you and Pandas know what delimiter and floating point format are. But Excel might not agree with you, depending on your locale. A simple way to make sure is to write a tiny Excel sheet containing on first row one text value and one floating point one. You then export the file as csv and control what delimiter and floating point formats are.
AFAIK, it is much more easy to change your Python code to follow what your Excel expects that trying to explain Excel that the format of CSV files can vary...
I know that you can change the delimiter and floating point format in the current locale in a Windows system. Simply it is a global setting...
A short example of data would be most useful here. Otherwise we have no idea what you're actually writing/reading. But I'll hazard a guess based on the information you've provided.
The pandas dataframe will have column names. These column names will be text. Unless you tell Excel/Sheets to use the first row as the column name, it will have to treat the column as text. If this isn't the case, could you perhaps save the head of the dataframe to a csv, check it in a text editor, and see how Excel/Sheets imports it. Then include those five rows and two columns in your follow up.
The coding is not necessarily the issue here, but a combination of various factors. I am assuming that your computer is not using the dot character as a decimal separator, due to your language settings (for example, French, Dutch, etc). Instead your computer (and thus also Excel) is likely using a comma as a decimal separator.
If you want to open the data of your analysis / work later with Excel with little to no changes, you can either opt to change how Excel works or how you store the data to a CSV file.
Choosing the later, you can specify the decimal character for the df.to_csv method. It has the "decimal" keyword. You should then also remember that you have to change the decimal character during the importing of your data (if you want to read again the data).
Continuing with the approach of adopting your Python code, you can use the following code snippets to change how you write the dataframe to a csv
import pandas as pd
... some transformations here ...
df.to_csv('myfile.csv', decimal=',')
If you, then, want to read that output file back in with Python (using Pandas), you can use the following:
import pandas as pd
df = pd.read_csv('myfile.csv', decimal=',')

How can I effectively load data on Stack Overflow questions using Pandas read_clipboard?

I notice a lot of Pandas questions on Stack Overflow only include a few rows of their data as text, without the accompanying code to generate/reproduce it. I am aware of the existence of read_clipboard, but I am unable to figure out how to effectively call this function to read data in many situations, such as when there are white spaces in the header names, or Python objects such as lists in the columns.
How can I use pd.read_clipboard more effectively to read data pasted in unconventional formats that don't lend themselves to easy reading using the default arguments? Are there situations where read_clipboard comes up short?
read_clipboard: Beginner's Guide
read_clipboard is truly a saving grace for anyone starting out to answer questions in the Pandas tag. Unfortunately, pandas veterans also know that the data provided in questions isn't always easy to grok into a terminal due to various complications in the format of the data posted.
Thankfully, read_clipboard has arguments that make handling most of these cases possible (and easy). Here are some common use cases and their corresponding arguments.
Common Use Cases
read_clipboard uses read_csv under the hood with white space separator, so a lot of the techniques for parsing data from CSV apply here, such as
parsing columns with spaces in the data
use sep with regex argument. First, ensure there are at least two spaces between columns and at most one consecutive white space inside the column's data itself. Then you can use sep=r'\s{2,}' which means "separate columns by looking for at least two consecutive white spaces for the separator" (note: engine='python' is required for multicharacter or regex separators):
df = pd.read_clipboard(..., sep=r'\s{2,}', engine='python')
Also see How do you handle column names having spaces in them when using pd.read_clipboard?.
reading a series instead of DataFrame
use squeeze=true, you would likely also need header=None if the first row is also data.
s = pd.read_clipboard(..., header=None, squeeze=True)
Also see Could there be an easier way to use pandas read_clipboard to read a Series?.
loading data with custom header names
use names=[...] in conjunction with header=None and skiprows=[0] to ignore existing headers.
df = pd.read_clipboard(..., header=None, names=['a', 'b', 'c'], skiprows=[0])
loading data without any headers
use header=None
set one or more columns as the index
use index_col=[...] with the appropriate label or index
parsing dates
use parse_dates with the appropriate format. If parsing datetimes (i.e., columns with date separated by timestamp), you will likely also need to use sep=r'\s{2,}' while ensuring your columns are separated by at least two spaces.
See this answer by me for a more comprehensive list on read_csv arguments for other cases not covered here.
Caveats
read_clipboard is a Swiss Army knife. However, it
cannot read data in prettytable/tabulate formats (IOW, borders make it harder)
See Reading in a pretty-printed/formatted dataframe using pd.read_clipboard? for solutions to tackle this.
cannot correctly parse MultIndexes unless all elements in the index are specified.
See Copying MultiIndex dataframes with pd.read_clipboard? for solutions to tackle this.
cannot ignore/handle ellipses in data
my suggested method is to manually remove ellipses before printing
cannot parse columns of lists (or other objects) as anything other than string. The columns will need to be converted separately, as shown in How do you read in a dataframe with lists using pd.read_clipboard?.
cannot read text from images (so please don't use images as a means to share your data with folks, please!)
The one weakness of this function is that it doesn't capture contents of Ctrl + C if the copy is performed from a PDF file. Testing it this way results in an empty read.
But by using a regular text editor, it goes just fine. Here is an example using randomly typed text:
>>> pd.read_clipboard()
Empty DataFrame
Columns: [sfsesfsdsxcvfsdf]
Index: []

How to read_csv float value with correct decimal value in python using [duplicate]

I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.

Why does `head` need `()` and `shape` does not?

In the following code, I import a csv file into Python's pandas library and display the first 5 rows, and query the 'shape' of the pandas dataframe.
import pandas as pd
data = pd.read_csv('my_file.csv')
data.head() #returns the first 5 rows of the dataframe
data.shape # displays the # of rows and # of columns of dataframe
Why is it that the head() method requires empty parentheses after head but shape does not? Does it have to do with their types?
If I called head without following it with the empty parentheses, I would not get the same result. Is it that head is a method and shape is just an attribute?
How could I generalize the answer to the above question to the rest of Python? I am trying to learn not just about pandas here but Python in general. For example, a sentence such as "When _____ is the case, one must include empty parentheses if no arguments will be provided, but for other attributes one does not have to?
The reason that head is a method and not a attribute is most likely has to do with performance. In case head would be an attribute it would mean that every time you wrangle a dataframe, pandas would have to precompute the slice of data and store it in the head attribute, which would be waste of resources. The same goes for the other methods with empty parenthesis.
In case of shape, it is provided as an attribute since this information is essential to any dataframe manipulation thus it is precomputed and available as an attribute.
When you call data.head() you are calling the method head(self) on the object data,
However, when you write data.shape, you are referencing a public attribute of the object data
It is good to keep in mind that there is a distinct difference between methods and object attributes. You can read up on it here

Categories

Resources