(This is a mix between code and 'user' issue, but since i suspect the issue is code, i opted to post in StackOverflow instead of SuperUser Exchange).
I generated a .csv file with pandas.DataFrame.to_csv() method. This file consists in 2 columns: one is a label (text) and another is a numeric value called accuracy (float). The delimiter used to separate columns is comma (,) and all float values are stored with dot ponctuation like this: 0.9438245862
Even saving this column as float, Excel and Google Sheets infer its type as text. And when i try to format this column as number, they ignore "0." and return a very high value instead of decimals like:
(text) 0.9438245862 => (number) 9438245862,00
I double-checked my .csv file reimporting it again with pandas.read_csv() and printing dataframe.dtypes and the column is imported as float succesfully.
I'd thank for some guidance on what am i missing.
Thanks,
By itself, the csv file should be correct. Both you and Pandas know what delimiter and floating point format are. But Excel might not agree with you, depending on your locale. A simple way to make sure is to write a tiny Excel sheet containing on first row one text value and one floating point one. You then export the file as csv and control what delimiter and floating point formats are.
AFAIK, it is much more easy to change your Python code to follow what your Excel expects that trying to explain Excel that the format of CSV files can vary...
I know that you can change the delimiter and floating point format in the current locale in a Windows system. Simply it is a global setting...
A short example of data would be most useful here. Otherwise we have no idea what you're actually writing/reading. But I'll hazard a guess based on the information you've provided.
The pandas dataframe will have column names. These column names will be text. Unless you tell Excel/Sheets to use the first row as the column name, it will have to treat the column as text. If this isn't the case, could you perhaps save the head of the dataframe to a csv, check it in a text editor, and see how Excel/Sheets imports it. Then include those five rows and two columns in your follow up.
The coding is not necessarily the issue here, but a combination of various factors. I am assuming that your computer is not using the dot character as a decimal separator, due to your language settings (for example, French, Dutch, etc). Instead your computer (and thus also Excel) is likely using a comma as a decimal separator.
If you want to open the data of your analysis / work later with Excel with little to no changes, you can either opt to change how Excel works or how you store the data to a CSV file.
Choosing the later, you can specify the decimal character for the df.to_csv method. It has the "decimal" keyword. You should then also remember that you have to change the decimal character during the importing of your data (if you want to read again the data).
Continuing with the approach of adopting your Python code, you can use the following code snippets to change how you write the dataframe to a csv
import pandas as pd
... some transformations here ...
df.to_csv('myfile.csv', decimal=',')
If you, then, want to read that output file back in with Python (using Pandas), you can use the following:
import pandas as pd
df = pd.read_csv('myfile.csv', decimal=',')
I am using Python 3.7
I need to load data from two different sources (both csv) and determine which rows from the one sources are not in the second source.
I have used pandas data-frames to load the data and do a comparison between the two sources of data.
I loaded the data from the csv file and a value like 2010392 is turned to 2010392.0 in the data-frame column.
I have read quite a number of articles about formatting data-frame columns; unfortunately, most of them are about date and time conversions.
I came across an article "Format integer column of Data-frame in Python pandas" at http://www.datasciencemadesimple.com/format-integer-column-of-dataframe-in-python-pandas/ which does not solve my problem
Based on the above mentioned article I have tried the following:
pd.to_numeric(data02['IDDLECT'], downcast='integer')
Out[63]:
0 2010392.0
1 111777967.0
2 2010392.0
3 2012554.0
4 2010392.0
5 2010392.0
6 2010392.0
7 1170126.0
and as you can see, the column values still have a decimal point with a zero.
I expect the load of the dataframe from a csv file to keep the format of a number such as 2010392 to be 2010392 and not 2010392.0
Here is the code that I have tried:
import pandas as pd
data = pd.read_csv("timetable_all_2019-2_groups.csv")
data02 = data.drop_duplicates()
print(f'Len data {len(data)}')
print(data.head(20))
print(f'Len data02 {len(data02)}')
print(data02.head(20))
pd.to_numeric(data02['IDDLECT'], downcast='integer')
Here is a few lines of the content of the csv file:
The data in the one source looks like this:
IDDCYR,IDDSUBJ,IDDOT,IDDGRPTYP,IDDCLASSGROUP,IDDLECT,IDDPRIMARY
019,AAACA1B,VF,C,A1,2010392,Y
2019,AAACA1B,VF,C,A1,111777967,N
2019,AAACA3B,VF,C,A1,2010392,Y
2019,AAACA3B,VF,C,A1,2012554,N
2019,AAACB2A,VF,C,B1,2010392,Y
2019,AAACB2A,VF,P,B2,2010392,Y
2019,AAACB2A,VF,C,B1,2010392,N
2019,AAACB2A,VF,P,B2,1170126,N
2019,AAACH1A,VF,C,A1,2010392,Y
Looks like you have data which is not of integer type. Once loaded you should do something about that data and then convert the column to int.
From your error description, you have nans and/or inf values. You could impute the missing values with the mode, mean, median or a constant value. You can achieve that either with pandas or with sklearn imputer, which is dedicated to imputing missing values.
Note that if you use mean, you may end up with a float number, so make sure to get the mean as an int.
The imputation method you choose really depends on what you'll use this data for later. If you want to understand the data, filling nans with 0 may destroy aggregation functions later (e.g. if you'll want to know what the mean is, it won't be accurate).
That being said, I see you're dealing with categorical data. One option here is to use dtype='category'. If you want to later fit a model with this and you leave ids as numbers, the model can conclude weird things which are not correct (e.g. the sum of two ids equals to some third id, or that ids that are higher are more important than lower ones... things that a priori make no sense and should not be ignored and left to chance.)
Hope this helps!
data02['IDDLECT'] = data02['IDDLECT']fillna(0).astype('int')
I have been using pandas but am open to all suggestions, I'm not an expert at scripting but am a complete loss. My goal is the following:
Merge multiple CSV files. Was able to do this in Pandas and have a dataframe with the merged dataset.
Screenshot of how merged dataset looks like
Delete the duplicated "GEO" columns after the first set. This last part doesn't let me usedf = df.loc[:,~df.columns.duplicated()] because they are not technically duplicated.The repeated column names end with a .1,.2,etc. as I am guessing the concate adds this. Other problem is that some columns have a duplicated column name but are different datasets. I have been using the first row as the index since it's always the same coded values but this row is unnecessary and will be deleted afterwards in the script. This is my biggest problem right now.
Delete certain columns such as the ones with the "Margins". I use ~df2.columns.str.startswith for this and have no trouble with this.
Replace spaces, ":" and ";" with underscores in the first row. I have no clue how to do this.
Insert a new column, write '=TEXT(B1,0)' formula, do this for the whole column (formula would change to B2,B3, etc.), copy the column and paste as values. I was able to do this in openpyxl although was having trouble and was not able to try the final output thanks to excel trouble.
source = excel.Workbooks.Open(filename)
excel.Range("C1:C1337").Select()
excel.Selection.Copy()
excel.Selection.PasteSpecial(Paste=constants.xlPasteValues)
Not sure if it works and was wondering if it was possible in pandas, win32com or I should stay with openpyxl. Thanks all!
I have a pandas dataframe with a lot of % and currency columns coming in from a sql table as float datatypes.
I'm doing the below to convert it to currency format.
I am appending this data to an excel spreadsheet with openpyxl.
while the output is showing currently with cost as $123, $100 etc, it possibly converts to string thus losing the ability to sum the values after generating the data.
How can I make the data display the $ before cost yet retain its ability to sum the values?
(df['Cost']).apply(lambda x: '${:,.2f}'.format(x))
The dataframe I am attempting to append to excel is very large and wide. About 480 columns and 30K rows - data is mixed. Strings, Currency, % and plain old float. Not sure if I can specify it by cell. So was hoping to handle it in the dataframe and output it. Also read that number formats are strings in openpyxl so that would not help in this case
Any other ideas/options for me?
Thanks in advance.
I am currently using Python and I got a dataframe including a column with PartNumbers.
These part numbers have various patterns: e.g. 500-1222-33, 48L48 etc.
However, I want to remove rows having the following format: e.g. 06/06/3582.
Is there a way to remove the rows with these value-patterns from the dataframe?
Thanks in advance.