Python Handling differenct date formats in a columns including null - python

I have a csv with a date column so after fetching it using pandas. I am trying to insert into mysql db.
My requirement is that it should look for null values and make them 'None' and then irrespective of date format it should convert it to 2019-07-02 this format. below is my code
df = pd.read_csv (r'C:\Users\adminuser\Desktop\CSAExcel\test.csv', parse_dates = True , usecols = ['number','active','short_description','incident_state','caller_id','assignment_group','assigned_to','sys_created_on','sys_created_by','opened_at','opened_by','closed_at','closed_by','resolved_at','u_reported_by','u_reported_by.country','u_type'],encoding='cp1252')
df2 = df1.replace(np.nan, '', regex=True)
df2['created_on']= df2['created_on'].apply(lambda t: None if pd.isnull(t) else datetime.datetime.fromtimestamp(t).strftime('%Y-%m-%d'))
I am getting error an integer is required (got type str)

I see a few problems here. My guess is you spend some time googling solutions and you have different approaches weaved into one. I'll attempt to give you an easy to understand approach, but for common problems like parsing dates, various methods exist. In your snippet, there are two clear problems.
First you replace the NaN values for empty strings, but then in the lambda function you condition on a null value, so this is redundant. So we can remove the second line.
Asides, why do you want None instead of np.nan?
The lambda function assumes a timestamp, which is a POSIX timestamp, similar to what is returned by time.time(). The column 'created_on' does not contain that, but a string.
My approach will do the following. It uses pd.to_datetime(), which does the heavy lifting here. It will convert the str value to a datetime object. (actually, it takes various datatypes, including strings or Series of strings). If you pass a NaN value in to_datetime(), it will return a NaT value. So then we replace those to None.
Can you try if the following works?
df = pd.read_csv(r'C:\Users\adminuser\Desktop\CSAExcel\test.csv', usecols = ['number','active','short_description','incident_state','caller_id','assignment_group','assigned_to','sys_created_on','sys_created_by','opened_at','opened_by','closed_at','closed_by','resolved_at','u_reported_by','u_reported_by.country','u_type'],encoding='cp1252')
df1['created_on']= pd.to_datetime(df1['created_on'], format='%Y-%m-%d')
df1['created_on'] = df1['created_on'].apply(lambda t: None if pd.isnull(t) else t)
This is an explicit way, but pandas has a quicker way. In your original snippet, you also used the argument parse_data=True in the read_csv(). You can print df1.dtypes to see if the column created_on was already successfully converted to datetime objects. If so, you'd only have to change the NaT/NaN values to None.

Related

Pandas.read_sql_query reads int as float

I am exporting SQL database into csv using Pandas.read_sql_query + df.to_csv and have a problem that integer fields are represented as float in DataFrame.
My code:
conn = pymysql.connect(server, port)
chunk = pandas.read_sql_query('''select * from table''', conn)
df = pandas.DataFrame(chunk) # here int values are float
df.to_csv()
I am exporting a number of tables this way and the problem is that int fields are exported as float (with a dot). Also, I don't know upfront which column has which type (code supposed to be generic for all tables)
However, my target is to export everything as is - in strings
The things I've tried (without a success):
df.applymap(str)
read_sql_query(coerce_float=False)
df.fillna('')
DataFrame(dtype=object) / DataFrame(dtype=str)
Of course, I can then post-process the data to type-cast integers, but would be better to do it during the initial import
UPD: My dataset has NULL values. They should be replaced with empty strings (as the purpose is to typecast all columns to strings)
Pandas infers the datatype from a sample of the data. If an integer column has null values, pandas assigns it the float datatype because NaN is of float type and because pandas is based on numpy which has no nullable integer type. So try making sure that you dont have any nulls or that nulls are replaced with 0's if it makes sense in your dataset, 0 being an integer.
ALSO, another way to go about this would be to specify the dtypes on data import, but you have to use a special kind of integer, like Int64Dtype, as per the docs:
"If you need to represent integers with possibly missing values, use one of the nullable-integer extension dtypes provided by pandas"
So I found a solution by post-processing the data (force converting to strings) using applymap:
def convert_to_int(x):
try:
return str(int(x))
except:
return x
df = df.applymap(convert_to_int)
The step takes significant time for processing, however solves my problem

Need helping with a sorting error in Pandas

I have a data frame that looks like this:
Pandas DF:
I exported it to excel to be able to see it easier. But basically I am trying to sort it by SeqNo asc and it isnt counting correctly. So instead of going 0,0,0,0,1,1,1,1,2,2,2,2 its going 0,0,0,0,0,1,1,1,1,10,10,10,10. Please help out if possible. Here is the code that I have to sort it. I have tried many other methods but it just isnt sorting correctly.
final_df = df.sort_values(by=['SeqNo'])
Based on your description I think it is treating the column values as "String" instead of "int". You can confirm this by checking the datatype of your column (Ex: use df.info() to check datatype of all the columns in dataframe)
One option to resolve this is to convert that particular column type from string to "int" before sorting and exporting to excel. You can apply pandas "to_numeric()" function before sorting and exporting to excel. Please check pandas documentation for to_numeric() (Refer to https://www.linkedin.com/pulse/change-data-type-columns-pandas-mohit-sharma/ for sample)
First of all try Command Given Below for Verifying the type of Data given to you because it's important to understand your data first:-
print(df.dtypes)
Above Command will display all the Datatypes of Given Data. Then try to find SeqNo Datatype. If your Output for SeqNo is Showing Something like:-
SeqNo object
dtype: object
Then Your Data is of String Format and you have to Convert it to Integer or Numeric Format. So, For Converting it there are two Ways:-
1. By astype(int) Method
df['SeqNo'] = df['SeqNo'].astype(int)
2. By to_numeric Method
df['SeqNo'] = pd.to_numeric(df['SeqNo'])
After this Step Try again to Verify the Datatype has been changed or not by typing print(df.dtypes) and Now it will show similar output as stated below:-
SeqNo int32
dtype: object
Now you can print Data after Sorting Operation in Ascending Format:-
final_df = df.sort_values(by = ['SeqNo'], ascending = True)

python pandas quantlib.time.date.Date

I have two dataframes:
import pandas as pd
from quantlib.time.date import Date
cols = ['ColStr','ColDate']
dataset1 = [['A',Date(2017,1,1)],['B',Date(2017,2,2)]]
x = pd.DataFrame(dataset1,columns=cols)
dataset2 = [['A','2017-01-01'],['B','2017-02-04']]
y = pd.DataFrame(dataset2,columns=cols)
Now, I want to compare the two table. I have written another set of code that compares the two (larger) dataframes and works for strings and numerical value.
My problem is - with column 'ColDate' one being string type and other being Date type, I am not able to validate if 'ColStr' = A is a match and 'ColStr' = 'B' is a mismatch.
I would have to
(1) either convert y.ColDate to Date
(2) or convert x.ColDate to str with a similar format as y.ColDate.
How do I achieve one or the other
I guess that you need to cast them to a single common type using something like dataset1['ColDate'] = dataset1.ColDate.map(convert_type) or any other method to iterate column values. Check other functions from pandas docs like apply().
The convert_type function should be defined in your program and accept a single argument to be passed into map().
And, when the columns have same types, you can compare them using any method you like.
You probably want to use the dt.strftime() function.
dataset1[0].dt.strftime("%Y-%m-%d")

Receiving KeyError when converting a column from float to int

created a pandas data frame using read_csv.
I then changed the column name of the 0th column from 'Unnamed' to 'Experiment_Number'.
The values in this column are floating point numbers and I've been trying to convert them to integers using:
df['Experiment_Number'] = df['Experiment_Number'].astype(int)
I get this error:
KeyError: 'Experiment_Number'
I've been trying every way since yesterday, for example also
df['Experiment_Number'] = df.astype({'Experiment_Number': int})
and many other variations.
Can someone please help, I'm new using pandas and this close to giving up on this :(
Any help will be appreciated
I had used this for renaming the column before:
df.columns.values[0] = 'Experiment_Number'
This should have worked. The fact that it didn't can only mean there were special characters/unprintable characters in your column names.
I can offer another possible suggestion, using df.rename:
df = df.rename(columns={df.columns[0] : 'Experiment_Number'})
You can convert the type during your read_csv() call then rename it afterward. As in
df = pandas.read_csv(filename,
dtype = {'Unnamed': 'float'}, # inform read_csv this field is float
converters = {'Unnamed': int}) # apply the int() function
df.rename(columns = {'Unnamed' : 'Experiment_Number'}, inplace=True)
The dtype is not strictly necessary, because the converter will override it in this case, but it is wise to get in the habit of always providing a dtype for every field of your input. It is annoying, for example, how pandas treats integers as floats by default. Also, you may later remove the converters option without worry, if you specified dtype.

Force Pandas to Read in Column as raw unicode

I need to account for folks entering data into a spreadsheet completely wrong. I cannot control their behavior because I'm scraping it from another website. However, there is some truly bad data entry, such as the following for "Tons" of cargo:
Lovely, right? I need to figure out a way to read numbers like that into pandas without pandas auto-casting them to dates, after which point it's impossible to convert them back to 11955 and 11862. To add a cherry on top, the following won't work:
dfx = pd.read_excel(ii,header=None,dtype={'Tons': str})
because often the data has no column headers and I'm inferring the header from the order of the data, which thankfully doesn't change. So how to get pandas to be agreeable here?
Once I read in the data, even if I then change the entire column to unicode or string, it'll just be a unicode or string representation of the date:
2055-01-19 00:00:00
2062-01-18 00:00:00
So I need to read it in either "raw" (not sure what that means) as 1,19,55 without pandas trying to guess at the type, or just somehow as a number ignoring the commas...
Thanks!
You can create a converter for the column Tons to format the data as you want as pd.read_execel documentation explains:
converters : dict, default None Dict of functions for converting
values in certain columns. Keys can either be integers or column
labels, values are functions that take one input argument, the Excel
cell content, and return the transformed content.
for example you can use the following converter
tons_converter = lambda x: int("".join(x.split(',')))
dfx = pd.read_excel(ii,header=None,dtype={0: str}, converters={0: tons_converter})
reproducible example
Here's an example creating a csv file on the fly and applying the conversion.
from StringIO import StringIO
import pandas as pd
data = """
1,125,125
10,578,589
12
"""
tons_converter = lambda x: int("".join(x.split(',')))
dfx = pd.read_csv(StringIO(data),header=None,dtype=object, sep="|", converters={0: tons_converter})
print(dfx.head())
The ouput is you want:
0
0 1125125
1 10578589
2 12

Categories

Resources