Pandas datatype change within a function

Pandas datatype change within a function - python

General background
I've written a function which incorporates a MySQL query, with some munging on the returned data (pulled into a pandas df).
enginedb =create_engine("mysql+mysqlconnector://user:pswd#10.0.10.26:3306/db",
encoding='latin1')
query = ("""Select blah blah""")
df = pd.read_sql(query, enginedb)
This works fine - the query is a significant one with multiple joins etc.. However, it transpired for a certain lot within the db, the datatype was off: for almost all 'normal' lots, the datatypes for the columns were int64, some object, a datetime64[ns]... but for one lot (so far), all but the datetime were returning as object.
Issue
I need to do a stack - one of the columns is a list, and i've got some code to take each item of the list and stack them down row by row:
cols = list(df)
cols = cols[:-1]
df_stack = df.set_index(cols)['data'].apply(pd.Series).stack()
The problem is this doesn't work for this 'odd' lot, with the non-standard datatypes (the reason for the non-std data types is due to an upstream ETL process and i can't affect that).
The exact error is:
'Series' object has no attribute 'stack'
Consequently I had incorporated an if/else statement, checking to see if the dtype of one of the cols is incorrect, and if so, change it:
if df['id'].dtype == 'int64':
df_stack = df.set_index(cols)['data'].apply(pd.Series).stack()
df_stack = df_stack.reset_index()
else:
df_stack = df.apply(pd.to_numeric, errors = 'coerce')
# it can't be more specific than for all the columns, as there are a LOT
But this is having no effect - i've included in the function (containing the query and subsequent munging) a print out statement of dy.dtypes and df_stack.dtypes and the function is having no effect.
Why is this?
EDIT
I've added this picture to show the code (at right) which is attempting to catch the incorrectly-dtyped lot (12384), and the print-outs before and after the pd.to_numeric function (which both show just objects, no numeric cols).
My underlying questions has two parts:
What would cause 'Series' object has no attribute 'stack'? (more fundamentally than wrong datatype - or at least why is datatype an issue?)
Why would a pd.numeric not cause any change here?

Related

Difference in size by import data vs specifying data types after load

More of a conceptual question.
When I import files into Python (without specifying the data types) -- just straight up df = pd.read_csv("blah.csv/") or df = pd.read_excel("blah.xls"), Python naturally guesses the data types of the columns.
No issues here.
However, sometimes when I am working with one of the columns, say, an object column, and I know for certain that Python guessed correctly, my .str functions sometimes don't work as intended, or I get an error. Yet, if I were to specify the column data type after importation as a .str everything works as intended.
I also noticed that if I specify one of the object columns as a str datatype after importation, the size of the object increases. So I am guessing Python's object type is different from a "string object" datatype? What causes this discrepancy?

Why is a cell value coming as a series when you do dataframe[dataframe[ColumnName]==some_value]?

I am having the same problem as described here Cell value coming as series in pandas
While a solution is provided there is no explanation on why a cell would be a dataseries when I would expect that to be a string (which I understand that it is dtype=object)
My dataframe has columns as below
Serial Number object
Device ID int64
Device Name object
I am extracting a
device=s_devices[s_devices['Device ID']==17177529]
print(device['Device ID'])
prints fine as I would expect
17177529
print(device['Device Name'])
prints like below, like a Series:
49 10.112.165.182
Name: Device Name, dtype: object
What can be done ? I could see that I could use ".values" to get the IP only 10.112.165.182 but I am wondering what is causing the difference between dtype float and dtype object at import or elsewhere. I am reading from excel.

As far as I understand, your code should always output a Series. So the problem is probably in the code you are not describing. Also, the question you are referring to uses ix (which doesn't exist in the latest version of pandas), which indicates pandas version may also be an issue.
By the way, I don't think values is a good choice for your case, because it is used when you want an array, not an element. (Also values is not recommended anymore)
If you just want to extract an element, try:
# If there are multiple elements, the first one will be extracted.
print(device['Device Name'].iloc[0])
or
# If there are multiple elements, ValueError will be raised.
print(device['Device Name'].item())

Applying corrections to a subsampled copy of a dataframe back to the original dataframe?

I'm a Pandas newbie, so please bear with me.
Overview: I started with a free-form text file created by a data harvesting script that remotely accessed dozens of different kinds of devices, and multiple instances of each. I used OpenRefine (a truly wonderful tool) to munge that into a CSV that was then input to dataframe df using Pandas in a JupyterLab notebook.
My first inspection of the data showed the 'Timestamp' column was not monotonic. I accessed individual data sources as follows, in this case for the 'T-meter' data source. (The technique was taken from a search result - I don't really understand it, but it worked.)
cond = df['Source']=='T-meter'
rows = df.loc[cond, :]
df_tmeter = pd.DataFrame(columns=df.columns)
df_tmeter = df_tmeter.append(rows, ignore_index=True)
then checked each as follows:
df_tmeter['Timestamp'].is_monotonic
Fortunately, the problem was easy to identify and fix: Some sensors were resetting, then sending bad (but still monotonic) timestamps until their clocks were updated. I wrote the function healing() to cleanly patch such errors, and it worked a treat:
df_tmeter['healed'] = df_tmeter['Timestamp'].apply(healing)
Now for my questions:
How do I get the 'healed' values back into the original df['Timestamp'] column for only the 'T-meter' items in df['Source']?
Given the function healing(), is there a clean way to do this directly on df?
Thanks!
Edit: I first thought I should be using 'views' into df, but other operations on the data would either generate errors, or silently turn the views into copies.

I wrote a wrapper function heal_row() for healing():
def heal_row( row ):
if row['Source'] == 'T-meter': # Redundant check, but safe!
row['Timestamp'] = healing(row['Timestamp'])
return row
then did the following:
df = df.apply(lambda row: row if row['Source'] != 'T-meter' else heal_row(row), axis=1)
This ordering is important, since healing() is stateful based on the prior row(s), and thus can't be the default operation.

Filtering a dataset on values not in another dataset

I am looking to filter a dataset based off of whether a certain ID does not appear in a different dataframe.
While I'm not super attached to the way in which I've decided to do this if there's a better way that I'm not familiar with, I want to apply a Boolean function to my dataset, put the results in a new column, and then filter the entire dataset off of that True/False result.
My main dataframe is df, and my other dataframe with the ID's in it is called ID:
def groups():
if df['owner_id'] not in ID['owner_id']:
return True
return False
This ends up being accepted (no syntax problems), so I then go to apply it to my dataframe, which fails:
df['ID Groups?'] = df.apply (lambda row: groups() ,axis=1)
Result:
TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 0')
It seems that somewhere my data that I'm trying to use (the ID's are both letters and numbers, so strings) is incorrectly formatted.
I have two questions:
Is my proposed method the best way of going about this?
How can I fix the error that I'm seeing?
My apologies if it's something super obvious, I have very limited exposure to Python and coding as a whole, but I wasn't able to find anywhere where this type of question had already been addressed.

Expression to keep only these rows in df that match owner_id in ID:
df = df[df['owner_id'].isin(ID['owner_id'])]
Lambda expression is going to be way slower that this.
isin is the Pandas way. not in is the Python collections way.
The reason you are getting this error is df['owner_id'] not in ID['owner_id'] hashes left hand side to figure out if it is present in the right hand side. df['owner_id'] is of type Series and is not hashable, as reported. Luckily, it is not needed.

Retain successfully transformed rows in the event of runtime error in pandas

When applying a string manipulation function on Pandas data frame column whose length is north of a million rows. Due to some bad data in between it fails with:
AttributeError: 'float' object has no attribute 'lower'
Is there a way I can save the progress made so far on the column?
Let's say the manipulation function is:
def clean_strings(strg):
strg = strg.lower() #lower
return strg
And is applied to the data frame as
df_sample['clean_content'] = df_sample['content'].apply(clean_strings)
Where 'content' is the column with strings and 'clean_content' is the new column added.
Please suggest other approaches. TIA

First use map as your input is only 1 column and map is faster than apply
df_sample['clean_content']= df_sample['content'].map(clean_strings)
Secondly just type cast your column to string type to run your function
df['content'] = df['content'].astype(str)
def clean_strings(strg):
strg= strg.lower() #lower
return strg

Is there a way I can save the progress made so far on the column?
Unfortunately not, these function calls are meant to act atomically on the dataframe, meaning either the entire operation succeeds, or fails. I'm assuming the str.lower is just a representative example, you're actually doing much more in your function. That means that this is a job for exception handling.
def clean_string(row):
try:
return row.lower()
except AttributeError:
return row
If a particular record fails, you can handle the raised exception inside the function itself, controlling what is returned in that case.
You'd call the function appropriately -
df_sample['clean_content'] = df_sample['content'].apply(clean_string)
Note that content is a column of objects, and objects generally offer very poor performance in terms of vectorised operations. I'd recommend performing a cast to string -
df_sample['content'] = df_sample['content'].astype(str)
After this, consider using pandas' vectorised .str accessor functions in place of clean_string.
For reference, if all you want to do is lowercase your string column, use str.lower -
df_sample['content'] = df_sample['content'].astype(str).str.lower()
Note that, for an object column, you can still use the .str accessor. However, non-string elements will be coerced to NaN -
df_sample['content'] = df_sample['content'].str.lower() # assuming `content` is of `object` type

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas datatype change within a function - python

Related

Difference in size by import data vs specifying data types after load

Why is a cell value coming as a series when you do dataframe[dataframe[ColumnName]==some_value]?

Applying corrections to a subsampled copy of a dataframe back to the original dataframe?

Filtering a dataset on values not in another dataset

Retain successfully transformed rows in the event of runtime error in pandas

Categories

Resources