Pyspark: Remove UTF null character from pyspark dataframe

Pyspark: Remove UTF null character from pyspark dataframe - python

I have a pyspark dataframe similar to the following:
df = sql_context.createDataFrame([
Row(a=3, b=[4,5,6],c=[10,11,12], d='bar', e='utf friendly'),
Row(a=2, b=[1,2,3],c=[7,8,9], d='foo', e=u'ab\u0000the')
])
Where one of the values for column e contains the UTF null character \u0000. If I try to load this df into a postgresql database, I get the following error:
ERROR: invalid byte sequence for encoding "UTF8": 0x00
which makes sense. How can I efficiently remove the null character from the pyspark dataframe before loading the data into postgres?
I have tried using some of the pyspark.sql.functions to clean the data first without success. encode, decode, and regex_replace did not work:
df.select(regexp_replace(col('e'), u'\u0000', ''))
df.select(encode(col('e'), 'UTF-8'))
df.select(decode(col('e'), 'UTF-8'))
Ideally, I would like to clean the entire dataframe without specifying exactly which columns or what the violating character is, since I don't necessarily know this information ahead of time.
I am using a postgres 9.4.9 database with UTF8 encoding.

Ah wait - I think I have it. If I do something like this, it seems to work:
null = u'\u0000'
new_df = df.withColumn('e', regexp_replace(df['e'], null, ''))
And then mapping to all string columns:
string_columns = ['d','e']
new_df = df.select(
*(regexp_replace(col(c), null, '').alias(c) if c in string_columns else c for
c in df.columns)
)

You can use DataFrame.fillna() to replace null values.
Replace null values, alias for na.fill(). DataFrame.fillna() and
DataFrameNaFunctions.fill() are aliases of each other.
Parameters:
value – int, long, float, string, or dict. Value to
replace null values with. If the value is a dict, then subset is
ignored and value must be a mapping from column name (string) to
replacement value. The replacement value must be an int, long, float,
or string.
subset – optional list of column names to consider. Columns
specified in subset that do not have matching data type are ignored.
For example, if value is a string, and subset contains a non-string
column, then the non-string column is simply ignored.

Related

Pandas.read_sql_query reads int as float

I am exporting SQL database into csv using Pandas.read_sql_query + df.to_csv and have a problem that integer fields are represented as float in DataFrame.
My code:
conn = pymysql.connect(server, port)
chunk = pandas.read_sql_query('''select * from table''', conn)
df = pandas.DataFrame(chunk) # here int values are float
df.to_csv()
I am exporting a number of tables this way and the problem is that int fields are exported as float (with a dot). Also, I don't know upfront which column has which type (code supposed to be generic for all tables)
However, my target is to export everything as is - in strings
The things I've tried (without a success):
df.applymap(str)
read_sql_query(coerce_float=False)
df.fillna('')
DataFrame(dtype=object) / DataFrame(dtype=str)
Of course, I can then post-process the data to type-cast integers, but would be better to do it during the initial import
UPD: My dataset has NULL values. They should be replaced with empty strings (as the purpose is to typecast all columns to strings)

Pandas infers the datatype from a sample of the data. If an integer column has null values, pandas assigns it the float datatype because NaN is of float type and because pandas is based on numpy which has no nullable integer type. So try making sure that you dont have any nulls or that nulls are replaced with 0's if it makes sense in your dataset, 0 being an integer.
ALSO, another way to go about this would be to specify the dtypes on data import, but you have to use a special kind of integer, like Int64Dtype, as per the docs:
"If you need to represent integers with possibly missing values, use one of the nullable-integer extension dtypes provided by pandas"

So I found a solution by post-processing the data (force converting to strings) using applymap:
def convert_to_int(x):
try:
return str(int(x))
except:
return x
df = df.applymap(convert_to_int)
The step takes significant time for processing, however solves my problem

Attribute Error in Pandas using Strip When Column Empty

I have a column in a Pandas dataframe that sometimes contains blank rows. I want to use str.strip() to tidy up the rows that contain strings but this gives me the following error when a row is empty:
AttributeError: Can only use .str accessor with string values!
This is the code:
ts_df['Message'] = ts_df['Message'].str.strip()
How do I ignore the blank rows?

str.strip() should be able to handle NaN values if your column contains only strings and NaN. So, it's most probably your column is mixed with other non-string types (e.g. int or float, not string of int or float but really of type int or float).
If you want to clean up the column and maintain only string type values, you can cast it to string by .astype(str). However, NaN will also be casted to string 'nan' when the column is casted to string. Hence, you have to replace NaN by empty string first by .fillna() with empty string before casting to string type, as follows:
ts_df['Message'] = ts_df['Message'].fillna('').astype(str).str.strip()

May be your column contains null values which resulting the dtype as float64 instead of str. Try converting the column to string first using astype(str)
ts_df['Message'] = ts_df['Message'].astype(str).str.strip()

Changing dataframe column dtypes in Pandas

I am using df.columns to fetch the header of the dataframe and storing into a list. A is the list of the header value of dataframe.
A=list(df.columns)
But each element of the list are in string dtype and my header also have int value below an example of the header:
A=['ABC','1345','Acv-1234']
But I want that '1345' came to list as int dtype, not as string,
like this
A=['ABC',1345,'Acv-1234']
Can anyone suggest an approach for this?

A simple way to do it is to iterate through the columns and check if the column name (string type) contains only numbers
( str.isdecimal() ) than convert it to int otherwise keep it as a string
In one line:
A = [int(x) if x.isdecimal() else x for x in df.columns ]

I suspect that '1345' is already a string in your df.columns before assign them to list A. You must search for the source of your df, and how the columns are assigned, in order to assign columns types.
However you can always change df.coluns as you want in any time with:
df.columns=['ABC', 1345 ,'Acv-1234']

Python Handling differenct date formats in a columns including null

I have a csv with a date column so after fetching it using pandas. I am trying to insert into mysql db.
My requirement is that it should look for null values and make them 'None' and then irrespective of date format it should convert it to 2019-07-02 this format. below is my code
df = pd.read_csv (r'C:\Users\adminuser\Desktop\CSAExcel\test.csv', parse_dates = True , usecols = ['number','active','short_description','incident_state','caller_id','assignment_group','assigned_to','sys_created_on','sys_created_by','opened_at','opened_by','closed_at','closed_by','resolved_at','u_reported_by','u_reported_by.country','u_type'],encoding='cp1252')
df2 = df1.replace(np.nan, '', regex=True)
df2['created_on']= df2['created_on'].apply(lambda t: None if pd.isnull(t) else datetime.datetime.fromtimestamp(t).strftime('%Y-%m-%d'))
I am getting error an integer is required (got type str)

I see a few problems here. My guess is you spend some time googling solutions and you have different approaches weaved into one. I'll attempt to give you an easy to understand approach, but for common problems like parsing dates, various methods exist. In your snippet, there are two clear problems.
First you replace the NaN values for empty strings, but then in the lambda function you condition on a null value, so this is redundant. So we can remove the second line.
Asides, why do you want None instead of np.nan?
The lambda function assumes a timestamp, which is a POSIX timestamp, similar to what is returned by time.time(). The column 'created_on' does not contain that, but a string.
My approach will do the following. It uses pd.to_datetime(), which does the heavy lifting here. It will convert the str value to a datetime object. (actually, it takes various datatypes, including strings or Series of strings). If you pass a NaN value in to_datetime(), it will return a NaT value. So then we replace those to None.
Can you try if the following works?
df = pd.read_csv(r'C:\Users\adminuser\Desktop\CSAExcel\test.csv', usecols = ['number','active','short_description','incident_state','caller_id','assignment_group','assigned_to','sys_created_on','sys_created_by','opened_at','opened_by','closed_at','closed_by','resolved_at','u_reported_by','u_reported_by.country','u_type'],encoding='cp1252')
df1['created_on']= pd.to_datetime(df1['created_on'], format='%Y-%m-%d')
df1['created_on'] = df1['created_on'].apply(lambda t: None if pd.isnull(t) else t)
This is an explicit way, but pandas has a quicker way. In your original snippet, you also used the argument parse_data=True in the read_csv(). You can print df1.dtypes to see if the column created_on was already successfully converted to datetime objects. If so, you'd only have to change the NaT/NaN values to None.

Converting strings to ints in a DataFrame

How to covert a DataFrame column containing strings and "-" values to floats.
I have tried with pd.to_numeric and pd.Series.astype(int) but i haven´t success.
What do you recommend??

If I correctly understand, you want pandas to convert the string 7.629.352 to the float value 7629352.0 and the string (21.808.956) to the value -21808956.0.
For the first part, it is directly possible with the thousands parameter, and it is even possible to process - as a NaN:
m = read_csv(..., thousands='.', na_values='-')
The real problem is the parens for negative values.
You could use a python function to convert the values. A possible alternative would be to post process the dataframe column wise:
m = read_csv(..., thousands='.', na_values='-')
for col in m.columns:
if m[col].dtype == np.dtype('object'):
m[col] = m[col].str.replace(r'\.', '').str.replace(r'\((.*)\)', r'-\1').astype('float64')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark: Remove UTF null character from pyspark dataframe - python

Related

Pandas.read_sql_query reads int as float

Attribute Error in Pandas using Strip When Column Empty

Changing dataframe column dtypes in Pandas

Python Handling differenct date formats in a columns including null

Converting strings to ints in a DataFrame

Categories

Resources