How to remove commas in a column within a Pyspark Dataframe - python

Hi all thanks for the time to help me on this,
Right now I have uploaded a csv into spark and the type of the dataframe is pyspark.sql.dataframe.DataFrame
I have a column of numbers (that are strings in this case though). They are numbers like 6,000 and I just want to remove all the commas from these numbers. I have tried df.select("col").replace(',' , '') and df.withColumn('col', regexp_replace('col', ',' , '') but seem to be getting an error that "DataFrame Object does not support item assignment"
Any ideas? I'm fairly new to Spark

You should be casting it really:
from pyspark.sql.types import IntegerType
df = df.withColumn("col", df["col"].cast(IntegerType()))

Related

how to replace a string with different string in the column of a data frame

I have a dataframe Adult and a column in the data frame workclass with thousands of rows. The column contains different string objects. I would like to replace all string ? with string Private I have tried different variations of the code:
Adult.loc[:,'workclass'] = Adult.loc[:,'workclass'].replace(to_replace="?", value=str("Private"))
After running the code I do not get an error but when I run the code Adult.workclass.unique() the ?is still in the data frame. How would I go about replacing the string with the correct string?
Thanks in advance
Try the following code:
Adult['workclass'] = Adult['workclass'].str.replace('?', 'Private')

Passing PySpark pandas_udf data limit?

The problem is simple. Please observe the code below.
#pyf.pandas_udf(pyt.StructType(RESULTS_SCHEMA_LIST), pyf.PandasUDFType.GROUPED_MAP)
def train_udf(df):
return train_ml_model(df=df)
results_df = complete_df.groupby('training-zone').apply(train_udf)
One of the columns of the results_df is typically a very large string (>4e6 characters). While this isn't a problem for a pandas.DataFrame or a spark.DataFrame when I convert the pandas dataframe to a spark dataframe. It is a problem when the pandas_udf() attempts to do this. The error returned is pyarrrow.lib.ArrowInvalid could not convert **string** with type pyarrow.lib.StringValue: did not recognize the Python value type when inferring an Arrow data type
This UDF does work if I don't return the problematic column or I make the problematic column only contain some small string such as 'wow that is cool', so I know the problem is not with the udf itself per se.
I know the function train_ml_model() works because when I get a random group from the spark dataframe then convert it to a pandas dataframe and pass it to train_ml_model() it produces the expected pandas dataframe with the column with a massive string.
I know spark can handle such large strings because when I convert the pandas dataframe to a spark dataframe using spark.createDataFrame() the spark dataframe contains the full expected value.
PS: Why is pyarrow even trying to infer the datatype when I pass the types to the pandas_udf()?
Any help would be very much appreciated!

String conversion to dataframe

In this screenshot data (string datatype) and df2 (pandas dataframe) store the same data - a timestamp and a value.
How do I get data in a similar dataframe so I can append the values to df2 so I have all the data records and all the df2 records in one dataframe and matching the current format of df2 ?
I can post what I've tried so far, but all I get is errors :(
import ast
import pandas as pd
data = "[[1212.1221, -10.5],[2232.55, -19.44],[32432.87655, -445.88]]"
df = pd.DataFrame(ast.literal_eval(data),
columns=['index', 'data'])
Looks like your string data is a correctly formatted json (which from my knowledge looks exactly like Python dictionaries but is strict about double quotes over single quotes). Try:
import json
dict = json.loads(data)
This will convert your string into dict type from which you can easily create and manipulate DataFrames.
EDIT:
If any of your strings have single quotes, you can remedy this using str.replace("'", "\"") to convert them to double quotes. This will only cause problems if for whatever reason your data has incorrectly paired quotes.

Filtering Chunks by A string

I have csv file with 60M plus rows. I am only interested in a subset of these and would like to put them in a dataframe.
Here is the code I am using:
iter_csv = pd.read_csv('/Users/xxxx/Documents/globqa-pgurlbymrkt-Report.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['Site Market (evar13)'].str.contains("Canada", na=False)] for chunk in iter_csv])
off the answer here : pandas: filter lines on load in read_csv
I get the following error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Cant seem to figure out whats wrong and will appreciate guidance here.
Try verifying the data representing a string first.
What does the last chunk return that you are expecting to use .contains() on?
It seems that the data may be missing and if so then it wouldn't be a string.

error occured when using df.fillna(0)

Very simple code using spark + python:
df = spark.read.option("header","true").csv(file_name)
df = df_abnor_matrix.fillna(0)
but error occured:
pyspark.sql.utils.AnalysisException: u'Cannot resolve column name
"cp_com.game.shns.uc" among (ProductVersion, IMEI, FROMTIME, TOTIME,
STATISTICTIME, TimeStamp, label, MD5, cp_com.game.shns.uc,
cp_com.yunchang....
What's wrong with it? cp_com.game.shns.uc is among the list.
Spark does not support dot character in column names, check issue, so you need to replace dots with underscore before working on the csv.

Categories

Resources