Unable to format double values in pyspark - python

I have a CSV data like the below:
time_value,annual_salary
5/01/19 01:02:16,120.56
06/01/19 2:02:17,12800
7/01/19 03:02:18,123.00
08/01/19 4:02:19,123isdhad
I want to consider only numeric values along with decimal values. Basically, I want to ignore the last record since it is alphanumeric in the case of annual_salary and which I was able to do so. But, when I tried it convert it to the proper decimal values it is giving me incorrect results. Below is my code:
df = df.withColumn("annual_salary", regexp_replace(col("annual_salary"), "\.", ""))
df = df.filter(~col("annual_salary").rlike("[^0-9]"))
df.show(truncate=False)
df.withColumn("annual_salary", col("annual_salary").cast("double")).show(truncate=False)
But it gives me records like the below:
which is incorrect.
Expected output:
annual_salary
120.56
12800.00
123.00
What could be wrong here? Should I need to implement UDF for this type of conversion?

Please Try cast Decimal Type.
df.where(~col('annual_salary').rlike('[A-Za-z]')).withColumn('annual_salary', col('annual_salary').cast(DecimalType(38,2))).show()
+----------------+-------------+
| time_value|annual_salary|
+----------------+-------------+
|5/01/19 01:02:16| 120.56|
|06/01/19 2:02:17| 12800.00|
|7/01/19 03:02:18| 123.00|
+----------------+-------------+

Related

Problem with converting a string that contains numbers to a numbers data type

Currently, I'm working with a column called 'amount' that contains transaction amounts. This column is from the string datatype and I would like to convert it to a number data type.
The problem I ran into was that the code I wrote to convert the string data type to numbers worked but the only problem is that when I removed the ',' in the code below and changed it to numbers, the decimals were added which causes extremely high values in my data. So, 100000,95 became 10000095. I used the following code to convert my string data type to numbers:
df["amount"] = df["amount"].str.replace(',', '')
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
Can someone help me with this problem?
EDIT: Not all values contain decimals. I'm looking for a solution for only the values that contain a ','.
You need repalce by comma if need floats:
df["amount"] = df["amount"].str.replace(',', '.')

Converting float to integer based on the result of join in python with panda

I am doing a left and right join to get my data. Now I want to concatenate a fixed text which is "Hello/ProductID=" with the result of my join which must be integer. I don't know why the value which I got as the result is float?
As this is my URL I need to convert it to Integer:
df = df.join(df.set_index(['ID','Type'])['ProductID'].rename('PID'), on=['ID','UniqueCode'])
df["URL"]= "Hello/ProductID=" + df['PID'].apply(str)
The real result is as below:
Hello/ProductID=1221.0
My expected result should be as below:
Hello/ProductID=1221
I tried replicating the condition and it is beacause PID is of float type. You may have to convert it to an integar type to obtain the desired results.
I used replace method to remove .0:
df['URL'] = df['URL'].str.replace('\.0$', ' ')

Converting commas to dots from excel file in pandas

I have this kind of column in excel file:
Numbers
13.264.999,99
1.028,10
756,4
1.100,1
So when I load it with pd.read_excel some numbers like 756,4 get converted to 756.4 and become floats while other 3 from the example above remain the same and are strings.
Now I want to have the column in this form (type int):
Numbers
13264999.99
1028.10
756.4
1100.1
However when converting the loaded column from excel using this code:
df["Numbers"]=df["Numbers"].str.replace('.','')
df["Numbers"]=df["Numbers"].str.replace(',','.')
df["Numbers"]=df["Numbers"].astype(float)
I get:
Numbers
13264999.99
1028.10
nan
1100.1
What to do?
Okay so I managed to solve this issue:
So first I convert every value to string and then replace every comma to dot.
Then I leave last dot so that the numbers can be converted to float easily:
df["Numbers"]=df["Numbers"].astype(str).str.replace(",", ".")
df["Numbers"]=df["Numbers"].str.replace(r'\.(?=.*?\.)', '')
df["Numbers"]=df["Numbers"].astype(float)
As shown in the comment by Anton vBR, using the parameter thousands='.', you will get the data read in correctly.
You can try reading excel with default type as string
df=pd.read_excel('file.xlsx',dtype=str)

How do i change the number format of an integer in a dataframe?

I have imported data from a CSV file into a dataframe. One of the columns is a reference number and should have six digits. Some of the reference numbers have only 3, 4 or 5 digits. Is there a similar function that exists in Excel which would something like this: =TEXT(A1,"000000")?
I've tried searching on the internet for some documentation on how to use the format and display functions in pandas but i couldn't find the answer that i was looking for. An example of the issue is shown below:
Actual: 10158
Desired: 010158
Actual: 101
Desired: 010100
try to read it directly in the right format with this:
df = pd.read_csv("yourfile.csv", dtype = {"ColumnName" : "float64"})
probably its already a string.
You can always check your types of a dataframe with df.dtypes

Error message when multiplying two columns together

Trying to create a new column for my data set using pandas that is the product of two columns multiplied together. One set is a value in dollars called price and the other is a number called installs. Running the multiplication code by itself gives me an error 'can't multiply sequence by non-int of type 'str''
I tried running the following code to convert the strings into integers.
pd.to_numeric(appdata['Installs'], errors ='ignore')
pd.to_numeric(appdata['Price'], errors= 'ignore')
appdata[Income]= appdata['Installs'] * appdata[('Price')]
But this gives me the same error.
What other way could I convert my data into integer format?
Thanks in advance.
pd.to_numeric() does not edit the column in place. You should do:
appdata['Installs'] = pd.to_numeric(appdata['Installs'], errors ='ignore')
appdata['Price'] = pd.to_numeric(appdata['Price'], errors= 'ignore')
appdata['Income']= appdata['Installs'] * appdata['Price']
# remove , and + from 'Installs' to make the cells look integer to Pandas
appdata['Installs'] = pd.to_numeric(appdata['Installs'].str.replace(r'[\,\+]+', '', regex=True))
# remove , from 'Price' to make the cells look integer to Pandas
appdata['Price'] = pd.to_numeric(appdata['Price'].str.replace(r'[\,]+', '', regex=True))
# calculate "Income"
appdata['Income'] = appdata['Installs'] * appdata['Price']

Categories

Resources