cleaning numeric columns in pandas - python

I have some difficulties to exploit csv scraping file in pandas.
I have several columns, one of them contain prices as '1 800 €'
After to import csv as dataframe, I can not convert my columns in Integrer
I deleted euro symbol without problem
data['prix']= data['prix'].str.strip('€')
I tried to delete space with the same approach, but the space still remaied
data['prix']= data['prix'].str.strip()
or
data['prix']= data['prix'].str.strip(' ')
or
data['prix']= data['prix'].str.replace(' ', '')
I tried to force the conversion in Int
data['prix']= pd.to_numeric(data['prix'], errors='coerce')
My column was fill by Nan value
I tried to convert before operation of replace space in string
data = data.convert_dtypes(convert_string=True)
But same result : impossible to achieve my aim
the spaces are always present and I can not convert in integer
I looked with Excel into dataset, I can not identify special problem in the data
I tried also to change encoding standard in read_csv ... ditto
In this same dataset I had the same problem for the kilometrage as 15 256 km
And I had no problem to retreat and convert to int ...
I would like to test through REGEX to copy only numbers of the field et create new column with
How to proceed ?
I am also interested by other ideas
Thank you

Use str.findall:
I would like to test through REGEX to copy only numbers of the field et create new column with
data['prix2'] = data['prix'].str.findall(r'\d+').str.join('').astype(int)
# Or if it raises an exception
data['prix2'] = pd.to_numeric(data['prix'].str.findall('(\d+)').str.join(''), errors='coerce')

To delete the white space use this line:
data['prix']= data['prix'].str.replace(" ","")
and to convert the string into a int use this line:
data['prix'] = [int(i) for i in data['prix']]

Related

Struggling to clean the column in pandas

I need help with cleaning a column of my data. So, basically in the column, in each separate cell there are dates, time, letter, floating points so many other type of data. The datatype of this column is 'Object'.
What I want to do is, remove all the dates and replace it with empty cells and keep only the time in the entire column. And then I want to insert the average time into the empty cells.
I'm using pycharm and using PANDAS to clean the column.
[enter image description here][1]
I would imagine you can achieve this with something along the lines of below. For time format, it seems like for your data column just checking if string contains 2 semi colons is enough. You can also specify something more robust:
def string_splitter (x):
x=x.split()
y=[]
for stuff in x:
if stuff.index(":")>1: #<you can also replace with a more robust pattern for time>
y.append(stuff)
else:
y.append("")#<add your string for indicating empty space>
return " ".join(y)
df['column_name'].apply(string_splitter)

Delete categorical data from a column and leave the numerical data only ? I want to delete the word "SUBM-" from "SUBM-1245" IN PHYTON

I have a column in phyton which data type is object but I want to change it to integer.
The records on that column show :
SUBM - 4562
SUBM - 4563
and all the information in that column is like that. I want to delete the SUBM - word from the records and apply a similar function like excel "replace with" and I will add 0 to leave that empty with the numerical data only. Can anyone suggest a way to do that ?
If you are working with a column in python, so I assume you are using pandas to parse your table. In this case, you can simply use
df["mycolumn"] = df["mycolumn"].str.replace("SUBM-","")
However, you still have a column of type "object" then. A save way to convert it to numeric is this, where you basically throw away everything that can't be converted to a numeric:
df["mycolumn"] = pd.to_numeric(df["mycolumn"], errors="coerce", downcast="integer")
If you specifically need integer values (float not acceptible for you in case of NaN) you can afterwards fill empty cells with 0 and convert the column to integer:
df["mycolumn"] = df["mycolumn"].fillna(0).map(int) # if you specifically need integers
Alternative is to extract all numeric values using regular expressions. This would automatically return NaN if the expressions do not match (i.e. also when "SUBM-" is not present in your cell)
df["mycolumn"] = df["mycolumn"].str.extract("SUBM-([0-9]*)")

Problem with converting a string that contains numbers to a numbers data type

Currently, I'm working with a column called 'amount' that contains transaction amounts. This column is from the string datatype and I would like to convert it to a number data type.
The problem I ran into was that the code I wrote to convert the string data type to numbers worked but the only problem is that when I removed the ',' in the code below and changed it to numbers, the decimals were added which causes extremely high values in my data. So, 100000,95 became 10000095. I used the following code to convert my string data type to numbers:
df["amount"] = df["amount"].str.replace(',', '')
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
Can someone help me with this problem?
EDIT: Not all values contain decimals. I'm looking for a solution for only the values that contain a ','.
You need repalce by comma if need floats:
df["amount"] = df["amount"].str.replace(',', '.')

Converting commas to dots from excel file in pandas

I have this kind of column in excel file:
Numbers
13.264.999,99
1.028,10
756,4
1.100,1
So when I load it with pd.read_excel some numbers like 756,4 get converted to 756.4 and become floats while other 3 from the example above remain the same and are strings.
Now I want to have the column in this form (type int):
Numbers
13264999.99
1028.10
756.4
1100.1
However when converting the loaded column from excel using this code:
df["Numbers"]=df["Numbers"].str.replace('.','')
df["Numbers"]=df["Numbers"].str.replace(',','.')
df["Numbers"]=df["Numbers"].astype(float)
I get:
Numbers
13264999.99
1028.10
nan
1100.1
What to do?
Okay so I managed to solve this issue:
So first I convert every value to string and then replace every comma to dot.
Then I leave last dot so that the numbers can be converted to float easily:
df["Numbers"]=df["Numbers"].astype(str).str.replace(",", ".")
df["Numbers"]=df["Numbers"].str.replace(r'\.(?=.*?\.)', '')
df["Numbers"]=df["Numbers"].astype(float)
As shown in the comment by Anton vBR, using the parameter thousands='.', you will get the data read in correctly.
You can try reading excel with default type as string
df=pd.read_excel('file.xlsx',dtype=str)

Extract a string from a CSV cell containing special characters in Python

I'm writing a Python program to extract specific values from each cell in a .CSV file column and then make all the extracted values new columns.
Sample column cell:(This is actually a small part, the real cell contains much more data)
AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,"PossibleDataMissing":false,"StreamDirection":"FROM-to-
One value I'm trying to extract is number 10 between the "JitterInterArrival": and ,"JitterInterArrivalMax" . But since each cell contains relatively long strings and special characters around it(such as ""), opener=re.escape(r"***")and closer=re.escape(r"***") wouldn't work.
Does anyone know a better solution? Thanks a lot!
IIUC, you have a json string and wish to get values from its attributes. So, given
s = '''
{"AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,
"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,
"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,
"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,
"PossibleDataMissing":false}]}
'''
You can do
import json
>>> data = json.loads(s)
>>> ji = data['AudioStreams'][0]['JitterInterArrival']
10
In a data frame scenario, if you have a column col of strings such as the above, e.g.
df = pd.DataFrame({"col": [s]})
You can use transform passing json.loads as argument
df.col.transform(json.loads)
to get a Series of dictionaries. Then, you can manipulate these dicts or just access the data as done above.

Categories

Resources