I have this kind of column in excel file:
Numbers
13.264.999,99
1.028,10
756,4
1.100,1
So when I load it with pd.read_excel some numbers like 756,4 get converted to 756.4 and become floats while other 3 from the example above remain the same and are strings.
Now I want to have the column in this form (type int):
Numbers
13264999.99
1028.10
756.4
1100.1
However when converting the loaded column from excel using this code:
df["Numbers"]=df["Numbers"].str.replace('.','')
df["Numbers"]=df["Numbers"].str.replace(',','.')
df["Numbers"]=df["Numbers"].astype(float)
I get:
Numbers
13264999.99
1028.10
nan
1100.1
What to do?
Okay so I managed to solve this issue:
So first I convert every value to string and then replace every comma to dot.
Then I leave last dot so that the numbers can be converted to float easily:
df["Numbers"]=df["Numbers"].astype(str).str.replace(",", ".")
df["Numbers"]=df["Numbers"].str.replace(r'\.(?=.*?\.)', '')
df["Numbers"]=df["Numbers"].astype(float)
As shown in the comment by Anton vBR, using the parameter thousands='.', you will get the data read in correctly.
You can try reading excel with default type as string
df=pd.read_excel('file.xlsx',dtype=str)
Related
I need help with cleaning a column of my data. So, basically in the column, in each separate cell there are dates, time, letter, floating points so many other type of data. The datatype of this column is 'Object'.
What I want to do is, remove all the dates and replace it with empty cells and keep only the time in the entire column. And then I want to insert the average time into the empty cells.
I'm using pycharm and using PANDAS to clean the column.
[enter image description here][1]
I would imagine you can achieve this with something along the lines of below. For time format, it seems like for your data column just checking if string contains 2 semi colons is enough. You can also specify something more robust:
def string_splitter (x):
x=x.split()
y=[]
for stuff in x:
if stuff.index(":")>1: #<you can also replace with a more robust pattern for time>
y.append(stuff)
else:
y.append("")#<add your string for indicating empty space>
return " ".join(y)
df['column_name'].apply(string_splitter)
I have some difficulties to exploit csv scraping file in pandas.
I have several columns, one of them contain prices as '1 800 €'
After to import csv as dataframe, I can not convert my columns in Integrer
I deleted euro symbol without problem
data['prix']= data['prix'].str.strip('€')
I tried to delete space with the same approach, but the space still remaied
data['prix']= data['prix'].str.strip()
or
data['prix']= data['prix'].str.strip(' ')
or
data['prix']= data['prix'].str.replace(' ', '')
I tried to force the conversion in Int
data['prix']= pd.to_numeric(data['prix'], errors='coerce')
My column was fill by Nan value
I tried to convert before operation of replace space in string
data = data.convert_dtypes(convert_string=True)
But same result : impossible to achieve my aim
the spaces are always present and I can not convert in integer
I looked with Excel into dataset, I can not identify special problem in the data
I tried also to change encoding standard in read_csv ... ditto
In this same dataset I had the same problem for the kilometrage as 15 256 km
And I had no problem to retreat and convert to int ...
I would like to test through REGEX to copy only numbers of the field et create new column with
How to proceed ?
I am also interested by other ideas
Thank you
Use str.findall:
I would like to test through REGEX to copy only numbers of the field et create new column with
data['prix2'] = data['prix'].str.findall(r'\d+').str.join('').astype(int)
# Or if it raises an exception
data['prix2'] = pd.to_numeric(data['prix'].str.findall('(\d+)').str.join(''), errors='coerce')
To delete the white space use this line:
data['prix']= data['prix'].str.replace(" ","")
and to convert the string into a int use this line:
data['prix'] = [int(i) for i in data['prix']]
Currently, I'm working with a column called 'amount' that contains transaction amounts. This column is from the string datatype and I would like to convert it to a number data type.
The problem I ran into was that the code I wrote to convert the string data type to numbers worked but the only problem is that when I removed the ',' in the code below and changed it to numbers, the decimals were added which causes extremely high values in my data. So, 100000,95 became 10000095. I used the following code to convert my string data type to numbers:
df["amount"] = df["amount"].str.replace(',', '')
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
Can someone help me with this problem?
EDIT: Not all values contain decimals. I'm looking for a solution for only the values that contain a ','.
You need repalce by comma if need floats:
df["amount"] = df["amount"].str.replace(',', '.')
I'm writing a Python program to extract specific values from each cell in a .CSV file column and then make all the extracted values new columns.
Sample column cell:(This is actually a small part, the real cell contains much more data)
AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,"PossibleDataMissing":false,"StreamDirection":"FROM-to-
One value I'm trying to extract is number 10 between the "JitterInterArrival": and ,"JitterInterArrivalMax" . But since each cell contains relatively long strings and special characters around it(such as ""), opener=re.escape(r"***")and closer=re.escape(r"***") wouldn't work.
Does anyone know a better solution? Thanks a lot!
IIUC, you have a json string and wish to get values from its attributes. So, given
s = '''
{"AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,
"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,
"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,
"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,
"PossibleDataMissing":false}]}
'''
You can do
import json
>>> data = json.loads(s)
>>> ji = data['AudioStreams'][0]['JitterInterArrival']
10
In a data frame scenario, if you have a column col of strings such as the above, e.g.
df = pd.DataFrame({"col": [s]})
You can use transform passing json.loads as argument
df.col.transform(json.loads)
to get a Series of dictionaries. Then, you can manipulate these dicts or just access the data as done above.
I am trying to read in a 4 column txt file and create a 5th column. Column 1 is a string and columns 2-4 are numbers, however they are being read as strings. I have two questions - my python script is currently unable to perform multiplication on two of the columns because it is reading columns 2-4 as strings. I want to know how to change columns 2-4 (which are numbers) to floating numbers and then create a 5th column that is two of the previous columns multiplied together.
You can cast strings to floats in python like so.
>>> float('1.25')
1.25
Just cast them to float using float(x) method where x is the string that contains the float
Whenever you iterate over the Lines, you could run the string through a try-except operation like:
try:
float(value)
except ValueError:
pass
#Or do some kind of error handling here