I'm looking to convert a column to lower case. The issue is there are some instances where the string within the column only contains numbers. In my real life case this is due to poor data entry. Instead of having these values converted to NaN, I would like to keep the numeric string as is. What is the best approach to achieving this?
Below is my current code and output
import pandas as pd
df = pd.DataFrame({'col':['G5051', 'G5052', 5053, 'G5054']})
df['col'].str.lower()
Current Output
Desired Output
Just convert to column to strings first:
import pandas as pd
df = pd.DataFrame({'col':['G5051', 'G5052', 5053, 'G5054']})
print(df['col'].astype(str).str.lower())
Pre-Define the data as str format.
import pandas as pd
df = pd.DataFrame({'col':['G5051', 'G5052', 5053, 'G5054']}, dtype=str)
print(df['col'].str.lower())
to add a slight variation to Tim Roberts' solution without using the .str accessor:
import pandas as pd
df = pd.DataFrame({'col':['G5051', 'G5052', 5053, 'G5054']})
print(df['col'].astype(str).apply(lambda x: x.lower()))
Related
I have this csv file called input.csv
KEY;Rate;BYld;DataAsOfDate
CH04;0.719;0.674;2020-01-29
CH03;1.5;0.148;2020-01-29
then I execute the following code:
import pandas as pd
input_df = pd.read_csv('input.csv', sep=";")
input_df.to_csv('output.csv', sep=";")
and get the following output.csv file
KEY;Rate;BYld;DataAsOfDate
CH04;0.7190000000000001;0.674;2020-01-29
CH03;1.5;0.14800000000000002;2020-01-29
I was hoping for and expecting an output like this:
(to be able to use a tool like winmerge.org to detect real differences on each row)
(my real code truly modifies the dataframe - this stack overflow example is for demonstration only)
KEY;Rate;BYld;DataAsOfDate
CH04;0.719;0.674;2020-01-29
CH03;1.5;0.148;2020-01-29
What is the idiomatic way with to achieve such an unmodified output with Pandas?
Python does not use traditional rounding to so as to prevent problems with bankers rounding. However, if being close is not a problem you could use the round function and replace the "2" with whichever number you would like to round to
d = [['CH04',0.719,0.674,'2020-01-29']]
df = pd.DataFrame(d, columns = (['KEY', 'Rate', 'BYld', 'DataAsOfDate']))
df['Rate'] = df['Rate'].apply(lambda x : round(x, 2))
df
Using #Prokos idea I changed the code like this:
import pandas as pd
input_df = pd.read_csv('input.csv', dtype='str',sep=";")
input_df.to_csv('str_output.csv', sep=";", index=False)
and that meets the requirement - all columns come out unchanged.
I am currently learning pandas and I am using an imdb movies database, which one of the columns is the duration of the movies. However, one of the values is "None", so I can´t calculate the mean because there is this string in the middle. I thought of changing the "None" to = 0, however that would skew the results. Like can be seen with the code below.
dur_temp = duration.replace("None", 0)
dur_temp = dur_temp.astype(float)
descricao_duration = dur_temp.mean()
Any ideas on what I should do in order to not skew the data? I also graphed it and it becomes more clear how it skews it.
You can replace "None" with numpy.nan, instead that using 0.
Something like this should do the trick:
import numpy as np
dur_temp = duration.replace("None", np.nan)
descricao_duration = dur_temp.mean()
if you want it working for any string in your pandas serie, you could use pd.to_numeric:
pd.to_numeric(dur_temp, errors='coerce').mean()
in this way all the values that cannot be converted to float will be replaced by NaN regardless of which is
Just filter by condition like this
df[df['a']!='None'] #assuming your mean values are in column a
Make them np.NAN values
I am writing it as answer because i can't comment df = df.replace('None ', np.NaN) or df.replace('None', np.NaN, inplace=True)
You can use fillna(value=np.nan) as shown below:
descricao_duration = dur_temp.fillna(value=np.nan).mean()
Demo:
import pandas as pd
import numpy as np
dur_temp = pd.DataFrame({'duration': [10, 20, None, 15, None]})
descricao_duration = dur_temp.fillna(value=np.nan).mean()
print(descricao_duration)
Output:
duration 15.0
dtype: float64
I want to change the existing dataframe value to nan. What is the way to do it when you need to change several?
dataframe['A', 'B'....] = np.nan
I tried this but nothing changed
Double brackets are required in this case, to pass the list of columns to the dataframe index operator []. For the OP case would be:
dataframe[['A', 'B'....]] = np.nan
Reproducible example:
import numpy as np
import pandas as pd
dict= {'ColA':[1,2,3], 'ColB':[4,5,6], 'ColC':[7,8,9], 'ColD':[-1,-2,-3] }
df=pd.DataFrame(dict,index=['I1','I2','I3'])
print(df)
df[['ColA','ColD']]=np.nan
print(df)
Note:
This solution was originally suggested via comment, now included as an answer with a reproducible example for future reference.
I want to replace "?" with NaN in Python.
The following code does not work, and I am not sure what is the reason.
import pandas as pd;
import numpy as np;
col_names = ['BI_RADS', 'age','shape','margin','density','severity']
dataset = pd.read_csv('mammographic_masses.data.txt', names = col_names)
dataset.replace("?", np.NaN)
After executing the above code, I still get those question marks in the dataset.
The format of the dataset looks like the followings:
5,67,3,5,3,1
4,43,1,1,?,1
5,58,?,5,3,1
4,28,1,1,3,0
5,74,1,5,?,1
Use inplace=True
Ex:
dataset.replace("?", np.NaN, inplace=True)
Recently I need to write a python script to find out how many times the specific string occurs in the excel sheet.
I noted that we can use *xlwings.Range('A1').table.formula* to achieve this task only if the cells are continuous. If the cells are not continuous how can I accomplish that?
It's a little hacky, but why not.
By the way, I'm assuming you are using python 3.x.
First well create a new boolean dataframe that matches the value you are looking for.
import pandas as pd
import numpy as np
df = pd.read_excel('path_to_your_excel..')
b = df.applymap(lambda x: x == 'value_you_want_to_find' if isinstance(x, str) else False)
and then simply sum all occurences.
print(np.count_nonzero(b.values))
As clarified in the comments, if you already have a dataframe, you can simply use count (Note: there must be a better way of doing it):
df = pd.DataFrame({'col_a': ['a'], 'col_b': ['ab'], 'col_c': ['c']})
string_to_search = '^a$' # should actually be a regex, in this example searching for 'a'
print(sum(df[col].str.count(string_to_search).sum() for col in df.columns))
>> 1