In my df I have a salary_range column, which contains ranges like 100 000 - 150 000. I'd like to modify this column so it would take the first value as an int. So in this example I'd like to change "100 000 - 150 000(string) to 100000(int). Unfortunatelly salary_range is full of NaN, and I don't really know how to use if/where statements in pandas.
I tried doing something like this: df['salary_range'] = np.where(df['salary_range']!='NaN',) but I don't know what should I write as the second argument of np.where. Obviously I can't just use str(salary_range), so I don't know how to do it.
You first need to take the subset where the value is not NaN. This can be done using the following code.
pd.isna(df['salary_range'])
The above function will return a series containing True/False values. Now you can select the non-NaN rows using the following code.
df[pd.isna(df['salary_range'])]
Next you will need to parse the entries of this subset, which can be done in many ways, one of which can be the following.
df[pd.isna(df['salary_range'])]['salary_range'] = df[pd.isna(df['salary_range'])]['salary_range'].str.split(' ')[0].replace(' ', '').astype(int)
This will only change the rows, where the column is not null. Since you did not include the code, I can't help much without knowing more about the context. Hope this helps.
Related
Coming from Excel background, I find indexes so confusing in code.
Typically I'll make something an index that I feel should be one, then lose the functionality I would have had when it was a column.
I've a df with 4 digit years from 2015 to 2113 as the index. If i call a for loop on the index they are class type int (shouldn't matter for my purposes).
I then want to take a cut that's just 2020, so I do
df[df.index==2020] and it returns a blank df where there is data to return
If i do df.loc[2020] it says it can't do label indexing on ints
I just want to slice the data by years (so I can say just give me 2020 onward for example)
What am I doing wrong? Feel like I'm missing something fundamental?
I created a mock df to reproduce the problem for the question but that works fine.
If I do a for loop on the index of both the problem df and the example one they both return class int for each row
If I do example_df.index though it returns
Int64Index(2019,2020,2021, dtype='int64', name='Yr')
If I do the same on the problem df, it returns
Index(['2019','2020','2021'],dtype='object')
The above look like strings to me, but the loop says they are int?
Original problem index comes from Excel with set_index, so i can't produce an example here.
Any ideas?
On the problem df, indeed the index's data type is string.
Index(['2019','2020','2021'],dtype='object')
When you write
df[df.index==2020]
A blank result is expected because you search for int 2020 not string '2020'.
Then, in code
df.loc[2020]
Is a wrong code for searching some data with some condition. loc is used to slice column and row, not to search a row by some condition like what you wanted to do.
So the code
df[df.index==2020]
is the most right answer, but first you need to change the datatype of your index column first.
df.index= [int(i) for i in df.index]
I have a date column in my DataFrame say df_dob and it looks like -
id
DOB
23312
31-12-9999
1482
31-12-9999
807
#VALUE!
2201
06-12-1925
653
01/01/1855
108
01/01/1855
768
1967-02-20
What I want to print is a list of unique years like - `['9999', '1925', '1855', '1967']
basically through this list I just wanted to check whether there is some unwanted year is present or not.
I have tried(pasted my code below) but getting ValueError: time data 01/01/1855 doesn't match format specified and could not resolve it.
df_dob['DOB'] = df_dob['DOB'].replace('01/01/1855 00:00:00', '1855-01-01')
df_dob['DOB'] = pd.to_datetime(df_dob.DOB, format='%Y-%m-%d')
df_dob['DOB'] = df_dob['DOB'].dt.strftime('%Y-%m-%d')
print(np.unique(df_dob['DOB']))
# print(list(df_dob['DOB'].year.unique()))
P.S - when I print df_dob['DOB'], I get values like - 1967-02-20 00:00:00
Can you try this?
df_dob["DOB"] = pd.to_datetime(df_DOB["Date"])
df_dob['YOB'] = df_dob['DOB'].dt.strftime('%Y')
Use pandas' unique for this. And on year only.
So try:
print(df['DOB'].dt.year.unique())
Also, you don't need to stringify your time. Alse, you don't need to replace anything, pandas is smart enough to do it for you. So you overall code becomes:
df_dob['DOB'] = pd.to_datetime(df_dob.DOB) # No need to pass format if there isn't some specific anomoly
print(df['DOB'].dt.year.unique())
Edit:
Another method:
Since you have outofbounds problem,
Another method you can try is not converting them to datetime, but rather find all the four digit numbers in each column using regex.
So,
df['DOB'].str.extract(r'(\d{4})')[0].unique()
[0] because unique() is a function of pd.series not a dataframe. So taking the first series in the dataframe.
The first thing you need to know is if the resulting values (which you said look like 1967-02-20 00:00:00 are datetimes or not. That's as simple as df_dob.info()
If the result says similar to datetime64[ns] for the DOB column, you're good. If not you'll need to cast it as a DateTime. You have a couple of different formats so that might be part of your problem. Also, because there're several ways of doing this and it's a separate question, I'm not addressing it.
We going to leverage the speed of sets, plus a bit of pandas, and then convert that back to a list as you wanted the final version to be.
years = list({i for i in df['date'].dt.year})
And just a side note, you can't use [] instead of list() as you'll end with a list with a single element that's a set.
That's a list as you indicated. If you want it as a column, you won't get unique values
Nitish's answer will also work but give you something like: array([9999, 1925, 1855, 1967])
i have a large measurement data which contain 35O columns after filtering(for example to A49,B0to B49,F0 toF49) with some random numbers.
Now i want to look in to (B0 to B49) whether it has values in the range(say: between 20 and 30).If not I want to delete that columns from the measurement data.
How to do this in python with pandas?
I want to know some faster methods for this filtering?
sample data:https://docs.google.com/spreadsheets/d/17Xjc81jkjS-64B4FGZ06SzYDRnc6J27m/edit?usp=sharing&ouid=106137353367530025738&rtpof=true&sd=true
(In Pandas) You can apply a function on all elements of an array using the applymap function. You can also apply aggregating actions to have a single value out of a whole column. You put those two things together to have what you want.
For instance, you want to know if a given set of columns (the "B" ones) have value in some range (say, 20:30). So, you want to verify the values at the element level, but collect the column names as output.
You can do that with the following code. Execute them separately/progressively to understand what they are doing.
>>> b_cols_of_interest_indx = df.filter(regex='^B').applymap(lambda x:20<x<30).any()
>>> b_cols_of_interest_indx[b_cols_of_interest_indx]
B19 True
B21 True
dtype: bool
I have a pandas column :'function' ,of jobs functions:
IT,HR etc..
but I have them in few variations for each function.
('IT application','IT,Digital,Digital' etc..)
I wanted to change all values that contains IT -> IT for example.
I tried:
df['function'].str.contains('IT')
df['function'].isin(['IT'])
which gives only partial results.
I wanted something like:
'IT' in df.loc[:,'function']
but a solution that would work for all the column and not for 1 index at a time.
if there is a solution that doesn't need a loop that would be great.
This should work:
df['function'] = df.function.str.replace(r'(^.IT.$)', 'IT')
I want to make a DataFrame from data of another Dataframe. my first table has 3 columns and I chose the min value of one of the columns and I want to choose two other corresponding value and put it on another DataFrame. but when I import it I get extra information that won't let me convert it to float64. what should I do?
a= fp['w']
b= fp[r'$\Omega_m$']
data={"best_value_w": [a], "best_value_$\Omega$": [b], "errors":[1]}
bv_table= pd.DataFrame(data, index=['1"$\sigma$"', '2"$\sigma$"', '3"$\sigma$"'])
here is what I get
but I want something like this, but without brackets, I want dtype to be float, not object
what I almost want but without brackets
I found the answer, I should use astype like this :
bv_table[' best_value_w ']=bv_table.best_value_w.astype(float)
bv_table[' best_value_$\Omega$ ']=bv_table.best_value_omega.astype(float)
then I got a table like this:
Done!