Pandas indexing by integer - python

Coming from Excel background, I find indexes so confusing in code.
Typically I'll make something an index that I feel should be one, then lose the functionality I would have had when it was a column.
I've a df with 4 digit years from 2015 to 2113 as the index. If i call a for loop on the index they are class type int (shouldn't matter for my purposes).
I then want to take a cut that's just 2020, so I do
df[df.index==2020] and it returns a blank df where there is data to return
If i do df.loc[2020] it says it can't do label indexing on ints
I just want to slice the data by years (so I can say just give me 2020 onward for example)
What am I doing wrong? Feel like I'm missing something fundamental?
I created a mock df to reproduce the problem for the question but that works fine.
If I do a for loop on the index of both the problem df and the example one they both return class int for each row
If I do example_df.index though it returns
Int64Index(2019,2020,2021, dtype='int64', name='Yr')
If I do the same on the problem df, it returns
Index(['2019','2020','2021'],dtype='object')
The above look like strings to me, but the loop says they are int?
Original problem index comes from Excel with set_index, so i can't produce an example here.
Any ideas?

On the problem df, indeed the index's data type is string.
Index(['2019','2020','2021'],dtype='object')
When you write
df[df.index==2020]
A blank result is expected because you search for int 2020 not string '2020'.
Then, in code
df.loc[2020]
Is a wrong code for searching some data with some condition. loc is used to slice column and row, not to search a row by some condition like what you wanted to do.
So the code
df[df.index==2020]
is the most right answer, but first you need to change the datatype of your index column first.
df.index= [int(i) for i in df.index]

Related

Get unique years from a date column in pandas DataFrame

I have a date column in my DataFrame say df_dob and it looks like -
id
DOB
23312
31-12-9999
1482
31-12-9999
807
#VALUE!
2201
06-12-1925
653
01/01/1855
108
01/01/1855
768
1967-02-20
What I want to print is a list of unique years like - `['9999', '1925', '1855', '1967']
basically through this list I just wanted to check whether there is some unwanted year is present or not.
I have tried(pasted my code below) but getting ValueError: time data 01/01/1855 doesn't match format specified and could not resolve it.
df_dob['DOB'] = df_dob['DOB'].replace('01/01/1855 00:00:00', '1855-01-01')
df_dob['DOB'] = pd.to_datetime(df_dob.DOB, format='%Y-%m-%d')
df_dob['DOB'] = df_dob['DOB'].dt.strftime('%Y-%m-%d')
print(np.unique(df_dob['DOB']))
# print(list(df_dob['DOB'].year.unique()))
P.S - when I print df_dob['DOB'], I get values like - 1967-02-20 00:00:00
Can you try this?
df_dob["DOB"] = pd.to_datetime(df_DOB["Date"])
df_dob['YOB'] = df_dob['DOB'].dt.strftime('%Y')
Use pandas' unique for this. And on year only.
So try:
print(df['DOB'].dt.year.unique())
Also, you don't need to stringify your time. Alse, you don't need to replace anything, pandas is smart enough to do it for you. So you overall code becomes:
df_dob['DOB'] = pd.to_datetime(df_dob.DOB) # No need to pass format if there isn't some specific anomoly
print(df['DOB'].dt.year.unique())
Edit:
Another method:
Since you have outofbounds problem,
Another method you can try is not converting them to datetime, but rather find all the four digit numbers in each column using regex.
So,
df['DOB'].str.extract(r'(\d{4})')[0].unique()
[0] because unique() is a function of pd.series not a dataframe. So taking the first series in the dataframe.
The first thing you need to know is if the resulting values (which you said look like 1967-02-20 00:00:00 are datetimes or not. That's as simple as df_dob.info()
If the result says similar to datetime64[ns] for the DOB column, you're good. If not you'll need to cast it as a DateTime. You have a couple of different formats so that might be part of your problem. Also, because there're several ways of doing this and it's a separate question, I'm not addressing it.
We going to leverage the speed of sets, plus a bit of pandas, and then convert that back to a list as you wanted the final version to be.
years = list({i for i in df['date'].dt.year})
And just a side note, you can't use [] instead of list() as you'll end with a list with a single element that's a set.
That's a list as you indicated. If you want it as a column, you won't get unique values
Nitish's answer will also work but give you something like: array([9999, 1925, 1855, 1967])

Problem with omitting extra information when importing a value in DataFrame?

I want to make a DataFrame from data of another Dataframe. my first table has 3 columns and I chose the min value of one of the columns and I want to choose two other corresponding value and put it on another DataFrame. but when I import it I get extra information that won't let me convert it to float64. what should I do?
a= fp['w']
b= fp[r'$\Omega_m$']
data={"best_value_w": [a], "best_value_$\Omega$": [b], "errors":[1]}
bv_table= pd.DataFrame(data, index=['1"$\sigma$"', '2"$\sigma$"', '3"$\sigma$"'])
here is what I get
but I want something like this, but without brackets, I want dtype to be float, not object
what I almost want but without brackets
I found the answer, I should use astype like this :
bv_table[' best_value_w ']=bv_table.best_value_w.astype(float)
bv_table[' best_value_$\Omega$ ']=bv_table.best_value_omega.astype(float)
then I got a table like this:
Done!

Pandas change range into int

In my df I have a salary_range column, which contains ranges like 100 000 - 150 000. I'd like to modify this column so it would take the first value as an int. So in this example I'd like to change "100 000 - 150 000(string) to 100000(int). Unfortunatelly salary_range is full of NaN, and I don't really know how to use if/where statements in pandas.
I tried doing something like this: df['salary_range'] = np.where(df['salary_range']!='NaN',) but I don't know what should I write as the second argument of np.where. Obviously I can't just use str(salary_range), so I don't know how to do it.
You first need to take the subset where the value is not NaN. This can be done using the following code.
pd.isna(df['salary_range'])
The above function will return a series containing True/False values. Now you can select the non-NaN rows using the following code.
df[pd.isna(df['salary_range'])]
Next you will need to parse the entries of this subset, which can be done in many ways, one of which can be the following.
df[pd.isna(df['salary_range'])]['salary_range'] = df[pd.isna(df['salary_range'])]['salary_range'].str.split(' ')[0].replace(' ', '').astype(int)
This will only change the rows, where the column is not null. Since you did not include the code, I can't help much without knowing more about the context. Hope this helps.

Replace Value in Dataframe Column based upon value in another column within the same dataframe

I have a pandas dataframe in which some rows didn't pull in correctly so that the values were pushed over into the next column over. Therefore I have a column that is mostly null, but has a few instances where there is a value that should go in the previous column. Below is an example of what it looks like.
enter image description here
I need to replace the 12345 and 45678 in the Approver column with JJones in the NeedtoDelete column.
I am not sure if a for loop, or a regular expression is the right way to go. I also came across the replace function, but I'm not sure how I would set that up in this scenario. Below is the code I have tried thus far (Q1Q2 is the df name):
for Q1Q2['Approver'] in Q1Q2:
Replacement = Q1Q2.loc[Q1Q2['Need to Delete'].notnull()]
Q1Q2.loc[Replacement] = Q1Q2['Approver']
Q1Q2.loc[Q1Q2['Need to Delete'].notnull(), ['Approver'] == Q1Q2['Need to Delete']]
If you could help me fix either attempts above, or point me in the right direction, it would be greatly appreciated. Thanks in advance!
You can use boolean indexing:
r=Q1Q2['Need to Delete'].notnull()
Q1Q2.loc[r,'Approver']=Q1Q2.loc[r,'Need to Delete']

How to change column values according to size

I have a dataframe df in a PySpark setting. I want to change a column, say it is called A, whose datatype is "string". I want to change its values according to their lengths. In particular, if in a row we have only a character, we want to concatenate 0 to the end. Otherwise, we take the default value. The name of the "modified" column must still be A. This is for a Jupyter Notebook using PySpark3.
This is what I have tried so far:
df = df.withColumn("A", when(size(df.col("A")) == 1, concat(df.col("A"), lit("0"))).otherwise(df.col("A")))
I also tried the same code deleting the "df.col"'s.
When I run this code, the software complains saying that the syntax is invalid, but I don't see the error.
df.withColumn("temp", when(length(df.A) == 1, concat(df.A, lit("0"))).\
otherwise(df.A)).drop("A").withColumnRenamed('temp', 'A')
What I understood after reading your question was, you are getting one extra column A.
So you want that old column A replaced by new Column A. So I created a temp column with your required logic then dropped column A then renamed temp column to A.
Listen here child...
To choose a column from a DF in pyspark, you must not use the "col" function, since it is a Scala/Java API. In Pyspark, the correct way is just to choose the name from the DF: df.colName.
To get the length of your string, use the "length" function. Size function is for iterables.
And for the grand solution... (drums drums drums)
df.withColumn("A", when(length(df.A) == 1, concat(df.A, lit("0"))).otherwise(df.A))
Por favor!

Categories

Resources