Pandas error: cannot reindex from a duplicate axis - python

I have a dateframe named Mj_rank, with date as Datetime and index which looks like this:
A B C ...
date
2016-01-29 False False True
2016-01-30 False False True
2016-02-01 True True True
....
2017-12-29 False True True
Currently, the data is daily, but I would like to resample the data into a new df that contains every 6 months nth.
Therefore I did:
Mj_rank_s = Mj_rank.resample('6M').asfreq().tail()
which gives me this output:
ValueError: cannot reindex from a duplicate axis
strangely enough, if I use other methods like max() or min() it works fine, but not "asfreq()".
I tried different ways based on existing stackoverflow suggestions like adding in front, but didn't work :
Mj_rank = Mj_rank.reset_index()
Mj_rank['date'] = pd.to_datetime(Mj_rank['date'])
Mj_rank = Mj_rank.set_index('date')
Thanks a lot!
Edit:
Thanks to #jezrael he pointed out I had problems with duplicates using
Mj_rank[Mj_rank.index.duplicated(keep=False)]

Related

Pandas groupby .all method

I am trying to understand how to use .all, for example:
import pandas as pd
df = pd.DataFrame({
"user_id": [1,1,1,1,1,2,2,2,3,3,3,3],
"score": [1,2,3,4,5,3,4,5,5,6,7,8]
})
When I try:
df.groupby("user_id").all(lambda x: x["score"] > 2)
I get:
score
user_id
1 True
2 True
3 True
But I expect:
score
user_id
1 False # Since for first group of users the score is not greater than 2 for all
2 True
3 True
In fact it doesn't even matter what value I pass instead of 2, the result DataFrame always has True for the score column.
Why do I get the result that I get? How can I get my expected result?
I looked at the documentation: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.all.html, but it is very brief and did not help me.
the line
df.groupby("user_id").all(lambda x: x["score"] > 2)
is not asking "are all datapoints larger than 2?", in reality is asking "are there datapoints?"
to ask what you really want you need to do the following:
df['score'].gt(2).groupby(df['user_id']).all()
Out
user_id
1 False
2 True
3 True
groupby.all does not take any function as parameter. The only parameter (skipna) accepts a boolean and is used to change how NaN values are interpreted.
You probably want:
df['score'].gt(2).groupby(df['user_id']).all()
Which can also be written as:
df.assign(flag=df['score'].gt(2)).groupby('user_id')['flag'].all()

How to find error value on pandas dataframe?

While using csv files from excel and read it with pandas data frame, got 1 value that's has symbol such as 2$3.74836730957 while it has to be 243.74836730957 (it seems mistook 4 with $). is there anyways that I could find such as values that I mention before and change it into NaN values on Data Frame?
CSV file:
You can use pd.to_numeric in order to report boolean values that denote whether a particular column has only numerical values. In order to check all columns you can do
df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())
And the output would look like:
A True
B False
C True
D False
...
dtype: bool
Now if you want to know which specific row(s) are not numerical you can use np.isreal:
df.applymap(np.isreal)
A B C D
item
r1 True True True True
r2 True True True True
r3 True True True False
...

Split strings in DataFrame and keep only certain parts

I have a DataFrame like this:
x = ['3.13.1.7-2.1', '3.21.1.8-2.2', '4.20.1.6-2.1', '4.8.1.2-2.0', '5.23.1.10-2.2']
df = pd.DataFrame(data = x, columns = ['id'])
id
0 3.13.1.7-2.1
1 3.21.1.8-2.2
2 4.20.1.6-2.1
3 4.8.1.2-2.0
4 5.23.1.10-2.2
I need to split each id string on the periods, and then I need to know when the second part is 13 and the third part is 1. Ideally, I would have one additional column that is a boolean value (in the above example, index 0 would be TRUE and all others would be FALSE). But I could live with multiple additional columns, where one or more contain individual string parts, and one is for said boolean value.
I first tried to just split the string into parts:
df['id_split'] = df['id'].apply(lambda x: str(x).split('.'))
This worked, however if I try to isolate only the second part of the string like this...
df['id_split'] = df['id'].apply(lambda x: str(x).split('.')[1])
...I get an error that the list index is out of range.
Yet, if I check any individual index in the DataFrame like this...
df['id_split'][0][1]
...this works, producing only the second item in the list of strings.
I guess I'm not familiar enough with what the .apply() method is doing to know why it won't accept list indices. But anyway, I'd like to know how I can isolate just the second and third parts of each string, check their values, and output a boolean based on those values, in a scalable manner (the actual dataset is millions of rows). Thanks!
Let's use str.split to get the parts, then you can compare:
parts = df['id'].str.split('\.', expand=True)
(parts[[1,2]] == ['13','1']).all(1)
Output:
0 True
1 False
2 False
3 False
4 False
dtype: bool
You can do something like this
df['flag'] = df['id'].apply(lambda x: True if x.split('.')[1] == '13' and x.split('.')[2]=='1' else False)
Output
id flag
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False
You can do it directly, like below:
df['new'] = df['id'].apply(lambda x: str(x).split('.')[1]=='13' and str(x).split('.')[2]=='1')
>>> print(df)
id new
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False

Using .loc or .iloc instead of .ix

I am using python 3.6.
I have a pandas.core.frame.DataFrame and would like to filter the entire DataFrame based on if the column called "Closed Date" is not null. In other words, if it is null in the "Closed Date" column, then remove the whole row from the DataFrame.
My code right now is the following:
data = raw_data.ix[raw_data['Closed Date'].notnull()]
Though it gets the job done, I get an warming message saying the following:
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
I tried this code:
data1 = raw_data.loc[raw_data.notnull(), 'Closed Date']
But get this error:
ValueError: Cannot index with multidimensional key
How do I fix this? Any suggestions?
This should work for you:
data1 = raw_data.loc[raw_data['Closed Date'].notnull()]
.ix was very similar to the current .loc (which is why the correct .loc syntax is equivalent to what you were originally doing with .ix). The difference, according to this detailed answer is: "ix usually tries to behave like loc but falls back to behaving like iloc if a label is not present in the index"
Example:
Taking this dataframe as an example (let's call it raw_data):
Closed Date x
0 1.0 1.0
1 2.0 2.0
2 3.0 NaN
3 NaN 3.0
4 4.0 4.0
raw_data.notnull() returns this DataFrame:
Closed Date x
0 True True
1 True True
2 True False
3 False True
4 True True
You can't index using .loc based on a dataframe of boolean values. However, when you do raw_data['Closed Date'].notnull(), you end up with a Series:
0 True
1 True
2 True
3 False
4 True
Which can be passed to .loc as a sort of "boolean filter" to apply onto your dataframe.
Alternate Solution
As pointed out by John Clemens, the same can be achieved with raw_data.dropna(subset=['Closed Date']). The documentation for the .dropna method outlines how this could be more flexible in some situations (for instance, allowing to drop rows or columns in which any or all values are NaN using the how argument, etc...)

Find values of data frame in another dataframe in python

I have two data frames df_1 contains:
["TP","MP"]
and df_2 contains:
["This is case 12389TP12098","12378MP899" is now resolved","12356DCT is pending"]
I want to use values in df_1 search it in each entry of df_2
and return those which matches. In this case,those two entries which have TP,MP.
I tried something like this.
df_2.str.contains(df_1)
You need to do it separately for each element of df_1. Pandas will help you:
df_1.apply(df_2.str.contains)
Out:
0 1 2
0 True False False
1 False True False
That's a matrix of all combinations. You can pretty it up:
matches = df_1.apply(df_2.str.contains)
matches.index = df_1
matches.columns = df_2
matches
Out:
This is case 12389TP12098 12378MP899 is now resolved 12356DCT is pending
TP True False False
MP False True False

Categories

Resources