I wanted to change my variables on spesific column with dictionary values but it does not change. I tried several ways and but it does not work. My dataset has 47k rows and my dictionary has 30 different words so I will show some.
My dataset:
Dictionary:
rolechange = {"\\Adv":"Adversary",
"\\Sci":"Scientist",
"\\Inn":"Innocent",
"\\Und":"Undetermined"}
I'm trying
movies_df["Role Type"].replace(rolechange, inplace=True)
It does not gives error but result is same. I couldn't find similar question on here, sorry for if its duplicate.
You just have to create raw strings (prefix 'r')
rolechange = {r"\\Adv":"Adversary",
r"\\Sci":"Scientist",
r"\\Inn":"Innocent",
r"\\Und":"Undetermined"}
>>> df['Role Type'].replace(rolechange)
0 Scientist
1 Innocent
2 Undetermined
Name: Role Type, dtype: object
Related
Good evening! I'm using pandas on Jupyter Notebook. I have a huge dataframe representing full history of posts of 26 channels in a messenger. It has a column "dialog_id" which represents in which dialog the message was sent(so, there can be only 26 unique values in the column, but there are more then 700k rows, and the df is sorted itself by time, not id, so it is kinda chaotic). I have to split this dataframe into 2 different(one will contain full history of 13 channels, and the other will contain history for the rest 13 channels). I know ids by which I have to split, they are random as well. For example, one is -1001232032465 and the other is -1001153765346.
The question is, how do I do it most elegantly and adequate?
I know I can do it somehow with df.loc[], but I don't want to put like 13 rows of df.loc[]. I've tried to use logical operators for this, like:
df1.loc[(df["dialog_id"] == '-1001708255880') & (df["dialog_id"] == '-1001645788710' )], but it doesn't work. I suppose I'm using them wrong. I expect a solution with any method creating a new df, with the use of logical operators. In verbal expression, I think it should sound like "put the row in a new df if the dialog_id is x, or dialog_id is y, or dialog_id is z, etc". Please help me!
The easiest way seems to be just setting up a query.
df = pd.DataFrame(dict(col_id=[1,2,3,4,], other=[5,6,7,8,]))
channel_groupA = [1,2]
channel_groupB = [3,4]
df_groupA = df.query(f'col_id == {channel_groupA}')
df_groupB = df.query(f'col_id == {channel_groupB}')
I have a date column in my DataFrame say df_dob and it looks like -
id
DOB
23312
31-12-9999
1482
31-12-9999
807
#VALUE!
2201
06-12-1925
653
01/01/1855
108
01/01/1855
768
1967-02-20
What I want to print is a list of unique years like - `['9999', '1925', '1855', '1967']
basically through this list I just wanted to check whether there is some unwanted year is present or not.
I have tried(pasted my code below) but getting ValueError: time data 01/01/1855 doesn't match format specified and could not resolve it.
df_dob['DOB'] = df_dob['DOB'].replace('01/01/1855 00:00:00', '1855-01-01')
df_dob['DOB'] = pd.to_datetime(df_dob.DOB, format='%Y-%m-%d')
df_dob['DOB'] = df_dob['DOB'].dt.strftime('%Y-%m-%d')
print(np.unique(df_dob['DOB']))
# print(list(df_dob['DOB'].year.unique()))
P.S - when I print df_dob['DOB'], I get values like - 1967-02-20 00:00:00
Can you try this?
df_dob["DOB"] = pd.to_datetime(df_DOB["Date"])
df_dob['YOB'] = df_dob['DOB'].dt.strftime('%Y')
Use pandas' unique for this. And on year only.
So try:
print(df['DOB'].dt.year.unique())
Also, you don't need to stringify your time. Alse, you don't need to replace anything, pandas is smart enough to do it for you. So you overall code becomes:
df_dob['DOB'] = pd.to_datetime(df_dob.DOB) # No need to pass format if there isn't some specific anomoly
print(df['DOB'].dt.year.unique())
Edit:
Another method:
Since you have outofbounds problem,
Another method you can try is not converting them to datetime, but rather find all the four digit numbers in each column using regex.
So,
df['DOB'].str.extract(r'(\d{4})')[0].unique()
[0] because unique() is a function of pd.series not a dataframe. So taking the first series in the dataframe.
The first thing you need to know is if the resulting values (which you said look like 1967-02-20 00:00:00 are datetimes or not. That's as simple as df_dob.info()
If the result says similar to datetime64[ns] for the DOB column, you're good. If not you'll need to cast it as a DateTime. You have a couple of different formats so that might be part of your problem. Also, because there're several ways of doing this and it's a separate question, I'm not addressing it.
We going to leverage the speed of sets, plus a bit of pandas, and then convert that back to a list as you wanted the final version to be.
years = list({i for i in df['date'].dt.year})
And just a side note, you can't use [] instead of list() as you'll end with a list with a single element that's a set.
That's a list as you indicated. If you want it as a column, you won't get unique values
Nitish's answer will also work but give you something like: array([9999, 1925, 1855, 1967])
I want to make the columns of Salary_Data_split variables, depending of Sal_name (type : list) where:
Sal_name = ['Success_S_1', 'Failure_S_1', 'Success_S_2', 'Failure_S_2','Success_S_4', 'Failure_S_4','Success_S_7', 'Failure_S_7','Success_S_8', 'Failure_S_8']
and Salary_Data_split must be as follow, it contains: Salary + existing rows on Sal_name. Like :
Salary_Data_split = data[["Salary",'Success_S_1', 'Failure_S_1', 'Success_S_2', 'Failure_S_2','Success_S_4', 'Failure_S_4','Success_S_7', 'Failure_S_7','Success_S_8', 'Failure_S_8']]
I have tried this code but it doesnt work
Salary_Data_split = data[["Salary", Sal_name]]
Please always include example data in your posts. It's also important to always include error messages in your posts. That way, your question is alot more clear. I am guessing data is your dataframe with columns Sal_name and Salary, which you want to combine in Sal_data_split?
data['sal_Data_Split'] = [data['Salary'], data['Sal_name']]
This will put the columns Salary and Sal_name in a list, resulting in a nested list if data['Sal_name'] is a list itself. The way you assigned Salary_Data_split = data[["Salary", Sal_name]] in your original post it just indexes 2 columns of the dataframe at once. You also forgot the quotation marks around Sal_name if that is what you meant.
I have a csv dataset which for whatever reason has an extra asterisk (*) at the end of some names. I am trying to remove them, but I'm having trouble. I just want to replace the name in the case where it ends with a *, otherwise keep it as-is.
I have tried a couple variations of the following, but with little success.
import pandas as pd
people = pd.read_csv("people.csv")
people.loc[people["name"].str[-1] == "*"]] = people["name"].str[:-1]
Here I am getting the following error:
ValueError: Must have equal len keys and value when setting with an iterable
I understand why this is wrong, but I'm not sure how else to reference the values I want to change.
I could instead do something like:
starred = people.loc[people["name"].str[-1] == "*"]
starred["name"] = starred["name"].str[:-1]
I get a warning here, but this kind of works. The problem is that it only contains the previously starred people, not all of them.
I'm kind of new to this, so apologies if this is simple. I feel like it shouldn't be too hard, there should be some function to do this, but I don't know what it is.
Your syntax for pd.DataFrame.loc needs to include a column label:
df = pd.DataFrame({'name': ['John*', 'Rose', 'Summer', 'Mark*']})
df.loc[df['name'].str[-1] == '*', 'name'] = df['name'].str[:-1]
print(df)
name
0 John
1 Rose
2 Summer
3 Mark
If you only specify the first part of the indexer, you will be filtering by row label only and return a dataframe. You cannot assign a series to a dataframe.
I have a large pandas dataframe, where I am running groups by operations.
CHROM POS Data01 Data02 ......
1 ....................
1 ...................
2 ..................
2 ............
scaf_9 .............
scaf_9 ............
So, i am doing:
my_data_grouped = my_data.groupby('CHROM')
for chr_, data in my_data_grouped:
do something in chr_
write something from that chr_ data
Everything is fine in small data and in the data where there is no string type CHROM i.e scaff_9. But, with very large data and with scaff_9, I am getting two groups of 2. It really isn't an error message and it is not affecting the computation. The issue is when I write the data by group in the file; I am getting two groups of 2 (splitted unequally).
It is becoming very hard for me to traceback the origin of this problem, since there is no error message and with small data it works well. My only assumption are:
Is there certain limit on the the number of lines in total dataframe vs. grouped dataframe the pandas module can handle. What is the fix to this problem ?
Among all the 2 most of them are treated as integer object and some (later part) as string object being close to scaff_9. Is this possible ?
Sorry, I am only making my assumption here, and it is becoming impossible for me to know the origin of the problem.
Post Edit:
I have also tried to run sort_by(['CHROM']) before doing to groupby, but the problem still persists.
Any possible fix to the issue.
Thanks,
In my opinion there is data problem, obviously some whitespaces, so pandas processes each group separately.
Solution should be remove traling whitespaces first:
df.index = df.index.astype(str).str.strip()
You can also check unique strings values of index:
a = df.index[df.index.map(type) == str].unique().tolist()
If first column is not index:
df['CHROM'] = df['CHROM'].astype(str).str.strip()
a = df.loc[df['CHROM'].map(type) == str, 'CHROM'].unique().tolist()
EDIT:
Last final solution was simplier - casting to str like:
df['CHROM'] = df['CHROM'].astype(str)