i want to change upper cases for all values in dataframe, and use the following codes,
import pandas as pd
import numpy as np
path1= "C:\\Users\\IBM_ADMIN\\Desktop\\ml-1m\\SELECT_FROM_HRAP2P3_SAAS_ZTXDMPARAM_201611291745.csv"
frame1 = pd.read_csv(path1,encoding='utf8',dtype = {'COUNTRY_CODE': str})
for x in frame1:
frame1[x] = frame1[x].str.lower()
frame1
but i get the following error for this row:
frame1[x] = frame1[x].str.lower()
error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
don't know the reason,
You can use applymap function.
import pandas as pd
df1 = pd.DataFrame({'MovieName': ['LIGHTS OUT', 'Legend'], 'Actors':['MARIA Bello', 'Tom Hard']})
df2=df1.applymap(lambda x: x.lower())
print df1, "\n"
print df2
Output:
Actors MovieName
0 MARIA Bello LIGHTS OUT
1 Tom Hard Legend
Actors MovieName
0 maria bello lights out
1 tom hard legend
Try to use str.lower on Series object.
Support your DataFrame like below:
df = pd.DataFrame(dict(name=["HERE", "We", "are"]))
name
0 HERE
1 We
2 are
Then lower all values and output:
df['name'] = df['name'].str.lower()
name
0 here
1 we
2 are
You can try this:
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Series(["TEST","Train","test","train"]),
'F' : 'foo' })
mylist = list(df2.select_dtypes(include=['object']).columns) # in dataframe
#string is stored as object
for i in mylist:
df2[i]= df2[i].str.lower()
Related
I have a dataframe where I want to create a Dummy variable that takes the value 1 when the Asset Class starts with a D. I want to have all variants that start with a D. How would you do it?
The data looks like
dic = {'Asset Class': ['D.1', 'D.12', 'D.34','nan', 'F.3', 'G.12', 'D.2', 'nan']}
df = pd.DataFrame(dic)
What I want to have is
dic_want = {'Asset Class': ['D.1', 'D.12', 'D.34', 'nan', 'F.3', 'G.12', 'D.2', 'nan'],
'Asset Dummy': [1,1,1,0,0,0,1,0]}
df_want = pd.DataFrame(dic_want)
I tried
df_want["Asset Dummy"] = ((df["Asset Class"] == df.filter(like="D"))).astype(int)
where I get the following error message: ValueError: Columns must be same length as key
I also tried
CSDB["test"] = ((CSDB["PAC2"] == CSDB.str.startswith('D'))).astype(int)
where I get the error message AttributeError: 'DataFrame' object has no attribute 'str'.
I tried to transform my object to a string with the standard methos (as.typ(str) and to_string()) but it also does not work. This is probably another problem but I have found only one post with the same question but the post does not have a satisfactory answer.
Any ideas how I can solve my problem?
There are many ways to create a new column based on conditions this is one of them :
import pandas as pd
import numpy as np
dic = {'Asset Class': ['D.1', 'D.12', 'D.34', 'F.3', 'G.12', 'D.2']}
df = pd.DataFrame(dic)
df['Dummy'] = np.where(df['Asset Class'].str.contains("D"), 1, 0)
Here's a link to more : https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
You can use Series.str.startswith on df['Asset Class']:
>>> dic = {'Asset Class': ['D.1', 'D.12', 'D.34', 'nan', 'F.3', 'G.12', 'D.2', 'nan']}
>>> df = pd.DataFrame(dic)
>>> df['Asset Dummy'] = df['Asset Class'].str.startswith('D').astype(int)
>>> df
Asset Class Asset Dummy
0 D.1 1
1 D.12 1
2 D.34 1
3 nan 0
4 F.3 0
5 G.12 0
6 D.2 1
7 nan 0
I have a df where I want to query while using values from itertuples() from another dataframe:
matching_group = Pandas(Index=0, match='TEST')
df.query("column_a == #matching_group.match")
"matching_group" is coming from df.itertuples() and I can't change that data type. The query above works.
But now I need to cast "matching_group.match" to lowercase.
matching_group = Pandas(Index=0, match='TEST')
df.query("column_a == #matching_group.match.lower()")
This does not work.
It's hard to create a minimal viable example here.
How can I cast a variable used via # in a df.query() to lowercase?
For me working well your code with named tuples, one possible reason for not matching should be trailing whitesapces, then remove them by strip:
df = pd.DataFrame({ 'column_a': ['test', 'tesT', 'No']})
from collections import namedtuple
Pandas = namedtuple('Pandas', 'Index match')
matching_group = Pandas(Index=0, match='TEST')
print (matching_group)
Pandas(Index=0, match='TEST')
df3 = df.query("column_a == #matching_group.match.lower()")
print (df3)
column_a
0 test
df3 = df.query("column_a.str.strip() == #matching_group.match.lower().strip()")
Input Toy Example
df = pd.DataFrame({
'test':['abc', 'DEF'],
'num':[1,2]
})
val='Abc' # variable to be matched
Input df
test num
0 abc 1
1 DEF 2
Code
df.query('test == #val.lower()')
Output
test num
0 abc 1
Tested on pandas version
pd.version # '1.2.4'
I have a dataframe which has list in one column that I want to convert into a simple string
id data_words_nostops
26561364 [andrographolide, major, labdane, diterpenoid]
26561979 [dgat, plays, critical, role, hepatic, triglyc]
26562217 [despite, success, imatinib, inhibiting, bcr]
DESIRED OUTPUT
id data_words_nostops
26561364 andrographolide, major, labdane, diterpenoid
26561979 dgat, plays, critical, role, hepatic, triglyc
26562217 despite, success, imatinib, inhibiting, bcr
Try this :
df['data_words_nostops'] = df['data_words_nostops'].apply(lambda row : ','.join(row))
Complete code :
import pandas as pd
l1 = ['26561364', '26561979', '26562217']
l2 = [['andrographolide', 'major', 'labdane', 'diterpenoid'],['dgat', 'plays', 'critical', 'role', 'hepatic', 'triglyc'],['despite', 'success', 'imatinib', 'inhibiting', 'bcr']]
df = pd.DataFrame(list(zip(l1, l2)),
columns =['id', 'data_words_nostops'])
df['data_words_nostops'] = df['data_words_nostops'].apply(lambda row : ','.join(row))
Output :
id data_words_nostops
0 26561364 andrographolide,major,labdane,diterpenoid
1 26561979 dgat,plays,critical,role,hepatic,triglyc
2 26562217 despite,success,imatinib,inhibiting,bcr
df["data_words_nostops"] = df.apply(lambda row: row["data_words_nostops"][0], axis=1)
You can use pandas str join for this:
df["data_words_nostops"] = df["data_words_nostops"].str.join(",")
df
id data_words_nostops
0 26561364 andrographolide,major,labdane,diterpenoid
1 26561979 dgat,plays,critical,role,hepatic,triglyc
2 26562217 despite,success,imatinib,inhibiting,bcr
I tried the following as well
df_ready['data_words_nostops_Joined'] = df_ready.data_words_nostops.apply(', '.join)
I am trying to access the attribute Job-Label from a dataframe to do the following
df.Job-Label.iloc[i] = "something"
But this gives an error "can't assign to operator". How can I make this work?
If I write it as
df.iloc[i]["Job-Label"] = "something"
It creates a copy and no changes are made to the original dataframe.
EDIT:
Here is the snippet of code I am using this in
for i in range(len(job_data)):
x=[]
if len(job_data.iloc[i]["Job-Label"])>1:
for job in job_data.iloc[i]["Job-Label"]:
if int(job) not in x:
x.append(int(job))
job_data.Job-Label.iloc[i] = frozenset(x)
I guess you want to do :
for index in job_data.index.tolist():
x=[]
if len(job_data.loc[index, "Job-Label"])>1:
for job in job_data.loc[index, "Job-Label"]:
if int(job) not in x:
x.append(int(job))
job_data.loc[index, "Job-Label"] = frozenset(x)
The error is raised because python interprets your minus in the name as the minus operator.
Probably not what you searching for but I would just format the column names to be clean:
import pandas as pd
# create data
data_dict = {}
data_dict['my-col-1'] = ['Apple', 'Orange']
data_dict['my-col-2'] = [1.5, 3.24]
data_dict['weight'] = [12.03, 48.]
# create dataframe from dict
df = pd.DataFrame.from_dict(data_dict)
print(df)
# Print columns names
dict_rename = {}
for col in df.columns.values:
if '-' in col:
print("Renaming col %s" % col)
dict_rename[col] = col.replace("-", "_")
df = df.rename(columns=dict_rename)
print(df)
The code output:
my-col-1 my-col-2 weight
0 Apple 1.50 12.03
1 Orange 3.24 48.00
Renaming col my-col-1
Renaming col my-col-2
my_col_1 my_col_2 weight
0 Apple 1.50 12.03
1 Orange 3.24 48.00
Let's say I have a Pandas DataFrame like following.
df = pd.DataFrame({'Name' : ['A','B','C'],
'Country' : ['US','UK','SL']})
Country Name
0 US A
1 UK B
2 SL C
And I'm having a csv like following.
Name,Extended
A,Jorge
B,Alex
E,Mark
F,Bindu
I need to check whether df['Name'] is in csv and if so get the "Extended". If not I need to just get the "Name". So my Expected output is like following.
Country Name Extended
0 US A Jorge
1 UK B Alex
2 SL C C
Following shows what I tried so far.
f = open('mycsv.csv','r')
lines = f.readlines()
def parse(x):
for line in lines:
if x in line.split(',')[0]:
return line.strip().split(',')[1]
df['Extended'] = df['Name'].apply(parse)
Name Country Extended
0 A US Jorge
1 B UK Alex
2 C SL None
I can not figure out how to get the "Name" for C at "Extended"(else part in the code)? Any help.
You can use the "fillna" function from pandas like this:
import pandas as pd
df1 = pd.DataFrame({'Name' : ['A','B','C'],
'Country' : ['US','UK','SL']})
df2 = pd.DataFrame.from_csv('mycsv.csv', index_col=None)
df_merge = pd.merge(df, f, how="left", on="Name")
df_merge["Extended"].fillna('Name', inplace=True)
You could just load the csv as a df and then assign using where:
df['Name'] = df2['Extended'].where(df2['Name'] != df2['Extended'], df2['Name'])
So here we use the boolean condition to test if 'Name' is not equal to 'Extended' and use that value, otherwise just use 'Name'.
Also is 'Extended' always either different or same as 'Name'? If so why not just assign the value of extended to the dataframe:
df['Name'] = df2['Extended']
This would be a lot simpler.