I have imported excel file with some data and removed missing values.
df = pd.read_excel (r'file.xlsx', na_values = missing_values)
Im trying to split string values to make them into list for later actions.
df['GENRE'] = df['GENRE'].map(lambda x: x.split(','))
df['ACTORS'] = df['ACTORS'].map(lambda x: x.split(',')[:3])
df['DIRECTOR'] = df['DIRECTOR'].map(lambda x: x.split(','))
But it gives me following error - AttributeError: 'list' object has no attribute 'split'
I've done the same with csv and it worked.. could it be because its excel?
Im sure it's simple but i can't get my head around it.example of my dataframe
Try using str.split, the Pandas way:
df['GENRE'] = df['GENRE'].str.split(',')
df['ACTORS'] = df['ACTORS'].str.split(',').str[:3]
df['DIRECTOR'] = df['DIRECTOR'].str.split(',')
Related
I am working on a script that imports an excel file, iterates through a column called "Title," and returns False if a certain keyword is present in "Title." The script runs, until I get to part where I want to export another csv file that gives me a separate column. My error is as follows: AttributeError: 'int' object has no attribute 'lower'
Based on this error, I changed the df.Title to a string using df['Title'].astype(str), but I get the same error.
import pandas as pd
data = pd.read_excel(r'C:/Users/Downloads/61_MONDAY_PROCESS_9.16.19.xlsx')
df = pd.DataFrame(data, columns=['Date Added','Track Item', 'Retailer Item ID','UPC','Title','Manufacturer','Brand','Client Product
Group','Category','Subcategory',
'Amazon Sub Category','Segment','Platform'])
df['Title'].astype(str)
df['Retailer Item ID'].astype(str)
excludes = ['chainsaw','pail','leaf blower','HYOUJIN','brush','dryer','genie','Genuine
Joe','backpack','curling iron','dog','cat','wig','animal','dryer',':','tea', 'Adidas', 'Fila',
'Reebok','Puma','Nike','basket','extension','extensions','batteries','battery','[EXPLICIT]']
my_excludes = [set(x.lower().split()) for x in excludes]
match_titles = [e for e in df.Title.astype(str) if any(keywords.issubset(e.lower().split()) for
keywords in my_excludes)]
def is_match(title, excludes = my_excludes):
if any(keywords.issubset(title.lower().split()) for keywords in my_excludes):
return True
return False
This is the part that returns the error:
df['match_titles'] = df['Title'].apply(is_match)
result = df[df['match_titles']]['Retailer Item ID']
print(df)
df.to_csv('Asin_List(9.18.19).csv',index=False)
Use the following code to import your file:
data = pd.read_excel(r'C:/Users/Downloads/61_MONDAY_PROCESS_9.16.19.xlsx',
dtype='str')`
For pandas.read_excel, you can pass an optional parameter dtype.
You can also use it to pass multiple data types for different columns:
ex: dtype={'Retailer Item ID': int, 'Title': str})
At the line where you wrote
match_titles = [e for e in df.Title.astype(str) if any(keywords.issubset(e.lower().split()) for
keywords in my_excludes)]
python returns as variable e an integer and not the String you like.This happens because when you write df.Title.astype(str) you are searching the index of a new pandas dataframe containing only the column Title and not the contents of the column.If you want to iterate through column you should try
match_titles = [e for e in df.ix[:,5] if any(keywords.issubset(e.lower().split()) for keywords in my_excludes)
The df.ix[:,5] returns the fifth column of the dataframe df,which is the column you want.If this doesn't work try with the iteritems() function.
The main idea is that if you directly assign a df[column] to something else,you are assigning its index,not its contents.
I have over 8 million rows of text where I want to remove all stop words and also lemmatize the text using dask.map_partitions() but get the following error:
AttributeError: 'Series' object has no attribute 'split'
Is there any way to apply the function to the dataset?
Thanks for the help.
import pandas as pd
import dask.dataframe as dd
from spacy.lang.en import stop_words
cachedStopWords = list(stop_words.STOP_WORDS)
def stopwords_lemmatizing(text):
return [word for word in text.split() if word not in cachedStopWords]
text = 'any length of text'
data = [{'content': text}]
df = pd.DataFrame(data, index=[0])
ddf = dd.from_pandas(df, npartitions=1)
ddf['content'] = ddf['content'].map_partitions(stopwords_lemmatizing, meta='f8')
map_partitions, as the name suggests, works on each partition of your overall dask dataframe, which are each pandas dataframes ( http://docs.dask.org/en/latest/dataframe.html#design ). Your function value-by-value for a seriesq, so what you actually wanted was the simple map:
ddf['content'] = ddf['content'].map(stopwords_lemmatizing)
(if you want to provide the meta here, it should be a zero-length Series rather than dataframe, e.g., meta=pd.Series(dtype='O')).
I'm attempting to convert a pipelinedRDD in pyspark to a dataframe. This is the code snippet:
newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), )))
df = newRDD.toDF()
When I run the code though, I receive this error:
'list' object has no attribute 'encode'
I've tried multiple other combinations, such as converting it to a Pandas dataframe using:
newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), )))
df = newRDD.toPandas()
But then I end up receiving this error:
AttributeError: 'PipelinedRDD' object has no attribute 'toPandas'
Any help would be greatly appreciated. Thank you for your time.
rdd.toDF() or rdd.toPandas() is only used for SparkSession.
To fix your code, try below:
spark = SparkSession.builder.getOrCreate()
rdd = spark.sparkContext.textFile()
newRDD = rdd.map(...)
df = newRDD.toDF() or newRDD.toPandas()
Here is my python code, Which is throwing error while executing.
def split_cell(s):
a = s.split(".")
b = a[1].split("::=")
return (a[0].lower(),b[0].lower(),b[1].lower())
logic_tbl,logic_col,logic_value = split_cell(rules['logic_1'][ith_rule])
mems = logic_tbl[logic_tbl[logic_col]==logic_value]['mbr_id'].tolist()
Function split_cell is working fine, and all the columns in logic_tbl are of object datatypes.
HEre is the Traceback
Got this corrected!
Logic_tbl contains name of pandas dataframe
Logic_col contains name of column name in the pandas dataframe
logic_value contains value of the rows in the logic_col variable in logic_tbl dataframe.
mems = logic_tbl[logic_tbl[logic_col]==logic_value]['mbr_id'].tolist()
I was trying like above, But python treating logic_tbl as string, not doing any pandas dataframe level operations.
So, I had created a dictionary like this
dt_dict={}
dt_dict['a_med_clm_diag'] = a_med_clm_diag
And modified my code as below,
mems = dt_dict[logic_tbl][dt_dict[logic_tbl][logic_col]==logic_value]['mbr_id'].tolist()
This is working as expected. I come to this idea when i wrote like,
mems = logic_tbl[logic_tbl[logic_col]==logic_value,'mbr_id']
And this throwed message like,"'logic_tbl' is a string Nothing to filter".
Try writing that last statement like below code:
filt = numpy.array[a==logic_value for a in logic_col]
mems = [i for indx,i in enumerate(logic_col) if filt[indx] == True]
Does this work?
I am trying to do a change data capture on two dataframes. The logic is to merge two dataframes and group by one keys and then run a loop for groups having count >1 to see which column 'updated'. I am getting strange error. any help is appreciated.
code
import pandas as pd
import numpy as np
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print("reading wolverine xlxs")
# defining metadata
df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias',
'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription',
'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description',
'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID',
'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryPhysicalHV',
'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExemption','Any','ContractID',
'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage']
df_w01 = pd.read_excel("wolverine_1.xlsx", names = df_header)
df_w02 = pd.read_excel("wolverine_2.xlsx", names = df_header)
df_w01['version'] = 'OLD'
df_w02['version'] = 'NEW'
#print(df_w01)
df_m_d = pd.concat([df_w01, df_w02], ignore_index = True)
first_pass = df_m_d[df_m_d.duplicated(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'], keep=False)]
first_pass_keep_duplicate = df_m_d[df_m_d.duplicated(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'], keep='first')]
group_by_1 = first_pass.groupby(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'])
for i,rows in group_by_1.iterrows():
print("rownumber", i)
print (rows)
print(first_pass)
And The error I get :
AttributeError: Cannot access callable attribute 'iterrows' of 'DataFrameGroupBy' objects, try using the 'apply' method
Any help is much appreciated.
Your GroupBy object supports iteration, so instead of
for i,rows in group_by_1.iterrows():
print("rownumber", i)
print (rows)
you need to do something like
for name, group in group_by_1:
print name
print group
then you can do what you need to do with each group
See the docs
Why not do as suggested and use apply? Something like:
def print_rows(rows):
print rows
group_by_1.apply(print_rows)