I am trying to create a function in python that checks if the data in the dataframe is following a certain structure
in my case i need to ensure that the id column is structured like this ID0101-10
here is my code but it is not working, i keep getting an indexing error:
i = 0
for i in df["id"]:
if ('-' in df["id"]):
df["id"].iloc[i] = df["id"].iloc[i]
i += 1
else:
df.drop(df["id"].iloc[i])
i += 1
if you're curious about my data, its like this:
id name
ID0101-10 John
ID0101-11 Mary
8454 Test
MMMM MMMM
ID0101-01 Ben
MN87876 00.00
i am trying to clean my data by dropping the dummy values
EDIT: i get this error
TypeError: Cannot index by location index with a non-integer key
Any help is appreciated thanks
If I understand correctly, you can do this:
import pandas as pd
df = pd.DataFrame({'id':['ID0101-10', 'ID0101-11', '8454', 'MMMM', 'ID0101-01', 'MN87876'],
'name':['John', 'Mary', 'Test', 'MMMM', 'Ben', '00.00']})
result = df[df['id'].str.startswith('ID0101-')]
print(result)
Output:
id name
0 ID0101-10 John
1 ID0101-11 Mary
4 ID0101-01 Ben
As a general rule, you rarely need to loop over pandas dataframes, it's almost always faster to use native pandas functions.
For more complex matches you can use regular expressions: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.match.html
Related
I have dataframe df_my that looks like this
id name age major
----------------------------------------
0 1 Mark 34 English
1 2 Tom 55 Art
2 3 Peter 31 Science
3 4 Mohammad 23 Math
4 5 Mike 47 Art
...
I am trying to get the value of major (only)
I used this and it works fine when I know the id of the record
df_my["major"][3]
returns
"Math"
great
but I want to get the major for a variable record
I used
i = 3
df_my.loc[df_my["id"]==i]["major"]
and also used
i = 3
df_my[df_my["id"]==i]["major"]
but they both return
3 Math
it includes the record index too
how can I get the major only and nothing else?
You could use squeeze:
i = 3
out = df.loc[df['id']==i,'major'].squeeze()
Another option is iat:
out = df.loc[df['id']==i,'major'].iat[0]
Output:
'Science'
I also stumbled over this problem, from a little different angle:
df = pd.DataFrame({'First Name': ['Kumar'],
'Last Name': ['Ram'],
'Country': ['India'],
'num_var': 1})
>>> df.loc[(df['First Name'] == 'Kumar'), "num_var"]
0 1
Name: num_var, dtype: int64
>>> type(df.loc[(df['First Name'] == 'Kumar'), "num_var"])
<class 'pandas.core.series.Series'>
So it returns a Series (although it is only a series with only 1 element). If you access through the index, you receive the integer.
df.loc[0, "num_var"]
1
type(df.loc[0, "num_var"])
<class 'numpy.int64'>
The answer on how to select the respective, single value was already given above. However, I think it is interesting to note that accessing through an index always gives the single value whereas accessing through a condition returns a series. This is, b/c accessing with index clearly returns only one value whereas accessing through a condition can return several values.
If one of the columns of your dataframe is the natural primary index for those data, then it's usually a good idea to make pandas aware of it by setting the index accordingly:
df_my.set_index('id', inplace=True)
Now you can easily get just the major value for any id value i:
df_my.loc[i, 'major']
Note that for i = 3, the output is 'Science', which is expected, as noted in the comments to your question above.
Here the dataset:
df = pd.read_csv('https://data.lacity.org/api/views/d5tf-ez2w/rows.csv?accessType=DOWNLOAD')
The problem:
I have a pandas dataframe of traffic accidents in Los Angeles.
Each accident has a column of mo_codes which is a string of numerical codes (which I converted into a list of codes). Here is a screenshot:
I also have a dictionary of mo_codes description for each respective mo_code and loaded in the notebook.
Now, using the code below I can combine the numeric code with the description:
mo_code_list_final = []
for i in range(20):
for j in df.mo_codes.iloc[i]:
print(i, mo_code_dict[j])
So, I haven't added this as a column to Pandas yet. I wanted to ask if there is a better way to solve the problem I have which is, how best to add the textual description in pandas as a column.
Also, is there an easier way to process this with a pandas function like .assign instead of the for loop. Maybe a list comprehension to process the mo_codes into a new dataframe with the description?
Thanks in advance.
ps. if there is a technical word for this type of problem, pls let me know.
import pandas
codes = {0:'Test1',1:'test 2',2:'test 3',3:'test 4'}
df1 = pandas.DataFrame([["red",[0,1,2],5],["blue",[3,1],6]],columns=[0,'codes',2])
# first explode the list into its own rows
df2 = df1['codes'].apply(pandas.Series).stack().astype(int).reset_index(level=1, drop=True).to_frame('codes').join(df1[[0,2]])
#now use map to apply the text descriptions
df2['desc'] = df2['codes'].map(codes)
print(df2)
"""
codes 0 2 desc
0 0 red 5 Test1
0 1 red 5 test 2
0 2 red 5 test 3
1 3 blue 6 test 4
1 1 blue 6 test 2
"""
I figured out how to finally do this. However, I found the answer in Javascript but the same concept applies.
You simply create a dictionary of mocodes and its string value.
export const mocodesDict = {
"0100": "Suspect Impersonate",
"0101": "Aid victim",
"0102": "Blind",
"0103": "Crippled",
...
}
After that, its as simple as doing this
mocodesDict[item)]
where item you want to convert.
Apologies for the messy title: Problem as follows:
I have some data frame of the form:
df1 =
Entries
0 "A Level"
1 "GCSE"
2 "BSC"
I also have a data frame of the form:
df2 =
Secondary Undergrad
0 "A Level" "BSC"
1 "GCSE" "BA"
2 "AS Level" "MSc"
I have a function which searches each entry in df1, looking for the words in each column of df2. The words that match, are saved (Words_Present):
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
return Attribute
I apply this function over all entries in df1, and all columns in df2, using the following iteration:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
This yields an output df which looks something like:
df1 =
Entries Secondary undergrad
0 "A Level" 1 0
1 "GCSE" 1 0
2 "AS Level" 1 0
I want to amend the Word_Search function to output a the Words_Present list as well as the Attribute, and input these into a new column, so that my eventual df1 array looks like:
Desired dataframe:
Entries Secondary Words Found undergrad Words Found
0 "A Level" 1 "A Level" 0
1 "GCSE" 1 "GCSE" 0
2 "AS Level" 1 "AS Level" 0
If I do:
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
if Yes == 0:
Attribute = 0
return Attribute,Words_Present
My function therefore now has multiple outputs. So applying the following:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
My Output Looks like this:
Entries Secondary undergrad
0 "A Level" [1,"A Level"] 0
1 "GCSE" [1, "GCSE"] 0
2 "AS Level" [1, "AS Level"] 0
The output of pd.apply() is always a pandas series, so it just shoves everything into the single cell of df[i] where i = secondary.
Is it possible to split the output of .apply into two separate columns, as shown in the desired dataframe?
I have consulted many questions, but none seem to deal directly with yielding multiple columns when the function contained within the apply statement has multiple outputs:
Applying function with multiple arguments to create a new pandas column
Create multiple columns in Pandas Dataframe from one function
Apply pandas function to column to create multiple new columns?
For example, I have also tried:
for i in df2:
terms = df2[i].values.tolist()
[df1[i],df1[i]+"Present"] = pd.concat([df1['course'][0:1].apply(lambda x: word_search(x,i,terms))])
but this simply yields errors such as:
raise ValueError('Length of values does not match length of ' 'index')
Is there a way to use apply, but still extract the extra information directly into multiple columns?
Many thanks, apologies for the length.
The direct answer to your question is yes: use the apply method of the DataFrame object, so you'd be doing df1.apply().
However, for this problem, and anything in pandas in general, try to vectorise rather than iterate through rows -- it's faster and cleaner.
It looks like you are trying to classify Entries into Secondary or Undergrad, and saving the keyword used to make the match. If you assume that each element of Entries has no more than one keyword match (i.e. you won't run into 'GCSE A Level'), you can do the following:
df = df1.copy()
df['secondary_words_found'] = df.Entries.str.extract('(A Level|GCSE|AS Level)')
df['undergrad_words_found'] = df.Entries.str.extract('(BSC|BA|MSc)')
df['secondary'] = df.secondary_words_found.notnull() * 1
df['undergrad'] = df.undergrad_words_found.notnull() * 1
EDIT:
In response to your issue with having many more categories and keywords, you can continue in the spirit of this solution by using an appropriate for loop and doing '(' + '|'.join(df2['Undergrad'].values) + ')' inside the extract method.
However, if you have exact matches, you can do everything by a combination of pivots and joins:
keywords = df2.stack().to_frame('Entries').reset_index().drop('level_0', axis = 1).rename(columns={'level_1':'category'})
df = df1.merge(keywords, how = 'left')
for colname in df.category:
df[colname] = (df.Entries == colname) * 1 # Your indicator variable
df.loc[df.category == colname, colname + '_words_found'] = df.loc[df.category == colname, 'Entries']
The first line 'pivots' your table of keywords into a 2-column dataframe of keywords and categories. Your keyword column must be the same as the column in df1; in SQL, this would be called the foreign key that you are going to join these tables on.
Also, you generally want to avoid having duplicate indexes or columns, which in your case, was Words Found in the desired dataframe!
For the sake of completeness, if you insisted on using the apply method, you would iterate over each row of the DataFrame; your code would look something like this:
secondary_words = df2.Secondary.values
undergrad_words = df2.Undergrad.values
def(s):
if s.Entries.isin(secondary_words):
return pd.Series({'Entries':s.Entries, 'Secondary':1, 'secondary_words_found':s.Entries, 'Undergrad':0, 'undergrad_words_found':''})
elif s.Entries.isin(undergrad_words ):
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':1, 'undergrad_words_found':s.Entries})
else:
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':0, 'undergrad_words_found':''})
This second version will only work in the cases you want it to if the element in Entries is exactly the same as its corresponding element in df2. You certainly don't want to do this, as it's messier, and will be noticeably slower if you have a lot of data to work with.
I have a huge dataframe with a lot of dates. I want to apply a function with multiple arguments to a set of those columns in order to create a new one in this dataframe.
The function I have is the following (it works correctly):
def func(*args):
count=0
for i in args:
if i=="Cool":
count+=1
return count
I create a new column in my dataframe applying this function to a set of columns:
dates=["2000","2001","2002","2003","2004","2005","2006","2007","2009",]
df["new_Column"]=df.apply(lambda row : func(row[date] for date in dates), axis = 1)
However, after execution my new_Column is constantly equal to zero. The problem comes from the last line for sure. Any ideas?
It's because you're passing generator object as only argument to func. Since generator object is not 'Cool' you get 0.
As others noticed your question is not complete. But as far as I can anticipate you have dataframe which looks like this
import pandas as pd
df = pd.DataFrame({'2000': ['Cool', 'yay', 'nope'], '2001': ['ugly', 'cool', 'nice']})
So you can rewrite your func
def func(lst):
count=0
for i in lst:
if i=="Cool":
count+=1
return count
And create new column with list constructor
df["new_Column"]=df.apply(lambda row : func(list(row[date] for date in ['2000', '2001'])), axis = 1)
and receive
2000 2001 new_Column
0 Cool ugly 1
1 yay cool 0
2 nope nice 0
If this is the case there is pure pandas solution
df['new_Column2']=df[df.isin(['Cool'])].count(axis=1)
My data analysis repeatedly falls back on a simple but iffy motif, namely "groupby everything except". Take this multi-index example, df:
accuracy velocity
name condition trial
john a 1 -1.403105 0.419850
2 -0.879487 0.141615
b 1 0.880945 1.951347
2 0.103741 0.015548
hans a 1 1.425816 2.556959
2 -0.117703 0.595807
b 1 -1.136137 0.001417
2 0.082444 -1.184703
What I want to do now, for instance, is averaging over all available trials while retaining info about names and conditions. This is easily achieved:
average = df.groupby(level=('name', 'condition')).mean()
Under real-world conditions, however, there's a lot more metadata stored in the multi-index. The index easily spans 8-10 columns per row. So the pattern above becomes quite unwieldy. Ultimately, I'm looking for a "discard" operation; I want to perform an operation that throws out or reduces a single index column. In the case above, that's trial number.
Should I just bite the bullet or is there a more idiomatic way of going about this? This might well be an anti-pattern! I want to build a decent intuition when it comes to the "true pandas way"... Thanks in advance.
You could define a helper-function for this:
def allbut(*names):
names = set(names)
return [item for item in levels if item not in names]
Demo:
import pandas as pd
levels = ('name', 'condition', 'trial')
names = ('john', 'hans')
conditions = list('ab')
trials = range(1, 3)
idx = pd.MultiIndex.from_product(
[names, conditions, trials], names=levels)
df = pd.DataFrame(np.random.randn(len(idx), 2),
index=idx, columns=('accuracy', 'velocity'))
def allbut(*names):
names = set(names)
return [item for item in levels if item not in names]
In [40]: df.groupby(level=allbut('condition')).mean()
Out[40]:
accuracy velocity
trial name
1 hans 0.086303 0.131395
john 0.454824 -0.259495
2 hans -0.234961 -0.626495
john 0.614730 -0.144183
You can remove more than one level too:
In [53]: df.groupby(level=allbut('name', 'trial')).mean()
Out[53]:
accuracy velocity
condition
a -0.597178 -0.370377
b -0.126996 -0.037003
In the documentation of groupby, there is an example of how to group by all but one specified column of a multiindex. It uses the .difference method of the index names:
df.groupby(level=df.index.names.difference(['name']))