I am wondering how to pass pandas data frame column values into a regular expression. I have tried the below but get "TypeError: 'Series' objects are mutable, thus they cannot be hashed"
Im after the result below. (I could just use a different regex but was wondering how this might be done dynamically)
Thoughts appreciated :)
to_search search_string search_result
ABC-T3-123 ABC ABC-T3
ABC-T2-123 ABC ABC-T3
DEF-T1-123 ABC DEF-T1
import pandas as pd
# create list for data frame
data = [['ABC-T3-123', 'ABC'], ['ABC-T2-123', 'ABC'], ['DEF-T1-123', 'DEF']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['to_search', 'search_string'])
df['search_results']=df['to_search'].str.extract("(" + df['search_string'] + "-T[0-9])")}```
I know that you want an efficient solution, but typically these pandas functions do not take values such as Serieses. Here is an apply-based solution, which I think, besides simplifying the regular expression, is the only viable solution here:
searched = df.apply(lambda row: re.search("(" + row['search_string'] + "-T[0-9])", row['to_search']).group(1), axis=1)
Output:
>>> searched
0 ABC-T3
1 ABC-T2
2 DEF-T1
dtype: object
Related
I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)
I have a Pandas dataframe with a lot of columns looking like p_d_d_c0, p_d_d_c1, ... p_d_d_g1, p_d_d_g2, ....
df =
a b c p_d_d_c0 p_d_d_c1 p_d_d_c2 ... p_d_d_g0 p_d_d_g1 ...
All these columns, which confirm to the regex need to be selected and their datatypes need to be changed from object to float. In particular, columns look like p_d_d_c* and p_d_d_g* are they are all object types and I would like to change them to float types. Is there a way to select columns in bulk by using regular expression and change them to float types?
I tried the answer from here, but it takes a lot of time and memory as I have hundreds of these columns.
df[df.filter(regex=("p_d_d_.*"))
I also tried:
df.select(lambda col: col.startswith('p_d_d_g'), axis=1)
But, it gives an error:
AttributeError: 'DataFrame' object has no attribute 'select'
My Pandas version is 1.0.1
So, how to select columns in bulk and change their data types using regex?
Try this:
import pandas as pd
# sample dataframe
df = pd.DataFrame(data={"co1":[1,2,3,4], "co22":[4,3,2,1], "co3":[2,3,2,4], "abc":[5,4,3,2]})
# select all columns which have co in it
floatcols = [col for col in df.columns if "co" in col]
for floatcol in floatcols:
df[floatcol] = df[floatcol].astype(float)
From the same link, and with some astype magic.
column_vals = df.columns.map(lambda x: x.startswith("p_d_d_"))
train_temp = df.loc(axis=1)[column_vals]
train_temp = train_temp.astype(float)
EDIT:
To modify the original dataframe, do something like this:
column_vals = [x for x in df.columns if x.startswith("p_d_d_")]
df[column_vals] = df[column_vals].astype(float)
I am a beginner at coding, and since this is a very simple question, I know there must be answers out there. However, I've searched for about a half hour, typing countless queries in google, and all has flown over my head.
Lets say I have a dataframe with columns "Name", "Hobbies" and 2 people, so 2 rows. Currently, I have the hobbies as strings in the form "hobby1, hobby2". I would like to change this into ["hobby1", "hobby2"]
hobbies_as_string = df.iloc[0, 2]
hobbies_as_list = hobbies_as_string.split(',')
df.iloc[0, -2] = hobbies_as_list
However, this falls to an error, ValueError: Must have equal len keys and value when setting with an iterable. I don't understand why if I get hobbies_as_string as a copy, I'm able to assign the hobbies column as a list no problem. I'm also able to assign df.iloc[0,-2] as a string, such as "Hey", and that works fine. I'm guess it has to do the with ValueError. Why won't pandas let me assign it as a list??
Thank you very much for your help and explanation.
Use the "at" method to replace a value with a list
import pandas as pd
# create a dataframe
df = pd.DataFrame(data={'Name': ['Stinky', 'Lou'],
'Hobbies': ['Shooting Sports', 'Poker']})
# replace Lous hobby of poker with a list of degen hobbies with the at method
df.at[1, 'Hobbies'] = ['Poker', 'Ponies', 'Dice']
Are you looking to apply a split row-wise to each value into a list?
import pandas as pd
df = pd.DataFrame({'Name' : ['John', 'Kate'],
'Hobbies' : ["Hobby1, Hobby2", "Hobby2, Hobby3"]})
df['Hobbies'] = df['Hobbies'].apply(lambda x: x.split(','))
df
OR if you are not a big lambda exer, then you can do str.split() on the entire column, which is easier:
import pandas as pd
df = pd.DataFrame({'Name' : ['John', 'Kate'],
'Hobbies' : ["Hobby1, Hobby2", "Hobby2, Hobby3"]})
df['Hobbies'] = df['Hobbies'].str.split(",")
df
Output:
Name Hobbies
0 John [Hobby1, Hobby2]
1 Kate [Hobby2, Hobby3]
Another way of doing it
df=pd.DataFrame({'hobbiesStrings':['"hobby1, hobby2"']})
df
replace ,whitespace with "," and put hobbiesStrings values in a list
x=df.hobbiesStrings.str.replace('((?<=)(\,\s+)+)','","').values.tolist()
x
Here I use regex expressions
Basically I am replacing comma \, followed by whitespace \s with ","
rewrite column s using df.assign
df=df.assign(hobbies_stringsnes=[x])
Chained together
df=df.assign(hobbies_stringsnes=[df.hobbiesStrings.str.replace('((\,\s))','","').values.tolist()])
df
Output
When I create a dataframe using concat like this:
import pandas as pd
dfa = pd.DataFrame({'a':[1],'b':[2]})
dfb = pd.DataFrame({'a':[3],'b':[4]})
dfc = pd.concat([dfa,dfb])
And I try to reference like I would for any other DataFrame I get the following result:
>>> dfc['a'][0]
0 1
0 3
Name: a, dtype: int64
I would expect my concatenated DataFrame to behave like a normal DataFrame and return the integer that I want like this simple DataFrame does:
>>> dfa['a'][0]
1
I am just a beginner, is there a simple explanation for why the same call is returning an entire DataFrame and not the single entry that I want? Or, even better, an easy way to get my concatenated DataFrame to respond like a normal DataFrame when I try to reference it? Or should I be using something other than concat?
You've mistaken what normal behavior is. dfc['a'][0] is a label lookup and matches anything with an index value of 0 in which there are two because you concatenated two dataframes with index values including 0.
in order to specify position of 0
dfc['a'].iloc[0]
or you could have constructed dfc like
dfc = pd.concat([dfa,dfb], ignore_index=True)
dfc['a'][0]
Both returning
1
EDITED (thx piRSquared's comment)
Use append() instead pd.concat():
dfc = dfa.append(dfb, ignore_index=True)
dfc['a'][0]
1
Pandas beginner here. I'm looking to return a full column's data and I've seen a couple of different methods for this.
What is the difference between the two entries below, if any? It looks like they return the same thing.
loansData['int_rate']
loansData.int_rate
The latter is basically syntactic sugar for the former. There are (at least) a couple of gotchas:
If the name of the column is not a valid Python identifier (e.g., if the column name is my column name?!, you must use the former.
Somewhat surprisingly, you can only use the former form to completely correctly add a new column (see, e.g., here).
Example for latter statement:
import pandas as pd
df = pd.DataFrame({'a': range(4)})
df.b = range(4)
>> df.columns
Index([u'a'], dtype='object')
For some reason, though, df.b returns the correct results.
They do return the same thing. The column names in pandas are akin to dictionary keys that refer to a series. The column names themselves are named attributes that are part of the dataframe object.
The first method is preferred as it allows for spaces and other illegal operators.
For a more complete explanation, I recommend you take a look at this article:
http://byumcl.bitbucket.org/bootcamp2013/labs/pd_types.html#pandas-types
Search 'Access using dict notation' to find the examples where they show that these two methods return identical values.
They're the same but for me the first method handles spaces in column names and illegal characters so is preferred, example:
In [115]:
df = pd.DataFrame(columns=['a', ' a', '1a'])
df
Out[115]:
Empty DataFrame
Columns: [a, a, 1a]
Index: []
In [116]:
print(df.a) # works
print([' a']) # works
print(df.1a) # error
File "<ipython-input-116-4fa4129a400e>", line 3
print(df.1a)
^
SyntaxError: invalid syntax
Really when you use dot . it's trying to find a key as an attribute, if for some reason you have used column names that match an attribute then using dot will not do what you expect.
Example:
In [121]:
df = pd.DataFrame(columns=['index'], data = np.random.randn(3))
df
Out[121]:
index
0 0.062698
1 -1.066654
2 -1.560549
In [122]:
df.index
Out[122]:
Int64Index([0, 1, 2], dtype='int64')
The above has now shown the index as opposed to the column 'index'
In case if you are working on any ML projects and you want to extract feature and target variables separately and need to have them separably.
Below code will be useful: This is selecting features through indexing as a list and applying them to the dataframe. in this code data is DF.
len_col=len(data.columns)
total_col=list(data.columns)
Target_col_Y=total_col[-1]
Feature_col_X=total_col[0:-1]
print('The dependent variable is')
print(Target_col_Y)
print('The independent variables are')
print(Feature_col_X)
The output for the same can be obtained as given below:
The dependent variable is
output
The independent variables are
['age', 'job', 'marital', 'education','day_of_week', ... etc]