Is it possible to add a new column to a dataframe that comes from regular expression used on text from first column? How this could be done?
re.compile ('\S+#\S+', s)
And I would like to use that regexp on each row for each text from frst column and ass another column add the outcome of regexp.
for idx, data_string in df.itertuples(name='first_column'):
# do things with the data_string here
# save result in second column
df.loc[idx, 'second_column'] = result
Maybe I get you wrong, but isn't it just iterating over all rows and save the result from your regexp in the second column?
Pandas DataFrame all must be same length ,so the row which match regex is the only row going to be in dataframe at the end.
You just need to define function which apply regex on string
and use apply function in pandas series and insert it to dataframe at the end.
import re
import numpy as np
import pandas as pd
df = pd.DataFrame({'col_1':['123','12','b23','134'],'col_2':['a','b','c','d']})
df
Out[1]:
col_1 col_2
0 123 a
1 12 b
2 b23 c
3 134 d
def regex(string):
pattern = re.compile(r"\d{1,2}")
result = pattern.match(string)
if result:
return result.group()
return np.nan #Here if not match so i can drop all row later
new_col = df.col_1.apply(regex)
df.insert(loc =2,column='new_col',value=new_col)
df = df.dropna()
df
Out[2]:
col_1 col_2 new_col
0 123 a 12
1 12 b 12
3 134 d 13
Related
I have the following dataframe:
df = pd.DataFrame([['A', 1],['B', 2],['C', 3]], columns=['index', 'result'])
index
result
A
1
B
2
C
3
I would like to create a new column, for example multiply the column 'result' by two, and I am just curious to know if there is a way to do it in pandas as pyspark does it.
In pyspark:
df = df\
.withColumn("result_multiplied", F.col("result")*2)
I don't like the fact of writing the name of the dataframe everytime I have to perform an operation as it is done in pandas such as:
In pandas:
df['result_multiplied'] = df['result']*2
Use DataFrame.assign:
df = df.assign(result_multiplied = df['result']*2)
Or if column result is processing in code before is necessary lambda function for processing counted values in column result:
df = df.assign(result_multiplied = lambda x: x['result']*2)
Sample for see difference column result_multiplied is count by multiple original df['result'], for result_multiplied1 is used multiplied column after mul(2):
df = df.mul(2).assign(result_multiplied = df['result']*2,
result_multiplied1 = lambda x: x['result']*2)
print (df)
index result result_multiplied result_multiplied1
0 AA 2 2 4
1 BB 4 4 8
2 CC 6 6 12
I have a dataset imported from a CSV file to a dataframe in Python. I want to remove some specific rows from this dataframe and append them to an empty dataframe. So far I have tried to remove row 1 and 0 from the "big" dataframe called df and put these into dff using this code:
dff = pd.DataFrame() #Create empty dataframe
for x in range(0, 2):
dff = dff.append(df.iloc[x]) #Append the first 2 rows from df to dff
#How to remove appended rows from df?
This seems to work, however the columns are flipped, for e.g., df got order A, B, C, then dff will get the order C, B, A; other than that the data is correct. Also how do I remove a specific row from a dataframe?
If your goal is just to remove the first two rows into another dataframe, you don't need to use a loop, just slice:
import pandas as pd
df = pd.DataFrame({"col1": [1,2,3,4,5,6], "col2": [11,22,33,44,55,66]})
dff = df.iloc[:2]
df = df.iloc[2:]
Will give you:
dff
Out[6]:
col1 col2
0 1 11
1 2 22
df
Out[8]:
col1 col2
2 3 33
3 4 44
4 5 55
5 6 66
If your list of desired rows is more complex than just the first two, per your example, a more generic method could be:
dff = df.iloc[[1,3,5]] # Your list of row numbers
df = df.iloc[~df.index.isin(dff.index)]
This means that even if the index column isn't sequential integers, any rows that you used to populate dff will be removed from df.
I managed to solve it by doing:
dff = pd.DataFrame()
dff = df.iloc[:0]
This will copy the first row of df (the titles of the colums e.g. A,B,C) into dff, then append work as it should with any row and row e.g. 1150 can be appended and removed using:
dff = dff.append(df.iloc[1150])
df = df.drop(df.index[1150])
So I am able to get what I want if I only filter by one item, but can't figure out how to filter by two items.
Basically I have a data set with a potential of unlimited rows but have 26 columns. I want to filter row data based on column data on columns A and B but only want the data in C and D to be returned only If A AND B match the values passed into the function. A and B values will be different but specified by being passed into the function.
It seems simple to me but when I try to run the second filter on the first filtered df my returned df is empty.
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1,2,3,4], "B":[7,6,5,4], "C":[9,8,7,6], "D":[0,1,0,1]})
>>> df = df.loc[(df.A>1) & (df.B>4), ["C", "D"]]
>>> print(df)
C D
1 8 1
2 7 0
I have a column say A with strings and another column B with binary values 1/0.
I am trying to match a regular expression in column A and update column B accordingly.
If this is my regular expression
pattern_1 = re.compile(r'\bstudent', re.IGNORECASE)
I would like the table to look like below.
A B
I am a teacher 0
I am a student 1
Student group 1
you can use pandas to create dataframe and make new column by checking each row data:
import pandas as pd
import re
pattern_1 = re.compile(r'\bstudent', re.IGNORECASE)
data = [['I am a teacher',0],['I am a student ',0],['Student group', 0]]
df = pd.DataFrame(data, columns =['A','B'])
print("orginal df:",df)
df['B'] = df.apply(lambda row: 1 if pattern_1.search(row.A) else row.B , axis=1)
print("\n\nmodified df:",df)
output:
orginal df: A B
0 I am a teacher 0
1 I am a student 0
2 Student group 0
modified df: A B
0 I am a teacher 0
1 I am a student 1
2 Student group 1
You don't specify how your columns are stored, but this sounds like a job for a basic for-loop with enumerate.
Assuming that A and B are lists:
for i, a_value in enumerate(A):
B[i] = bool(pattern_1.search(A))
Replacing Periods in DF's Columns
I was wondering if there was an efficient way to replace periods in pandas dataframes without having to iterate through each row and call.replace() on the row.
import pandas as pd
df = pd.DataFrame.from_dict({'column':['Sam M.']})
df.column = df.column.replace('.','')
print df
Result
column
0 None
Desired Result
column
0 Sam M
df['column'].str.replace('.', '', regex=False)
0 Sam M
Name: column, dtype: object
Because . is a regex special character so put '\' front of it then it will be good:
Solution:
df['column'].str.replace('\.','')
Example:
df['column']=df['column'].str.replace('\.','')
print(df)
Output:
column
0 Sam M