Remove certain words from string - python

I have a dataframe df with a column containing string. I have another dataframe df2 with 1 column (so it can be a serie) which contains 1 word each row.
I would like to remove all the words from df that are in df2.
Example:
df:
ColString
0 I would like to buy apples.
df2:
Wordlist
0 like
1 apples
Result:
df:
ColString
0 I would to buy .
Any ideas ? Thanks for help !

You can using replace with regex=True
df1.col.replace(df2.Wordlist.str.cat(sep='|'),'',regex=True)
Out[510]:
0 I would to buy .
Name: col, dtype: object

Related

Left justify pandas string column with pattern

I have a large pandas dataset with a messy string column which contains for example:
72.1
61
25.73.20
33.12
I'd like to fill the gaps in order to match a pattern like XX.XX.XX (X are only numbers):
72.10.00
61.00.00
25.73.20
33.12.00
thank you!
How about defining base_string = '00.00.00' then fill other string in each row with base_string:
base_str = '00.00.00'
df = pd.DataFrame({'ms_str':['72.1','61','25.73.20','33.12']})
print(df)
df['ms_str'] = df['ms_str'].apply(lambda x: x+base_str[len(x):])
print(df)
Output:
ms_str
0 72.1
1 61
2 25.73.20
3 33.12
ms_str
0 72.10.00
1 61.00.00
2 25.73.20
3 33.12.00
Here is a vectorized solution, that works for this particular pattern. First fill with zeros on the right, then replace every third character by a dot:
df['col'].str.ljust(8, fillchar='0').str.replace(r'(..).', r'\1.', regex=True)
Output:
0 72.10.00
1 61.00.00
2 25.73.20
3 33.12.00
Name: col, dtype: object

Drop row in pandas if it contains condition

I am trying to drop rows in pandas based on whether or not it contains "/" in the cells in column "Price". I have referred to the question: Drop rows in pandas if they contains "???".
As such, I have tried both codes:
df = df[~df["Price"].str.contains('/')]
and
df = df[~df["Price"].str.contains('/',regex=False)]
However, both codes give the error:
AttributeError: Can only use .str accessor with string values!
For reference, the first few rows of my dataframe is as follows:
Fruit Price
0 Apple 3
1 Apple 2/3
2 Banana 2
3 Orange 6/7
May I know what went wrong and how can I fix this problem? Thank you very much!
Try this:
df = df[~df['Price'].astype(str).str.contains('/')]
print(df)
Fruit Price
0 Apple 3
2 Banana 2
You need to convert the price column to string first and then apply this operation. I believe that price column doesn't have datatype string
df['Price'] = df['Price'].astype(str)
and then try
df = df[~df["Price"].str.contains('/',regex=False)]

Matching regex in two different dataframe Python

I'm having trouble on how to match regex in two different dataframe that is linked with its type and unique country. Here is the sample for the data df and the regex df. Note that the shape for these two dataframe is different because the regex df containing just unique value.
**Data df** **Regex df**
**Country Type Data** **Country Type Regex**
MY ABC MY1234567890 MY ABC ^MY[0-9]{10}
IT ABC IT1234567890 IT ABC ^IT[0-9]{10}
PL PQR PL123456 PL PQR ^PL
MY XYZ 456792abc MY XYZ ^\w{6,10}$
IT ABC MY45889976
IT ABC IT567888976
I have tried to merge them together and just use lambda to do the matching. Below is my code,
df.merge(df_regex,left_on='Country',right_on="Country")
df['Data Quality'] = df.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
But, it will add another row for each of the different type and country. So there will be a lot of duplication which is not efficient and time consuming.
Is there any pythonic way to match the data to its country and type but the reference is in another dataframe. without merging those 2 df. Then if its match to its own regex, it will return 1, else 0.
To avoid repetition based on Type we should include Type also in the joining conditions, Now apply the lambda
df2 = df.merge(df_regex, left_on=['Country', 'Type'],right_on=['Country', 'Type'])
df2['Data Quality'] = df2.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
df2
It will give you the following output.
Country Type Data Regex Data Quality
0 MY ABC MY1234567890 ^MY[0-9]{10} 1
1 IT ABC IT1234567890 ^IT[0-9]{10} 1
2 IT ABC MY45889976 ^IT[0-9]{10} 0
3 IT ABC IT567888976 ^IT[0-9]{10} 0
4 PL PQR PL123456 ^PL 1
5 MY XYZ 456792abc ^\w{6,10}$ 1

Explode List containing many dictionaries in Pandas dataframe

I am having a dataset which look like follows(in dataframe):
**_id** **paper_title** **references** **full_text**
1 XYZ [{'abc':'something','def':'something'},{'def':'something'},...many others] something
2 XYZ [{'abc':'something','def':'something'},{'def':'something'},...many others] something
3 XYZ [{'abc':'something'},{'def':'something'},...many others] something
Expected:
**_id** **paper_title** **abc** **def** **full_text**
1 XYZ something something something
something something
.
.
(all the dic in list with respect to_id column)
2 XYZ something something something
something something
.
.
(all the dic in list with respect to_id column)
I have tried df['column_name'].apply(pd.Series).apply(pd.Series) to split the list and dictionaries into columns of dataframe but doesn't help as it didn't split dictionaries.
First row of my dataframe:
df.head(1)
Assuming your original DataFrame is a list of dictionaries with one key:value pair and a key named 'reference':
print(df)
id paper_title references full_text
0 1 xyz [{'reference': 'description1'}, {'reference': ... some text
1 2 xyz [{'reference': 'descriptiona'}, {'reference': ... more text
2 3 xyz [{'reference': 'descriptioni'}, {'reference': ... even more text
Then you can use concat to separate out your references with their index:
df1 = pd.concat([pd.DataFrame(i) for i in df['references']], keys = df.index).reset_index(level=1,drop=True)
print(df1)
reference
0 description1
0 description2
0 description3
1 descriptiona
1 descriptionb
1 descriptionc
2 descriptioni
2 descriptionii
2 descriptioniii
Then use DataFrame.join to join the columns back together on their index:
df = df.drop('references', axis=1).join(df1).reset_index(drop=True)
print(df)
id paper_title full_text reference
0 1 xyz some text description1
1 1 xyz some text description2
2 1 xyz some text description3
3 2 xyz more text descriptiona
4 2 xyz more text descriptionb
5 2 xyz more text descriptionc
6 3 xyz even more text descriptioni
7 3 xyz even more text descriptionii
8 3 xyz even more text descriptioniii
After a lot of Documentation reading of pandas, I found the explode method applying with apply(pd.Series) is the easiest of what I was looking for in the question.
Here is the Code:
df = df.explode('reference')
# It explodes the lists to rows of the subset columns
df = df['reference'].apply(pd.Series).merge(df, left_index=True, right_index=True, how ='outer')
# split a list inside a Dataframe cell into rows and merge with original dataframe like (AUB) in set theory
Sidenote: while merging look for unique values in column as there will many columns with duplicated values
I hope this helps someone with dataframe/Series with columns having list containing multiple dictionaries and want to split list of multiple dictionaries key to new column with values as their rows.

counting all string values in given column of a table and grouping it based on third column

I have three columns. the table looks like this:
ID. names tag
1. john. 1
2. sam 0
3. sam,robin. 1
4. robin. 1
Id: type integer
Names: type string
Tag: type integer (just 0,1)
What I want is to find how many times each name is repeated grouped by 0 and 1. this is to be done in python.
Answer must look like
0 1
John 23 12
Robin 32 10
sam 9 30
Using extractall and crosstab:
s = df.names.str.extractall(r'(\w+)').reset_index(1, drop=True).join(df.tag)
pd.crosstab(s[0], s['tag'])
tag 0 1
0
john 0 1
robin 0 2
sam 1 1
Because of the nature of your names column, there is some re-processing that needs to be done before you can get value counts. In the case of your example dataframe, this could look something like:
my_counts = (df.set_index(['ID.', 'tag'])
# Get rid of periods and split on commas
.names.str.strip('.').str.split(',')
.apply(pd.Series)
.stack()
.reset_index([0, 1])
# rename column 0 for consistency, easier reading
.rename(columns={0: 'names'})
# Get value counts of names per tag:
.groupby('tag')['names']
.value_counts()
.unstack('tag', fill_value=0))
>>> my_counts
tag 0 1
names
john 0 1
robin 0 2
sam 1 1

Categories

Resources