This question already has answers here:
How to remove parentheses and all data within using Pandas/Python?
(4 answers)
Closed 3 years ago.
I would like to remove characters between [] and currently I am doing
df['Text'] = df['Text'].str.replace(r"\[.*\]","")
But the output isn't desirable. Before it is [image] This document and after it is ******* This document where * is whitespace.
How do I get rid of this white space.
Edit 1
The Text column of df looks like below:
ID Text
0 REAL ESTATE LEASE THIS INDUSTRIAL REAL ESTAT...
5 Lease AureementMade and signed on the \ of Aug...
6 FIRST AMENDMENT OF LEASEDATE: August 31, 2001L...
8 [image: image0.jpg] Jack[image: image1.jb2] ...
9 [image: image0.jpg] ABC SALES Meeting 97...
14 FIRST AMENDMENT OF LEASETHIS FIRST AMENDMENT O...
17 [image: image0.tif] Deep ML LEASE SERVI...
22 [image: image0.jpg] F 15 083 EX [image: image1...
26 LEASE AGREEMENT—GROSS LEASEBASIC LEASE PROVISI...
28 [image: image0.jpg] 17. Medical VERIFICATION...
31 [image: image0.jpg] [image: image1.jb2] PLL 3...
32 SUBLEASETHIS SUBLEASE this “Sublease” made as ...
34 [image: image0.tif] Lease Agreement May 10, 20...
35 13057968.3 1 Initials: _____ _____ SECOND ...
42 [image: image0.jpg] Jack Dowson Buy Real MI...
46 Deep – Machine Learning LEASE B...
I would like to see
ID Text
0 REAL ESTATE LEASE THIS INDUSTRIAL REAL ESTAT...
5 Lease AureementMade and signed on the \ of Aug...
6 FIRST AMENDMENT OF LEASEDATE: August 31, 2001L...
8 Jack ...
9 ABC SALES Meeting 97...
14 FIRST AMENDMENT OF LEASETHIS FIRST AMENDMENT O...
17 Deep ML LEASE SERVI...
22 F 15 083 EX ...
26 LEASE AGREEMENT—GROSS LEASEBASIC LEASE PROVISI...
28 17. Medical VERIFICATION...
31 PLL 3...
32 SUBLEASETHIS SUBLEASE this “Sublease” made as ...
34 Lease Agreement May 10, 20...
35 13057968.3 1 Initials: _____ _____ SECOND ...
42 Jack Dowson Buy Real MI...
46 Deep – Machine Learning LEASE B...
Looks like you need .str.strip()
Ex:
df = pd.DataFrame({"ID": [1,2,3], "Text": ["[image: 123.jpg] This document", "[image: image.jpg] Readers of the article", "The agreement between [image: image.jpg] two parties"]})
df["Text"] = df["Text"].str.replace(r"(\s*\[.*?\]\s*)", " ").str.strip()
print(df)
Output:
0 This document
1 Readers of the article
2 The agreement between two parties
Name: Text, dtype: object
Add optional space (?) to your regex, so the whole regex (match part) should be:
r'\[.*\] ?'
Another hint: Your regex is enclosed in parentheses (a capturing group).
They are not needed. Remove them.
Related
I have this df1 with a lot of different news articles. An example of a news article is this:
'Today is Monday Aug. 17 the 230th day of 2020 . There are 136 days left in the year . On August 17 2017 a van plowed through pedestrians along a packed promenade in the Spanish city of Barcelona killing 13 people and injuring 120 . A 14th victim died later from injuries . Another man was stabbed to death in a carjacking that night as the van driver made his getaway and a woman died early the next day in a vehicle-and-knife attack in a nearby coastal town . Six by police two more died when a bomb workshop exploded . In 1915 a mob in Cobb County Georgia lynched Jewish businessman Leo Frank 31 whose death sentence for the murder of 13-year-old Mary Phagan had been commuted to life imprisonment . Frank who d maintained his innocence was pardoned by the state of Georgia in 1986 . In 1960 the newly renamed Beatles formerly the Silver Beetles began their first gig in Hamburg West Germany Teamsters union president Jimmy Hoffa was sentenced in Chicago to five years in federal prison for defrauding his union s pension fund . Hoffa was released in 1971 after President Richard Nixon commuted his sentence for this conviction and jury tampering . In 1969 Hurricane Camille slammed into the Mississippi coast as a Category 5 storm that was blamed for 256 U.S. deaths three in Cuba . In 1978 the first successful trans-Atlantic balloon flight ended as Maxie Anderson Ben Abruzzo and Larry Newman landed In 1982 the first commercially produced compact discs a recording of ABBA s The Visitors were pressed at a Philips factory near Hanover West Germany .'
And I have this df2 with all the words from the news articles in the column "Word" with their corresponding LIWC category in the second column.
Data example:
data = {'Word': ['killing','even','guilty','brain'], 'Category': ['Affect', 'Adverb', 'Anx','Body']}
What I'm trying to do is: To calculate for each article in df1 how many words occur of each category in df2. So I want to create a column for each category mentioned in df2["category"].
And it should look like this in the end:
Content | Achieve | Affiliation | affect
article text here | 6 | 2 | 2
article text here | 2 | 43 | 2
article text here | 6 | 8 | 8
article text here | 2 | 13 | 7
I since it's all strings I tried str.findall but this returns all NA's for everything. This is what I tried:
from collections import Counter
liwc = df1['articles'].str.findall(fr"'({'|'.join(df2)})'") \
.apply(lambda x: pd.Series(Counter(x), index=df2["category"].unique())) \
.fillna(0).astype(int)
Both a pandas or r solution would be equally great.
First flatten df2 values to dictionary, add word boundaries \b\b and pass to Series.str.extractall, so possible use Series.map and create DataFrame by reset_index, last pass to crosstab and append to original by DataFrame.join:
df1 = pd.DataFrame({'articles':['Today is killing Aug. 17 the 230th day of 2020',
'Today is brain Aug. 17 the guilty day of 2020 ']})
print (df1)
articles
0 Today is killing Aug. 17 the 230th day of 2020
1 Today is brain Aug. 17 the guilty day of 2020
If list of values in Word column like in picture:
data = {'Word': [['killing'],['even'],['guilty'],['brain']],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
Word Category
0 [killing] Affect
1 [even] Adverb
2 [guilty] Anx
3 [brain] Body
d = {x: b for a, b in zip(df2['Word'], df2['Category']) for x in a}
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
If df2 is different:
data = {'Word': ['killing','even','guilty','brain'],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
0 killing Affect
1 even Adverb
2 guilty Anx
3 brain Body
d = dict(zip(df2['Word'], df2['Category']))
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
import re
#thank you for improve solution Wiktor Stribiżew
pat = r"\b(?:{})\b".format("|".join(re.escape(x) for x in d))
df = df1['articles'].str.extractall(rf'({pat})')[0].map(d).reset_index(name='Category')
df = df1.join(pd.crosstab(df['level_0'], df['Category']))
print (df)
articles Affect Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 1 1
You can craft a custom regex with named capturing groups and use str.extractall.
With your dictionary the custom regex would be '(?P<Affect>\\bkilling\\b)|(?P<Adverb>\\beven\\b)|(?P<Anx>\\bguilty\\b)|(?P<Body>\\bbrain\\b)'
Then groupby+max the notna results, convert to int and join to the original dataframe:
regex = '|'.join(fr'(?P<{k}>\b{v}\b)' for v,k in zip(*data.values()))
(df1.join(df1['articles'].str.extractall(regex, flags=2) # re.IGNORECASE
.notna().groupby(level=0).max()
.astype(int)
)
)
output:
articles Affect Adverb Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 0 1 1
To whom it may concern,
I have a very large dataframe (MasterDataFrame) that contains ~180K groups that I would like to split into 5 smaller DataFrames and process each smaller DataFrame separately. Does anyone know of any way that I could achieve this split into 5 smaller DataFrames without accidentally splitting/jeopardizing the integrity of any of the groups from the MasterDataFrame? In other words, I would like for the 5 smaller DataFrames to not have overlapping groups.
Thanks in advance,
Christos
This is what my dataset looks like:
|======MasterDataset======|
Name Age Employer
Tom 12 Walmart
Nick 15 Disney
Chris 18 Walmart
Darren 19 KMart
Nate 43 ESPN
Harry 23 Walmart
Uriel 24 KMart
Matt 23 Disney
. . .
. . .
. . .
I need to be able to split my dataset such that the groups shown in the MasterDataset above are preserved. The smaller groups into which my MasterDataset will be split need to look like this:
|======SubDataset1======|
Name Age Employer
Tom 12 Walmart
Chris 18 Walmart
Harry 23 Walmart
Darren 19 KMart
Uriel 24 KMart
|======SubDataset2======|
Name Age Employer
Nick 15 Disney
Matt 23 Disney
I assume that you mean the number of lines with "groups"
For that .iloc should be perfect.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
df_1 = df.iloc[0:100000,:]
df_2 = df.iloc[100001:200000,:]
....
I am trying to run a hypothesis test using model ols. I am trying to do this model Ols for tweet count based on four groups that I have in my data frame. The four groups are Athletes, CEOs, Politicians, and Celebrities. I have the four groups each labeled for each name in one column as a group.
frames = [CEO_df, athletes_df, Celebrity_df, politicians_df]
final_df = pd.concat(frames)
final_df=final_df.reindex(columns=["name","group","tweet_count","retweet_count","favorite_count"])
final_df
model=ols("tweet_count ~ C(group)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
I want to do something along the lines of:
model=ols("tweet_count ~ C(Athlete) + C(Celebrity) + C(CEO) + C(Politicians)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
Is that even possible? How else will I be able to run a hypothesis test with those conditions?
Here is my printed final_df:
name group tweet_count retweet_count favorite_count
0 #aws_cloud # #ReInvent R “Ray” Wang 王瑞光 #1A CEO 6 6 0
1 Aaron Levie CEO 48 1140 18624
2 Andrew Mason CEO 24 0 0
3 Bill Gates CEO 114 78204 439020
4 Bill Gross CEO 36 486 1668
... ... ... ... ... ...
56 Tim Kaine Politician 48 8346 50898
57 Tim O'Reilly Politician 14 28 0
58 Trey Gowdy Politician 12 1314 6780
59 Vice President Mike Pence Politician 84 1146408 0
60 klay thompson Politician 48 41676 309924
I want to count paragraphs from data frames. However, it turns out that my result gets zero inside the list. Does anybody know how to fix it? Thank you so much.
Here is my code:
def count_paragraphs(df):
paragraph_count = []
linecount = 0
for i in df.text:
if i in ('\n','\r\n'):
if linecount == 0:
paragraphcount = paragraphcount + 1
return paragraph_count
count_paragraphs(df)
df.text
0 On Saturday, September 17 at 8:30 pm EST, an e...
1 Story highlights "This, though, is certain: to...
2 Critical Counties is a CNN series exploring 11...
3 McCain Criticized Trump for Arpaio’s Pardon… S...
4 Story highlights Obams reaffirms US commitment...
5 Obama weighs in on the debate\n\nPresident Bar...
6 Story highlights Ted Cruz refused to endorse T...
7 Last week I wrote an article titled “Donald Tr...
8 Story highlights Trump has 45%, Clinton 42% an...
9 Less than a day after protests over the police...
10 I woke up this morning to find a variation of ...
11 Thanks in part to the declassification of Defe...
12 The Democrats are using an intimidation tactic...
13 Dolly Kyle has written a scathing “tell all” b...
14 The Haitians in the audience have some newswor...
15 The man arrested Monday in connection with the...
16 Back when the news first broke about the pay-t...
17 Chicago Environmentalist Scumbags\n\nLeftists ...
18 Well THAT’S Weird. If the Birther movement is ...
19 Former President Bill Clinton and his Clinton ...
Name: text, dtype: object
Use Series.str.count:
def count_paragraphs(df):
return df.text.str.count(r'\n\n').tolist()
count_paragraphs(df)
This is my answer and It works!
def count_paragraphs(df):
paragraph_count = []
for i in range(len(df)):
paragraph_count.append(df.text[i].count('\n\n'))
return paragraph_count
count_paragraphs(df)
I'm pulling in the data frame using tabula. Unfortunately, the data is arranged in rows as below. I need to take the first 23 rows and use them as column headers for the remainder of the data. I need each row to contain these 23 headers for each of about 60 clinics.
Col \
0 Date
1 Clinic
2 Location
3 Clinic Manager
4 Lease Cost
5 Square Footage
6 Lease Expiration
8 Care Provided
9 # of Providers (Full Time)
10 # FTE's Providing Care
11 # Providers (Part-Time)
12 Patients seen per week
13 Number of patients in rooms per provider
14 Number of patients in waiting room
15 # Exam Rooms
16 Procedure rooms
17 Other rooms
18 Specify other
20 Other data:
21 TI Needs:
23 Conclusion & Recommendation
24 Date
25 Clinic
26 Location
27 Clinic Manager
28 Lease Cost
29 Square Footage
30 Lease Expiration
32 Care Provided
33 # of Providers (Full Time)
34 # FTE's Providing Care
35 # Providers (Part-Time)
36 Patients seen per week
37 Number of patients in rooms per provider
38 Number of patients in waiting room
39 # Exam Rooms
40 Procedure rooms
41 Other rooms
42 Specify other
44 Other data:
45 TI Needs:
47 Conclusion & Recommendation
Val
0 9/13/2017
1 Gray Medical Center
2 1234 E. 164th Ave Thornton CA 12345
3 Jane Doe
4 $23,074.80 Rent, $5,392.88 CAM
5 9,840
6 7/31/2023
8 Family Medicine
9 12
10 14
11 1
12 750
13 4
14 2
15 31
16 1
17 X-Ray, Phlebotomist/blood draw
18 NaN
20 Facilities assistance needed. 50% of business...
21 Paint and Carpet (flooring is in good conditio...
23 Lay out and occupancy flow are good for this p...
24 9/13/2017
25 Main Cardiology
26 12000 Wall St Suite 13 Main CA 12345
27 John Doe
28 $9610.42 Rent, $2,937.33 CAM
29 4,406
30 5/31/2024
32 Cardiology
33 2
34 11, 2 - P.T.
35 2
36 188
37 0
38 2
39 6
40 0
41 1 - Pacemaker, 1 - Treadmill, 1- Echo, 1 - Ech...
42 Nurse Office, MA station, Reading Room, 2 Phys...
44 Occupied in Emerus building. Needs facilities ...
45 New build out, great condition.
47 Practice recently relocated from 84th and Alco...
I was able to get my data frame in a better place by fixing the headers. I'm re-posting the first 3 "groups" of data to better illustrate the structure of the data frame. Everything repeats (headers and values) for each clinic.
Try this:
df2 = pd.DataFrame(df[23:].values.reshape(-1, 23),
columns=df[:23][0])
print(df2)
Ideally the number 23 is the number of columns in each row for the result df . you can replace it with the desired number of columns you want.