from pandas import DataFrame,Series
import pandas as pd
df
text region
The Five College Region The Five College Region
South Hadley (Mount Holyoke College) South Hadley
Waltham (Bentley University), (Brandeis Univer..) Waltham
The region should extract from text.
If the row contains "(",remove anything after "(",and then remove the white space.
If the row doesn't contain "(", keep it and copy to the region.
I know I can deal it with str.extract function. But I'm troubled in writing right regex pattern
df['Region'] =df['text'].str.extract(r'(.+)\(.*')
This regex pattern can not extract first string
I also acknowledge that using split functon works for this problem
str.split('(')[0]
But I don't know how to put the result in a column.
Hope to receive answers covering both methods.
option 1
assign + str.split
df.text.str.split('\s*\(').str[0]
0 The Five College Region
1 South Hadley
2 Waltham
Name: text, dtype: object
df.assign(region=df.text.str.split('\s*\(').str[0])
text region
0 The Five College Region The Five College Region
1 South Hadley (Mount Holyoke College) South Hadley
2 Waltham (Bentley University), (Brandeis Univer..) Waltham
option 2
join + str.extract
df.text.str.extract('(?P<region>[^\(]+)\s*\(*', expand=False)
0 The Five College Region
1 South Hadley
2 Waltham
Name: text, dtype: object
df.join(df.text.str.extract('(?P<region>[^\(]+)\s*\(*', expand=False))
text region
0 The Five College Region The Five College Region
1 South Hadley (Mount Holyoke College) South Hadley
2 Waltham (Bentley University), (Brandeis Univer..) Waltham
Related
I am using google colab and there is a file which called 'examples' and inside there are three txt files.
I am using the following code to read and convert them to pandas
dataset_filepaths = glob.glob('examples/*.txt')
for filepath in tqdm.tqdm(dataset_filepaths):
df = pd.read_csv(filepath)
If you print the dataset_filepaths you will see
['examples/kate_middleton.txt',
'examples/jane_doe.txt',
'examples/daniel_craig.txt']
which is correct. However, in the df there is only the first document. Could you please let me know how we can create a pandas in the following form
index text
-----------------
0 text0
1 text1
. .
. .
. .
Updated: #Steven Rumbalski using your code
dfs = [pd.read_csv(filepath) for filepath in tqdm.tqdm(dataset_filepaths)]
dfs
The output looks like this
[Empty DataFrame
Columns: [Kate Middleton is the wife of Prince William. She is a mother of 3 children; 2 boys and a girl. Kate is educated to university level and that is where she met her future husband. Kate dresses elegantly and is often seen carrying out charity work. However, she is a mum first and foremost and the interactions we see with her children are adorable. Kate’s sister, Pippa, has followed Kate into the public eye. She was born in 1982 and will soon turn 40. When pregnant, Kate suffers from a debilitating illness called Hyperemesis Gravidarum, which was little known about until it was reported that Kate had it.]
Index: [], Empty DataFrame
Columns: [Jane Doe was born in December 1978 and is currently living in London, United Kingdom.]
Index: [], Empty DataFrame
Columns: [He is an English film actor known for playing James Bond in the 007 series of films. Since 2005, he has been playing the character but he confirmed that No Time to Die would be his last James Bond film. He was born in Chester on 2nd of March in 1968. He moved to Liverpool when his parents divorced and lived there until he was sixteen years old. He auditioned and was accepted into the National Youth Theatre and moved down to London. He studied at Guildhall School of Music and Drama. He has appeared in many films.]
Index: []]
How can I convert it in the form that I want?
I have a solution below to give me a new column as a universal identifier, but what if there is additional data in the NAME column, how can I tweak the below to account for a wildcard like search term?
I want to basically have so if German/german or Mexican/mexican is in that row value then to give me Euro or South American value in new col
df["Identifier"] = (df["NAME"].str.lower().replace(
to_replace = ['german', 'mexican'],
value = ['Euro', 'South American']
))
print(df)
NAME Identifier
0 German Euro
1 german Euro
2 Mexican South American
3 mexican South American
Desired output
NAME Identifier
0 1990 German Euro
1 german 1998 Euro
2 country Mexican South American
3 mexican city 2006 South American
Based on an answer in this post:
r = '(german|mexican)'
c = dict(german='Euro', mexican='South American')
df['Identifier'] = df['NAME'].str.lower().str.extract(r, expand=False).map(c)
Another approach would be using np.where with those two conditions, but probably there is a more ellegant solution.
below code will work. i tried it using apply function but somehow can't able to get it. probably in sometime. meanwhile workable code below
df3['identifier']=''
js_ref=[{'german':'Euro'},{'mexican':'South American'}]
for i in range(len(df3)):
for l in js_ref:
for k,v in l.items():
if k.lower() in df3.name[i].lower():
df3.identifier[i]=v
break
So I have three pandas dataframes(train, test). Overall it is about 700k lines. And I would like to remove all cities from a cities list - common_cities. But tqdm in notebook cell suggests that it would take about 24 hrs to replace all from a list of 33000 cities.
dataframe example (train_original):
id
name_1
name_2
0
sun blinds decoration paris inc.
indl de cuautitlan sa cv
1
eih ltd. dongguan wei shi
plastic new york product co., ltd.
2
jsh ltd. (hk) mexico city
arab shipbuilding seoul and repair yard madrid c
common_cities list example
common_cities = ['moscow', 'madrid', 'san francisco', 'mexico city']
what is supposed to be output:
id
name_1
name_2
0
sun blinds decoration inc.
indl de sa cv
1
eih ltd. wei shi
plastic product co., ltd.
2
jsh ltd. (hk)
arab shipbuilding and repair yard c
My solution in such case worked well on small filter words list, but when it is large, the performance is low.
%%time
for city in tqdm(common_cities):
train_original.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
train_augmented.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
test.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
P.S: I presume it's not great to use list comprehension while splitting string and substituting city name, because city name could be > 2 words.
Any suggestions, ideas on approach to make a quick replacement on Pandas Dataframes in such situations?
Instead of iterating over the huge dfs for reach pass, remember that pandas replace accepts dictionaries with all the replacements to be done in a single go.
Therefore we can start by creating the dictionary and then using it with replace:
replacements = {x:'' for x in common_cities}
train_original = train_original.replace(replacements)
train_augmented = train_augmented.replace(replacements)
test = test.replace(replacements)
Edit: Reading the documentation it might be even easier, because it also accept lists of values to be replaced:
train_original = train_original.replace(common_cities,'')
train_augmented = train_augmented.replace(common_cities,'')
test = test.replace(common_cities,'')
i am trying to find pandas column all the cell value to particular string how do I check it?
there is one dataframe and one string, want to search entire df column into string, it should return matching elements from column
looking for solution like in MySQL
select * from table where "string" like CONCAT('%',columnname,'%')
Dataframe:
area office_type
0 c d a (o) S.O
1 dr.b.a. chowk S.O
2 ghorpuri bazar S.O
3 n.w. college S.O
4 pune cantt east S.O
5 pune H.O
6 pune new bazar S.O
7 sachapir street S.O
Code:
tmp_df=my_df_main[my_df_main['area'].str.contains("asasa sdsd sachapir street sdsds ffff")]
in above example "sachapir street" is there is pandas column in area and also it is there in string, it should return "sachapir street" for matching word.
I know it should be like a reverse I tried my code like
tmp_df=my_df_main["asasa sdsd sachapir street sdsds ffff".str.contains(my_df_main['area'])]
any idea how to do that?
Finally I did this using "import pandasql as ps"
query = "SELECT area,office_type FROM my_df_main where 'asasa sdsd sachapir street sdsds ffff' like '%'||area||'%'"
tmp_df = ps.sqldf(query, locals())
The first line of the excel contains words with \n character in each cell.
eg:
Month "East North Central\n(NSA)" "East North Central\n(SA)" "East South Central\n(NSA)"
So while converting to csv using this code :
data_xls = pd.read_excel('/home/scripts/usless/HP_PO_hist.xls', 'sheet1', index_col=4,skiprows=3)
data_xls.to_csv('HH_PO_output.csv', encoding='utf-8')
It converts the chars after \n into new lines like :
,Month,"East North Central
(NSA)","East North Central
(SA)","East South Central
(NSA)","East South Central
But the expected output is like :
Month East North Central (NSA) East North Central (SA) East South Central (NSA) East South Central (SA)
How to remove this \n character only from this index line while converting to csv in Python df?
I used the following dummy data frame:
import pandas as pd
columns=["Month", "East North Central\n(NSA)", "East North Central\n(SA)", "East South Central\n(NSA)"]
df = pd.DataFrame(columns=columns)
When exporting to csv via df.to_csv I get the same line break behavior (pandas 0.19.2):
,Month,"East North Central
(NSA)","East North Central
(SA)","East South Central
(NSA)"
One solution for this is to simply replace the \n with whitespaces like this:
df.columns = df.columns.str.replace("\n", " ")
This provides the desired result:
,Month,East North Central (NSA),East North Central (SA),East SouthCentral (NSA)