I have a dataframe that has one column only like the following.(a minimal example)
import pandas as pd
dataframe =pd.DataFrame({'text': ['##weather','how is today?', 'we go out', '##rain',
'my day is rainy', 'I am not feeling well','rainy
blues','##flower','the blue flower', 'she likes red',
'this flower is nice']})
I would like to add a second column called 'id' and increment every time the row contains '##'. so my desired output would be,
text id
0 ##weather 100
1 how is today? 100
2 we go out 100
3 ##rain 101
4 my day is rainy 101
5 I am not feeling well 101
6 rainy blues 101
7 ##flower 102
8 the blue flower 102
9 she likes red 102
10 this flower is nice 102
so far i have done the following which does not return the right output as i want.
dataframe['id']= 100
dataframe.loc[dataframe['text'].str.contains('## intent:'), 'id'] += 1
You can try groupby with ngroup
m = dataframe['text'].str.contains('##').cumsum()
dataframe['id'] = dataframe.groupby(m).ngroup() + 100
print(dataframe)
text id
0 ##weather 100
1 how is today? 100
2 we go out 100
3 ##rain 101
4 my day is rainy 101
5 I am not feeling well 101
6 rainy 101
7 blues 101
8 ##flower 102
9 the blue flower 102
10 she likes red 102
11 this flower is nice 102
Related
I have extracted some data online and I would like to reverse the first column order.
Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
soup = BeautifulSoup(requests.get("https://kworb.net/spotify/country/us_weekly.html").content, 'html.parser')
for e in soup.select('#spotifyweekly tr:has(td)'):
data.append({
'Frequency':e.td.text,
'Artists':e.a.text,
'Songs':e.a.find_next_sibling('a').text
})
data2 = data[:100]
print(data2)
data = pd.DataFrame(data2).to_excel('Kworb_Weekly.xlsx', index = False)
And here is my output:
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/TmGmI.png
I've used [::-1], but it reversed all the columns and I just only want to reverse the first column.
Your first column is 'Frequency', so you can get that column from the data frame, and use [::] on both sides:
data = pd.DataFrame(data2)
print(data)
data['Frequency'][::1] = data['Frequency'][::-1]
print(data)
Got this as the output:
Frequency Artists Songs
0 1 SZA Kill Bill
1 2 PinkPantheress Boy's a liar Pt. 2
2 3 Miley Cyrus Flowers
3 4 Morgan Wallen Last Night
4 5 Lil Uzi Vert Just Wanna Rock
.. ... ... ...
95 96 Lizzo Special
96 97 Glass Animals Heat Waves
97 98 Frank Ocean Pink + White
98 99 Foo Fighters Everlong
99 100 Meghan Trainor Made You Look
[100 rows x 3 columns]
Frequency Artists Songs
0 100 SZA Kill Bill
1 99 PinkPantheress Boy's a liar Pt. 2
2 98 Miley Cyrus Flowers
3 97 Morgan Wallen Last Night
4 96 Lil Uzi Vert Just Wanna Rock
.. ... ... ...
95 5 Lizzo Special
96 4 Glass Animals Heat Waves
97 3 Frank Ocean Pink + White
98 2 Foo Fighters Everlong
99 1 Meghan Trainor Made You Look
[100 rows x 3 columns]
Process finished with exit code 0
how can I want to remove only the first 2 characters in a string that starts with 11
My df :
Product1 Id
0 Waterproof Liner 114890
1 Phone Tripod 981150
2 Waterproof Pants 0
3 baby Kids play Mat 1198547
4 Hiking BACKPACKS 113114
5 security Camera 111160
Product1 object
Id object
dtype: object
Expected output:
Product1 Id
0 Waterproof Liner 4890
1 Phone Tripod 981150
2 Waterproof Pants 0
3 baby Kids play Mat 98547
4 Hiking BACKPACKS 3114
5 security Camera 1160
I write this
df1['Id'] = df1['Id'].str.replace("11","")
But i got this output:
Product1 Id
0 Waterproof Liner 4890
1 Phone Tripod 9850
2 Waterproof Pants 0
3 baby Kids play Mat 98547
4 Hiking BACKPACKS 34
5 security Camera 60
Force match on beginning:
df1['Id'] = df1['Id'].str.replace("^11","")
I want to compare this dataframe df1:
Product Price
0 Waterproof Liner 40
1 Phone Tripod 50
2 Waterproof Pants 0
3 baby Kids play Mat 985
4 Hiking BACKPACKS 34
5 security Camera 160
with df2 as shown below:
Product Id
0 Home Security IP Camera 508760
1 Hiking Backpacks – Spring Products 287950
2 Waterproof Eyebrow Liner 678897
3 Waterproof Pants – Winter Product 987340
4 Baby Kids Water Play Mat – Summer Product 111500
I want to compare Product column in df1 with Product df2. In order to find The good id of the product. And if there is similarity < 80 it will put 'Remove' in the ID field
NB: The text of the Product column in df1 and df2 are not 100% matched
Can Anyone help me with this or how can i use fuzzy wazzy to get the good id?
Here is my code
import pandas as pd
from fuzzywuzzy import process
data1 = {'Product1': ['Waterproof Liner','Phone Tripod','Waterproof Pants','baby Kids play Mat','Hiking BACKPACKS','security Camera'],
'Price':[40,50,0,985,34,160]}
data2 = {'Product2': ['Home Security IP Camera','Hiking Backpacks – Spring Products','Waterproof Eyebrow Liner',
'Waterproof Pants – Winter Product','Baby Kids Water Play Mat – Summer Product'],
'Id': [508760,287950,678897,987340,111500],}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
dfm = pd.DataFrame(df1["Product1"].apply(lambda x: process.extractOne(x, df2["Product2"]))
.tolist(), columns=['Product1',"match_comp", "Id"])
What i got :
Product1 match_comp Id
0 Waterproof Eyebrow Liner 86 2
1 Waterproof Eyebrow Liner 50 2
2 Waterproof Pants – Winter Product 90 3
3 Baby Kids Water Play Mat – Summer Product 86 4
4 Hiking Backpacks – Spring Products 90 1
5 Home Security IP Camera 86 0
What is expected to be :
Product Price ID
0 Waterproof Liner 40 678897
1 Phone Tripod 50 Remove
2 Waterproof Pants 0 987340
3 baby Kids play Mat 985 111500
4 Hiking BACKPACKS 34 287950
5 security Camera 160 508760
You can make a wrapper function:
def extract(s):
name,score,_ = process.extractOne(s, df2["Product2"], score_cutoff=0)
if score < 80:
return 'Remove'
return df2.set_index('Product2').loc[name, 'Id']
df1['ID'] = df1["Product1"].apply(extract)
output:
Product1 Price ID
0 Waterproof Liner 40 678897
1 Phone Tripod 50 Remove
2 Waterproof Pants 0 987340
3 baby Kids play Mat 985 111500
4 Hiking BACKPACKS 34 287950
5 security Camera 160 508760
NB. the output is not exactly what you expect, you have to explain why rows 4/5 should be dropped
i want to merge two dataframes by partial string match.
I have two data frames to combine. First df1 consists of 130.000 rows like this:
id text xc1 xc2
1 adidas men shoes 52465 220
2 vakko men suits 49220 224
3 burberry men shirt 78248 289
4 prada women shoes 45780 789
5 lcwaikiki men sunglasses 34788 745
and second df2 consists of 8000 rows like this:
id keyword abc1 abc2
1 men shoes 1000 11
2 men suits 2000 12
3 men shirt 3000 13
4 women socks 4000 14
5 men sunglasses 5000 15
After matching between keyword and text, outputshould look like this:
id text xc1 xc2 keyword abc1 abc2
1 adidas men shoes 52465 220 men shoes 1000 11
2 vakko men suits 49220 224 men suits 2000 12
3 burberry men shirt 78248 289 men shirt 3000 13
4 lcwaikiki men sunglasses 34788 745 men sunglasses 5000 15
Let's approach by cross join the 2 dataframes and then filter by matching string with substring, as follows:
df3 = df1.merge(df2, how='cross') # for Pandas version >= 1.2.0 (released in Dec 2020)
import re
mask = df3.apply(lambda x: (re.search(rf"\b{x['keyword']}\b", str(x['text']))) != None, axis=1)
df_out = df3.loc[mask]
If your Pandas version is older than 1.2.0 (released in Dec 2020) and does not support merge with how='cross', you can replace the merge statement with:
# For Pandas version < 1.2.0
df3 = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
After the cross join, we created a boolean mask to filter for the cases that keyword is found within text by using re.search within .apply().
We have to use re.search instead of simple Python substring test like stringA in stringB found in most of the similar answers in StackOverflow. Such kind of test will fail with false match of 'men suits' in keyword with 'women suits' in text since it returns True for test of 'men suits' in 'women suits'.
We use regex with a pair of word boundary \b meta-characters around the keyword (regex pattern: rf"\b{x['keyword']}\b") to ensure matching only for whole word match for text in df1, i.e. men suits in df2 would not match with women suits in df1 since the word women does not have a word boundary between the letters wo and men.
Result:
print(df_out)
id_x text xc1 xc2 id_y keyword abc1 abc2
0 1 adidas men shoes 52465 220 1 men shoes 1000 11
6 2 vakko men suits 49220 224 2 men suits 2000 12
12 3 burberry men shirt 78248 289 3 men shirt 3000 13
24 5 lcwaikiki men sunglasses 34788 745 5 men sunglasses 5000 15
Here, columns id_x and id_y are the original id column in df1 and df2 respectively. As seen from the comment, these are just row numbers of the dataframes that you may not care about. We can then remove these 2 columns and reset index to clean up the layout:
df_out = df_out.drop(['id_x', 'id_y'], axis=1).reset_index(drop=True)
Final outcome
print(df_out)
text xc1 xc2 keyword abc1 abc2
0 adidas men shoes 52465 220 men shoes 1000 11
1 vakko men suits 49220 224 men suits 2000 12
2 burberry men shirt 78248 289 men shirt 3000 13
3 lcwaikiki men sunglasses 34788 745 men sunglasses 5000 15
Let's start by ordering the keywords longest-first, so that "women suits" matches "before "men suits"
lkeys = df2.keyword.reindex(df2.keyword.str.len().sort_values(ascending=False).index)
Now define a matching function; each text value from df1 will be passed as s to find a matching keyword:
def is_match(arr, s):
for a in arr:
if a in s:
return a
return None
Now we can extract the keyword from each text in df1, and add it to a new column:
df1['keyword'] = df1['text'].apply(lambda x: is_match(lkeys, x))
We now have everything we need for a standard merge:
pd.merge(df1, df2, on='keyword')
Suppose I have a DataFrame like so:
Item Check Date Inventory
Apple 1/1/2020 50
Banana 1/1/2020 80
Apple 1/2/2020 75
Banana 1/2/2020 300
Apple 2/1/2020 100
Apple 2/2/2020 98
Banana 2/2/2020 341
Apple 2/3/2020 95
Banana 2/3/2020 328
Apple 2/4/2020 90
Apple 2/5/2020 85
Banana 2/5/2020 325
I want to find the average rate of change in the inventory for a given item starting from the max inventory count, then use that to compute what day the inventory will reach zero.
So for apples it would be starting from 2/1: 2+3+5+5/4 = 3.75, similarly for bananas starting from 2/2 13+3/2 = 8.
Since there are different items, I have used:
apples = df[df["Item"] == "apples"]
to get a dataframe for just the apples, then used:
apples["Inventory"].idxmax()
to find the row with the max inventory count.
However, this gives me the row label of the row for the original dataframe. So I'm not sure where to go from here since my plan was to then get the date off the row with the max inventory count, then ignore any dates before that.
You can still use the idxmax but with transform
s=df[df.index>=df.groupby('Item').Inventory.transform('idxmax')]
out=s.groupby('Item')['Inventory'].apply(lambda x : -x.diff().mean())
Item
Apple 3.75
Banana 8.00
Name: Inventory, dtype: float64