how to check string contains any word from dataframe colum - python

i am trying to find pandas column all the cell value to particular string how do I check it?
there is one dataframe and one string, want to search entire df column into string, it should return matching elements from column
looking for solution like in MySQL
select * from table where "string" like CONCAT('%',columnname,'%')
Dataframe:
area office_type
0 c d a (o) S.O
1 dr.b.a. chowk S.O
2 ghorpuri bazar S.O
3 n.w. college S.O
4 pune cantt east S.O
5 pune H.O
6 pune new bazar S.O
7 sachapir street S.O
Code:
tmp_df=my_df_main[my_df_main['area'].str.contains("asasa sdsd sachapir street sdsds ffff")]
in above example "sachapir street" is there is pandas column in area and also it is there in string, it should return "sachapir street" for matching word.
I know it should be like a reverse I tried my code like
tmp_df=my_df_main["asasa sdsd sachapir street sdsds ffff".str.contains(my_df_main['area'])]
any idea how to do that?

Finally I did this using "import pandasql as ps"
query = "SELECT area,office_type FROM my_df_main where 'asasa sdsd sachapir street sdsds ffff' like '%'||area||'%'"
tmp_df = ps.sqldf(query, locals())

Related

Create new column based on value of another column

I have a solution below to give me a new column as a universal identifier, but what if there is additional data in the NAME column, how can I tweak the below to account for a wildcard like search term?
I want to basically have so if German/german or Mexican/mexican is in that row value then to give me Euro or South American value in new col
df["Identifier"] = (df["NAME"].str.lower().replace(
to_replace = ['german', 'mexican'],
value = ['Euro', 'South American']
))
print(df)
NAME Identifier
0 German Euro
1 german Euro
2 Mexican South American
3 mexican South American
Desired output
NAME Identifier
0 1990 German Euro
1 german 1998 Euro
2 country Mexican South American
3 mexican city 2006 South American
Based on an answer in this post:
r = '(german|mexican)'
c = dict(german='Euro', mexican='South American')
df['Identifier'] = df['NAME'].str.lower().str.extract(r, expand=False).map(c)
Another approach would be using np.where with those two conditions, but probably there is a more ellegant solution.
below code will work. i tried it using apply function but somehow can't able to get it. probably in sometime. meanwhile workable code below
df3['identifier']=''
js_ref=[{'german':'Euro'},{'mexican':'South American'}]
for i in range(len(df3)):
for l in js_ref:
for k,v in l.items():
if k.lower() in df3.name[i].lower():
df3.identifier[i]=v
break

Find the city with highest number of amenities

I am currently trying to crack a programming puzzle that has the very simple dataframe host with 2 columns named city and amenities (both are object datatype). Now, entries in both columns could be repeated multiple times. Below is the first few entries of host is beLOW
City Amenities Price($)
NYC {TV,"Wireless Internet", "Air conditioning","Smoke 8
detector",Essentials,"Lock on bedroom door"}
LA {"Wireless Internet",Kitchen,Washer,Dryer,"First aid
kit",Essentials,"Hair dryer","translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"}
10
SF {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,"Free
parking on premises","Pets live on this
property",Dog(s),"Indoor fireplace","Buzzer/wireless
intercom",Heating,Washer,Dryer,"Smoke detector","Carbon
monoxide detector","First aid kit","Safety card","Fire e
extinguisher",Essentials,Shampoo,"24-hour check-
in",Hangers,"Hair dryer",Iron,"Laptop friendly
workspace","translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50","Self Check-In",Lockbox} 15
NYC {"Wireless Internet","Air
conditioning",Kitchen,Heating,"Suitable for events","Smoke
detector","Carbon monoxide detector","First aid kit","Fire
extinguisher",Essentials,Shampoo,"Lock on bedroom
door",Hangers,"translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"} 20
LA {TV,Internet,"Wireless Internet","Air
conditioning",Kitchen,"Free parking on
premises",Essentials,Shampoo,"translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"}
LA {TV,"Cable TV",Internet,"Wireless Internet",Pool,Kitchen,"Free
parking on premises",Gym,Breakfast,"Hot tub","Indoor
fireplace",Heating,"Family/kid friendly",Washer,Dryer,"Smoke
detector","Carbon monoxide detector",Essentials,Shampoo,"Lock
on bedroom door",Hangers,"Private entrance"} 28
.....
Question. Output the city with the highest number of amenities.
My attempt. I tried using groupby() function to group it based on column city using host.groupby('city'). Now, I need to count successfully the number of elements in each set of Amenities. Since the data types are different, the len() function did not work because there are \ between each element in the set (for example, if I use host['amenities'][0], the output is "{TV,\"Wireless Internet\",\"Air conditioning\",\"Smoke detector\",\"Carbon monoxide detector\",Essentials,\"Lock on bedroom door\",Hangers,Iron}". Applying len() to this output would result in 134, which is clearly incorrect). I tried using host['amenities'][0].strip('\n') which removes the \, but the len() function still gives 134.
Can anyone please help me crack this problem?
My solution, inspired by ddejohn's solution:
### Transform each "string-type" entry in column "amenities" to "list" type
host["amenities"] = host["amenities"].str.replace('["{}]', "", regex=True).str.split(",")
## Create a new column that count all the amenities for each row
entry host["am_count"] = [len(data) for data in host["amenities"]]
## Output the index in the new column resulting from aggregation over the column `am_count` grouped by `city`
host.groupby("city")["am_count"].agg("sum").argmax()
Solution
import functools
# Process the Amenities strings into sets of strings
host["amenities"] = host["amenities"].str.replace('["{}]', "", regex=True).str.split(",").apply(set)
# Groupby city, perform the set union to remove duplicates, and get count of unique amenities
amenities_by_city = host.groupby("city")["amenities"].apply(lambda x: len(functools.reduce(set.union, x))).reset_index()
Output:
city amenities
0 LA 27
1 NYC 17
2 SF 29
Getting the city with the max number of amenities is achieved with
city_with_most_amenities = amenities_by_city.query("amenities == amenities.max()")
Output:
city amenities
2 SF 29

Compare values between 2 dataframes and transform data

The main aim of this script is to compare the regex format of the data present in the csv with the official ZIP Code regex format for that country, and if the format does not match, the script would carry out transformations on said data and output it all in one final dataframe.
I have 2 csv files, one (countries.csv) containing the following columns & data examples
INPUT:
Contact ID
Country
Zip Code
1
USA
71293
2
Italy
IT 2310219
and another csv (Regex.csv) with the following data examples:
Country
Regex format
USA
[0-9]{5}(?:-[0-9]{4})?
Italy
\d{5}
Now, the first csv has some 35k records so I would like to create a function which loops through the regex.csv (Dataframe) to grab the country column and also the regex format. Then it would loop through the country list to grab every instance where regex['country'] == countries['country'] and it would apply the regex transformation to the ZIP Codes for that country.
So far I have this function but I can't get it to work.
def REGI (dframe):
dframe=pd.DataFrame().reindex_like(contacts)
cols = list(contacts.columns)
for index,row in mergeOne.iterrows():
country = (row['Country'])
reg = (row[r'regex'])
for i, r in contactsS.iterrows():
if (r['Country of Residence'] == country or r['Country of Residence.1'] == country or r['Mailing Country (text only)'] == country or r['Other Country (text only)'] == country) :
dframe.loc[i] = r
dframe['Mailing Zip/Postal Code']=dframe['Mailing Zip/Postal Code'].apply(str).str.extractall(reg).unstack().apply(lambda x:','.join(x.dropna()), axis=1)
contacts.loc[contacts['Contact ID'].isin(dframe['Contact ID']),cols] = dframe[cols]
dframe = dframe.dropna(how='all')
return dframe
['Contact ID'] is being used as an identifier column.
The second for loop works on its own however I would need to manually re-type a new dataframe name, regex format and country name (without the first for loop).
At the moment I am getting the following error:
ValueError
ValueError: pattern contains no capture groups
removed some columns to mimic example given above
dataframes & error
error continued
If I paste the results into a new dataframe, it returns the following:
results in a new dataframe
Example as text
Account ID
Country
Zip/Postal Code
1
United Kingdom
WV9 5BT
2
Ireland
D24 EO29
3
Latvia
1009
4
United Kingdom
EN6 1JE
5
Italy
22010
REGEX table
Country
Regex
United Kingdom
([Gg][Ii][Rr] 0[Aa]{2})
(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})
([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
Latvia
[L]{1}[V]{1}-{4}
Ireland
STRNG_LTN_EXT_255
Italy
\d{5}
United Kingdom regex:
([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
Based on your response to my comment, I would suggest to directly fix the zip code using your regexes:
df3 = df2.set_index('Country')
df1['corrected_Zip'] = (df1.groupby('Country')
['Zip Code']
.apply(lambda x: x.str.extract('(%s)' % df3.loc[x.name, 'Regex format']))
)
df1
This groups by country, applies the regex for this country, and extract the value.
output:
Contact ID Country Zip Code corrected_Zip
0 1 USA 71293 71293
1 2 Italy IT 2310219 23102
NB. if you want you can directly overwrite Zip Code by doing df1['Zip Code'] = …
NB2. This will work only if all countries have an entry in df2, if this is not the case, you need to add a check for that (let me know)
NB3. if you want to know which rows had an invalid zip, you can fetch them using:
df1[df1['Zip Code']!=df1['corrected_Zip']]

pandas row manipulation - If startwith keyword found - append row to end of previous row

I have a question regarding text file handling. My text file prints as one column. The column has data scattered throughout the rows and visually looks great & somewhat uniform however, still just one column. Ultimately, I'd like to append the row where the keyword is found to the end of the top previous row until data is one long row. Then I'll use str.split() to cut up sections into columns as I need.
In Excel (code below-Top) I took this same text file and removed headers, aligned left, and performed searches for keywords. When found, Excel has a nice feature called offset where you can place or append the cell value basically anywhere using this offset(x,y).value from the active-cell start position. Once done, I would delete the row. This allowed my to get the data into a tabular column format that I could work with.
What I Need:
The below Python code will cycle down through each row looking for the keyword 'Address:'. This part of the code works. Once it finds the keyword, the next line should append the row to the end of the previous row. This is where my problem is. I can not find a way to get the active row number into a variable so I can use in place of the word [index] for the active row. Or [index-1] for the previous row.
Excel Code of similar task
Do
Set Rng = WorkRng.Find("Address", LookIn:=xlValues)
If Not Rng Is Nothing Then
Rng.Offset(-1, 2).Value = Rng.Value
Rng.Value = ""
End If
Loop While Not Rng Is Nothing
Python Equivalent
import pandas as pd
from pandas import DataFrame, Series
file = {'Test': ['Last Name: Nobody','First Name: Tommy','Address: 1234 West Juniper St.','Fav
Toy', 'Notes','Time Slot' ] }
df = pd.DataFrame(file)
Test
0 Last Name: Nobody
1 First Name: Tommy
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I've tried the following:
for line in df.Test:
if line.startswith('Address:'):
df.loc[[index-1],:].values = df.loc[index-1].values + ' ' + df.loc[index].values
Line above does not work with index statement
else:
pass
# df.loc[[1],:] = df.loc[1].values + ' ' + df.loc[2].values # copies row 2 at the end of row 1,
# works with static row numbers only
# df.drop([2,0], inplace=True) # Deletes row from df
Expected output:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I am trying to wrap my head around the entire series vectorization approach but still stuck trying loops that I'm semi familiar with. If there is a way to achieve this please point me in the right direction.
As always, I appreciate your time and your knowledge. Please let me know if you can help with this issue.
Thank You,
Use Series.shift on Test then use Series.str.startswith to create a boolean mask, then use boolean indexing with this mask to update the values in Test column:
s = df['Test'].shift(-1)
m = s.str.startswith('Address', na=False)
df.loc[m, 'Test'] += (' ' + s[m])
Result:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot

Extracting Information from a complex string

Hi I have the following two columns in Panda Arrary. As You can see that the information in the second column has a lot of information. As far as I understand it is some form of a "list" but with double instead of single quotation marks.
Customer Name Details
Jacob "[{""Name"":""Phone"",""Value"":""03477444556""},{""Name"":""Type"",""Value"":""Apartment""},{""Name"":""No - Name"",""Value"":""1210""},{""Name"":""Apartment N me"",""Value"":""Khudadaad Height  E-11\/1""},{""Name"":""Street"",""Value"":null},{""Name"":""Sector"",""Value"":null},{""Name"":""Landmark"",""Value"":null}]"
John "[{""Name"":""Phone"",""Value"":""03477444550""},{""Name"":""Type"",""Value"":null},{""Name"":""No - Name"",""Value"":""10""},{""Name"":""Apartment Name"",""Val e"":""Khudadaad Height  E-11\/1""},{""Name"":""Street"",""Value"":null},{""Name"":""Sector"",""Value"":null},{""Name"":""Landmark"",""Value"":null}]"
Smith "[{""Name"":""Phone"",""Value"":""03475649292""},{""Name"":""Type"",""Value"":""House""},{""Name"":""No - Name"",""Value"":""1 a""},{""Name"":""Apartment Name"" ""Value"":null},{""Name"":""Street"",""Value"":null},{""Name"":""Sector"",""Value"":""f 7 3""},{""Name"":""Landmark"",""Value"":null}]"
Adam "[{""Name"":""Phone"",""Value"":""03466700079""},{""Name"":""Type"",""Value"":""Office""},{""Name"":""No - Name"",""Value"":""ptcl head quarter""},{""Name"":""A artment Name"",""Value"":null},{""Name"":""Street"",""Value"":null},{""Name"":""Sector"",""Value"":""g\/8\/4""},{""Name"":""Landmark"",""Value"":null}]"
Carlos "[{""Name"":""Phone"",""Value"":""03466700079""},{""Name"":""Type"",""Value"":""Office""},{""Name"":""No - Name"",""Value"":""ptcl head quarter""},{""Name"":""A artment Name"",""Value"":null},{""Name"":""Street"",""Value"":null},{""Name"":""Sector"",""Value"":""g\/8\/4""},{""Name"":""Landmark"",""Value"":null}]"
Ali "[{""Name"":""Phone"",""Value"":""03465403134""},{""Name"":""Type"",""Value"":""House""},{""Name"":""No - Name"",""Value"":""55-B ""},{""Name"":""Apartment Name ",""Value"":null},{""Name"":""Street"",""Value"":""21""},{""Name"":""Sector"",""Value"":""F 10\/2""},{""Name"":""Landmark"",""Value"":null}]"
Anyhow this is how I'm interpreting this information. It contains seven different rows against each customer with each row containing within it four different values. So for the first customer Jacob the value in Row 1 and column 4 is " 03477444556 ". This row for each customer contains their phone number. Similarly, against each customer row 3 column 4 contains their location.
I'm interested in creating a column which would contain say the phone number for all customers. How can I go about it so that from the above I can get this:
Customer Name | Phone Number
Jacob | 03477444556
John | 034777444550
Smith | 03475649292
Adam | 03466700079
Carlos | 03466700079
Ali | 03465403132
And be able to do the above for any information contained within the master column.

Categories

Resources