Extracting Information from a complex string

Extracting Information from a complex string - python

Hi I have the following two columns in Panda Arrary. As You can see that the information in the second column has a lot of information. As far as I understand it is some form of a "list" but with double instead of single quotation marks.
Customer Name Details
Jacob "[{""Name"":""Phone"",""Value"":""03477444556""},{""Name"":""Type"",""Value"":""Apartment""},{""Name"":""No - Name"",""Value"":""1210""},{""Name"":""Apartment N me"",""Value"":""Khudadaad Height  E-11\/1""},{""Name"":""Street"",""Value"":null},{""Name"":""Sector"",""Value"":null},{""Name"":""Landmark"",""Value"":null}]"
John "[{""Name"":""Phone"",""Value"":""03477444550""},{""Name"":""Type"",""Value"":null},{""Name"":""No - Name"",""Value"":""10""},{""Name"":""Apartment Name"",""Val e"":""Khudadaad Height  E-11\/1""},{""Name"":""Street"",""Value"":null},{""Name"":""Sector"",""Value"":null},{""Name"":""Landmark"",""Value"":null}]"
Smith "[{""Name"":""Phone"",""Value"":""03475649292""},{""Name"":""Type"",""Value"":""House""},{""Name"":""No - Name"",""Value"":""1 a""},{""Name"":""Apartment Name"" ""Value"":null},{""Name"":""Street"",""Value"":null},{""Name"":""Sector"",""Value"":""f 7 3""},{""Name"":""Landmark"",""Value"":null}]"
Adam "[{""Name"":""Phone"",""Value"":""03466700079""},{""Name"":""Type"",""Value"":""Office""},{""Name"":""No - Name"",""Value"":""ptcl head quarter""},{""Name"":""A artment Name"",""Value"":null},{""Name"":""Street"",""Value"":null},{""Name"":""Sector"",""Value"":""g\/8\/4""},{""Name"":""Landmark"",""Value"":null}]"
Carlos "[{""Name"":""Phone"",""Value"":""03466700079""},{""Name"":""Type"",""Value"":""Office""},{""Name"":""No - Name"",""Value"":""ptcl head quarter""},{""Name"":""A artment Name"",""Value"":null},{""Name"":""Street"",""Value"":null},{""Name"":""Sector"",""Value"":""g\/8\/4""},{""Name"":""Landmark"",""Value"":null}]"
Ali "[{""Name"":""Phone"",""Value"":""03465403134""},{""Name"":""Type"",""Value"":""House""},{""Name"":""No - Name"",""Value"":""55-B ""},{""Name"":""Apartment Name ",""Value"":null},{""Name"":""Street"",""Value"":""21""},{""Name"":""Sector"",""Value"":""F 10\/2""},{""Name"":""Landmark"",""Value"":null}]"
Anyhow this is how I'm interpreting this information. It contains seven different rows against each customer with each row containing within it four different values. So for the first customer Jacob the value in Row 1 and column 4 is " 03477444556 ". This row for each customer contains their phone number. Similarly, against each customer row 3 column 4 contains their location.
I'm interested in creating a column which would contain say the phone number for all customers. How can I go about it so that from the above I can get this:
Customer Name | Phone Number
Jacob | 03477444556
John | 034777444550
Smith | 03475649292
Adam | 03466700079
Carlos | 03466700079
Ali | 03465403132
And be able to do the above for any information contained within the master column.

Related

Setting specific rows to the value found in a row if differing index

I work with a lot of CSV data for my job. I am trying to use Pandas to convert the member 'Email' to populate into the row of their spouses 'PrimaryMemberEmail' column. Here is a sample of what I mean:
import pandas as pd
user_data = {'FirstName':['John','Jane','Bob'],
'Lastname':['Snack','Snack','Tack'],
'EmployeeID':['12345','12345S','54321'],
'Email':['John#issues.com','NaN','Bob#issues.com'],
'DOB':['09/07/1988','12/25/1990','07/13/1964'],
'Role':['Employee On Plan','Spouse On Plan','Employee Off Plan'],
'PrimaryMemberEmail':['NaN','NaN','NaN'],
'PrimaryMemberEmployeeId':['NaN','12345','NaN']
}
df = pd.DataFrame(user_data)
I have thousands of rows like this. I need to only populate the 'PrimaryMemberEmail' when the user is a spouse with the 'Email' of their associated primary holders email. So in this case I would want to autopopulate the 'PrimaryMemberEmail' for Jane Snack to be that of her spouse, John Snack, which is 'John#issues.com' I cannot find a good way to do this. currently I am using:
for i in (df['EmployeeId']):
p = (p + len(df['EmployeeId']) - (len(df['EmployeeId'])-1))
EEID = df['EmployeeId'].iloc[p]
if 'S' in EEID:
df['PrimaryMemberEmail'].iloc[p] = df['Email'].iloc[p-1]
What bothers me is that this only works if my file comes in correctly, like how I showed in the example DataFrame. Also my NaN values do not work with dropna() or other methods, but that is a question for another time.
I am new to python and programming. I am trying to add value to myself in my current health career and I find this all very fascinating. Any help is appreciated.

IIUC, map the values and fillna:
df['PrimaryMemberEmail'] = (df['PrimaryMemberEmployeeId']
.map(df.set_index('EmployeeID')['PrimaryMemberEmail'])
.fillna(df['PrimaryMemberEmail'])
)
Alternatively, if you have real NaNs, (not strings), use boolean indexing:
df.loc[df['PrimaryMemberEmployeeId'].notna(),
'PrimaryMemberEmail'] = df['PrimaryMemberEmployeeId'].map(df.set_index('EmployeeID')['PrimaryMemberEmail'])
output:
FirstName Lastname EmployeeID DOB Role PrimaryMemberEmail PrimaryMemberEmployeeId
0 John Mack 12345 09/07/1988 Employee On Plan John#issues.com NaN
1 Jane Snack 12345S 12/25/1990 Spouse On Plan John#issues.com 12345
2 Bob Tack 54321 07/13/1964 Employee Off Plan Bob#issues.com NaN

Find the city with highest number of amenities

I am currently trying to crack a programming puzzle that has the very simple dataframe host with 2 columns named city and amenities (both are object datatype). Now, entries in both columns could be repeated multiple times. Below is the first few entries of host is beLOW
City Amenities Price($)
NYC {TV,"Wireless Internet", "Air conditioning","Smoke 8
detector",Essentials,"Lock on bedroom door"}
LA {"Wireless Internet",Kitchen,Washer,Dryer,"First aid
kit",Essentials,"Hair dryer","translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"}
10
SF {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,"Free
parking on premises","Pets live on this
property",Dog(s),"Indoor fireplace","Buzzer/wireless
intercom",Heating,Washer,Dryer,"Smoke detector","Carbon
monoxide detector","First aid kit","Safety card","Fire e
extinguisher",Essentials,Shampoo,"24-hour check-
in",Hangers,"Hair dryer",Iron,"Laptop friendly
workspace","translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50","Self Check-In",Lockbox} 15
NYC {"Wireless Internet","Air
conditioning",Kitchen,Heating,"Suitable for events","Smoke
detector","Carbon monoxide detector","First aid kit","Fire
extinguisher",Essentials,Shampoo,"Lock on bedroom
door",Hangers,"translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"} 20
LA {TV,Internet,"Wireless Internet","Air
conditioning",Kitchen,"Free parking on
premises",Essentials,Shampoo,"translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"}
LA {TV,"Cable TV",Internet,"Wireless Internet",Pool,Kitchen,"Free
parking on premises",Gym,Breakfast,"Hot tub","Indoor
fireplace",Heating,"Family/kid friendly",Washer,Dryer,"Smoke
detector","Carbon monoxide detector",Essentials,Shampoo,"Lock
on bedroom door",Hangers,"Private entrance"} 28
.....
Question. Output the city with the highest number of amenities.
My attempt. I tried using groupby() function to group it based on column city using host.groupby('city'). Now, I need to count successfully the number of elements in each set of Amenities. Since the data types are different, the len() function did not work because there are \ between each element in the set (for example, if I use host['amenities'][0], the output is "{TV,\"Wireless Internet\",\"Air conditioning\",\"Smoke detector\",\"Carbon monoxide detector\",Essentials,\"Lock on bedroom door\",Hangers,Iron}". Applying len() to this output would result in 134, which is clearly incorrect). I tried using host['amenities'][0].strip('\n') which removes the \, but the len() function still gives 134.
Can anyone please help me crack this problem?
My solution, inspired by ddejohn's solution:
### Transform each "string-type" entry in column "amenities" to "list" type
host["amenities"] = host["amenities"].str.replace('["{}]', "", regex=True).str.split(",")
## Create a new column that count all the amenities for each row
entry host["am_count"] = [len(data) for data in host["amenities"]]
## Output the index in the new column resulting from aggregation over the column `am_count` grouped by `city`
host.groupby("city")["am_count"].agg("sum").argmax()

Solution
import functools
# Process the Amenities strings into sets of strings
host["amenities"] = host["amenities"].str.replace('["{}]', "", regex=True).str.split(",").apply(set)
# Groupby city, perform the set union to remove duplicates, and get count of unique amenities
amenities_by_city = host.groupby("city")["amenities"].apply(lambda x: len(functools.reduce(set.union, x))).reset_index()
Output:
city amenities
0 LA 27
1 NYC 17
2 SF 29
Getting the city with the max number of amenities is achieved with
city_with_most_amenities = amenities_by_city.query("amenities == amenities.max()")
Output:
city amenities
2 SF 29

how to check string contains any word from dataframe colum

i am trying to find pandas column all the cell value to particular string how do I check it?
there is one dataframe and one string, want to search entire df column into string, it should return matching elements from column
looking for solution like in MySQL
select * from table where "string" like CONCAT('%',columnname,'%')
Dataframe:
area office_type
0 c d a (o) S.O
1 dr.b.a. chowk S.O
2 ghorpuri bazar S.O
3 n.w. college S.O
4 pune cantt east S.O
5 pune H.O
6 pune new bazar S.O
7 sachapir street S.O
Code:
tmp_df=my_df_main[my_df_main['area'].str.contains("asasa sdsd sachapir street sdsds ffff")]
in above example "sachapir street" is there is pandas column in area and also it is there in string, it should return "sachapir street" for matching word.
I know it should be like a reverse I tried my code like
tmp_df=my_df_main["asasa sdsd sachapir street sdsds ffff".str.contains(my_df_main['area'])]
any idea how to do that?

Finally I did this using "import pandasql as ps"
query = "SELECT area,office_type FROM my_df_main where 'asasa sdsd sachapir street sdsds ffff' like '%'||area||'%'"
tmp_df = ps.sqldf(query, locals())

Compare multiple cells in openpyxl

I need to make a comparison of multiple cells in openpyxl but I have not been successful. To be more precise, I have an .xlsx file that I import into my python script, which contains 4 columns, and around 70,000 rows. The rows that have the same first 3 columns, must be joined and add the digit that appears in the fourth column.
For example
Row 1 .. Type of material: A | Location: NY | Month of sale: January | Cost: 100
..
Row 239 Type of material: A | Location: NY | Month of sale: January | Cost: 150
..
Row 1020 Type of material: A | Location: NY | Month of sale: January | Cost: 80
..
etc
Assuming that only such matches existed, a new data table must be generated (for example in a data sheet) where only one row appears in this way:
Type of material: A | Location: NY | Month of sale: January | Cost: 330 (sum of costs)
And so on, with all the data in .xlsx file to get a new consolidated table.
I hope to have been clear with the explanation, but if it was not, I can be even more precise if necessary.
As I mentioned at the beginning, I have not been successful so far, so I will appreciate any help!
Thank you very much

instead of reading it via openpyxl, I would use pandas
import pandas as pd
raw_data = pd.read_excel(filename, header=0)
summary = raw_data.groupby(['Type of material', 'Location', 'Month of sale'])['Cost'].sum()
If this raises some KeyErrors you'll need to fix the labels

Iterate through and compare customer data from two .csv files, export single file with most recent updated customer information (Python)

I have two .csv files with the following customer information as headers:
First name
Last name
Email
Phone number
Street name
House number
City
Zip code
Country
Date -- last time customer information was updated
I want to go through both files, and export a single file with the most recent customer information.
For example,
File 1 contains the following for a single customer:
First name - John
Last name - Smith
Email - jsmith#verizon.net
Phone number - 123 456 7890
Street name - Baker St
House number - 50
City - London
Zip code - 12345
Country - England
Date - 01-06-2016 (DD-MM-YYYY)
And file 2 contains the following information for the same customer:
First name - John
Last name - Smith
Email - jsmith#gmail.com
Phone number - 098 765 4321
Street name - Baker St
House number - 50
City - London
Zip code - 12345
Country - England
Date - 01-10-2016
I want to use the information for this customer from file 2 in the exported file.
Any suggestions how to go about doing this in Python?
Thanks!

I suggest you to use pandas. You may create two DataFrame's and after that you may update first frame by the second. I found a question which looks similar like your(https://stackoverflow.com/questions/7971513/using-one-data-frame-to-update-another ) I hope that it can help you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting Information from a complex string - python

Related

Setting specific rows to the value found in a row if differing index

Find the city with highest number of amenities

how to check string contains any word from dataframe colum

Compare multiple cells in openpyxl

Iterate through and compare customer data from two .csv files, export single file with most recent updated customer information (Python)

Categories

Resources