How to populate dataframe with values drawn from a CSV in Python

How to populate dataframe with values drawn from a CSV in Python - python

I'm trying to fill an existing spreadsheet with values from a separate CSV file with Python.
I have this long CSV file with emails and matching domains that I want to insert into a spreadsheet of business contact information. Basically, insert email into the email column where the 'website' column matches up.
The spreadsheet I'm trying to populate looks like this:
| Index | business_name | email| website |
| --- | --------------- |------| ----------------- |
| 0 | Apple | | www.apple.com |
| 1 | Home Depot | | www.home-depot.com|
| 4 | Amazon | | www.amazon.com |
| 6 | Microsoft | | www.microsoft.com |
The CSV file I'm taking contacts from looks like this:
steve#apple.com, www.apple.com
jeff#amazon.com, www.amazon.com
marc#amazon.com, www.amazon.com
john#amazon.com, www.amazon.com
marc#salesforce.com, www.salesforce.com
dan#salesforce.com, www.salesforce.com
in Python:
index = [0, 1, 4, 6]
business_name = ["apple", "home depot", "amazon", "microsoft"]
email = ["" for i in range(4)]
website = ["www.apple.com", "www.home-depot.com", "www.amazon.com", "www.microsoft.com"]
col1 = ["steve#apple.com", "jeff#amazon.com", "marc#amazon.com", "john#amazon.com", "marc#salesforce.com", "Dan#salesforce.com"]
col2 = ["www.apple.com", "www.amazon.com", "www.amazon.com", "www.amazon.com", "www.salesforce.com", "www.salesforce.com"]
# spreadsheet to insert values into
spreadsheet_df = pd.DataFrame({"index":index, "business_name":business_name, "email":email, "website":website})
# csv file that is read
csv_df = pd.DataFrame({"col1":col1, "col2":col2})
Desired Output:
| Index | business_name | email | website |
| --- | --------------- |---------------------| ----------------- |
| 0 | Apple | steve#apple.com | www.apple.com |
| 1 | Home Depot | NaN | www.home-depot.com|
| 4 | Amazon | jeff#amazon.com | www.amazon.com |
| 6 | Microsoft | NaN | www.microsoft.com |
I want to iterate through every row in the CVS file to find where the 2nd column (in the CSV) matches the fourth column of the spreadsheet, then insert the corresponding value from the CSV file (value in the first column) into the 3rd column of the spreadsheet.
Up until now, I've had to manually insert email contacts from the CSV file into the spreadsheet which has become very tedious. Please save me from this monotony.
I've scoured stack overflow for an identical or similar thread but cannot find one. I apologize if there is a thread with this same issue, and if my post is confusing or lacking information as it is my first. There are multiple entries for a single domain, so ideally I want to append every entry in the CSV file to its matching row and column in the spreadsheet. This seems like an easy task at first but has become a massive headache for me.

welcome to Stackoverflow, in the future please kindly follow these guidelines. In this scenario, please follow the community PANDAS guidelines as well. Following these guidelines are important in how the community can help you and how you can help the community as well.
First you need to provide and create a minimal and reproducible example for those helping you:
# Setup
index = [0, 1, 4, 6]
business_name = ["apple", "home depot", "amazon", "microsoft"]
email = ["" for i in range(4)]
website = ["www.apple.com", "www.home-depot.com", "www.amazon.com", "www.microsoft.com"]
col1 = ["steve#apple.com", "jeff#amazon.com", "marc#amazon.com", "john#amazon.com", "marc#salesforce.com", "Dan#salesforce.com"]
col2 = ["www.apple.com", "www.amazon.com", "www.amazon.com", "www.amazon.com", "www.salesforce.com", "www.salesforce.com"]
# Create DataFrames
# In your code this is where you would read in the CSV and spreadsheet via pandas
spreadsheet_df = pd.DataFrame({"index":index, "business_name":business_name, "email":email, "website":website})
csv_df = pd.DataFrame({"col1":col1, "col2":col2})
This will also help others who are reviewing this question in the future.
If I understand you correctly, you're looking to provide an email address for every company you have on the spread sheet:
You can accomplish it by reading in the csv and spreadsheet into a dataframe and merging them:
# Merge my two dataframes
df = spreadsheet_df.merge(csv_df, left_on="website", right_on="col2", how="left")
# Only keep the columns I want
df = df[["index", "business_name", "email", "website", "col1"]]
output:
index business_name email website col1
0 0 apple www.apple.com steve#apple.com
1 1 home depot www.home-depot.com NaN
2 4 amazon www.amazon.com jeff#amazon.com
3 4 amazon www.amazon.com marc#amazon.com
4 4 amazon www.amazon.com john#amazon.com
5 6 microsoft www.microsoft.com NaN
Because you didn't provide an expected output, I don't know if this is correct.

If you want to associate only the first email for a business in the CSV file with a website, you can do groupby/first on that and then merge with the business dataframe. I'm also going to drop the original email column since it serves no purpose
import pandas
index = [0, 1, 4, 6]
business_name = ["apple", "home depot", "amazon", "microsoft"]
email = ["" for i in range(4)]
website = ["www.apple.com", "www.home-depot.com", "www.amazon.com", "www.microsoft.com"]
col1 = ["steve#apple.com", "jeff#amazon.com", "marc#amazon.com", "john#amazon.com", "marc#salesforce.com", "Dan#salesforce.com"]
col2 = ["www.apple.com", "www.amazon.com", "www.amazon.com", "www.amazon.com", "www.salesforce.com", "www.salesforce.com"]
# spreadsheet to insert values into
business = pandas.DataFrame({"index":index, "business_name":business_name, "email":email, "website":website})
# csv file that is read
email = pandas.DataFrame({"email":col1, "website":col2})
output = (
business
.drop(columns=['email']) # this is empty and needs to be overwritten
.merge(
email.groupby('website', as_index=False).first(), # just the first email
on='website', how='left' # left-join -> keep all rows from `business`
)
.loc[:, business.columns] # get your original column order back
)
And I get:
index business_name email website
0 apple steve#apple.com www.apple.com
1 home depot NaN www.home-depot.com
4 amazon jeff#amazon.com www.amazon.com
6 microsoft NaN www.microsoft.com

Assuming that the spreadsheet is also a pandas dataframe and that it looks exactly like your image, there is a straightforward way of doing this using boolean indexing. I advise you to read about it further here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
First, I suggest that you turn your so-called CSV-file into a dictionary where the website is the key and the e-mail addresses are the values. Seeing as you don't need more than one contact, this works well. The reason I asked that question is that a dictionary cannot contain identical keys and thus some e-mail addresses would disappear. Achieving this is easily done by reading in the CSV-file as a pandas Series and doing the following:
series = pd.Series.from_csv('path_to_csv')
contacts_dict = series.to_dict()
Note that your order here would now be incorrect, in that you would have the e-mail as a key and the domain as a value. As such, you can do the following to swap them around:
dict_contacts = {value:key for key, value in dict_contacts.items()}
The reason for this step is that I believe that it it easier to work with an expanding list of clients.
Having done that, what you could simply do is then:
For i in dict_contacts.keys():
df1['e-mail'][df1['website'] == i] = dict_contacts[i] #THERE SHOULD BE AN INDENT HERE
What this does is that it filters out only the e-mail addresses for each unique key in the dictionary (i.e. the domain) and assigns it the value of that key, i.e. the e-mail.
Finally, I have deliberately attempted to provide you with a solution that is general and thus wouldn't require additional work in case you were to have 2000 different clients with unique domains and e-mails.

Related

python variables as index

I have a file with different sheets and column-heading but same structure. I want to convert to json. but already now I have a problem. How can I index, my first column(with different heading) to pandas?
import pandas;
datapath = 'myfile.xlsx'
datasheet = 'testsheet'
data = pandas.read_excel(datapath, sheet_name=datasheet)
index_1 = data.columns[0]
# now my problem, in bash I would do it like:
chipset = data.$(echo $index_1)
print(chipset)
# can anyone give me please a solution?
I have a excel-file (sx) sheet:
s1:
s1 col1: | s1col2
sc11data1 | sc12data1
sc11data2 | sc12data2
---
s2:
s2 col1: | s2col2
sc21data | sc22data
--
I dont know how the exact name of the heading in a sheet is but 1st sheet is always a index in my json.

I don't seem to understand your question. Do you mean you want to set the first column as the index? Doesn't data.set_index(index_1,inplace=True) work?

Python: Possible to pass a list of values to an if statement?

I am merging multiple excel sheets into a pandas DataFrame and extracting the email address from a Report_Name column. I have been able to execute what I need but the hardship is that there are over 5000 email addresses that I would have to run checks against (in this example only five). While I could type out all the conditions for these, it's not practical.
So my question is:
Can I pass a single list that includes these 5000 email addresses as a condition (somehow) in an if statement? Below is my code of how I am currently doing it.
| **Report_Name** |
|--------------------------------------------------- |
|SYN-Laptops-Nov10 (002) |
|something_offer bozo#domain.com |
|another thing foxtrot#domain.com |
|my offer is attached ooo 12-31 rocksteps#domain.com |
|copy of offer dolphin#domain.com |
|private offering copy chomps#domain.com |
#----------- extract the email address from the Report Name column -----------#
#blank list to collect and store the email addresses
collected_emails = []
#----------- For Loop to iterate through the values under the ['Report_Name'] column -----------#
for report_name_value in excel_df['Report_Name']:
if 'bozo#domain.com' in report_name_value:
collected_emails.append('bozo#domain.com')
elif 'foxtrot#domain.com' in report_name_value:
collected_emails.append('foxtrot#domain.com')
elif 'rocksteps#domain.com' in report_name_value:
collected_emails.append('rocksteps#domain.com')
elif 'dolphin#domain.com' in report_name_value:
collected_emails.append('dolphin#domain.com')
elif 'chomps#domain.com' in report_name_value:
collected_emails.append('chomps#domain.com')
else:
collected_emails.append('No Email Address')
#create DataFrame for the collected emails
collected_emails_df = pd.DataFrame(collected_emails, columns = ['Email_Address'])
#create master_df to concat both the excel_df and collected_emails_df together
master_df = pd.concat([excel_df, collected_emails_df,], axis = 1)
#export master DataFrame to an excel file and save it on the SharePoint directory
master_df.to_excel(output_local+'Offers.xlsx', index=False)}
RESULT
| **Report_Name** | **Email_Address** |
| SYN-Laptops-Nov10 (002) | No Email Address |
| something_offer bozo#domain.com | bozo#domain.com |
| another thing foxtrot#domain.com | foxtrot#domain.com |
| my offer is attached ooo 12-31 rocksteps#domain.com | rocksteps#domain.com |
| copy of offer dolphin#domain.com | dolphin#domain.com |
| private offering copy chomps#domain.com | chomps#domain.com |
I am a beginner with python and was unable to pull up any references of how to tackle this problem specifically. Hence the post. Thanks for your time and I appreciate any advice you can offer.

Sounds like a good case for lambda and filter!
list_to_check = ['bozo#domain.com', ...] # Pass this in to your function
is_match_func = lambda x: x in excel_df['Report_Name']
all_matches = list(filter(is_match_func, list_to_check))
for email in all_matches:
collected_emails.append(email)

Given:
# df
Report_Name
0 SYN-Laptops-Nov10 (002)
1 something_offer bozo#domain.com
2 another thing foxtrot#domain.com
3 my offer is attached ooo 12-31 rocksteps#domai...
4 copy of offer dolphin#domain.com
5 private offering copy chomps#domain.com
Doing:
df['Emails'] = df.Report_Name.str.extract('(\w+#\w+\.\w+)')
print(df)
Regex Explanation - basically, we know that there aren't spaces in an email, and that they all contain [chars # chars . chars]
Output:
Report_Name Emails
0 SYN-Laptops-Nov10 (002) NaN
1 something_offer bozo#domain.com bozo#domain.com
2 another thing foxtrot#domain.com foxtrot#domain.com
3 my offer is attached ooo 12-31 rocksteps#domai... rocksteps#domain.com
4 copy of offer dolphin#domain.com dolphin#domain.com
5 private offering copy chomps#domain.com chomps#domain.com
It's not clear what you want to do after this point...

You can simply use pandas.Series.str.extract with a regular expression to extract all the emails.
Try this :
collected_emails = (
excel_df['Report_Name']
.str.extract(r'([a-zA-Z0-9.-_]+#.+\.com)')
.dropna()
.squeeze()
.tolist()
)
# Output :
print(collected_emails)
['bozo#domain.com', 'foxtrot#domain.com', 'rocksteps#domain.com', 'dolphin#domain.com', 'chomps#domain.com']

How to extract desired sections from a JSON string

I want to know how to clean up my data to better understand it so that I can know how to sift through the data more easily. So far I have been able to download a public google spreadsheets doc and then convert that into a csv file. But when I print the data it is quite messy and hard to understand. The data came from a website, so when I go to google developer mode I can see how it is neatly organized.
Like this:
Website data on inspect page mode
But actually seeing it as I print into in Jupyter notebooks it looks messy like this:
b'/O_o/\ngoogle.visualization.Query.setResponse({"version":"0.6","reqId":"0output=csv","status":"ok","sig":"1241529276","table":{"cols":[{"id":"A","label":"Entity","type":"string"},{"id":"B","label":"Week","type":"number","pattern":"General"},{"id":"C","label":"Day","type":"date","pattern":"yyyy-mm-dd"},{"id":"D","label":"Flights
2019
(Reference)","type":"number","pattern":"General"},{"id":"E","label":"Flights","type":"number","pattern":"General"},{"id":"F","label":"%
vs 2019
(Daily)","type":"number","pattern":"General"},{"id":"G","label":"Flights
(7-day moving
average)","type":"number","pattern":"General"},{"id":"H","label":"% vs
2019 (7-day Moving
Average)","type":"number","pattern":"General"},{"id":"I","label":"Day
2019","type":"date","pattern":"yyyy-mm-dd"},{"id":"J","label":"Day
Previous
Year","type":"date","pattern":"yyyy-mm-dd"},{"id":"K","label":"Flights
Previous
Year","type":"number","pattern":"General"}],"rows":[{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"}]},{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,2)","f":"2020-09-02"},{"v":92.0,"f":"92"},{"v":59.0,"f":"59"},{"v":-0.358695652173913,"f":"-0,3586956522"},{"v":70.0,"f":"70"},{"v":-0.300998573466476,"f":"-0,3009985735"},{"v":"Date(2019,8,4)","f":"2019-09-04"},{"v":"Date(2019,8,4)","f":"2019-09-04"},{"v":92.0,"f":"92"}]},{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,3)","f":"2020-09-03"},{"v":96.0,"f":"96"},{"v":67.0,"f":"67"},{"v":-0.302083333333333,"f":"-0,3020833333"},
Is there a Panda way to keep this data up?
Essentially what I am trying to do is extract three variables from the data: country, date, and a number.
Here it can be seen how the code starts out with the title, "rows":
Code in Jupyter showing how the code starts out
Essentially it gives a country, date, then a bunch of associated numbers.
What I want to get is the country name, a specific date, and a specific number.
For example, here is an example section, this sequence is repeated throughout the data:
{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"}]},
of this section of the data I only want to get out the word Country name: Albania, the date "2020-09-01", and the number -0.5038
Here is the code I used to grab the google spreadsheet data and save it as a csv:
import requests
import pandas as pd
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=csv')
data = r.content
print(data)
Please any and all advice would be amazing.
Thank you

I'm not sure how you arrived at this csv file, but the easiest way would be to get the json directly with requests, load it as a dict and process it. Nonetheless a solution for the current file would be:
import requests
import pandas as pd
import json
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=jspn')
data = r.content
data = json.loads(data.decode('utf-8').split("(", 1)[1].rsplit(")", 1)[0]) # clean up the string so only the json data is left
d = [[i['c'][0]['v'], i['c'][2]['f'], i['c'][5]['v']] for i in data['table']['rows']]
df = pd.DataFrame(d, columns=['country', 'date', 'number'])
Output:
| | country | date | number |
|---:|:----------|:-----------|--------------:|
| 0 | Albania | 2020-09-01 | -0.503876 |
| 1 | Albania | 2020-09-02 | -0.358696 |
| 2 | Albania | 2020-09-03 | -0.302083 |
| 3 | Albania | 2020-09-04 | -0.135922 |
| 4 | Albania | 2020-09-05 | -0.43617 |

Pandas not displaying all columns when writing to

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!

changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

Designing an expandable command-line interface with lists

These are the three lists I have:
# made up data
products = ['apple','banana','orange']
prices = ['£0.11','£0.07','£0.05']
dates = ['02/04/2017','14/09/2018','06/08/2016']
Important to know
The data in these lists will vary along with its size, although they will maintain the same data type.
The first elements of each list are linked, likewise for the second and third element etc...
Desired command line interface:
Product | Price | Date of Purchase
--------|-------|------------------
apple | £0.11 | 02/04/2017
--------|-------|------------------
banana | £0.07 | 14/09/2018
--------|-------|------------------
orange | £0.05 | 06/08/2016
I want to create a table like this. It should obviously continue if there are more elements in each list but I don't know how I would create it.
I could do
print(""" Product | Price | Date of Purchase # etc...
--------|-------|------------------
%s | %s | %s
""" % (products[0],prices[0],dates[0]))
But I think this would be hardcoding the interface, which isn't ideal because the list has an undetermined length
Any help?

If you want a version that doesn't utilize a library, here's a fairly simple function that makes use of some list comprehensions
def print_table(headers, *columns):
# Ignore any columns of data that don't have a header
columns = columns[:len(headers)]
# Start with a space to set the header off from the left edge, then join the header strings with " | "
print(" " + " | ".join(headers))
# Draw the header separator with column dividers based on header length
print("|".join(['-' * (len(header) + 2) for header in headers]))
# Iterate over all lists passed in, and combine them together in a tuple by row
for row in zip(*columns):
# Center the contents within the space available in the column based on the header width
print("|".join([
col.center((len(headers[idx]) + 2), ' ')
for idx, col in enumerate(row)
]))
This doesn't handle cell values that are longer than the column header length + 2. But that would be easy to implement with a truncation of the cell contents (an example of string truncation can be seen here).

Try pandas:
import pandas as pd
products = ['apple','banana','orange']
prices = ['£0.11','£0.07','£0.05']
dates = ['02/04/2017','14/09/2018','06/08/2016']
df = pd.DataFrame({"Product": products, "Price": prices, "Date of Purchase": dates})
print(df)
Output:
Product Price Date of Purchase
0 apple £0.11 02/04/2017
1 banana £0.07 14/09/2018
2 orange £0.05 06/08/2016

import beautifultable
from beautifultable import BeautifulTable
table = BeautifulTable()
# made up data
products = ['apple','banana','orange']
prices = ['£0.11','£0.07','£0.05']
dates = ['02/04/2017','14/09/2018','06/08/2016']
table.column_headers=['Product' ,'Price','Date of Purchase']
for i in zip(products,prices,dates):
table.append_row(list(i))
print(table)
output is :
+---------+-------+------------------+
| Product | Price | Date of Purchase |
+---------+-------+------------------+
| apple | £0.11 | 02/04/2017 |
+---------+-------+------------------+
| banana | £0.07 | 14/09/2018 |
+---------+-------+------------------+
| orange | £0.05 | 06/08/2016 |
+---------+-------+------------------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to populate dataframe with values drawn from a CSV in Python - python

Related

python variables as index

Python: Possible to pass a list of values to an if statement?

How to extract desired sections from a JSON string

Pandas not displaying all columns when writing to

Designing an expandable command-line interface with lists

Categories

Resources