python variables as index - python

I have a file with different sheets and column-heading but same structure. I want to convert to json. but already now I have a problem. How can I index, my first column(with different heading) to pandas?
import pandas;
datapath = 'myfile.xlsx'
datasheet = 'testsheet'
data = pandas.read_excel(datapath, sheet_name=datasheet)
index_1 = data.columns[0]
# now my problem, in bash I would do it like:
chipset = data.$(echo $index_1)
print(chipset)
# can anyone give me please a solution?
I have a excel-file (sx) sheet:
s1:
s1 col1: | s1col2
sc11data1 | sc12data1
sc11data2 | sc12data2
---
s2:
s2 col1: | s2col2
sc21data | sc22data
--
I dont know how the exact name of the heading in a sheet is but 1st sheet is always a index in my json.

I don't seem to understand your question. Do you mean you want to set the first column as the index? Doesn't data.set_index(index_1,inplace=True) work?

Related

How to populate dataframe with values drawn from a CSV in Python

I'm trying to fill an existing spreadsheet with values from a separate CSV file with Python.
I have this long CSV file with emails and matching domains that I want to insert into a spreadsheet of business contact information. Basically, insert email into the email column where the 'website' column matches up.
The spreadsheet I'm trying to populate looks like this:
| Index | business_name | email| website |
| --- | --------------- |------| ----------------- |
| 0 | Apple | | www.apple.com |
| 1 | Home Depot | | www.home-depot.com|
| 4 | Amazon | | www.amazon.com |
| 6 | Microsoft | | www.microsoft.com |
The CSV file I'm taking contacts from looks like this:
steve#apple.com, www.apple.com
jeff#amazon.com, www.amazon.com
marc#amazon.com, www.amazon.com
john#amazon.com, www.amazon.com
marc#salesforce.com, www.salesforce.com
dan#salesforce.com, www.salesforce.com
in Python:
index = [0, 1, 4, 6]
business_name = ["apple", "home depot", "amazon", "microsoft"]
email = ["" for i in range(4)]
website = ["www.apple.com", "www.home-depot.com", "www.amazon.com", "www.microsoft.com"]
col1 = ["steve#apple.com", "jeff#amazon.com", "marc#amazon.com", "john#amazon.com", "marc#salesforce.com", "Dan#salesforce.com"]
col2 = ["www.apple.com", "www.amazon.com", "www.amazon.com", "www.amazon.com", "www.salesforce.com", "www.salesforce.com"]
# spreadsheet to insert values into
spreadsheet_df = pd.DataFrame({"index":index, "business_name":business_name, "email":email, "website":website})
# csv file that is read
csv_df = pd.DataFrame({"col1":col1, "col2":col2})
Desired Output:
| Index | business_name | email | website |
| --- | --------------- |---------------------| ----------------- |
| 0 | Apple | steve#apple.com | www.apple.com |
| 1 | Home Depot | NaN | www.home-depot.com|
| 4 | Amazon | jeff#amazon.com | www.amazon.com |
| 6 | Microsoft | NaN | www.microsoft.com |
I want to iterate through every row in the CVS file to find where the 2nd column (in the CSV) matches the fourth column of the spreadsheet, then insert the corresponding value from the CSV file (value in the first column) into the 3rd column of the spreadsheet.
Up until now, I've had to manually insert email contacts from the CSV file into the spreadsheet which has become very tedious. Please save me from this monotony.
I've scoured stack overflow for an identical or similar thread but cannot find one. I apologize if there is a thread with this same issue, and if my post is confusing or lacking information as it is my first. There are multiple entries for a single domain, so ideally I want to append every entry in the CSV file to its matching row and column in the spreadsheet. This seems like an easy task at first but has become a massive headache for me.
welcome to Stackoverflow, in the future please kindly follow these guidelines. In this scenario, please follow the community PANDAS guidelines as well. Following these guidelines are important in how the community can help you and how you can help the community as well.
First you need to provide and create a minimal and reproducible example for those helping you:
# Setup
index = [0, 1, 4, 6]
business_name = ["apple", "home depot", "amazon", "microsoft"]
email = ["" for i in range(4)]
website = ["www.apple.com", "www.home-depot.com", "www.amazon.com", "www.microsoft.com"]
col1 = ["steve#apple.com", "jeff#amazon.com", "marc#amazon.com", "john#amazon.com", "marc#salesforce.com", "Dan#salesforce.com"]
col2 = ["www.apple.com", "www.amazon.com", "www.amazon.com", "www.amazon.com", "www.salesforce.com", "www.salesforce.com"]
# Create DataFrames
# In your code this is where you would read in the CSV and spreadsheet via pandas
spreadsheet_df = pd.DataFrame({"index":index, "business_name":business_name, "email":email, "website":website})
csv_df = pd.DataFrame({"col1":col1, "col2":col2})
This will also help others who are reviewing this question in the future.
If I understand you correctly, you're looking to provide an email address for every company you have on the spread sheet:
You can accomplish it by reading in the csv and spreadsheet into a dataframe and merging them:
# Merge my two dataframes
df = spreadsheet_df.merge(csv_df, left_on="website", right_on="col2", how="left")
# Only keep the columns I want
df = df[["index", "business_name", "email", "website", "col1"]]
output:
index business_name email website col1
0 0 apple www.apple.com steve#apple.com
1 1 home depot www.home-depot.com NaN
2 4 amazon www.amazon.com jeff#amazon.com
3 4 amazon www.amazon.com marc#amazon.com
4 4 amazon www.amazon.com john#amazon.com
5 6 microsoft www.microsoft.com NaN
Because you didn't provide an expected output, I don't know if this is correct.
If you want to associate only the first email for a business in the CSV file with a website, you can do groupby/first on that and then merge with the business dataframe. I'm also going to drop the original email column since it serves no purpose
import pandas
index = [0, 1, 4, 6]
business_name = ["apple", "home depot", "amazon", "microsoft"]
email = ["" for i in range(4)]
website = ["www.apple.com", "www.home-depot.com", "www.amazon.com", "www.microsoft.com"]
col1 = ["steve#apple.com", "jeff#amazon.com", "marc#amazon.com", "john#amazon.com", "marc#salesforce.com", "Dan#salesforce.com"]
col2 = ["www.apple.com", "www.amazon.com", "www.amazon.com", "www.amazon.com", "www.salesforce.com", "www.salesforce.com"]
# spreadsheet to insert values into
business = pandas.DataFrame({"index":index, "business_name":business_name, "email":email, "website":website})
# csv file that is read
email = pandas.DataFrame({"email":col1, "website":col2})
output = (
business
.drop(columns=['email']) # this is empty and needs to be overwritten
.merge(
email.groupby('website', as_index=False).first(), # just the first email
on='website', how='left' # left-join -> keep all rows from `business`
)
.loc[:, business.columns] # get your original column order back
)
And I get:
index business_name email website
0 apple steve#apple.com www.apple.com
1 home depot NaN www.home-depot.com
4 amazon jeff#amazon.com www.amazon.com
6 microsoft NaN www.microsoft.com
Assuming that the spreadsheet is also a pandas dataframe and that it looks exactly like your image, there is a straightforward way of doing this using boolean indexing. I advise you to read about it further here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
First, I suggest that you turn your so-called CSV-file into a dictionary where the website is the key and the e-mail addresses are the values. Seeing as you don't need more than one contact, this works well. The reason I asked that question is that a dictionary cannot contain identical keys and thus some e-mail addresses would disappear. Achieving this is easily done by reading in the CSV-file as a pandas Series and doing the following:
series = pd.Series.from_csv('path_to_csv')
contacts_dict = series.to_dict()
Note that your order here would now be incorrect, in that you would have the e-mail as a key and the domain as a value. As such, you can do the following to swap them around:
dict_contacts = {value:key for key, value in dict_contacts.items()}
The reason for this step is that I believe that it it easier to work with an expanding list of clients.
Having done that, what you could simply do is then:
For i in dict_contacts.keys():
df1['e-mail'][df1['website'] == i] = dict_contacts[i] #THERE SHOULD BE AN INDENT HERE
What this does is that it filters out only the e-mail addresses for each unique key in the dictionary (i.e. the domain) and assigns it the value of that key, i.e. the e-mail.
Finally, I have deliberately attempted to provide you with a solution that is general and thus wouldn't require additional work in case you were to have 2000 different clients with unique domains and e-mails.

how to open the json inside dataframe

I have a dataframe that come from SharePoint (Microsoft), and it has a lot of jsons inside the cells with the metadata. i usually dont work with json, so im struggling with it.
# df sample
+-------------+----------+
| Id | Event |
+-------------+----------+
| 105 | x |
+-------------+----------+
x = {"#odata.type":"#Microsoft.Azure.Connectors.SharePoint.SPListExpandedReference","Id":1,"Value":"Digital Training"}
How i assign just the value "Digital Training" to the cell, for example? remembering that this is ocurring for a lot of columns, and i need to solve it too. Thanks.
If the event column consists of dict-object:
df['Value'] = df.apply(lambda x: x['Event']['Value'], 1)
If the event column has string objects:
import json
df['Value'] = df.apply(lambda x: json.loads(x['Event'])['Value'], 1)
Both result in
Id Event Value
0 x {"#odata.type":"#Microsoft.Azure.Connectors.Sh... Digital Training

Extract value from specified row and column in CSV file using Python. Cannot use CSV module or pandas module

I have been provided with a .csv file, which has data on covid19. It is in the form of:
district | country | date1 | date2 | date3 |etc
victoria | australia |1 case | 3 cases |7 cases | etc
It is a fairly large file, with 263 rows of countries/districts, and 150 columns of dates.
The program needs to be able to take in an input district, country, and date and print out the number of COVID cases in that location as of that date. (print the value of a specified row and column of a CSV file)
We have been instructed not to use the CSV module or the pandas module. I am having trouble understanding where to start. I will add my attempted solutions to this question as I go along. Not looking for a complete solution,but any ideas that I could try would be appreciated.
This is what I finally did to solve it. It works perfectly. for reference the data file I am using is : https://portland-my.sharepoint.com/:x:/g/personal/msharma8-c_ad_cityu_edu_hk/ES7eUlPURzxOqTmRLmcxVEMBtemkKQzLcKD6U6SlbX2-_Q?e=tc5aJF
# for the purpose of this answer I preset the country, province, and date
country = 'Australia'
province = 'New South Wales'
date = '3/10/2020'
with open('covid19.csv', 'r') as f:
final_list = []
list0 = f.readline().split(',')
for line in f:
if line.split(',')[0] == province:
final_list = line.split(',')
dict1 = dict(zip(list0,final_list))
print dict1[date]
I will use the same logic to finish the solution.

Python - How to replace all matching text in a column by a reference table - which requires replacing multiple matching text within a cell

Hi I'm totally new to Python but am hoping someone can show me the ropes.
I have a csv reference table which contains over 1000 rows with unique Find values, example of reference table:
|Find |Replace |
------------------------------
|D2-D32-dog |Brown |
|CJ-E4-cat |Yellow |
|MG3-K454-bird |Red |
I need to do a find and replace of text in another csv file. Example of Column in another file that I need to find and replace (over 2000 rows):
|Pets |
----------------------------------------
|D2-D32-dog |
|CJ-E4-cat, D2-D32-dog |
|MG3-K454-bird, D2-D32-dog, CJ-E4-cat |
|T2- M45 Pig |
|CJ-E4-cat, D2-D32-dog |
What I need is for python to find and replace, returning the following, and if no reference, return original value:
|Expected output |
---------------------
|Brown |
|Yellow, Brown |
|Red, Brown, Yellow |
|T2- M45 Pig |
|Yellow, Brown |
Thanking you in advance.
FYI - I don't have any programming experience, usually use Excel but was told that Python will be able to achieve this. So I have given it a go in hope to achieve the above - but it's returning invalid syntax error...
import pandas as pd
dfRef1 = pd.read_csv(r'C:\Users\Downloads\Lookup.csv')
#File of Find and Replace Table
df= pd.read_csv(r'C:\Users\Downloads\Data.csv')
#File that contains text I want to replace
dfCol = df['Pets'].tolist()
#converting Pets column to list from Data.csv file
for x in dfCol:
Split = str(x).split(',')
#asking python to look at each element within row to find and replace
newlist=[]
for index,refRow in dfRef1.iteritems():
newRow = []
for i in Split:
if i == refRow['Find']:
newRow.append(refRow['Replace']
else
newRow.append(refRow['Find'])
newlist.append(newRow)
newlist
#if match found replace, else return original text
#When run, the code is Returning - SyntaxError: invalid syntax
#I've also noticed that the dfRef1 dtype: object
Am I even on the right track? Any advise is greatly appreciated.
I understand the concept of Excel VLookup, however, because the cell value contains multiple lookup items which i need to replace within the same cell, I'm unable to do this in Excel.
Thanks again.
You can save the excel file as CSV to make your life easier
then strip your file to contain only the table without any unnecessary information.
load the CSV file to python with pandas:
import pandas as pd
df_table1 = pd.read_csv("file/path/filename.csv")
df_table2 = pd.read_csv("file/path/other_filename.csv")
df_table1[['wanted_to_be_replaced_col_name']] = df_table2[['wanted_col_to_copy']]
for further informaion and more complex assignment go visit the pandas documentaion # https://pandas.pydata.org/
hint: for large amount of columns check the iloc function

Python - Print list of CSV strings in aligned columns

I have written a fragment of code that is fully compatible with both Python 2 and Python 3. The fragment that I wrote parses data and it builds the output as a list of CSV strings.
The script provides an option to:
write the data to a CSV file, or
display it to the stdout.
While I could easily iterate through the list and replace , with \t when displaying to stdout (second bullet option), the items are of arbitrary length, so don't line up in a nice format due to variances in tabs.
I have done quite a bit of research, and I believe that string format options could accomplish what I'm after. That said, I can't seem to find an example that helps me get the syntax correct.
I would prefer to not use an external library. I am aware that there are many options available if I went that route, but I want the script to be as compatible and simple as possible.
Here is an example:
value1,somevalue2,value3,reallylongvalue4,value5,superlongvalue6
value1,value2,reallylongvalue3,value4,value5,somevalue6
Can you help me please? Any suggestion will be much appreciated.
import csv
from StringIO import StringIO
rows = list(csv.reader(StringIO(
'''value1,somevalue2,value3,reallylongvalue4,value5,superlongvalue6
value1,value2,reallylongvalue3,value4,value5,somevalue6''')))
widths = [max(len(row[i]) for row in rows) for i in range(len(rows[0]))]
for row in rows:
print(' | '.join(cell.ljust(width) for cell, width in zip(row, widths)))
Output:
value1 | somevalue2 | value3 | reallylongvalue4 | value5 | superlongvalue6
value1 | value2 | reallylongvalue3 | value4 | value5 | somevalue6
def printCsvStringListAsTable(csvStrings):
# convert to list of lists
csvStrings = map(lambda x: x.split(','), csvStrings)
# get max column widths for printing
widths = []
for idx in range(len(csvStrings[0])):
columns = map(lambda x: x[idx], csvStrings)
widths.append(
len(
max(columns, key = len)
)
)
# print the csv strings
for row in csvStrings:
cells = []
for idx, col in enumerate(row):
format = '%-' + str(widths[idx]) + "s"
cells.append(format % (col))
print ' |'.join(cells)
if __name__ == '__main__':
printCsvStringListAsTable([
'col1,col2,col3,col4',
'val1,val2,val3,val4',
'abadfafdm,afdafag,aadfag,aadfaf',
])
Output:
col1 |col2 |col3 |col4
val1 |val2 |val3 |val4
abadfafdm |afdafag |aadfag |aadfaf
The answer by Alex Hall is definitely better and a terse form of the same code which I have written.

Categories

Resources