I have a CSV with a large number of rows, from a user-submitted form. Each row includes a user email, and a field for them to list other user emails in their group. I've written a short script so far using Python and pandas that loads the CSV into a dataframe and cleans up entries.
I want to sort the rows by group, but am running into a few conceptual problems. Since it's user-entered, the list is not necessarily complete or correctly spelled. What's the best way to deal with this? I'm entirely new to parsing data like this and rather inexperienced overall.
Here's some example data to show what I mean:
email,group
user1#a.com, "['user4#b.com','user3#c.com']"
user2#a.com,
user3#c.com, "['user1#a.com']"
user4#b.com, "['user1#a.com','user3#b.com']"
So here user1, user3, and user4 are in a group. The problem is that user3 only listed user1.
My first thought was to append the submitting user's email to the group list and then sort the list and then column alphabetically. However, that only works if everyone's group entries are complete.
I'd like not to pick out 200 groups by hand, but I'm lost as how to proceed.
This is my current plan in pseudocode:
data # dataframe containing imported CSV
sorted_groups # result dataframe with equivalent rows, but sorted into groups
sort(data) by len(data[group])
for each row in data:
append row to sorted_groups
search for rows where email == entry in groups
append matching rows to sorted_groups
remove matching rows from data
remove initial row from data
This will definitely fail on misspellings, and only works if at least one person in the group got everything right. It's the best I have at the moment, though.
Thanks for taking the time to read this. Please let me know if I can clarify anything, and point me in the right direction!
I'm not sure how of your data is stored, so I'm writing this assuming you have a list of rows of data, and each row contains all of the email addresses entered in the form. e.g.,
rows = [['user1#a.com','user4#b.com','user3#c.com'],
['user2#a.com'],
['user3#c.com', 'user1#a.com'],
['user4#b.com','user1#a.com','user3#b.com']]
I'm also assuming that each user belongs to one and only one group, each user has submitted the form, and each user did not misspell their email.
We can obtain a set of valid email addresses using
valid = {row[0] for row in rows}
We can build a dictionary mapping users to groups, merge groups as we go, and remove invalid emails.
ugDict = {}
for row in rows:
mergedGroup = set(row) & valid
for user in row:
if user in ugDict:
mergedGroup |= ugDict[user]
for user in mergedGroup:
ugDict[user] = mergedGroup
This will result in a mapping from users to groups, and will include any mistyped email addresses. You'll have to decide how to validate emails -- you might just want to ignore them.
Now, to get a sorted list of groups, create a set of all groups, and use the sorted function.
sortedGroups = sorted({frozenset(g) for g in ugDict.values()})
frozenset(g) makes python's set object hashable (i.e. sortable).
The result?
sortedGroups = [frozenset({'user2#a.com'}),
frozenset({'user1#a.com', 'user3#b.com', 'user4#b.com'})]
Related
I am not sure how to fix this. This is the code I want, but I do not want it to continuously repeat the names of the rows in the output.
I'd suggest a few changes to your code.
Firstly, to answer your question, you can remove the multiple occurences of the words by using:
select_merch = d.loc[df['Category] == 'Merchandise'].sum()['Cost]
This will make sure to select only the sum of the Cost column for a particular dataframe. Also this code is very redundant and confusing. What you can do is also create a list and iterate over it for each category.
list(df['Category'].unique()) will give you a list of all the unqiue categories. Store it in a list and then iterate over it. Plus, you don't need to do a d=pd.Dataframe(df) everytime, you can use df itself as well.
I am working on a project to simply receive information from the exchange and spit that out onto an excel spreadsheet. I have run into a road block where I am not sure how I would remove an unwanted list which is under the same variable name as a wanted list. for clarification: running_program the red pen indicates the list I want to remove, and the green being the useful information. Removing this will allow me to output the useful information into an excel spreadsheet to be viewed. This doesn't currently work due to the prior lists being used instead of the useful lists. How would I fix this? there are a few solutions such as removing said lists or using a different method to export the data to an excel spreadsheet. Any help would be appreciated.
I have established a connection with the API, then sorted the incoming data into lists using [start:end:step] to get all relevant information together, however this is what lead to sorting the receiving message: into lists too. So how would i remove this information to only use the useful information?
every_mark_price = Response[4::4] #gets every mark price value from response
mp_only = []
for items in every_mark_price:
sort = items.replace("\"mark_price\":","")#removes mark_price: from value
mp_only.append(sort)
df['Mark_Price'] = pd.Series(mp_only)#changes list into panda series and outputs into excel
every_iv = Response[5::4]#gets every iv value from response
iv_only = []
for items in every_iv:
sort = items.replace("\"iv\":","")#removes iv: from each value
iv_only.append(sort)
df['IV'] = pd.Series(iv_only)#changes list to panda series and outputs into excel
every_instrument_name = Response[6::4]#gets every instrument name
in_only = []
for items in every_instrument_name:
sort = items.replace("\"instrument_name\":","")
sort_ = sort.replace("}","")#removes unnecessary char
in_only.append(sort_)#adds it to new list
df['Instrument_Name'] = pd.Series(in_only)#changes list into panda series and outputs into excel
Where I have used pandas to output into excel, it will only output the information highlighted by red in the image earlier(receiving message info) and to prevent this I need to somehow filter said info out or find a way to pass it. excel output
I have multiple data frames to compare. My problem is the product IDs. one is set up like:
000-000-000-000
Vs
000-000-000
(gross)
I have looked on here, reddit, YouTube, and even went deep down the rabbit hole trying .join, .append, some other method I've never seen before, or even understand yet. Is there a way(or even better some documentation I can read on to learn this) to pull the Product ID from the Main excel sheet, compare it to the one(s) that should match. Then i will more than like make the in place ID across all sheets. That way I can use those IDs as the index and do a side by side compare of the ID to row data? Each ID has about 113 values to compare. That's 113 columns, but for each row if that make sense
Example: (colorful columns is main sheet that the non colored column will be compared to)
additional notes:
The highlighted yellow IDs are "unique", and I wont be changing those but instead write them to a list or something and use an if statement to ignore them when found.
Edit:
so I wrote this code which is almost perfect what I need to do with this.
It takes out the "-" which I apply to all my IDs. Just need to make a list of ID that are unique to skip over on taking away the zeros
dfSS["Product ID"] = dfSS["Product ID"].str.replace("-", "")
Then this will only list the digits up to 9 digits, except the unique IDs
dfSS["Product ID"] = dfSS["Product ID"]str[:9]
Will add the full code below here once i get it to work 100%
I am now trying to figure out how to say somethin like
lst =[1,2,3,4,5]
if dfSS["Product ID"] not in lst:
dfSS["Product ID"] = dfSS["Product ID"].str.replace("-", "").str[:9]
This code does not work but everyday I get closer and closer to being able to compare these similar yet different data frames. the lst is just an example of a 000-000-000 Product IDs in a list that I do not want to filter at all. but keep in the data frame
If the ID transformation is predictable, then one option is to use regex for homogenizing IDs. For example if the situation is just removing the first three digits, then something like the following can be used:
df['short_id'] = df['long_id'].str.extract(r'\d\d\d-([\d-]*)')
If the ID transformation is not so predictable (e.g. due to transcription errors or some other noise in the data) then the best option is to first disambiguate the ID transformation using something like recordlinkage, see the example here.
Ok solved this for every Product ID with or without dashes, #, ltters, etc..
(\d\d\d-)?[_#\d-]?[a-zA-Z]?
(\d\d\d-)? -This is for the first & second three integer sets, w/ zero or more matches and a dashes (non-greedy)
[_#\d-]? - This is for any special chars and additional numbers (non-greedy)
[a-zA-Z]? - This, not sure why, but I had to separate from the last part due to it wouldn't pick up every letter. (non-greedy)
With the above I solved everything I needed for RE.
Where I learned how to improve my RE skills:
RE Documentation
Automate the Boring Stuff- Ch 7
You can test you RE's here
Additional way to show this. Put this here to show there is no one way of doing it. RE is super awesome:
(\d{3}-)?[_#\d{3}-]?[a-zA-Z]?
//EDIT: This question is kind of a sub-question. For a shorter and better example, which has better replies, check This Post
I'm very new into python and even newer into pandas.
I'm working with it for at least a month and I think I got most of the basics together.
My current task is to write values into a certrain cell, in a certain space inside of an xslx-file.
Situation
I have a very big excel-file including various data, from names to
email-adresses and everything. As well I have two lists (.txt-files)
with the same email-adresses of the excel-file in it, yet those
emails got verified either if they match certain security-cheks or
not. Depending on the outcome, they got stored inside of the
"Secured.txt" or the "Unsecured.txt" file.
To write and read in the excel-file, I use pandas.
Task
Next to the 'Emails'-column in the excel file, there is a column in which you mark with an entry either if the email is secured, or unsecured. My actual task is to insert those entrys, depending in which text-file the email lies.
Possible Solution
My approach to solve this problem is to read out each .txt-file and store each email-adress in a variable using a list and a for-loop. Iterating through those emails, I know want to look for the location of the email-adress inside of the excel-file and access the cell right next to it. Same row, different column. Since the emails got sorted matching for their security-validation before, I just can put in the according value into the validation-cell right next to the email.
Question
My question is the following: How do I approach a specific row based on a value in it?
I want to find the place of the cell which includes the actual content of the variable "mails", so I can move over to the cell right next to it. Since I know all the names of the columns, I actually just need the index of the row in which the email lies. I got the x-coordinate and need the y-coordinate.
Example
What I have up until now is the readout of the .txt-file:
import pandas as pd
import os
import re
#fetching the mail adress through indexnumber out of the list
with open('Protected/Protected G.txt', 'r') as file:
#creating the regex pattern to sort out the mail adresses
rgx = '\S+#\S+'
#read the file and convert the list into a string
content = file.readlines()
content_str = ''.join(content)
#get the mails out of the "list" with regex
mails = re.findall(rgx, content_str)
#put each mailadress in a variable
for item in mails:
print(item)
This dummy-dataframe represents the excel sheet I'm working with:
Dummy-Dataframe:
Forename Last Name Email Protection
1 John Kennedy John#gmx.net
2 Donald Trump Donald#gmx.net
3 Bill Clinton Bill#gmx.net
4 Richard Nixton Richard#gmx.net
I know want to pass the actual adress, stored in the variable 'item', to some kind of "locate"-function of pandas in order to find out in which row the actual email lies. As soon as I know in which row the adress lies, I can now tell pandas to write either an "x", saying the mail is protected, or an "o", meaning the mail is unprotected, in the very next column.
My finished dataframe could look like this:
Finished Dataframe:
Forename Last Name Email Protection
1 John Kennedy John#gmx.net x
2 Donald Trump Donald#gmx.net o
3 Bill Clinton Bill#gmx.net x
4 Richard Nixton Richard#gmx.net x
I really appreciate the help.
To make sure I understand you have a text file for protected and one for unprotected. I am making a large assumption you never have an email in both.
import pandas as pd
df = pd.read_csv('Protected/Protected G.txt', header = None, sep = " ")
df.columns = ['Protected Emails']
df2 = pd.read_excel('dummy-excel')
if df2['Email'].isin(df) :
df2['Protection'] = 'x'
else :
df2['Protection'] = 'o'
writer = pd.ExcelWriter('ProtectedEmails.xlsx')
df2.to_excel(writer,'Sheet1') #or whatever you want to name your sheet
writer.save()
maybe something like that, though I don't know what the text file of emails looks like.
Your question is different from the content. This is a simple answer might, somehow, be useful.
Assume that this is a dataframe:
Z = pd.DataFrame([1,2,4,6])
Now, let us access to number 4. There is a single column. Usually, the first column is assigned the name 0 as a heading. The required number, 4, exists in the third place of the dataframe. As python starts the indexes of lists, dfs, arrays.. etc from 0, then the number of index of number 4 is 2.
print(Z[0][2])
This would output [4]
Try applying the same thing on your data. Just male sure to know the names of the headings. Sometimes they are not numbers, but strings.
I'm using the python packages xlrd and xlwt to read and write from excel spreadsheets using python. I can't figure out how to write the code to solve my problem though.
So my data consists of a column of state abbreviations and a column of numbers, 1 through 7. There are about 200-300 entries per state, and i want to figure out how many ones, twos, threes, and so on exist for each state. I'm struggling with what method I'd use to figure this out.
normally i would post the code i already have but i don't even know where to begin.
Prepare a dictionary to store the results.
Get the numbers of line with data you have using xlrd, then iterate over each of them.
For each state code, if it's not in the dict, you create it also as a dict.
Then you check if the entry you read on the second column exists within the state key on your results dict.
4.1 If it does not, you'll create it also as a dict, and add the number found on the second column as a key to this dict, with a value of one.
4.2 If it does, just increment the value for that key (+1).
Once it has finished looping, your result dict will have the count for each individual entry on each individual state.
I'm going to assume you already know how to do to the easy part of this and read a spreadsheet into Python as a list of lists. So, you've got something like this:
data = [['CA', 1],
['AZ', 2],
['NM', 3],
['CA', 2]]
Now, what you want for each state, for each number, a count of the number of times that number appears. So:
counts = {}
for state, number in data:
counts.setdefault(state, collections.Counter())[number] += 1