Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a CSV in which a student appears on multiple lines. The goal is to obtain a CSV where the student's name appears only once and a "Sports" column is created where all the sports practiced by the student separated by a space converge (like the photos)
csv
final csv
I'm not going to post a full solution, as this sounds like a homework problem. If this is infact for a school assignment, please edit your question to include the information.
From your description, the problem can be broken into three steps, each of which can be independently written as code in your solution.
Parse a CSV file
Create a new data structure that reduces the number of rows and adds a new column
Output the data to a new CSV file.
Step 1 and 3 are the simplest. You will want to use things like with open('file', 'r'), list.split(), and ",".join()
For step 2, the problem is eaiser to understand if you think in terms of dictionaries. If you can turn your original data (which is a list of rows) into a dictionary of rows, then it becomes eaiser to detect duplicates. All dictionaries must have a unique key (or column in this case), and you already know that you have a key (student name) that you would like to be unique, but isn't.
Your code for step 2 will iterate over the list of rows, adding each one to a dictionary using student_name as a unique key. If that key already exists, then instead of adding a new entry, you will need to modify the existing entry's "sports" field.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a 278 x 2 data frame, and I want to find the rows that have 2 consecutive declining values in the second column. Here's a snippet:
I'm not sure how to approach this problem. I've searched how to identify consecutive declining values in a data frame, but so far I've only found questions that pertain to consecutive SAME values, which isn't what I'm looking for. I could iterate over the data frame, but I don't believe that's very efficient.
Also, I'm not asking for someone to show me how to code this problem. I'm simply asking for potential ways I could go about solving this problem on my own because I'm unsure of how to approach the issue.
Use shift to create a temporary column with all values shifted up one row.
Compare the two columns, "GDP" > "shift" This gives you a new
column of Boolean values.
Look for consecutive True values in this Boolean column. That identifies two consecutive declining values.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a large data file where each row looks as follows, where each pipe-delimited value represents a consistent variable (i.e. 1517892812 and 1517892086 represent the Unix Timestamp, and the last pipe delimited object will always be UnixTimestamp)
264|2|8|6|1.32235000|1.33070000|1.31400000|1257.89480966|1517892812
399|10|36|2|1.12329614|1.12659227|1.12000000|148194.47200218|1517892086
How can I pull out the values I need to make variables in Python? For example, looking at a row and getting UnixTimestamp=1517892812 (and other variables) out of it.
I want to pull out each relevant variable per line, work with them, and then look at the next line and reevaluate all of the variable values.
Is RegEx what I should be dealing with here?
No need for regex, you can use split():
int(a.strip().split('|')[-1])
If all variable are only number and you want a matrix whit all your values you can simply do something like:
[int(line.strip().split('|')) for line in your_data.splitlines()]
You can use regex and re.search():
int(re.search(r'[^|]+$', text).group())
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have two CSV files, and I would like to validate(find the differences and similarities) the data between these two files.
I am retrieving this data from vertica and because the data is so large I would like to do the validation at CSV level.
csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what’s actually changed. This is useful if you’re comparing the output of an automatic system from one day to the next, so that you can look at just what’s changed.
I don't think you can directly compare sheets using openpyxl without manually looping on each rows and using your own validation code.
That depend your aim at performance, if speed is not a requirement, then why not but that will require some additional work.
Instead I would use pandas dataframes for any CSV validation needs, if you can add this dependency it should become really easier to compare files while keeping it at a great performance.
Here is a link to complete example:
http://pbpython.com/excel-diff-pandas.html
However, use read_csv() instead of read_excel() to read data from your files.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I want to write a Python 3 script to manage my expenses, and I'm going to have a rules filter that says 'if the description contains a particular string, categorize it as x', and these rules will be read in from a text file.
The only way I can think of doing this is to apply str.find() for each rule on the description of each transaction, and break if one is found - but this is a O^2 solution, is there a better way of doing this?
Strip punctuation from the description, and split it into words. Make the words in the description into a set, and the categories into another set.
Since sets use dictionaries internally and dictionaries are built on hash-tables, average membership checking is O(1).
Only when a transaction is entered (or changed), intersect both sets to find the categories that apply (if any), and add the categories to your transaction record (dict, namedtuple, whatever).
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have been given a set of 20,000 entries in Excel. Each entry is a string, and they are all names of events such as: Daytona 500, NASCAR, 3x1 Brand Rep, etc.
Many of the event names are repeated, and I would like to make a list and sort them and find the most common items in the list, and how many times each one is entered. I am half way through my first semester of Python and have just learned about lists, and would like to use Python 2.7 to do this task, but I am also open to using Excel or R if it makes more sense to use one of these.
I'm not sure where to start or how to input such a large list into a program.
In Excel I would use a PivotTable, about 15 seconds to set up:
your_list = ['Daytona 500', 'NASCAR'] # more values of course
Now use a dictionary comprehension to count items for each unique key.
your_dict = {i:your_list.count(i) for i in set(your_list)}