I have a CSV file with 100,000 rows.
Each row in column A is a sentence comprised of both chars and integers.
I want column B to contain only integers.
I want the new columns to be in the same CSV file.
How can I accomplish this?
If I'm understanding your question correctly, I would use .isdigit() to parse the data in column A. I'm frankly not sure what the format of column A is, so I don't know exactly what you would do with this (if you gave more information I could give a more specific answer). Your solution will likely come in a similar form to this:
def find(lines):
B = []
for line in lines:
numbers = [c for c in line if c.isdigit()]
current = int(''.join(numbers))
# current is the concatenation of all
# integers found in column A from left to right
B.append(current)
return B
Let me know if this makes sense or is even in the right track for your solution. Once again, without knowing what you're trying to do, and what A looks like, I'm not sure what your actual goals are.
EDIT
I'm not going to explain the csv stuff for you, mainly because there is a fantastic resource and library for it included in python here. If you have specific questions related to writing csv, definitely post them.
It sounds like you essentially want to pull int values out of column A then add them to a new column B. There are definitely many ways to solve this, but the general form of the problem is for each row you'll filter out the int, then you'll add the filtered int into the new column. I'll list a couple:
Regex: You could use a pattern such as [0-9]+ to pull the string out of A, then use int(whatever that output is) to cast to int, then store those values in B. I'm a sucker for a good regular expression and this one is fairly straight forward. Regexr is a great resource to learn about this and test your pattern.
Use an algorithm similar to above: The above algorithm worked before, but I've updated it slightly. Now that it's been updated it'll return an array of numbers correspondent to numbers in A from left to right. This is relatively sound, but it doesn't necessarily guarantee you have the right integer, given that if the title has an int in it, it'll mess some things up. It is likely one of the more clear ways of doing this, though.
Related
I'm about to pull my hair out on this. I'm not sure why the index in my array is not being implemented in the second column.
I created this array - project_information :
project_information.append([proj_id,project_text])
When I print this out, I get the rows and columns. It contains about 40 rows.
When I iterate through it to print out the contents, everything comes out fine. I am using this:
for i in range(0,len(project_information)):
project_id = project_information[i][0]
project_text = project_information[i][1]
print(project_id)
print (project_text)
The project_text column contains text, while the project_id contains integers. It prints out perfectly, and the index, changes for both project_id and project_text.
However, I need to use the project_text in a different way, and I am really struggling with this. I need to slice the text to a shorter text for reuse. To do this, I tried:
for i in range(0,len(project_information)):
project_id = project_information[i][0]
project_text = project_information[i][1]
print(project_id)
print (project_text)
if len(project_text) > 5000:
trunc_proj_text = project_text[:1000]
else:
trunc_proj_text = project_text
print (project_id)
print(trunc_proj_text)
The problem I'm having here is that though the project_id column is being iterated through properly, the project_text is not. What I am getting is just the text in the first row for the project_text, sliced, and repeated for as many times as the length of the array.
I have tried different ways, and also a while loop, but it is still not working.
I've also looked at these answers for reference - Slicing,indexing and iterating over 2D Numpy arrays,Efficient iteration over slice in Python, iteration over list slices, and I can't seem to see how they can be applied to my problem.
I'm not well-versed in using Numpy, so is this something that it could help with? I'm well aware this might be simple and I'm missing it because I've been working on various aspects of this project for the past weeks, so I would appreciate a bit of consideration in this.
Thanks in advance.
The problem was with the input list here, so the slicing with this code does in fact work. The code to create the input array has now been fixed. The original code to create the input list was concatenating the strings for each entry, so the project_texts for each appeared different from the end, but all had the same beginning. But viewing this on a console, it was hard to see.
I have multiple data frames to compare. My problem is the product IDs. one is set up like:
000-000-000-000
Vs
000-000-000
(gross)
I have looked on here, reddit, YouTube, and even went deep down the rabbit hole trying .join, .append, some other method I've never seen before, or even understand yet. Is there a way(or even better some documentation I can read on to learn this) to pull the Product ID from the Main excel sheet, compare it to the one(s) that should match. Then i will more than like make the in place ID across all sheets. That way I can use those IDs as the index and do a side by side compare of the ID to row data? Each ID has about 113 values to compare. That's 113 columns, but for each row if that make sense
Example: (colorful columns is main sheet that the non colored column will be compared to)
additional notes:
The highlighted yellow IDs are "unique", and I wont be changing those but instead write them to a list or something and use an if statement to ignore them when found.
Edit:
so I wrote this code which is almost perfect what I need to do with this.
It takes out the "-" which I apply to all my IDs. Just need to make a list of ID that are unique to skip over on taking away the zeros
dfSS["Product ID"] = dfSS["Product ID"].str.replace("-", "")
Then this will only list the digits up to 9 digits, except the unique IDs
dfSS["Product ID"] = dfSS["Product ID"]str[:9]
Will add the full code below here once i get it to work 100%
I am now trying to figure out how to say somethin like
lst =[1,2,3,4,5]
if dfSS["Product ID"] not in lst:
dfSS["Product ID"] = dfSS["Product ID"].str.replace("-", "").str[:9]
This code does not work but everyday I get closer and closer to being able to compare these similar yet different data frames. the lst is just an example of a 000-000-000 Product IDs in a list that I do not want to filter at all. but keep in the data frame
If the ID transformation is predictable, then one option is to use regex for homogenizing IDs. For example if the situation is just removing the first three digits, then something like the following can be used:
df['short_id'] = df['long_id'].str.extract(r'\d\d\d-([\d-]*)')
If the ID transformation is not so predictable (e.g. due to transcription errors or some other noise in the data) then the best option is to first disambiguate the ID transformation using something like recordlinkage, see the example here.
Ok solved this for every Product ID with or without dashes, #, ltters, etc..
(\d\d\d-)?[_#\d-]?[a-zA-Z]?
(\d\d\d-)? -This is for the first & second three integer sets, w/ zero or more matches and a dashes (non-greedy)
[_#\d-]? - This is for any special chars and additional numbers (non-greedy)
[a-zA-Z]? - This, not sure why, but I had to separate from the last part due to it wouldn't pick up every letter. (non-greedy)
With the above I solved everything I needed for RE.
Where I learned how to improve my RE skills:
RE Documentation
Automate the Boring Stuff- Ch 7
You can test you RE's here
Additional way to show this. Put this here to show there is no one way of doing it. RE is super awesome:
(\d{3}-)?[_#\d{3}-]?[a-zA-Z]?
I am relatively new to Python. However, my needs generally only involve simple string manipulation of rigidly formatted data files. I have a specific situation that I have scoured the web trying to solve and have come up blank.
This is the situation. I have a simple list of two-part entires, formatted like this:
name = ['PAUL;25', 'MARY;60', 'PAUL;40', 'NEIL;50', 'MARY;55', 'HELEN;25', ...]
And, I need to keep only one instance of any repeated name (ignoring the number to the right of the ' ; '), keeping only the entry with the highest number, along with that highest value still attached. So the answer would look like this:
ans = ['MARY;60', 'PAUL;40', 'HELEN;25', 'NEIL;50, ...]
The order of the elements in the list is irrelevant, but the format of the ans list entries must remain the same.
I can probably figure out a way to brute force it. I have looked at 2D lists, sets, tuples, etc. But, I can't seem to find the answer. The name list has about a million entries, so I need something that is efficient. I am sure it will be painfully easy for some of you.
Thanks for any input you can provide.
Cheers.
alkemyst
Probably the best data structure for this would be a dictionary, with the entries split up (and converted to integer) and later re-joined.
Something like this:
max_score = {}
for n in name:
person, score_str = n.split(';')
score = int(score_str)
if person not in max_score or max_score[person] < score:
max_score[person] = score
ans = [
'%s;%s' % (person, score)
for person, score in max_score.items()
]
This is a fairly common structure for many functions and programs: first convert the input to an internal representation (in this case, split and convert to integer), then do the logic or calculation (in this case, uniqueness and maximum), then convert to the required output representation (in this case, string separated with ;).
In terms of efficiency, this code looks at each input item once, then at each output item once; there's unlikely to be any approach that can do better than that (certainly not formally, and likely not in practice). All of the per-item operations are constant-time and fast. It accumulates the intermediate answer in memory (in max_score), but again that is unavoidable; if memory is an issue, the input and output could be changed to iterators/generators, but the whole intermediate answer has to be accumulated in max_score before any items can be output.
I have recently stumbled upon a task utilizing some CSV files that are, to say the least, very poorly organized, with one cell containing what should be multiple separate columns. I would like to use this data in a Python script but want to know if it is possible to delete a portion of the row (all of it after a certain point) then write that to a dictionary.
Although I can't show the exact contents of the CSV, it looks like this:
useful. useless useless useless useless
I understand that this will most likely require either a regular expression or an endswith statement, but doing all of that to a CSV file is beyond me. Also, the period written after useful on the CSV should be removed as well, and is not a typo.
If you know the character you want to split on you can use this simple method:
good_data = bad_data.split(".")[0]
good_data = good_data.strip() # remove excess whitespace at start and end
This method will always work. split will return a tuple which will always have at least 1 entry (the full string). Using index may throw an exception.
You can also limit the # of splits that will happen if necessary using split(".", N).
https://docs.python.org/2/library/stdtypes.html#str.split
>>> "good.bad.ugly".split(".", 1)
['good', 'bad.ugly']
>>> "nothing bad".split(".")
['nothing bad']
>>> stuff = "useful useless"
>>> stuff = stuff[:stuff.index(".")]
ValueError: substring not found
Actual Answer
Ok then notice that you can use indexing for strings just like you do for lists. I.e. "this is a very long string but we only want the first 4 letters"[:4] gives "this". If we now new the index of the dot we could just get what you want like that. For exactly that strings have the index method. So in total you do:
stuff = "useful. useless useless useless useless"
stuff = stuff[:stuff.index(".")]
Now stuff is very useful :).
In case we are talking about a file containing multiple lines like that you could do it for each line. Split that line at , and put all in a dictionary.
data = {}
with open("./test.txt") as f:
for i, line in enumerate(f.read().split("\n")):
csv_line = line[:line.index(".")]
for j,col in enumerate(csv_line.split(",")):
data[(i,j)] = col
How one would do this
Notice that most people would not want to do it by hand. It is a common task to work on tabled data and there is a library called pandas for that. Maybe it would be a good idea to familiarise yourself a bit more with python before you dive into pandas though. I think a good point to start is this. Using pandas your task would look like this
import pandas as pd
pd.read_csv("./test.txt", comment=".")
giving you what is called a dataframe.
I have been working on Python for about 1.5yrs and looking for some direction. This is the first time I can't find what I need after doing a lot of searching and must be missing something- most likely searching the wrong terms.
Problem: I am working on an app that has many processes (Could be hundreds or even thousands). Each process may have a unique input and output data format - could be multiline strings, comma separated strings, excel or csv with or without varying headers and many others. I need something that will format the input correctly and handle the output based upon the process. New processes also need to be easily added/defined. I am open to whatever is the best approach, but my thoughts are to use a database that stores the template/data definition and use that to know the format given a process. However, I'm struggling to come up with exactly how, if this is really the best approach, but it needs to be a solution that is scalable. Any direction would be appreciated. Thank you.
A couple simple examples of data
Process 1 example data (multi line string with Header)
Input of
[ABC123, XYZ453, CDE987]
and the resulting data input below would be created:
Barcode
ABC123
XYZ453
CDE987
This code below works, but is not reusable for the example 2.
list = [ABC123, XYZ453, CDE987]
input = "Barcode /r/n"
for l in list:
input = input + l + '/r/n'
Process 2 example input template (comma separated with Header):
Barcode,Location,Param1,Param2
Item1,L1,11,A
Item1,L1,22,B
Item2,L1,33,C
Item2,L2,44,F
Item3,L2,55,B
Item3,L2,66,P
Process 2 example resulting input data (comma separated with Header):
Input of
{'Barcode':['ABC123', 'XYZ453', 'CDE987', 'FGH487', 'YTR123'], 'Location':['Shelf1', 'Shelf2']}
and using the template to create the input data below:
Barcode,Location,Param1,Param2
ABC123,Shelf1,11,A
ABC123,Shelf1,22,B
XYZ453,Shelf1,33,C
XYZ453,Shelf2,44,F
CDE987,Shelf2,55,B
CDE987,Shelf2,66,P
FGH487,Shelf1,11,A
FGH487,Shelf1,22,B
YTR123,Shelf1,33,C
YTR123,Shelf2,44,F
I know how to handle each process with hardcoded loop/dataframe merge, etc. Ive done some abstraction in other cases with dicts. However, how to define/store each format that vary so much and create reusable abstracted code is where I am stuck.
Maybe you can do the output of the functions as a tuple with the keys "datatype" and "output" for the actual output