Removing integer values from a alphanumeric column in python - python

I am new to python and struggling in one trivial task. I have one alphanumeric column known as region. It has both entries beginning with / such as /health/blood pressure etc and integer values. So typically few observations look like:
/health/blood pressure
/health/diabetes
7867
/fitness
9087
/health/type1 diabetes
Now I want to remove all the rows/cases with integer values. So after importing the data set into python shell, it is showing region as object. I intended to solve this problem with a sort of regular expression. So I did the following:
pattern='/'
data.region=Series(data.region)
matches=data.region.str.match(pattern)
matches
Here it gives a boolean object explaining whether each pattern is in the data set or not. So I get something like this:
0 true
1 false
2 true
3 true
.........
so on.
Now I am stuck further how to remove rows of matches boolean object with false tag. If statement is not working. If anyone can offer some sort of assistance, that would be great!!
Thanks!!

It seems like you are using the pandas framework. So I am not completely sure if this is working:
You can try:
matches = [i for i in data.region if i.str.match(pattern)]
In python this is called a list comprehension that goes through every entry in data.region and checks your pattern and puts it in the list if the pattern is matching (and the expression after 'if' is thus true).
See: https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions
If you want to map those for every region you can try to create a dictionary that maps the regions to the lists with the following dict-comprehension:
matches = {region: [i for i in data.region if i.str.match(pattern)] for region in data}
See: https://docs.python.org/2/tutorial/datastructures.html#dictionaries
However you are definitely leaving the realm of the pandas framework. This could eventually fail of regions is not an integer/string but a list itself (as Is aid I don't know pandas enough to judge).
In that case you could try:
matches = {}
for region in list_of_regions:
matches[region] = [i for i in data.region if i.str.match(pattern)]
which is basically the same just with a given list of region and the dict comprehension made explicit in a for loop.

Related

Can I amend one data sheet to match another data frame's ID that are almost similar?

I have multiple data frames to compare. My problem is the product IDs. one is set up like:
000-000-000-000
Vs
000-000-000
(gross)
I have looked on here, reddit, YouTube, and even went deep down the rabbit hole trying .join, .append, some other method I've never seen before, or even understand yet. Is there a way(or even better some documentation I can read on to learn this) to pull the Product ID from the Main excel sheet, compare it to the one(s) that should match. Then i will more than like make the in place ID across all sheets. That way I can use those IDs as the index and do a side by side compare of the ID to row data? Each ID has about 113 values to compare. That's 113 columns, but for each row if that make sense
Example: (colorful columns is main sheet that the non colored column will be compared to)
additional notes:
The highlighted yellow IDs are "unique", and I wont be changing those but instead write them to a list or something and use an if statement to ignore them when found.
Edit:
so I wrote this code which is almost perfect what I need to do with this.
It takes out the "-" which I apply to all my IDs. Just need to make a list of ID that are unique to skip over on taking away the zeros
dfSS["Product ID"] = dfSS["Product ID"].str.replace("-", "")
Then this will only list the digits up to 9 digits, except the unique IDs
dfSS["Product ID"] = dfSS["Product ID"]str[:9]
Will add the full code below here once i get it to work 100%
I am now trying to figure out how to say somethin like
lst =[1,2,3,4,5]
if dfSS["Product ID"] not in lst:
dfSS["Product ID"] = dfSS["Product ID"].str.replace("-", "").str[:9]
This code does not work but everyday I get closer and closer to being able to compare these similar yet different data frames. the lst is just an example of a 000-000-000 Product IDs in a list that I do not want to filter at all. but keep in the data frame
If the ID transformation is predictable, then one option is to use regex for homogenizing IDs. For example if the situation is just removing the first three digits, then something like the following can be used:
df['short_id'] = df['long_id'].str.extract(r'\d\d\d-([\d-]*)')
If the ID transformation is not so predictable (e.g. due to transcription errors or some other noise in the data) then the best option is to first disambiguate the ID transformation using something like recordlinkage, see the example here.
Ok solved this for every Product ID with or without dashes, #, ltters, etc..
(\d\d\d-)?[_#\d-]?[a-zA-Z]?
(\d\d\d-)? -This is for the first & second three integer sets, w/ zero or more matches and a dashes (non-greedy)
[_#\d-]? - This is for any special chars and additional numbers (non-greedy)
[a-zA-Z]? - This, not sure why, but I had to separate from the last part due to it wouldn't pick up every letter. (non-greedy)
With the above I solved everything I needed for RE.
Where I learned how to improve my RE skills:
RE Documentation
Automate the Boring Stuff- Ch 7
You can test you RE's here
Additional way to show this. Put this here to show there is no one way of doing it. RE is super awesome:
(\d{3}-)?[_#\d{3}-]?[a-zA-Z]?

Delete a Portion of a CSV Cell in Python

I have recently stumbled upon a task utilizing some CSV files that are, to say the least, very poorly organized, with one cell containing what should be multiple separate columns. I would like to use this data in a Python script but want to know if it is possible to delete a portion of the row (all of it after a certain point) then write that to a dictionary.
Although I can't show the exact contents of the CSV, it looks like this:
useful. useless useless useless useless
I understand that this will most likely require either a regular expression or an endswith statement, but doing all of that to a CSV file is beyond me. Also, the period written after useful on the CSV should be removed as well, and is not a typo.
If you know the character you want to split on you can use this simple method:
good_data = bad_data.split(".")[0]
good_data = good_data.strip() # remove excess whitespace at start and end
This method will always work. split will return a tuple which will always have at least 1 entry (the full string). Using index may throw an exception.
You can also limit the # of splits that will happen if necessary using split(".", N).
https://docs.python.org/2/library/stdtypes.html#str.split
>>> "good.bad.ugly".split(".", 1)
['good', 'bad.ugly']
>>> "nothing bad".split(".")
['nothing bad']
>>> stuff = "useful useless"
>>> stuff = stuff[:stuff.index(".")]
ValueError: substring not found
Actual Answer
Ok then notice that you can use indexing for strings just like you do for lists. I.e. "this is a very long string but we only want the first 4 letters"[:4] gives "this". If we now new the index of the dot we could just get what you want like that. For exactly that strings have the index method. So in total you do:
stuff = "useful. useless useless useless useless"
stuff = stuff[:stuff.index(".")]
Now stuff is very useful :).
In case we are talking about a file containing multiple lines like that you could do it for each line. Split that line at , and put all in a dictionary.
data = {}
with open("./test.txt") as f:
for i, line in enumerate(f.read().split("\n")):
csv_line = line[:line.index(".")]
for j,col in enumerate(csv_line.split(",")):
data[(i,j)] = col
How one would do this
Notice that most people would not want to do it by hand. It is a common task to work on tabled data and there is a library called pandas for that. Maybe it would be a good idea to familiarise yourself a bit more with python before you dive into pandas though. I think a good point to start is this. Using pandas your task would look like this
import pandas as pd
pd.read_csv("./test.txt", comment=".")
giving you what is called a dataframe.

Can I group graphite results by regex?

I've been using graphite for some time now in order to power our backend pythonic program. As part of my usage of it, I need to sum (using sumSeries) different metrics using wildcards.
Thing is, I need to group them according to a pattern; say I have the following range of metric names:
group.*.item.*
I need to sum the values of all items, for a given group (meaning: group.1.item.*, group.2.item.*, etc)
Unfortunately, I do not know in advance the set of existing group values, and so what I do right now is that I query metrics/index.json, parse the list, and generated the desired query (manually creating sumSeries(group.NUMBER.item.*) for every NUMBER I find in the metrics index).
I was wondering if there was a way to have graphite do this for me, and save the first roundtrip, as the communication and pre-processing are costly (taking more than half the time of the entire process)
Thanks in advance!
If you want a separate line for each group you could use the groupByNode function.
groupByNode(group.*.item.*, 1, "sumSeries")
Where '1' is the node you're selecting (indexed by 0) and "sumSeries" is the function you are feeding each group into.
You can read more about this here: http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.groupByNode
If you want to restrict the second node to only numeric values you can use a character range. You do this by specifying the range in square brackets [...]. A character range is indicated by 2 characters separated by a dash (-).
group.[0-9].item.*
You can read more about this here:
http://graphite.readthedocs.io/en/latest/render_api.html#paths-and-wildcards

How to find and replace 6 digit numbers within HREF links from map of values across site files, ideally using SED/Python

I need to create a BASH script, ideally using SED to find and replace value lists in href URL link constructs with HTML sit files, looking-up in a map (old to new values), that have a given URL construct. There are around 25K site files to look through, and the map has around 6,000 entries that I have to search through.
All old and new values have 6 digits.
The URL construct is:
One value:
HREF=".*jsp\?.*N=[0-9]{1,}.*"
List of values:
HREF=".*\.jsp\?.*N=[0-9]{1,}+N=[0-9]{1,}+N=[0-9]{1,}...*"
The list of values are delimited by + PLUS symbol, and the list can be 1 to n values in length.
I want to ignore a construct such as this:
HREF=".*\.jsp\?.*N=0.*"
IE the list is only N=0
Effectively I'm only interested in URL's that include one or more values that are in the file map, that are not prepended with CHANGED -- IE the list requires updating.
PLEASE NOTE: in the above construct examples: .* means any character that isn't a digit; I'm just interested in any 6 digit values in the list of values after N=; so I've trying to isolate the N= list from the rest of the URL construct, and it should be noted that this N= list can appear anywhere within this URL construct.
Initially, I want to create a script that will create a report of all links that fulfills the above criteria and that have a 6 digital OLD value that's in the map file, with its file path, to get an understanding of links impacted. EG:
Filename link
filea.jsp /jsp/search/results.jsp?N=204200+731&Ntx=mode+matchallpartial&Ntk=gensearch&Ntt=
filea.jsp /jsp/search/BROWSE.jsp?Ntx=mode+matchallpartial&N=213890+217867+731&
fileb.jsp /jsp/search/results.jsp?N=0+450+207827+213767&Ntx=mode+matchallpartial&Ntk=gensearch&Ntt=
Lastly, I'd like to find and replace all 6 digit numbers, within the URL construct lists, as outlined above, as efficiently as possible (I'd like it to be reasonably fast as there could be around 25K files, with 6K values to look up, with potentially multiple values in the list).
**PLEASE NOTE:** There is an additional issue I have, when finding and replacing, is that an old value could have been assigned a new value, that's already been used, that may also have to be replaced.
E.G. If the map file is as below:
MAP-FILE.txt
OLD NEW
214865 218494
214866 217854
214867 214868
214868 218633
... ...
and there is a HREF link such as:
/jsp/search/results.jsp?Ntx=mode+matchallpartial&Ntk=gensearch&N=0+450+214867+214868
214867 changes to 214868 - this would need to be prepended to flag that this value has been changed, and should not be replaced, otherwise what was 214867 would become 218633 as all 214868 would be changed to 218633. Hope this makes sense - I would then need to run through file and remove all 6 digit numbers that had been marked with the prepended flag, such that link would become:
/jsp/search/results.jsp?Ntx=mode+matchallpartial&Ntk=gensearch&N=0+450+214868CHANGED+218633CHANGED
Unless there's a better way to manage these infile changes.
Could someone please help me on this, I'm note an expert with these kind of changes - so help would be massively appreciated.
Many thanks in advance,
Alex
I will write the outline for the code in some kind of pseudocode. And I don't remember Python well to quickly write the code in Python.
First find what type it is (if contains N=0 then type 3, if contains "+" then type 2, else type 1) and get a list of strings containing "N=..." by exploding (name of PHP function) by "+" sign.
The first loop is on links. The second loop is for each N= number. The third loop looks in map file and finds the replacing value. Load the data of the map file to a variable before all the loops. File reading is the slowest operation you have in programming.
You replace the value in the third loop, then implode (PHP function) the list of new strings to a new link when returning to a first loop.
Probably you have several files with the links then you need another loop for the files.
When dealing with repeated codes you nees a while loop until spare number found. And you need to save the numbers that are already used in a list.

Need to understand function in python

def process_filter_description(filter, images, ial):
'''Return a new list containing only items from list images that pass
the description filter (a str). ial is the related image association list.
Matching is done in a case insensitive manner.
'''
images = []
for items in ial:
Those are the only two lines of code I have so far. What is troubling me is the filter in the function. I really don't know what the filter is supposed to do or how to use it.
In no way am I asking for the full code. I just want help with what the filter is supposed to do and how I can use it.
Like I said in my comment, this is really vague. But I'll try to explain a little about the concept of a filter in python, specifically the filter() function.
The prototype of filter is: iterable <- filter(function, iterable).
iterable is something that can be iterated over. You can look up this term in the docs for a more exact explanation, but for your question, just know that a list is iterable.
function is a function that accepts a single element of the iterable you specify (in this case, an element of the list) and returns a boolean specifying whether the element should exist in the iterable that is returned. If the function returns True, the element will appear in the returned list, if False, it will not.
Here's a short example, showing how you can use the filter() function to filter out all even numbers (which I should point out, is the same as "filtering in" all odd numbers)
def is_odd(i): return i%2
l = [1,2,3,4,5] # This is a list
fl = filter(is_odd, l)
print fl # This will display [1,3,5]
You should convince yourself that is_odd works first. It will return 1 (=True) for odd numbers and 0 (=False) for even numbers.
In practice, you usually use a lambda function instead of defining a single-use top-level function, but you shouldn't worry about that, as this is just fine.
But anyway, you should be able to do something similar to accomplish your goal.
Well it says in the description line:
Return a new list containing only items from list images that pass the description filter (a str)
...
Matching is done in a case insensitive manner
So.. im guessing the filter is just a string, do you have any kind of text associated with the images ? some kind of description or name that could be matched against the filter string ?

Categories

Resources