Removing Indices from a list in Python - python

I am trying to get an output such as this:
169.764569892, 572870.0, 19.6976
However I have a problem because the files that I am inputing have a format similar to the output I just showed, but some line in the data have 'nan' as a variable which I need to remove.
I am trying to use this to do so:
TData_Pre_Out = map(itemgetter(0, 7, 8), HDU_DATA)
TData_Pre_Filter = [Data for Data in TData_Pre_Out if Data != 'nan']
Here I am trying to use list comprehension to get the 'nan' to go away, but the output still displays it, any help on properly filtering this would be much appreciated.
EDIT: The improper output looks like this:
169.519361471, nan, nan
instead of what I showed above. Also, some more info:1) This is coming from a special data file, not a text file, so splitting lines wont work. 2) The input is exactly the same as the output, just mapped using the map() line that I show above and split into the indices I actually need (i.e. instead of using all of a data list like L = [(1,2,3),(3,4,5)] I only pull 1 and 3 from that list, to give you the gist of the data structure)
The Data is read in as so:
with pyfits.open(allfiles) as HDU:
HDU_DATA = HDU[1].data
The syntax is from a specialized program but you get the idea

TData_Pre_Out = map(itemgetter(0, 7, 8), HDU_DATA)
This statement gives you a list of tuples. And then you compare the tuple with a string. All the != comparisions success.

Without showing how you read in your data, the solution can only be guessed.
However, if HDU_DATA stores real NaN values, try following:
Comparing variable to NaNs does not work with the equality operator ==:
foo == nan
where nan and foo are both NaNs gives always false.
Use math.isnan() instead:
import math
...if math.isnan(Data)…

Based on my understanding of your description, this could work
with open('path/to/file') as infile:
for line in infile:
vals = line.strip().split(',')
print[v for v in vals if v!='nan']

Related

Getting length of set inside pandas column

I have a set with strings inside a column in a Pandas DataFrame:
x
A {'string1, string2, string3'}
B {'string4, string5, string6'}
I need to get the length of each set and ideally create a new column with the results
x x_length
A {'string1, string2, string3'} 3
B {'string4, string5'} 2
I don't know why but everything i tried to far always returns the length of the set as 1.
Here's what I've tried:
df['x_length'] = df['x'].str.len()
df['x_length'] = df['x'].apply(lambda x: len(x))
Custom function from another post:
def to_1D(series):
return pd.Series([len(x) for _list in series for x in _list])
to_1D(df['x'])
This function returns the number of characters in the whole set, not the length of the set.
I've even tried to convert the set to a list and tried the same functions, but still got the wrong results.
I feel like I'm very close to the answer, but I can't seem to figure it out.
I don't know why but everything i tried to far always returns the
length of the set as 1.
{'string1, string2, string3'} and {'string4, string5, string6'} are sets holding single str each (delimited by ') rather than sets with 3 str each (which would be {'string1', 'string2', 'string3'} and {'string4', 'string5', 'string6'} respectively) so there is problem somewhere earlier which leads to getting sets with single element rather than multitude of them. After you find and eliminate said problem your functions should start work as intended.

Extracting data out of Complex JSON

This is the JSON data I am using
In this dataset, I want to extract the value of "yaw" and store it into in camera_loc. My first attempt for the code was as follows:
with open(
"/home/siddhant/catkin_ws/src/view-adaptation/multi_sensors.json"
) as sensors:
multi_sensors = json.load(sensors)
camera_loc = [
multi_sensors["sensors"][0]["yaw"],
multi_sensors["sensors"][1]["yaw"],
multi_sensors["sensors"][2]["yaw"],
multi_sensors["sensors"][3]["yaw"],
multi_sensors["sensors"][4]["yaw"],
multi_sensors["sensors"][5]["yaw"],
]
This gives me the expected result.
But I want to generalize the same for any number of entries in the 'sensors' array.
I tried executing a 'for' loop and extracting the values for the same as follows:
for i in multi_sensors["sensors"]:
camera_loc = []
camera_loc.append(i["yaw"])
However, this method only gives a single value in the camera_loc list which is the last 'yaw' value from the JSON file. I am looking for a better approach or even any modifications to the way I execute the loop so that I can extract all the values of 'yaw' from the JSON file - in this example there are 6 entries but I want to generalize it for 'n' entries that may be created in other cases.
Thank you!
That's because you're defining your camera_loc array inside the loop, meaning every iteration resets it to []. This code should work if you remove the array definition from the loop:
camera_loc = []
for i in multi_sensors["sensors"]:
camera_loc.append(i["yaw"])
This is also a perfect use case for a List Comprehension:
camera_loc = [i["yaw"] for i in multi_sensors["sensors"]]
Both answers results in the same array though, so you may choose whichever you like most.

Pyspark tuple object has no attribute split

I am struggling with a Pyspark assignment. I am required to get a sum of all the viewing numbers per channels. I have 2 sets of files: 1 showing the show and views per show the other showing the shows and what channel they are shown on (can be multiple).
I have performed a join operation on the 2 files and the result looks like ..
[(u'Surreal_News', (u'BAT', u'11')),
(u'Hourly_Sports', (u'CNO', u'79')),
(u'Hourly_Sports', (u'CNO', u'3')),
I now need to extract the channel as the key and then I think do a reduceByKey to get the sum of views for the channels.
I have written this function to extract the chan as key with the views alongside, which I could then use a reduceByKey function to sum the results. However when I try to display results of below function with collect() I get an "AttributeError: 'tuple' object has no attribute 'split'" error
def extract_chan_views(show_chan_views):
key_value = show_chan_views.split(",")
chan_views = key_value[1].split(",")
chan = chan_views[0]
views = int(chan_views[1])
return (chan,views)
Since this is an assignment, I'll try to explain what's going on rather than just doing the answer. Hopefully that will be more helpful!
This actually isn't anything to do with pySpark; it's just a plain Python issue. Like the error is saying, you're trying to split a tuple, when split is a string operation. Instead access them by index. The object you're passing in:
[(u'Surreal_News', (u'BAT', u'11')),
(u'Hourly_Sports', (u'CNO', u'79')),
(u'Hourly_Sports', (u'CNO', u'3')),
is a list of tuples, where the first index is a unicode string and the second is another tuple. You can split them apart like this (I'll annotate each step with comments):
for item in your_list:
#item = (u'Surreal_News', (u'BAT', u'11')) on iteration one
first_index, second_index = item #this will unpack the two indices
#now:
#first_index = u'Surreal_News'
#second_index = (u'BAT', u'11')
first_sub_index, second_sub_index = second_index #unpack again
#now:
#first_sub_index = u'BAT'
#second_sub_index = u'11'
Note that you never had to split on commas anywhere. Also note that the u'11' is a string, not an integer in your data. It can be converted, as long as you're sure it's never malformed, with int(u'11'). Or if you prefer specifying indices to unpacking, you can do the same thing:
first_index, second_index = item
is equivalent to:
first_index = item[0]
second_index = item[1]
Also note that this gets more complicated if you are unsure what form the data will take - that is, if sometimes the objects have two items in them, other times three. In that case unpacking and indexing in a generalized way for a loop require a bit more thought.
I am not exactly resolving your code , but I faced same error when I applied join transformation on two datasets.
lets say , A and B are two RDDs.
c = A.join(B)
We may think that c is also Rdd , wrong. It is a tuple object where we cannot perform any split(",") kind of operations.One needs to make c into Rdd then proceed.
If we want tuple to be accessed, Lets say D is tuple.
E= D[1] // instead of E= D.split(",")[1]

Can I condense these values with a loop?

I have a set of values that get modified like so:
iskDropped = irrelevant1
iskDestroyed = irrelevant2
iskTotal = irrelevant3
iskDropped = condense_value(int(iskDropped[:-7].replace(',','')))
iskDestroyed = condense_value(int(iskDestroyed[:-7].replace(',','')))
iskTotal = condense_value(int(iskTotal[:-7].replace(',','')))
As you can see, all three lines go through the same changes. (condensed, shortened, and commas removed) before overwriting their original value.
I want to condense those three lines if possible because it feels inefficient.
I was trying something like this:
for value in [iskDropped,iskDestroyed,iskTotal]:
value = condense_value(int(value[:-7].replace(',','')))
which if you changed into a print statement successfully does print the correct values but it does not work in the regard of overwriting / updating the values (iskDropped,iskDestroyed, and iskTotal) that I need to call later in the program.
Is it possible to condense these lines in Python? If so can someone point me in the right direction?
You can do it like this:
iskDropped, iskDestroyed, iskTotal = [condense_value(int(value[:-7].replace(',',''))) for value in [iskDropped, iskDestroyed, iskTotal]]
This works by looping through the list of your 3 variables, performing the condense_value function on each and creates a list of the results, then finally unpacks the list back into the original values.

Python: How can I find the differences between two lists of strings?

I'm using Python 3. I have two lists of strings and I'm looking for mismatches between the two. The code I have works for smaller lists but not the larger lists I'm writing it for.
Input from the non-working lists is in the following format:
mmec11.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
mmec13.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
mmec12.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
mmec14.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
My function to compare two lists of data in the above format is:
result = []
for x in mmeList1:
if x not in mmeList2:
result.append(x)
return result
The problem is it's not working. I get an output file of both lists combined into one long list. When I put a test is to say "Hi" every time a match was made, nothing happened. Does anyone have any ideas where I'm going wrong. I work for a telecommunications company and we're trying to go through large database dumps to find missing MMEs.
I'm wondering if maybe my input function is broken? The function is:
for line in input:
field = line.split()
tempMME = field[0]
result.append(tempMME)
I'm not very experienced with this stuff and I'm wondering if the line.split() function is messing up due to the periods in the MME names?
Thank you for any and all help!
If you don't need to preserve ordering, the following will result in all mmes that exist in list2 but not list1.
result = list(set(mmeList2) - set(mmeList1))
I tested your compare function and it's working fine, assuming that the data in mmeList1 and mmeList2 is correct.
For example, I ran a test of your compare function using the following data.
mmeList1:
mmec11.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
mmec13.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
mmec12.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
mmec14.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
mmeList2:
mmec11.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
mmec13.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
mmec12.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
mmec15.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
Result contained:
mmec14.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org
I suspect the problem is that mmeList1 and mmeList2 don't contain what you think they contain. Unfortunately, we can't help you more without seeing how mmeList1 and mmeList2 are populated.
If you want to see the differences in both, (i.e. Result should contain mmec14 AND mmec15), then what you want to use is Sets.
For example:
mmeSet1 = set(mmecList1)
mmeSet2 = set(mmecList2)
print mmeSet1.symmetric_difference(mmeSet2)
will result in:
['mmec14.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org', 'mmec15.mmegifffa.mme.epc.mnc980.mcc310.3gppnetwork.org']
At first, using set() on list is best way for decreasing iteration.Try this one
result = []
a=list(set(mmeList1))
b=list(set(mmeList2))
for x in a:
if x not in b:
result.append(x)
return result

Categories

Resources