Grouping data by value in first column - python

I'm trying to group data from a 2 column object based on the value of a first column. I need this data in a list so I can sort them afterwards. I am fetching interface data with snmp on large number of machines. In the example I have 2 interfaces. I need data grouped by interface preferably in a list.
Data i get is in object item:
for i in item:
print i.oid, i.val
ifDescr lo
ifDescr eth0
ifAdminStatus 1
ifAdminStatus 1
ifOperStatus 1
ifOperStatus 0
i would like to get this data sorted in a list by value in the first column, like this:
I would like to get this data in a list, so it looks like this:
list=[[lo,1,1], [eth0,1,0]]
Solution I have is oh so dirty and long and I'm embarrassed to post it here, so any help is appreciated.
Here is my solution so you get better picture what I'm talking about. What I did is put each interface data in separate list based on item.oid, and then iterated trough cpu list and compared it to memory and name based on item.iid. In the end I have all data in cpu list where each interface is an element of the list. This solution works, but is too slow for my needs.
cpu=[]
memory=[]
name=[]
for item in process:
if item.oid=='ifDescr':
cpu.append([item.iid, int(item.val)])
if item.oid=='ifAdminStatus':
memory.append([item.iid, int(item.val)])
if item.oid=='ifOperStatus':
name.append([item.iid, item.val])
for c in cpu:
for m in memory:
if m[0]==c[0]:
c.append(m[1])
for n in name:
if n[0]==c[0]:
c.append(n[1])
cpu=sorted(cpu,key=itemgetter(1),reverse=True) #sorting is easy
Is there a pythonic, short and faster way of doing this? Limiting factor is that I get data in a 2 column object with key=data values.

Not sure I follow your sorting as I don't see any order but to group you can use a dict grouping by oid using a defaultdict for the repeating keys:
data = """ifDescr lo
ifDescr eth0
ifAdminStatus 1
ifAdminStatus 1
ifOperStatus 1
ifOperStatus 0"""
from collections import defaultdict
d = defaultdict(list)
for line in data.splitlines():
a, b = line.split()
d[a].append(b)
print((d.items()))
[('ifOperStatus', ['1', '0']), ('ifAdminStatus', ['1', '1']), ('ifDescr', ['lo', 'eth0'])]
using your code just use the attributes:
for i in item:
d[i.oid].append(i.val)

Pandas is a great way to work with data. Here is a quick example code. Check out the official website for more info.
# Python script using Pandas and Numpy
from pandas import DataFrame
from numpy import random
# Data with the dictionary keys defining the columns
data_dictionary = {'a': random.random(5),
'b': random.random(5)}
# Make a data frame
data_frame = DataFrame(data_dictionary)
print(data_frame)
# Return an new data frame with a sorted first column
data_frame_sorted = data_frame.sort_index(by='a')
print(data_frame_sorted)
This should run if you have numpy an pandas installed. If you don't have any clue about installing pandas go get the "anaconda python distribution."

Related

How to replace empty values with reference to another dataframe?

I have 2 data frames. One is reference table with columns: code and name. Other one is list of dictionaries. The second data frame has code filled up but some names as empty strings. I am thinking of performing 2 for loops to get to the dictionary. But, I am new to this so unsure how to get the value from reference table.
Started with something like this:
for i in sample:
for j in i:
if j['name']=='':
(j['code'])
I am unsure how to proceed with the code. I think there is a very simple way with .map() function. Can someone help?
Reference table:
enter image description here
Edit needed table:
enter image description here
It seems to me that in this particular case you're using Pandas only to work with Python data structures. If that's the case, it would make sense to ditch Pandas altogether and just use Python data structures - usually, it results in more idiomatic and readable code that often performs better than Pandas with dtype=object.
In any case, here's the code:
import pandas as pd
sample_name = pd.DataFrame(dict(code=[8, 1, 6],
name=['Human development',
'Economic managemen',
'Social protection and risk management']))
# We just need a Series.
sample_name = sample_name.set_index('code')['name']
sample = pd.Series([[dict(code=8, name='')],
[dict(code=1, name='')],
[dict(code=6, name='')]])
def fix_dict(d):
if not d['name']:
d['name'] = sample_name.at[d['code']]
return d
def fix_dicts(dicts):
return [fix_dict(d) for d in dicts]
result = sample.map(fix_dicts)

Pandas - KeyError - Dropping rows by index in a nested loop

I have a Pandas dataframe named pd. I am attempting to use a nested-for-loop to iterate through each tuple of the dataframe and, at each iteration, compare the tuple with all other tuples in the frame. During the comparison step, I am using Python's difflib.SequenceMatcher().ratio() and dropping tuples that have a high similarity (ratio > 0.8).
Problem:
Unfortunately, I am getting a KeyError after the first, outer-loop, iteration.
I suspect that, by dropping the tuples, I am invalidating the outer-loop's indexer. Or, I am invalidating the inner-loop's indexer by attempting to access an element that doesn't exist (dropped).
Here is the code:
import json
import pandas as pd
import pyreadline
import pprint
from difflib import SequenceMatcher
# Note, this file, 'tweetsR.json', was originally csv, but has been translated to json.
with open("twitter data/tweetsR.json", "r") as read_file:
data = json.load(read_file) # Load the source data set, esport tweets.
df = pd.DataFrame(data) # Load data into a pandas(pd) data frame for pandas utilities.
df = df.drop_duplicates(['text'], keep='first') # Drop tweets with identical text content. Note,
these tweets are likely reposts/retweets, etc.
df = df.reset_index(drop=True) # Adjust the index to reflect dropping of duplicates.
def duplicates(df):
for ind in df.index:
a = df['text'][ind]
for indd in df.index:
if indd != 26747: # Trying to prevent an overstep keyError here
b = df['text'][indd+1]
if similar(a,b) >= 0.80:
df.drop((indd+1), inplace=True)
print(str(ind) + "Completed") # Debugging statement, tells us which iterations have completed
duplicates(df)
Error Output:
Can anyone help me understand this and/or fix it?
One solution, which was mentioned by #KazuyaHatta, is the itertools.combination(). Although, the way I've used it (there may be another way), it's O(n^2). So, in this case, with 27,000 tuples, it's nearly 357,714,378 combinations to iterate (too long).
Here is the code:
# Create a set of the dropped tuples and run this code on bizon overnight.
def duplicates(df):
# Find out how to improve the speed of this
excludes = set()
combos = itertools.combinations(df.index, 2)
for combo in combos:
if str(combo) not in excludes:
if similar(df['text'][combo[0]], df['text'][combo[1]]) > 0.8:
excludes.add(f'{combo[0]}, {combo[1]}')
excludes.add(f'{combo[1]}, {combo[0]}')
print("Dropped: " + str(combo))
print(len(excludes))
duplicates(df)
My next step, which #KazuyaHatta described, is to attempt the dropping-by-mask method.
Note: I unfortunately won't be able to post a sample of the dataset.

How to convert Multilevel Dictionary with Irregular Data to Desired Format

Dict = {'Things' : {'Car':'Lambo', 'Home':'NatureVilla', 'Gadgets':{'Laptop':{'Programs':{'Data':'Excel', 'Officework': 'Word', 'Coding':{'Python':'PyCharm', 'Java':'Eclipse', 'Others': 'SublimeText'}, 'Wearables': 'SamsungGear', 'Smartphone': 'Nexus'}, 'clothes': 'ArmaaniSuit', 'Bags':'TravelBags'}}}}
d = {(i,j,k,l,m,n): Dict[i][j][k][l][m][n]
for i in Dict.keys()
for j in Dict[i].keys()
for k in Dict[j].keys()
for l in Dict[k].keys()
for m in Dict[l].keys()
for n in Dict[n].keys()
}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
print (df)
What I have already done:
I tried to Multiindex this Irregular Data using pandas but I am getting KeyError at 'Car'. Then I tried to handle exceptions and tried to PASS it but then it results in a Syntax Error. So May be I lost the direction. If there is any other module or way I can index this irregular data and put it in a table somehow. I have a chunk of raw data like this.
What I am trying to do:
I wanted to use this data for printing in QTableView which is from PyQt5 (Making a program with GUI).
Conditions:
This Data keeps on updating every hour from an API.
What I have thought till now:
May be I can append all this data to MySQL. But then when this data updates from API, only Values will change, rest of the KEYS will be the same. But then It will require more space.
References:
How to convert a 3-level dictionary to a desired format?
How to build a MultiIndex Pandas DataFrame from a nested dictionary with lists
Any Help will be appreciated. Thanks for reading the question.
You data is not actually a 6-level dictionary like a dictionary in a 3-level example you referenced to. The difference is: your dictionary has a data on multiple different levels, e.g. 'Lambo' value is on second level of hierarchy with key ('Things','Car') but 'Eclipse' value is on sixth level of hierarchy with key ('Things','Gadgets','Laptop','Programs','Coding','Java')
If you want to 'flatten' your structure you will need to decide what to do with 'missed' key values for deeper levels for values like 'Lambo'.
Btw, maybe it is not actually a solution for your problem, maybe you need to use more appropriate UI widgets like TreeView to work with such kind of hierarchical data, but I will try to directly address your exact question.
Unfortunately it seems to be no easy way to reference all different level values uniformly in one simple dict or list comprehension statement.
Just look at your 'value extractor' (Dict[i][j][k][l][m][n]) there are no such values for i,j,k,l,m,n exists which allows you to get a 'Lambo'. Because to get a Lambo you will need to just use Dict['Things']['Car'] (ironically, in a real life it is also could be difficult to get a Lambo :-) )
One straightforward way to solve your task is:
extract a second level data, extract a third level data, and so on, and combine them together.
E.g. to extract second level values you can write something like this:
val_level2 = {(k1,k2):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
but if you want to combine it later with six level values, it will need to add some padding to your key tuples:
val_level2 = {(k1,k2,'','','',''):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
later you can combine all together by something like:
d = {}
d.update(val_level2)
d.update(val_level3)
But usually the most organic way to work with hierarchical data is to use some recursion, like this:
def flatten_dict(d,key_prefix,max_deep):
return [(tuple(key_prefix+[k]+['']*(max_deep-len(key_prefix))),v)
for k,v in d.items() if not isinstance(v,dict)] +\
sum([flatten_dict(v,key_prefix+[k],max_deep)
for k,v in d.items() if isinstance(v,dict)],[])
And later with code like this:
d={k:v for k,v in flatten_dict(Dict,[],5)}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
df.reset_index()
I actually get this result with your data:
P.S. According to https://www.python.org/dev/peps/pep-0008/#prescriptive-naming-conventions we prefer a lowercase_with_underscores for variable names, CapWords is for classes. So src_dict would be much better, than Dict in your case.
You information looks a lot like json and that's what the API is returning. If that's the case, and you are turning it into a dictionary, then you might me better off using python's json library or even panda's built it read_json format.
Pandas read json
Python's json

How to feed array of user_ids to flickr.people.getInfo()?

I have been working on extracting the flickr users location (not lat. and long. but person's country) by using their user_ids. I have made a dataframe (Here's the dataframe) consisted with photo id, owner and few other columns. My attempt was to feed each of the owner to flickr.people.getInfo() query by iterating owner column in dataframe. Here is my attempt
for index, row in df.iterrows():
A=np.array(df["owner"])
for i in range(len(A)):
B=flickr.people.getInfo(user_id=A[i])
unfortunately, it results only 1 result. After careful examination I've found that it belongs to the last user in the dataframe. My dataframe has 250 observations. I don't know how could I extract others.
Any help is appreciated.
It seems like you forgot to store the results while iterating over the dataframe. I haven't use the API but I think that this snippet should do it.
result_dict = {}
for idx, owner in df['owner'].iteritems():
result_dict[owner] = flickr.people.getInfo(user_id=owner)
The results are stored in a dictonary where the user id is the key.
EDIT:
Since it is a JSON you can use the read_json function to parse the result.
Example:
result_list = []
for idx, owner in df['owner'].iteritems():
result_list.appen(pd.read_json(json.dumps(flickr.people.get‌​Info(user_id=owner))‌​,orient=list))
# you may have to set the orient parameter.
# Option are: 'split','records','index', Default is 'index'
Note: I switched the dictonary to a list, since it is more convenient
Afterwards you can concatenate the resulting pandas serieses together like this:
df = pd.concat(result_list, axis=1).transpose()
I added the transpose() since you probably want the ID as the index.
Afterwards you should be able to sort by the column 'location'.
Hope that helps.
The canonical way to achieve that is to use an apply. It will be much more efficient.
import pandas as pd
import numpy as np
np.random.seed(0)
# A function to simulate the call to the API
def get_user_info(id):
return np.random.randint(id, id + 10)
# Some test data
df = pd.DataFrame({'id': [0,1,2], 'name': ['Pierre', 'Paul', 'Jacques']})
# Here the call is made for each ID
df['info'] = df['id'].apply(get_user_info)
# id name info
# 0 0 Pierre 5
# 1 1 Paul 1
# 2 2 Jacques 5
Note, another way to write the same thing is
df['info'] = df['id'].map(lambda x: get_user_info(x))
Before calling the method, have the following lines first.
from flickrapi import FlickrAPI
flickr = FlickrAPI(FLICKR_KEY, FLICKR_SECRET, format='parsed-json')

Python For loop with data (csv)

I have this data:
http://prntscr.com/gojey0
Which keeps going on downward.
How do I find the top 20 most common platforms using python code?
I'm really lost. I thought of maybe going through the list in a for loop and counting each one? that seems wrong though..
Use pandas: http://pandas.pydata.org/
something like:
import pandas as pd
df = pd.read_csv("your_csv_file.csv")
top_platforms = df.nlargest(20, "Score")["Platform"]
A dictionary would be a good choice for collecting this information:
Initialize an empty dict.
For each row in the csv file:
Get the platform column.
If that platform is not already in the dict, create it with a count of one.
Otherwise if it is already in the dict, increment its count by one.
When you're done, sort the dict by the count value and print the top 20 entries.
I would use pandas to read in csv files
import pandas as pd
from collection import Counter
df = pd.read_csv('DATA.csv') # read the csv file into a dataframe *df*
# create counter object containing dictionary
# invoke the pandas groupby and count methods
d = Counter(dict(df.groupby(['Platform'])['Platform'].count()))
d will be a counter object "containing" a dictionary of the form {<platform>:<number of counts in dataset>}
You can get the top k most common platforms as follows:
k = 20
d.most_common(k)
>>> [('<platform1>', count1),
('<platform2>', count2),
('<platform3>', count3),
('<platform4>', count4),
....
Hope that helps. In future it would be better to see the head (first few lines) of your data or what code you have tried so far... or even what data wrangling tool you're using!

Categories

Resources