I have been working on extracting the flickr users location (not lat. and long. but person's country) by using their user_ids. I have made a dataframe (Here's the dataframe) consisted with photo id, owner and few other columns. My attempt was to feed each of the owner to flickr.people.getInfo() query by iterating owner column in dataframe. Here is my attempt
for index, row in df.iterrows():
A=np.array(df["owner"])
for i in range(len(A)):
B=flickr.people.getInfo(user_id=A[i])
unfortunately, it results only 1 result. After careful examination I've found that it belongs to the last user in the dataframe. My dataframe has 250 observations. I don't know how could I extract others.
Any help is appreciated.
It seems like you forgot to store the results while iterating over the dataframe. I haven't use the API but I think that this snippet should do it.
result_dict = {}
for idx, owner in df['owner'].iteritems():
result_dict[owner] = flickr.people.getInfo(user_id=owner)
The results are stored in a dictonary where the user id is the key.
EDIT:
Since it is a JSON you can use the read_json function to parse the result.
Example:
result_list = []
for idx, owner in df['owner'].iteritems():
result_list.appen(pd.read_json(json.dumps(flickr.people.getInfo(user_id=owner)),orient=list))
# you may have to set the orient parameter.
# Option are: 'split','records','index', Default is 'index'
Note: I switched the dictonary to a list, since it is more convenient
Afterwards you can concatenate the resulting pandas serieses together like this:
df = pd.concat(result_list, axis=1).transpose()
I added the transpose() since you probably want the ID as the index.
Afterwards you should be able to sort by the column 'location'.
Hope that helps.
The canonical way to achieve that is to use an apply. It will be much more efficient.
import pandas as pd
import numpy as np
np.random.seed(0)
# A function to simulate the call to the API
def get_user_info(id):
return np.random.randint(id, id + 10)
# Some test data
df = pd.DataFrame({'id': [0,1,2], 'name': ['Pierre', 'Paul', 'Jacques']})
# Here the call is made for each ID
df['info'] = df['id'].apply(get_user_info)
# id name info
# 0 0 Pierre 5
# 1 1 Paul 1
# 2 2 Jacques 5
Note, another way to write the same thing is
df['info'] = df['id'].map(lambda x: get_user_info(x))
Before calling the method, have the following lines first.
from flickrapi import FlickrAPI
flickr = FlickrAPI(FLICKR_KEY, FLICKR_SECRET, format='parsed-json')
Related
I am trying to build an elegant solution to assigning IDs starting from 0 for the following data:
My Attempt at first creating IDs for the 'Person' category is like this:
df = pd.DataFrame(
{'Person': ['Tom Jones','Bill Smeegle','Silvia Geerea'],
'PersonFriends': [['Bill Smeegle','Silvia Geerea'],['Tom Jones'],['Han Solo']]})
df['PersonID'] = (df['Person']).astype('category').cat.codes
which produces
Now I want to follow the same process but do this for the 'PersonFriends' column to get this result below. How can I apply the same functions to achieve this when I have a list of friends?
I have been able to do this via the hash() function on each name, but the ID generated is long and not very readable. Any help appreciated. Thanks.
Create a dict and apply values from key
id_map = dict(zip(df["Person"], df["PersonID"]))
df["FriendsID"] = df["PersonFriends"].apply(lambda x: [id_map.get(y) for y in x])
I have 2 data frames. One is reference table with columns: code and name. Other one is list of dictionaries. The second data frame has code filled up but some names as empty strings. I am thinking of performing 2 for loops to get to the dictionary. But, I am new to this so unsure how to get the value from reference table.
Started with something like this:
for i in sample:
for j in i:
if j['name']=='':
(j['code'])
I am unsure how to proceed with the code. I think there is a very simple way with .map() function. Can someone help?
Reference table:
enter image description here
Edit needed table:
enter image description here
It seems to me that in this particular case you're using Pandas only to work with Python data structures. If that's the case, it would make sense to ditch Pandas altogether and just use Python data structures - usually, it results in more idiomatic and readable code that often performs better than Pandas with dtype=object.
In any case, here's the code:
import pandas as pd
sample_name = pd.DataFrame(dict(code=[8, 1, 6],
name=['Human development',
'Economic managemen',
'Social protection and risk management']))
# We just need a Series.
sample_name = sample_name.set_index('code')['name']
sample = pd.Series([[dict(code=8, name='')],
[dict(code=1, name='')],
[dict(code=6, name='')]])
def fix_dict(d):
if not d['name']:
d['name'] = sample_name.at[d['code']]
return d
def fix_dicts(dicts):
return [fix_dict(d) for d in dicts]
result = sample.map(fix_dicts)
I have a csv file "qwi_ak_se_fa_gc_ns_op_u.csv" which contains a lot of observations of 80 variables. One of them is geography which is the county. Every county belongs to something called a Commuting Zone (CZ). Using a matching table given in "czmatcher.csv" I can assign a CZ to every county given in geography.
The code below shows my approach. It is simply going through every row and finding its CZ by going through the whole "czmatcher.csv" row and finding the right one. Then i proceed to just copy the values using .loc. The problem is, this took over 10 hours to run on a 0.5 GB file (2.5 million rows) which isn't that much and my intuition says this should be faster?
This picture illustrates the way the csv files look like. The idea would be to construct the "Wanted result (CZ)" column, name it CZ and add it to the dataframe.
File example
import pandas as pd
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']
Is there a faster way of doing this?
The best way to do this is a left merge on your dataframes,
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
I assume that in both dataframes the column country is spelled the same,
data_final = data.merge(czm, how='left', on = 'country')
If it isn't spelled the same way you can rename your columns,
data.rename(columns:{col1:country}, inplace=True)
read the doc for further information https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
In order to make it faster, but not reworking your whole solution I would recommend to use Dask DataFrames, to say it simple, Dask divides your reads your csv in chunks and processes each of them in parallel. After reading csv. you can use .compute method to get pandas df instead of Dask df.
This will look like this:
import pandas as pd
import dask.dataframe as dd #IMPROT DASK DATAFRAMES
# YOU NEED TO USE dd.read_csv instead of pd.read_csv
data = dd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
data = data.compute()
czm = dd.read_csv("czmatcher.csv")
czm = czm.compute()
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']
I have a function that returns a dataframe to my main. I am trying to store these dataframes in a dictionary, in order to retrieve them again later.
When I run this:
sa_wp5 = get_SA_WP5_value('testfile.txt')
template_dict["SAWP5Country Name"] = sa_wp5
my output looks like the following:
{'SAWP5Country Name': 1 2
0 Australia 047}
where I would rather the output just be the variable itself containing the dataframe.
What am I doing wrong here?
Nothing wrong here. Just a matter of formatting due to the default __str__() output of a DataFrame object. If you feel messy, try this way to print out your dict:
for key, df in template_dict.items():
print("%s:" % key)
print(df.to_string())
print("-------")
You can use Bunch to store all sorts of objects for easy retrieval.
from sklearn.datasets.base import Bunch
Then create a variable using the Bunch() method:
a = Bunch(df1 = df_template.copy(), df2 = df_other_df.copy())
Then you can simply call them as such:
a.df1
a.df1['col1']
df = a.df2
etc.
It's really effective for storage of objects.
I'm trying to group data from a 2 column object based on the value of a first column. I need this data in a list so I can sort them afterwards. I am fetching interface data with snmp on large number of machines. In the example I have 2 interfaces. I need data grouped by interface preferably in a list.
Data i get is in object item:
for i in item:
print i.oid, i.val
ifDescr lo
ifDescr eth0
ifAdminStatus 1
ifAdminStatus 1
ifOperStatus 1
ifOperStatus 0
i would like to get this data sorted in a list by value in the first column, like this:
I would like to get this data in a list, so it looks like this:
list=[[lo,1,1], [eth0,1,0]]
Solution I have is oh so dirty and long and I'm embarrassed to post it here, so any help is appreciated.
Here is my solution so you get better picture what I'm talking about. What I did is put each interface data in separate list based on item.oid, and then iterated trough cpu list and compared it to memory and name based on item.iid. In the end I have all data in cpu list where each interface is an element of the list. This solution works, but is too slow for my needs.
cpu=[]
memory=[]
name=[]
for item in process:
if item.oid=='ifDescr':
cpu.append([item.iid, int(item.val)])
if item.oid=='ifAdminStatus':
memory.append([item.iid, int(item.val)])
if item.oid=='ifOperStatus':
name.append([item.iid, item.val])
for c in cpu:
for m in memory:
if m[0]==c[0]:
c.append(m[1])
for n in name:
if n[0]==c[0]:
c.append(n[1])
cpu=sorted(cpu,key=itemgetter(1),reverse=True) #sorting is easy
Is there a pythonic, short and faster way of doing this? Limiting factor is that I get data in a 2 column object with key=data values.
Not sure I follow your sorting as I don't see any order but to group you can use a dict grouping by oid using a defaultdict for the repeating keys:
data = """ifDescr lo
ifDescr eth0
ifAdminStatus 1
ifAdminStatus 1
ifOperStatus 1
ifOperStatus 0"""
from collections import defaultdict
d = defaultdict(list)
for line in data.splitlines():
a, b = line.split()
d[a].append(b)
print((d.items()))
[('ifOperStatus', ['1', '0']), ('ifAdminStatus', ['1', '1']), ('ifDescr', ['lo', 'eth0'])]
using your code just use the attributes:
for i in item:
d[i.oid].append(i.val)
Pandas is a great way to work with data. Here is a quick example code. Check out the official website for more info.
# Python script using Pandas and Numpy
from pandas import DataFrame
from numpy import random
# Data with the dictionary keys defining the columns
data_dictionary = {'a': random.random(5),
'b': random.random(5)}
# Make a data frame
data_frame = DataFrame(data_dictionary)
print(data_frame)
# Return an new data frame with a sorted first column
data_frame_sorted = data_frame.sort_index(by='a')
print(data_frame_sorted)
This should run if you have numpy an pandas installed. If you don't have any clue about installing pandas go get the "anaconda python distribution."