I'm new to this forum, kindly excuse if the question format is not very good.
I'm trying to fetch rows from database table in mysql and print the same after processing the cols (one of the cols contains json which needs to be expanded). Below is the source and expected output. Would be great if someone can suggest an easier way to manage this data.
Note: I have achieved this with lots of looping and parsing but the challenges are.
1) There is no connection between col_names and data and hence when I am printing the data I don't know the order of the data in the resultset so there is a mismatch in the col title that I print and the data, any means to keep this in sync ?
2) I would like to have the flexibility of changing the order of the columns without much rework.
What is best possible way to achieve this. Have not explored the pandas library as I was not sure if it is really necessary.
Using python 3.6
Sample Data in the table
id, student_name, personal_details, university
1, Sam, {"age":"25","DOL":"2015","Address":{"country":"Poland","city":"Warsaw"},"DegreeStatus":"Granted"},UAW
2, Michael, {"age":"24","DOL":"2016","Address":{"country":"Poland","city":"Toruń"},"DegreeStatus":"Granted"},NCU
I'm querying the database using MySQLdb.connect object, steps below
query = "select * from student_details"
cur.execute(query)
res = cur.fetchall() # get a collection of tuples
db_fields = [z[0] for z in cur.description] # generate list of col_names
Data in variables:
>>>db_fields
['id', 'student_name', 'personal_details', 'university']
>>>res
((1, 'Sam', '{"age":"25","DOL":"2015","Address":{"country":"Poland","city":"Warsaw"},"DegreeStatus":"Granted"}','UAW'),
(2, 'Michael', '{"age":"24","DOL":"2016","Address":{"country":"Poland","city":"Toruń"},"DegreeStatus":"Granted"}','NCU'))
Desired Output:
id, student_name, age, DOL, country, city, DegreeStatus, University
1, 'Sam', 25, 2015, 'Poland', 'Warsaw', 'Granted', 'UAW'
2, 'Michael', 24, 2016, 'Poland', 'Toruń', 'Granted', 'NCU'
A not-too-pythonic way but easy to understand (and maybe you can write a more pythonic soltion) might be:
def unwrap_dict(_input):
res = dict()
for k, v in _input.items():
# Assuming you know there's only one nested level
if isinstance(v, dict):
for _k, _v in v.items():
res[_k] = _v
continue
res[k] = v
return res
all_data = list()
for row in result:
res = dict()
for field, data in zip(db_fields, row):
# Assuming you know personal_details is the only JSON column
if field == 'personal_details':
data = json.loads(data)
if isinstance(data, dict):
extra = unwrap_dict(data)
res.update(extra)
continue
res[field] = data
all_data.append(res)
Related
I'm using the google sheets API to get data which I then pass to Pandas so I can easily work with the data.
Let's say I want to get a sheet with the following data (depicted as a JSON object as tables weren't presented here well)
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '12345', '8 Leafy Street']
}
The sheets API will return something along the lines of this:
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35', '12345', '8 Leafy Street']
]
}
This is great and allows me to easily pass the column headings and data to Pandas without much fuss. I do this in the following manner:
values = sheets_api_result["values"]
df = pd.DataFrame(values[1:], columns=values[0])
My Problem
If I have a Gsuite Sheet that looks like the below table, depicted as a key:value data type
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '', '']
}
I will receive the following response
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35']
]
}
Note that the length of the two arrays are not unequal, and that instead of None or null values being returned, the data is simply not present in the response.
When working with this data in my code, I end up with an error that looks like this
ValueError: 4 columns passed, passed data had 2 columns
So as far as I can tell I have two options:
Come up with a clever way to pad my response where necessary with None
If possible, instruct the API to return a null value in the JSON where null values exist, especially when the last column(s) have no data at all.
With regards to point 1. I think I can append x None values to the list where x is equal to length_of_column_heading_array - length_of_data_array. This does however seem ugly and perhaps there is a more elegant way of doing it.
And with regards to point 2, I haven't managed to find an answer that helps me.
If anyone has any ideas on how I can solve this, I'd be very grateful.
Cheers!
If anyone is interested, here is how I solved the issue.
First, we need to get all the data from the Sheets API.
# define the names of the tabs I want to get
ranges = ['tab1', 'tab2']
# Call the Sheets API
request = service.spreadsheets().values().batchGet(spreadsheetId=document, ranges=ranges,)
response = request.execute()
Now I want to go through every column and ensure that each row's list contains the same number of elements as the first row which contains the column headings.
# response is the response from google sheets API,
# and from the code above. It contains column headings
# and data from every row.
# valueRanges is the key to access the data.
def extract_case_data(response, keyword):
for obj in response["valueRanges"]:
if keyword in obj["range"]:
values = pad_data(obj["values"])
df = pd.DataFrame(values[1:], columns=values[0])
return df
return None
And finally, the method to pad the data
def pad_data(data: list):
# build a new array with the column heading data
# this is the list which we will return
return_data = [data[0]]
for row in data[1:]:
difference = len(data[0]) - len(row)
new_row = row
# append None to the lists which have a shorter
# length than the column heading list
for count in range(1, difference + 1):
new_row.append(None)
return_data.append(new_row)
return return_data
I'm certainly not saying that this is the best or most elegant solution, but it has done the trick for me.
Hope this helps someone.
Same idea, maybe simpler look:
Get raw values
result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=data_range).execute()
raw_values = result.get('values', [])
Then complete while iterating
for row in raw_values:
row = row + [''] * (expected_length - len(row))
I'm reading data from a file into a series of lists as follows:
sourceData = [[source, topic, score],[source, topic, score],[source, topic, score]...]
wherein the sources and topics in each list may be the same or different.
What I am trying to achieve is a dictionary which groups the topics associated with each source, and their associated scores (the scores will then be averaged, but for the purpose of this question let's just list them as values of the topic (key)).
The results would ideally look like a list of nested dicts as follows:
[{SOURCE1:{TOPIC_A:SCORE1,SCORE2,SCORE3},
{TOPIC_B:SCORE1,SCORE2,SCORE3},
{TOPIC_C:SCORE1,SCORE2,SCORE3}},
{SOURCE2:{TOPIC_A:SCORE1,SCORE2,SCORE3},
{TOPIC_B:SCORE1,SCORE2,SCORE3},
{TOPIC_C:SCORE1,SCORE2,SCORE3}}...]
I think the best way to do this would be to create a Counter of the sources, and then a dict for each topics per source, and save each dict as a value for each corresponding source. However I am having trouble iterating properly to get the desired result.
Here's what I have so far:
sourceDict = {}
sourceDictList = []
for row in sourceData:
source = row[0]
score = row[1]
topic = row[2]
sourceDict = [source,{topic:score}]
sourceDictList.append(sourceDict)
sourceList.append(source)
wherein sourceDictList results in the following: [[source, {topic: score}]...], (essentially reformatting the data from the originally list of lists), and sourceList is just a list of all the source (some repeating).
Then I initialize a counter and match the source from the counter with the source from sourceDictList and if they match, save the topic:score dict as the key:
sourceCounter = Counter(sourceList)
for key,val in sourceCounter.items():
for dictitem in sourceDictList:
if dictitem[0] == key:
sourceCounter[key] = dictitem[1]
But the output is only saving the last topic:score dict to each source. So instead of the desired:
[{SOURCE1:{TOPIC_A:SCORE1,SCORE2,SCORE3},
{TOPIC_B:SCORE1,SCORE2,SCORE3},
{TOPIC_C:SCORE1,SCORE2,SCORE3}},
{SOURCE2:{TOPIC_A:SCORE1,SCORE2,SCORE3},
{TOPIC_B:SCORE1,SCORE2,SCORE3},
{TOPIC_C:SCORE1,SCORE2,SCORE3}}...]
I am only getting:
Counter({SOURCE1: {TOPIC_n: 'SCORE_n'}, SOURCE2: {TOPIC_n: 'SCORE_n'}, SOURCE3: {TOPIC_n: 'SCORE_n'}})
I am under the impression that if there is a unique key saved to a dict, it will append that key:value pair without overwriting previous ones. Am I missing something?
Appreciate any help on this.
Simply we can do:
sourceData = [
['source1', 'topic1', 'score1'],
['source1', 'topic2', 'score1'],
['source1', 'topic1', 'score2'],
['source2', 'topic1', 'score1'],
['source2', 'topic2', 'score2'],
['source2', 'topic1', 'score3'],
]
sourceDict = {}
for row in sourceData:
source = row[0]
topic = row[1]
score = row[2]
if source not in sourceDict:
# This will be executed when the source
# comes for the first time.
sourceDict[source] = {}
if topic not in sourceDict[source]:
# This will be executed when the topic
# inside that source comes for the first time.
sourceDict[source][topic] = []
sourceDict[source][topic].append(score)
print(sourceDict)
You can simply use the collection's defaultdict
sourdata = [['source', 'topic', 2],['source', 'topic', 3], ['source', 'topic2', 3],['source2', 'topic', 4]]
from collections import defaultdict
sourceDict = defaultdict(dict)
for source, topic, score in sourdata:
topicScoreDict = sourceDict[source]
topicScoreDict[topic] = topicScoreDict.get(topic, []) + [score]
>>> print(sourceDict)
>>> defaultdict(<class 'dict'>, {'source': {'topic': [2, 3], 'topic2': [3]}, 'source2': {'topic': [4]}})
>>> print(dict(sourceDict))
>>> {'source': {'topic': [2, 3], 'topic2': [3]}, 'source2': {'topic': [4]}}
I'm trying to create a dictionary of dictionaries like this:
food = {"Broccoli": {"Taste": "Bad", "Smell": "Bad"},
"Strawberry": {"Taste": "Good", "Smell": "Good"}}
But I am populating it from an SQL table. So I've pulled the SQL table into an SQL object called "result". And then I got the column names like this:
nutCol = [i[0] for i in result.description]
The table has about 40 characteristics, so it is quite long.
I can do this...
foodList = {}
for id, food in enumerate(result):
addMe = {str(food[1]): {nutCol[id + 2]: food[2], nulCol[idx + 3]:
food[3] ...}}
foodList.update(addMe)
But this of course would look horrible and take a while to write. And I'm still working out how I want to build this whole thing so it's possible I'll need to change it a few times...which could get extremely tedious.
Is there a DRY way of doing this?
In order to make solution position independent you can make use of dict1.update(dict2). This simply merges dict2 with dict1.
In our case since we have dict of dict, we can use dict['key'] as dict1 and simply add any additional key,value pair as dict2.
Here is an example.
food = {"Broccoli": {"Taste": "Bad", "Smell": "Bad"},
"Strawberry": {"Taste": "Good", "Smell": "Good"}}
addthis = {'foo':'bar'}
Suppose you want to add addthis dict to food['strawberry'] , we can simply use,
food["Strawberry"].update(addthis)
Getting result:
>>> food
{'Strawberry': {'Taste': 'Good', 'foo': 'bar', 'Smell': 'Good'},'Broccoli': {'Taste': 'Bad', 'Smell': 'Bad'}}
>>>
Assuming that column 0 is what you wish to use as your key, and you do wish to build a dictionary of dictionaries, then its:
detail_names = [col[0] for col in result.description[1:]]
foodList = {row[0]: dict(zip(detail_names, row[1:]))
for row in result}
Generalising, if column k is your identity then its:
foodList = {row[k]: {col[0]: row[i]
for i, col in enumerate(result.description) if i != k}
for row in result}
(Here each sub dictionary is all columns other than column k)
addMe = {str(food[1]):dict(zip(nutCol[2:],food[2:]))}
zip will take two (or more) lists of items and pair the elements, then you can pass the result to dict to turn the pairs into a dictionary.
Currently i am having an question in python pandas. I want to filter a dataframe using url query string dynamically.
For eg:
CSV:
url: http://example.com/filter?Name=Sam&Age=21&Gender=male
Hardcoded:
filtered_data = data[
(data['Name'] == 'Sam') &
(data['Age'] == 21) &
(data['Gender'] == 'male')
];
I don't want to hard code the filter keys like before because the csv file changes anytime with different column headers.
Any suggestions
The easiest way to create this filter dynamically is probably to use np.all.
For example:
import numpy as np
query = {'Name': 'Sam', 'Age': 21, 'Gender': 'male'}
filters = [data[k] == v for k, v in query.items()]
filter_data = data[np.all(filters, axis=0)]
use df.query. For example
df = pd.read_csv(url)
conditions = "Name == 'Sam' and Age == 21 and Gender == 'Male'"
filtered_data = df.query(conditions)
You can build the conditions string dynamically using string formatting like
conditions = " and ".join("{} == {}".format(col, val)
for col, val in zip(df.columns, values)
Typically, your web framework will return the arguments in a dict-like structure. Let's say your args are like this:
args = {
'Name': ['Sam'],
'Age': ['21'], # Note that Age is a string
'Gender': ['male']
}
You can filter your dataset successively like this:
for key, values in args.items():
data = data[data[key].isin(values)]
However, this is likely not to match any data for Age, which may have been loaded as an integer. In that case, you could load the CSV file as a string via pd.read_csv(filename, dtype=object), or convert to string before comparison:
for key, values in args.items():
data = data[data[key].astype(str).isin(values)]
Incidentally, this will also match multiple values. For example, take the URL http://example.com/filter?Name=Sam&Name=Ben&Age=21&Gender=male -- which leads to the structure:
args = {
'Name': ['Sam', 'Ben'], # There are 2 names
'Age': ['21'],
'Gender': ['male']
}
In this case, both Ben and Sam will be matched, since we're using .isin to match.
Say I have some data with timestamps, prices and amounts. This data can be quite large and matching conditions could occur anywhere in the group. A simple example shown below:
[{"date":1387496043,"price":19.379,"amount":1.000000}
{"date":1387496044,"price":20.20,"amount":2.00000}
{"date":1387496044,"price":10.00,"amount":0.10000}
{"date":1387496044,"price":20.20,"amount":0.300000}]
How could I sort this so I combine the amounts of any item that has the same timestamp and same price?
So the results look like (note the 2.0 and 0.3 amounts have been summed together):
[{"date":1387496043,"price":19.379,"amount":1.000000}
{"date":1387496044,"price":20.20,"amount":2.30000}
{"date":1387496044,"price":10.00,"amount":0.10000}]
I've tried a number of convoluted methods (using Python 2.7.3), but I don't know python very well. I'm sure there's a good way to find 2 matching values and then updating one with new amount and removing the duplicate.
FYI Here is the test data
L=[{"date":1387496043,"price":19.379,"amount":1.000000},{"date":1387496044,"price":20.20,"amount":2.00000},{"date":1387496044,"price":10.00,"amount":0.10000},{"date":1387496044,"price":20.20,"amount":0.300000}]
A defaultdict-based approach
from collections import defaultdict
d = defaultdict(float)
z = [{"date":1387496043,"price":19.379,"amount":1.000000},
{"date":1387496044,"price":20.20,"amount":2.00000},
{"date":1387496044,"price":10.00,"amount":0.10000},
{"date":1387496044,"price":20.20,"amount":0.300000}]
for x in z:
d[x["date"], x["price"]] += x["amount"]
print [{"date": k1, "price": k2, "amount": v} for (k1, k2), v in d.iteritems()]
[{'date': 1387496044, 'price': 10.0, 'amount': 0.1},
{'date': 1387496044, 'price': 20.2, 'amount': 2.3},
{'date': 1387496043, 'price': 19.379, 'amount': 1.0}]
Probably the best way to do this would be to make a dictionary with (date, price) as keys. If you ever encounter a duplicate key, you can combine your fields to keep the keys unique.
def combine(L):
results = {}
for item in L:
key = (item["date"], item["price"])
if key in results: # combine them
results[key] = {"date": item["date"], "price": item["price"], "amount": item["amount"] + results[key]["amount"]}
else: # don't need to combine them
results[key] = item
return results.values()
This would be a slightly messy O(n) solution to your example that can obviously be generalized to solve your initial problem.
FWIW you can do it using database operations:
records = [
{"date":1387496043,"price":19.379,"amount":1.000000},
{"date":1387496044,"price":20.20,"amount":2.00000},
{"date":1387496044,"price":10.00,"amount":0.10000},
{"date":1387496044,"price":20.20,"amount":0.300000},
]
import sqlite3
db = sqlite3.connect(':memory:')
db.row_factory = sqlite3.Row
db.execute('CREATE TABLE records (date int, price float, amount float)')
db.executemany('INSERT INTO records VALUES (:date, :price, :amount)', records)
sql = 'SELECT date, price, SUM(amount) AS amount FROM records GROUP BY date, price'
records = [dict(row) for row in db.execute(sql)]
print(records)