What is the fastest way to dedupe multivariate data?

What is the fastest way to dedupe multivariate data? - python

Let's assume a very simple data structure. In the below example, IDs are unique. "date" and "id" are strings, and "amount" is an integer.
data = [[date1, id1, amount1], [date2, id2, amount2], etc.]
If date1 == date2 and id1 == id2, I'd like to merge the two entries into one and basically add up amount1 and amount2 so that data becomes:
data = [[date1, id1, amount1 + amount2], etc.]
There are many duplicates.
As data is very big (over 100,000 entries), I'd like to do this as efficiently as possible. What I did is a created a new "common" field that is basically date + id combined into one string with metadata allowing me to split it later (date + id + "_" + str(len(date)).
In terms of complexity, I have four loops:
Parse and load data from external source (it doesn't come in lists) | O(n)
Loop over data and create and store "common" string (date + id + metadata) - I call this "prepared data" where "common" is my encoded field | O(n)
Use the Counter() object to dedupe "prepared data" | O(n)
Decode "common" | O(n)
I don't care about memory here, I only care about speed. I could make a nested loop and avoid steps 2, 3 and 4 but that would be a time-complexity disaster (O(n²)).
What is the fastest way to do this?

Consider a defaultdict for aggregating data by a unique key:
Given
Some random data
import random
import collections as ct
random.seed(123)
# Random data
dates = ["2018-04-24", "2018-05-04", "2018-07-06"]
ids = "A B C D".split()
amounts = lambda: random.randrange(1, 100)
ch = random.choice
data = [[ch(dates), ch(ids), amounts()] for _ in range(10)]
data
Output
[['2018-04-24', 'C', 12],
['2018-05-04', 'C', 14],
['2018-04-24', 'D', 69],
['2018-07-06', 'C', 44],
['2018-04-24', 'B', 18],
['2018-05-04', 'C', 90],
['2018-04-24', 'B', 1],
['2018-05-04', 'A', 77],
['2018-05-04', 'A', 1],
['2018-05-04', 'D', 14]]
Code
dd = ct.defaultdict(int)
for date, id_, amt in data:
key = "{}{}_{}".format(date, id_, len(date))
dd[key] += amt
dd
Output
defaultdict(int,
{'2018-04-24B_10': 19,
'2018-04-24C_10': 12,
'2018-04-24D_10': 69,
'2018-05-04A_10': 78,
'2018-05-04C_10': 104,
'2018-05-04D_10': 14,
'2018-07-06C_10': 44})
Details
A defaultdict is a dictionary that calls a default factory (a specified function) for any missing keys. It this case, every date + id combination is uniquely added to the dict. The amounts are added to values if existing keys are found. Otherwise an integer (0) initializes a new entry to the dict.
For illustration, you can visualize the aggregated values using a list as the default factory.
dd = ct.defaultdict(list)
for date, id_, val in data:
key = "{}{}_{}".format(date, id_, len(date))
dd[key].append(val)
dd
Output
defaultdict(list,
{'2018-04-24B_10': [18, 1],
'2018-04-24C_10': [12],
'2018-04-24D_10': [69],
'2018-05-04A_10': [77, 1],
'2018-05-04C_10': [14, 90],
'2018-05-04D_10': [14],
'2018-07-06C_10': [44]})
We see three occurrences of duplicate keys where the values were appropriately summed. Regarding efficiency, notice:
keys are made with format(), which should be a bit better the string concatenation and calling str()
every key and value is computed in the same iteration

Using pandas makes this really easy:
import pandas as pd
df = pd.DataFrame(data, columns=['date', 'id', 'amount'])
df.groupby(['date','id']).sum().reset_index()
For more control you can use agg instead of sum():
df.groupby(['date','id']).agg({'amount':'sum'})
Depending on what you are doing with the data, it may be easier/faster to go this way just because so much of pandas is built on compiled C extensions and optimized routines that make it super easy to transform and manipulate.

You could import the data into a structure that prevents duplicates and than convert it to a list.
data = {
date1: {
id1: amount1,
id2: amount2,
},
date2: {
id3: amount3,
id4: amount4,
....
}
The program's skeleton:
ddata = collections.defaultdict(dict)
for date, id, amount in DATASOURCE:
ddata[date][id] = amount
data = [[d, i, a] for d, subd in ddata.items() for i, a in subd.items()]

Related

Python: Logging specific values to a single pandas dataframe from multiple lists with sublists

I am trying to save specific data from my weather station to a dataframe. The code I have retrieves hourly log data as lists with sublists, and simply putting pd.DataFrame does not work due to multiple logs and sublists.
I am trying to make a code that retrieves specific parameters, e.g. tempHigh for each hourly log entry and puts it in a dataframe.
I am able to isolate the 'tempHigh' for the first hour by:
df = wu.hourly()["observations"][0]
x = df["metric"]
x["tempHigh"]
I am afraid I have to deal with my nemesis, Mr. For Loop, to retrieve each hourly log data. I was hoping to get some help on how to attack this problem most efficiently.
The screenshots show the output data structure, which continues in this structure for all hours for the past 7 days. Below I have pasted the output data for the top two log entries.
{
"observations":[
{
"epoch":1607554798,
"humidityAvg":39,
"humidityHigh":44,
"humidityLow":37,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":4,
"dewptHigh":5,
"dewptLow":4,
"heatindexAvg":19,
"heatindexHigh":19,
"heatindexLow":18,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1017.03,
"pressureMin":1016.53,
"pressureTrend":0.0,
"tempAvg":19,
"tempHigh":19,
"tempLow":18,
"windchillAvg":19,
"windchillHigh":19,
"windchillLow":18,
"windgustAvg":8,
"windgustHigh":13,
"windgustLow":2,
"windspeedAvg":6,
"windspeedHigh":10,
"windspeedLow":2
},
"obsTimeLocal":"2020-12-10 00:59:58",
"obsTimeUtc":"2020-12-09T22:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":324
},
{
"epoch":1607558398,
"humidityAvg":48,
"humidityHigh":52,
"humidityLow":44,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":7,
"dewptHigh":8,
"dewptLow":5,
"heatindexAvg":18,
"heatindexHigh":19,
"heatindexLow":17,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1016.93,
"pressureMin":1016.42,
"pressureTrend":-0.31,
"tempAvg":18,
"tempHigh":19,
"tempLow":17,
"windchillAvg":18,
"windchillHigh":19,
"windchillLow":17,
"windgustAvg":10,
"windgustHigh":15,
"windgustLow":4,
"windspeedAvg":8,
"windspeedHigh":13,
"windspeedLow":1
},
"obsTimeLocal":"2020-12-10 01:59:58",
"obsTimeUtc":"2020-12-09T23:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":326
}
]
}

I might have a solution that suits your case. The way I've tackled this challenge is to flatten the entries of the single hourly logs, so not to have a nested dictionary. With 1-dimensional dictionaries (one for each hour), easily a dataframe can be created with all the measures as columns and the date and time as index. From there on you can select whatever columns you'd like ;)
How do we get there and what do I mean by 'flatten the entries'?
The hourly logs come as single dictionaries with single key, value pairs except 'metric' which is another dictionary. What I want is to get rid of the key 'metric' but not its values. Let's look at an example:
# nested dictionary
original = {'a':1, 'b':2, 'foo':{'c':3}}
# flatten original to
flattened = {'a':1, 'b':2, 'c':3} # got rid of key 'foo' but not its value
The below function achieves exactly that, a 1-dimensional or flat dictionary:
def flatten(dic):
#
update = False
for key, val in dic.items():
if isinstance(val, dict):
update = True
break
if update: dic.update(val); dic.pop(key); flatten(dic)
return dic
# With data from your weather station
hourly_log = {'epoch': 1607554798, 'humidityAvg': 39, 'humidityHigh': 44, 'humidityLow': 37, 'lat': 27.389829, 'lon': 33.67048, 'metric': {'dewptAvg': 4, 'dewptHigh': 5, 'dewptLow': 4, 'heatindexAvg': 19, 'heatindexHigh': 19, 'heatindexLow': 18, 'precipRate': 0.0, 'precipTotal': 0.0, 'pressureMax': 1017.03, 'pressureMin': 1016.53, 'pressureTrend': 0.0, 'tempAvg': 19, 'tempHigh': 19, 'tempLow': 18, 'windchillAvg': 19, 'windchillHigh': 19, 'windchillLow': 18, 'windgustAvg': 8, 'windgustHigh': 13, 'windgustLow': 2, 'windspeedAvg': 6, 'windspeedHigh': 10, 'windspeedLow': 2}, 'obsTimeLocal': '2020-12-10 00:59:58', 'obsTimeUtc': '2020-12-09T22:59:58Z', 'qcStatus': -1, 'solarRadiationHigh': 0.0, 'stationID': 'IHURGH2', 'tz': 'Africa/Cairo', 'uvHigh': 0.0, 'winddirAvg': 324}
# Flatten with function
flatten(hourly_log)
>>> {'epoch': 1607554798,
'humidityAvg': 39,
'humidityHigh': 44,
'humidityLow': 37,
'lat': 27.389829,
'lon': 33.67048,
'obsTimeLocal': '2020-12-10 00:59:58',
'obsTimeUtc': '2020-12-09T22:59:58Z',
'qcStatus': -1,
'solarRadiationHigh': 0.0,
'stationID': 'IHURGH2',
'tz': 'Africa/Cairo',
'uvHigh': 0.0,
'winddirAvg': 324,
'dewptAvg': 4,
'dewptHigh': 5,
'dewptLow': 4,
'heatindexAvg': 19,
'heatindexHigh': 19,
'heatindexLow': 18,
'precipRate': 0.0,
'precipTotal': 0.0,
'pressureMax': 1017.03,
'pressureMin': 1016.53,
'pressureTrend': 0.0,
...
Notice: 'metric' is gone but not its values!
Now, a DataFrame can be easily created for each hourly log which can be concatenated to a single DataFrame:
import pandas as pd
hourly_logs = wu.hourly()['observations']
# List of DataFrames for each hour
frames = [pd.DataFrame(flatten(dic), index=[0]).set_index('epoch') for dic in hourly_logs]
# Concatenated to a single one
df = pd.concat(frames)
# With adjusted index as Date and Time
dti = pd.DatetimeIndex(df.index * 10**9)
df.index = pd.MultiIndex.from_arrays([dti.date, dti.time])
# All measures
df.columns
>>> Index(['humidityAvg', 'humidityHigh', 'humidityLow', 'lat', 'lon',
'obsTimeLocal', 'obsTimeUtc', 'qcStatus', 'solarRadiationHigh',
'stationID', 'tz', 'uvHigh', 'winddirAvg', 'dewptAvg', 'dewptHigh',
'dewptLow', 'heatindexAvg', 'heatindexHigh', 'heatindexLow',
'precipRate', 'precipTotal', 'pressureMax', 'pressureMin',
'pressureTrend', 'tempAvg', 'tempHigh', 'tempLow', 'windchillAvg',
'windchillHigh', 'windchillLow', 'windgustAvg', 'windgustHigh',
'windgustLow', 'windspeedAvg', 'windspeedHigh', 'windspeedLow'],
dtype='object')
# Read out specific measures
df[['tempHigh','tempLow','tempAvg']]
>>>
Hopefully this is what you've been looking for!

Pandas accepts a list of dictionaries as input to create a dataframe:
import pandas as pd
input_dict = {"observations":[
{
"epoch":1607554798,
"humidityAvg":39,
"humidityHigh":44,
"humidityLow":37,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":4,
"dewptHigh":5,
"dewptLow":4,
"heatindexAvg":19,
"heatindexHigh":19,
"heatindexLow":18,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1017.03,
"pressureMin":1016.53,
"pressureTrend":0.0,
"tempAvg":19,
"tempHigh":19,
"tempLow":18,
"windchillAvg":19,
"windchillHigh":19,
"windchillLow":18,
"windgustAvg":8,
"windgustHigh":13,
"windgustLow":2,
"windspeedAvg":6,
"windspeedHigh":10,
"windspeedLow":2
},
"obsTimeLocal":"2020-12-10 00:59:58",
"obsTimeUtc":"2020-12-09T22:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":324
},
{
"epoch":1607558398,
"humidityAvg":48,
"humidityHigh":52,
"humidityLow":44,
"lat":27.389829,
"lon":33.67048,
"metric":{
"dewptAvg":7,
"dewptHigh":8,
"dewptLow":5,
"heatindexAvg":18,
"heatindexHigh":19,
"heatindexLow":17,
"precipRate":0.0,
"precipTotal":0.0,
"pressureMax":1016.93,
"pressureMin":1016.42,
"pressureTrend":-0.31,
"tempAvg":18,
"tempHigh":19,
"tempLow":17,
"windchillAvg":18,
"windchillHigh":19,
"windchillLow":17,
"windgustAvg":10,
"windgustHigh":15,
"windgustLow":4,
"windspeedAvg":8,
"windspeedHigh":13,
"windspeedLow":1
},
"obsTimeLocal":"2020-12-10 01:59:58",
"obsTimeUtc":"2020-12-09T23:59:58Z",
"qcStatus":-1,
"solarRadiationHigh":0.0,
"stationID":"IHURGH2",
"tz":"Africa/Cairo",
"uvHigh":0.0,
"winddirAvg":326
}
]
}
observations = input_dict["observations"]
df = pd.DataFrame(observations)
If you now want a list of single "metrics" you need to "flatten" your list of dictionaries column. This does use your "Nemesis" but in a Pythonic way:
temperature_high = [d.get("tempHigh") for d in df["metric"].to_list()]
If you want all the metrics in a dataframe, even simpler, just get the list of dictionaries from the specific column:
metrics = pd.DataFrame(df["metric"].to_list())
As you would probably like the timestamp as an index to denote your entries (your rows), you can pick your column epoch, or the more human obsTimeLocal:
metrics = pd.DataFrame(df["metric"].to_list(), index=df["obsTimeLocal"].to_list())
From here you can read specific metrics of your interest:
metrics[["tempHigh", "tempLow"]]

convert list to dataframe using dictionary

I am new to Pythonland and I have a question. I have a list as below and want to convert it into a dataframe.
I read on Stackoverflow that it is better to create a dictionary then a list so I create one as follows.
column_names = ["name", "height" , "weight", "grade"] # Actual list has 10 entries
row_names = ["jack", "mick", "nick","pick"]
data = ['100','50','A','107','62','B'] # The actual list has 1640 entries
dic = {key:[] for key in column_names}
dic['name'] = row_names
t = 0
while t< len(data):
dic['height'].append(data[t])
t = t+3
t = 1
while t< len(data):
dic['weight'].append(data[t])
t = t+3
So on and so forth, I have 10 columns so I wrote above code 10 times to complete the full dictionary. Then i convert
it to dataframe. It works perfectly fine, there has to
be a way to do this in shorter way. I don't know how to refer to key of a dictionary with a number. Should it be wrapped to a function. Also, how can I automate adding one to value of t before executing the next loop? Please help me.

You can iterate through columnn_names like this:
dic = {key:[] for key in column_names}
dic['name'] = row_names
for t, column_name in enumerate(column_names):
i = t
while i< len(data):
dic[column_name].append(data[i])
i += 3
Enumerate will automatically iterate through t form 0 to len(column_names)-1

i = 0
while True:
try:
for j in column_names:
d[j].append(data[i])
i += 1
except Exception as er: #So when i value exceed by data list it comes to exception and it will break the loop as well
print(er, "################")
break

The first issue that you have all columns data concatenated to a single list. You should first investigate how to prevent it and have list of lists with each column values in a separate list like [['100', '107'], ['50', '62'], ['A', 'B']]. Any way you need this data structure to proceed efficiently:
cl_count = len(column_names)
d_count = len(data)
spl_data = [[data[j] for j in range(i, d_count, cl_count)] for i in range(cl_count)]
Then you should use dict comprehension. This is a 3.x Python feature so it will not work in Py 2.x.
df = pd.DataFrame({j: spl_data[i] for i, j in enumerate(column_names)})

First, we should understand how an ideal dictionary for a dataframe should look like.
A Dataframe can be thought of in two different ways:
One is a traditional collection of rows..
'row 0': ['jack', 100, 50, 'A'],
'row 1': ['mick', 107, 62, 'B']
However, there is a second representation that is more useful, though perhaps not as intuitive at first.
A collection of columns:
'name': ['jack', 'mick'],
'height': ['100', '107'],
'weight': ['50', '62'],
'grade': ['A', 'B']
Now, here is the key thing to realise, the 2nd representation is more useful
because that is the representation interally supported and used in dataframes.
It does not run into conflict of datatype within a single grouping (each column needs to have 1 fixed datatype)
Across a row representation however, datatypes can vary.
Also, operations can be performed easily and consistently on an entire column
because of this consistency that cant be guaranteed in a row.
So, tl;dr DataFrames are essentially collections of equal length columns.
So, a dictionary in that representation can be easily converted into a DataFrame.
column_names = ["name", "height" , "weight", "grade"] # Actual list has 10 entries
row_names = ["jack", "mick"]
data = [100, 50,'A', 107, 62,'B'] # The actual list has 1640 entries
So, With that in mind, the first thing to realize is that, in its current format, data is a very poor representation.
It is a collection of rows merged into a single list.
The first thing to do, if you're the one in control of how data is formed, is to not prepare it this way.
The goal is a list for each column, and ideally, prepare the list in that format.
Now, however, if it is given in this format, you need to iterate and collect the values accordingly. Here's a way to do it
column_names = ["name", "height" , "weight", "grade"] # Actual list has 10 entries
row_names = ["jack", "mick"]
data = [100, 50,'A', 107, 62,'B'] # The actual list has 1640 entries
dic = {key:[] for key in column_names}
dic['name'] = row_names
print(dic)
Output so far:
{'height': [],
'weight': [],
'grade': [],
'name': ['jack', 'mick']} #so, now, names are a column representation with all correct values.
remaining_cols = column_names[1:]
#Explanations for the following part given at the end
data_it = iter(data)
for row in zip(*([data_it] * len(remaining_cols))):
for i, val in enumerate(row):
dic[remaining_cols[i]].append(val)
print(dic)
Output:
{'name': ['jack', 'mick'],
'height': [100, 107],
'weight': [50, 62],
'grade': ['A', 'B']}
And we are done with the representation
Finally:
import pd
df = pd.DataFrame(dic, columns = column_names)
print(df)
name height weight grade
0 jack 100 50 A
1 mick 107 62 B
Edit:
Some explanation for the zip part:
zip takes any iterables and allows us through iterate through them together.
data_it = iter(data) #prepares an iterator.
[data_it] * len(remaining_cols) #creates references to the same iterator
Here, this is similar to [data_it, data_it, data_it]
The * in *[data_it, data_it, data_it] allows us to unpack the list into 3 arguments for the zip function instead
so, f(*[data_it, data_it, data_it]) is equivalent to f(data_it, data_it, data_it) for any function f.
the magic here is that traversing through an iterator/advancing an iterator will now reflect the change across all references
Putting it all together:
zip(*([data_it] * len(remaining_cols))) will actually allow us to take 3 items from data at a time, and assign it to row
So, row = (100, 50, 'A') in first iteration of zip
for i, val in enumerate(row): #just iterate through the row, keeping index too using enumerate
dic[remaining_cols[i]].append(val) #use indexes to access the correct list in the dictionary
Hope that helps.

If you are using Python 3.x, as suggested by l159, you can use a comprehension dict and then create a Pandas DataFrame out of it, using the names as row indexes:
data = ['100', '50', 'A', '107', '62', 'B', '103', '64', 'C', '105', '78', 'D']
column_names = ["height", "weight", "grade"]
row_names = ["jack", "mick", "nick", "pick"]
df = pd.DataFrame.from_dict(
{
row_label: {
column_label: data[i * len(column_names) + j]
for j, column_label in enumerate(column_names)
} for i, row_label in enumerate(row_names)
},
orient='index'
)
Actually, the intermediate dictionary is a nested dictionary: the keys of the outer dictionary are the row labels (in this case the items of the row_names list); the value associated with each key is a dictionary whose keys are the column labels (i.e., the items in column_names) and values are the correspondent elements in the data list.
The function from_dict is used to create the DataFrame instance.
So, the previous code produces the following result:
height weight grade
jack 100 50 A
mick 107 62 B
nick 103 64 C
pick 105 78 D

Manipulate data in dictionary-column from TSV

I have a TSV file where one of the columns are a dictionary-format type.
Example of headers and one row (notice the string-quotes in Preferences-column)
Name, Age, Preferences
Nick, 18, "[{"Hobby":"Football", "Food":"Pizza", "FavoriteNumber":"72"}]"
To read the file into python:
df = pd.read_csv('search_data_assessment.tsv',delimiter='\t')
To remove the strings of the "Preferences" at beginning and end, I used ast.literal_eval:
df["Preferences"] = ast.literal_eval(df["Preferences"])
This raises "ValueError: malformed node or string: 0", but it seems to do the trick.
The question: How can I check all rows and look for "FavoriteNumber" in Preferences, and if it == 72, change it to 100 (arbitrary example)?

You can use pd.Series.apply with a custom function. Just note this is bordering on abuse of Pandas. Pandas isn't designed to hold lists of dictionaries in series. Here, you are running a loop in a particularly inefficient way.
from ast import literal_eval
df = pd.DataFrame([['Nick', 18, '[{"Hobby":"Football", "Food":"Pizza", "FavoriteNumber":"72"}]']],
columns=['Name', 'Age', 'Preferences'])
def updater(x):
if x[0]['FavoriteNumber'] == '72':
x[0]['FavoriteNumber'] = '100'
return x
df['Preferences'] = df['Preferences'].apply(literal_eval)
df['Preferences'] = df['Preferences'].apply(updater)
print(df['Preferences'].iloc[0])
[{'Hobby': 'Football', 'Food': 'Pizza', 'FavoriteNumber': '100'}]

Iterate over two dictionaries in one loop in python

I have two dictionaries. One has chapter_id and book_id: {99: 7358, 852: 7358, 456: 7358}. Here just one book as an example, but there are many. And another one the same chapter_id and some information: {99: [John Smith, 20, 5], 852: [Clair White, 15, 10], 456: [Daniel Dylan, 25, 10]}. Chapter ids are unique through all the books. And I have to combine it in the way that every book gets information from all the chapters it contains. Something like {7358:[[99,852,456],[John Smith, Claire White, Daniel Dylan],[20,15,25],[5,10,10]]}. I also have a file already with a dictionary, where each book has ids of all chapters it has. I know how to do it by looping over both dictionaries (they used to be lists). But it takes ages. That is why they are now dictionaries and I think I can manage with just one loop over all chapters. But in my head I always come back to the looping over books and over chapters. Any ideas are very much appreciated! The final result I will write in the file, so it is not very important if it is a nested dictionary or something else. Or at least I think so.

If you are open to using other packages then you might want to have a look on pandas, which will allow you to do many things easily and fast. Here is an example based on the data you provided...
import pandas as pd
d1 = {99: 7358, 852: 7358, 456: 7358}
df1 = pd.DataFrame.from_dict(d1, "index")
df1.reset_index(inplace=True)
d2 = {99: ["John Smith", 20, 5], 852: ["Clair White", 15, 10], 456: ["Daniel Dylan", 25, 10]}
df2 = pd.DataFrame.from_dict(d2, "index")
df2.reset_index(inplace=True)
df = df1.merge(df2, left_on="index", right_on="index")
df.columns = ["a", "b", "c", "d", "e"]
# all data for 7358 (ie subsetting)
df[df.b == 7358]
# all names as a list
list(df[df.b == 7358].c)

You could always iterate over the dictionary keys, given that the same keys appear in both dictionaries:
for chapter_id in dict1:
book_id = dict1[chapter_id]
chapter_info = dict2[chapter_id]

from collections import defaultdict
def append_all(l, a):
if len(l) != len(a):
raise ValueError
for i in range(len(l)):
l[i].append(a[i])
final_dict = defaultdict(lambda: [[],[],[],[]])
for chapter, book in d1.items():
final_dict[book][0].append(chapter)
append_all(final_dict[book][1:], d2[chapter])
You only need to iterate over the chapters. You can replace the append_all function with explicit appends, but it seemed ugly to do it that way. I'm surprised there's not a method for this, but it may just be that I missed a clever way to use zip here.

Filter Pandas DataFrames Using Dynamic URL Query String

Currently i am having an question in python pandas. I want to filter a dataframe using url query string dynamically.
For eg:
CSV:
url: http://example.com/filter?Name=Sam&Age=21&Gender=male
Hardcoded:
filtered_data = data[
(data['Name'] == 'Sam') &
(data['Age'] == 21) &
(data['Gender'] == 'male')
];
I don't want to hard code the filter keys like before because the csv file changes anytime with different column headers.
Any suggestions

The easiest way to create this filter dynamically is probably to use np.all.
For example:
import numpy as np
query = {'Name': 'Sam', 'Age': 21, 'Gender': 'male'}
filters = [data[k] == v for k, v in query.items()]
filter_data = data[np.all(filters, axis=0)]

use df.query. For example
df = pd.read_csv(url)
conditions = "Name == 'Sam' and Age == 21 and Gender == 'Male'"
filtered_data = df.query(conditions)
You can build the conditions string dynamically using string formatting like
conditions = " and ".join("{} == {}".format(col, val)
for col, val in zip(df.columns, values)

Typically, your web framework will return the arguments in a dict-like structure. Let's say your args are like this:
args = {
'Name': ['Sam'],
'Age': ['21'], # Note that Age is a string
'Gender': ['male']
}
You can filter your dataset successively like this:
for key, values in args.items():
data = data[data[key].isin(values)]
However, this is likely not to match any data for Age, which may have been loaded as an integer. In that case, you could load the CSV file as a string via pd.read_csv(filename, dtype=object), or convert to string before comparison:
for key, values in args.items():
data = data[data[key].astype(str).isin(values)]
Incidentally, this will also match multiple values. For example, take the URL http://example.com/filter?Name=Sam&Name=Ben&Age=21&Gender=male -- which leads to the structure:
args = {
'Name': ['Sam', 'Ben'], # There are 2 names
'Age': ['21'],
'Gender': ['male']
}
In this case, both Ben and Sam will be matched, since we're using .isin to match.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the fastest way to dedupe multivariate data? - python

Related

Python: Logging specific values to a single pandas dataframe from multiple lists with sublists

convert list to dataframe using dictionary

Manipulate data in dictionary-column from TSV

Iterate over two dictionaries in one loop in python

Filter Pandas DataFrames Using Dynamic URL Query String

Categories

Resources