drop selective columns pandas dataframe while flattening - python

I have a created a dataframe from a JSON but want to keep only the first 5 columns of the result.
Here is a part of the JSON:
{
"lat": 52.517,
"lon": 13.3889,
"timezone": "Europe/Berlin",
"timezone_offset": 7200,
"current": {
"dt": 1628156947,
"sunrise": 1628134359,
"sunset": 1628189532,
"temp": 295.54,
"feels_like": 295.43,
"pressure": 1009,
"humidity": 61,
"dew_point": 287.66,
"uvi": 4.53,
"clouds": 20,
"visibility": 10000,
"wind_speed": 3.58,
"wind_deg": 79,
"wind_gust": 4.92,
"weather": [
{
"id": 801,
"main": "Clouds",
"description": "few clouds",
"icon": "02d"
}
]
},
"hourly": [
{
"dt": 1628154000,
"temp": 295.26,
"feels_like": 295.09,
"pressure": 1009,
"humidity": 60,
"dew_point": 287.14,
"uvi": 4.01,
"clouds": 36,
"visibility": 10000,
"wind_speed": 3.6,
"wind_deg": 83,
"wind_gust": 4.76,
"weather": [
{
"id": 500,
"main": "Rain",
"description": "light rain",
"icon": "10d"
}
],
"pop": 0.49,
"rain": {
"1h": 0.52
}
},
{
"dt": 1628157600,
"temp": 295.54,
"feels_like": 295.43,
"pressure": 1009,
"humidity": 61,
"dew_point": 287.66,
"uvi": 4.53,
"clouds": 20,
"visibility": 10000,
"wind_speed": 3.76,
"wind_deg": 85,
"wind_gust": 4.91,
"weather": [
{
"id": 801,
"main": "Clouds",
"description": "few clouds",
"icon": "02d"
}
],
"pop": 0.55
},
{
"dt": 1628161200,
"temp": 295.58,
"feels_like": 295.42,
"pressure": 1009,
"humidity": 59,
"dew_point": 287.18,
"uvi": 4.9,
"clouds": 36,
"visibility": 10000,
"wind_speed": 3.58,
"wind_deg": 95,
"wind_gust": 4.73,
"weather": [
{
"id": 802,
"main": "Clouds",
"description": "scattered clouds",
"icon": "03d"
}
],
"pop": 0.59
}
]
}
I have flattened the JSON first like this:
df_history = pd.json_normalize(data_history, max_level=1)`
That gave me this structure:
lat lon timezone timezone_offset hourly current.dt current.sunrise current.sunset current.temp current.feels_like ... current.humidity current.dew_point current.uvi current.clouds current.visibility current.wind_speed current.wind_deg current.wind_gust current.weather current.rain
0 52.517 13.3889 Europe/Berlin 7200 [{'dt': 1627776000, 'temp': 17.82, 'feels_like... 1627855200 1627874869 1627930649 16.36 16.4 ... 90 14.72 0 0 10000 3.13 254 11.18 [{'id': 500, 'main': 'Rain', 'description': 'l... {'1h': 0.17}
But I want to keep only the columns up to the column "hourly" and then flatten it.
I have tried this but to no avail:
df_history_small = pd.json_normalize(data_history, record_path='hourly',meta=['dt','temp', 'humidity'], errors='ignore')
What am I doing wrong? How can I achieve my goal?
my final goal it to have a dataframe that looks like this:
lat lon timezone timezone_offset timestamp temp feels_like humidity pressure
0 52.517 13.3889 Europe/Berlin 7200 08/01/2021 00:00:00 17.82 17.46 69 1005

Try:
cols = ['lat', 'lon', 'timezone', 'timezone_offset',
'dt', 'temp', 'feels_like', 'humidity']
out = pd.json_normalize(data_history, ['hourly'], meta=cols[:4])[cols]
>>> out
lat lon timezone timezone_offset dt temp feels_like humidity
0 52.517 13.3889 Europe/Berlin 7200 1628154000 295.26 295.09 60
1 52.517 13.3889 Europe/Berlin 7200 1628157600 295.54 295.43 61
2 52.517 13.3889 Europe/Berlin 7200 1628161200 295.58 295.42 59
Feel free to convert dt to timestamp with:
df['timestamp'] = pd.to_datetime(out['dt'], unit='s')

Related

Parsing Nested loops and converting to Dataframe

I have queried device information from Mongodb and the output comes as this. I need to put it in a dataframe with the following infomation:
[{'message': {'obj': [{'time': '2022-06-03 00:00:00',
'temp': 33.96,
'humidty': 91.44,
'x0': -543,
'y0': 93,
'z0': -790,
'dmac': 'DD340206D4C6'},
{'time': '2022-06-03 00:00:00',
'temp': 29.86,
'humidty': 80.92,
'x0': 178,
'y0': 774,
'z0': -527,
'dmac': 'DD340206D4C6'},
{'time': '2022-06-03 00:00:00',
'temp': 30.33,
'humidty': 85.11,
'x0': 94,
'y0': -701,
'z0': -737,
'dmac': 'DD340206D4C6'}]}},
{'message': {'obj': [{'time': '2022-06-03 00:00:01',
'temp': 28.82,
'humidty': 85.77,
'x0': -193,
'y0': 423,
'z0': -820,
'dmac': 'DD340206D4C6'},
{'time': '2022-06-03 00:00:01',
'temp': 30.33,
'humidty': 85.11,
'x0': 64,
'y0': -705,
'z0': -744,
'dmac': 'DD340206D4C6'},
{'time': '2022-06-03 00:00:02',
'temp': 33.96,
'humidty': 91.44,
'x0': -541,
'y0': 95,
'z0': -798,
'dmac': 'DD340206D4C6'}]}}
Expected like this:
dmac
temp
humidity
x0
y0
z0
time
DD340206D4C6
29.86
91.44
-543
93
-790
2022-06-03 00:00:00
It is a dictionary of nested dictionary. Each array has contains 3 lists so i need to put each in a different row
data_in:
in_ = [
{
"message": {
"obj": [
{
"time": "2022-06-03 00:00:00",
"temp": 33.96,
"humidty": 91.44,
"x0": -543,
"y0": 93,
"z0": -790,
"dmac": "DD340206D4C6"
},
{
"time": "2022-06-03 00:00:00",
"temp": 29.86,
"humidty": 80.92,
"x0": 178,
"y0": 774,
"z0": -527,
"dmac": "DD340206D4C6"
},
{
"time": "2022-06-03 00:00:00",
"temp": 30.33,
"humidty": 85.11,
"x0": 94,
"y0": -701,
"z0": -737,
"dmac": "DD340206D4C6"
}
]
}
},
{
"message": {
"obj": [
{
"time": "2022-06-03 00:00:01",
"temp": 28.82,
"humidty": 85.77,
"x0": -193,
"y0": 423,
"z0": -820,
"dmac": "DD340206D4C6"
},
{
"time": "2022-06-03 00:00:01",
"temp": 30.33,
"humidty": 85.11,
"x0": 64,
"y0": -705,
"z0": -744,
"dmac": "DD340206D4C6"
},
{
"time": "2022-06-03 00:00:02",
"temp": 33.96,
"humidty": 91.44,
"x0": -541,
"y0": 95,
"z0": -798,
"dmac": "DD340206D4C6"
}
]
}
}
]
Code:
import pandas as pd
df = pd.json_normalize(in_, ["message", "obj"])
df = df[["dmac", "temp", "humidty", "x0", "y0", "z0", "time"]]
Output:
dmac
temp
humidty
x0
y0
z0
time
DD340206D4C6
33.96
91.44
-543
93
-790
2022-06-03 00:00:00
DD340206D4C6
29.86
80.92
178
774
-527
2022-06-03 00:00:00
DD340206D4C6
30.33
85.11
94
-701
-737
2022-06-03 00:00:00
DD340206D4C6
28.82
85.77
-193
423
-820
2022-06-03 00:00:01
DD340206D4C6
30.33
85.11
64
-705
-744
2022-06-03 00:00:01
DD340206D4C6
33.96
91.44
-541
95
-798
2022-06-03 00:00:02

pandas dataframe to custom nested json

I have a pandas dataframe that looks like this:
user_id cat_id prod_id score pref_prod
29762 9 3115 1.000000 335.0
29762 58 1335 1.000000 335.0
234894 58 1335 1.000000 335.0
413276 43 1388 1.000000 335.0
413276 58 1335 1.000000 335.0
413276 73 26 1.000000 335.0
9280593 9 137 1.000000 335.0
9280593 58 1335 1.000000 335.0
9280593 74 160 1.000000 335.0
4554542 66 1612 0.166667 197.0
4554542 66 1406 0.166767 197.0
4554542 66 2021 1.000000 197.0
I want to group this df by user_id & cat_id and convert it to json so that it looks something like this:
{
29762: {
'cat_id': {
9: [{
'prod_id': 3115,
'score': 1.0
}],
58: [{
'prod_id': 1335,
'score': 1.0
}]
},
'pref_prod': 335.0
}
234894: {
'cat_id': {
58: [{
'prod_id': 1335,
'score': 1.0
}]
},
'pref_prod': 335.0
}
413276: {
'cat_id': {
43: [{
'prod_id': 1388,
'score': 1.0,
'fav_provider': 335.0
}],
58: [{
'prod_id': 1335,
'score': 1.0,
'fav_provider': 335.0
}],
73: [{
'prod_id': 26,
'score': 1.0,
}]
},
'pref_prod': 335.0
}
4554542: {
'cat_id': {
66: [{
'prod_id': 1612,
'score': 0.166
}, {
'prod_id': 1406,
'score': 0.16
}, {
'prod_id': 2021,
'score': 1.0,
}]
},
'pref_prod': 197.0
}
}
As of now I can do
gb = df.groupby(['user_id', 'cat_id']).apply(lambda g: g.drop(['user_id', 'cat_id'], axis=1).to_dict(orient='records')).to_dict()
which gives me user_id and cat_id in tuple keys:
{
(29762, 9): [{
'prod_id': 3115,
'score': 1.0,
'pref_prod': 335.0
}],
(29762, 58): [{
'prod_id': 1335,
'score': 1.0,
'pref_prod': 335.0
}],
(234894, 58): [{
'prod_id': 1335,
'score': 1.0,
'pref_prod': 335.0
}],
(413276, 43): [{
'prod_id': 1388,
'score': 1.0,
'pref_prod': 335.0
}],
(413276, 58): [{
'prod_id': 1335,
'score': 1.0,
'pref_prod': 335.0
}],
(413276, 73): [{
'prod_id': 26,
'score': 1.0,
'pref_prod': 335.0
}],
(9280593, 9): [{
'prod_id': 137,
'score': 1.0,
'pref_prod': 335.0
}],
(9280593, 58): [{
'prod_id': 1335,
'score': 1.0,
'pref_prod': 335.0
}],
(9280593, 74): [{
'prod_id': 160,
'score': 1.0,
'pref_prod': 335.0
}],
(4554542,
66): [{
'prod_id': 1612,
'score': 0.16666666666666666,
'pref_prod': 197.0
}, {
'prod_id': 1406,
'score': 0.16676666666666665,
'pref_prod': 197.0
}, {
'prod_id': 2021,
'score': 1.0,
'pref_prod': 197.0
}]
}
How can I get the json in the desired format
I can't think of any direct way to do it with pandas only. But you can construct a new dictionary with the desired format based on gb, using a defaultdict
from collections import defaultdict
import json # just to prettyprint the resulting dictionary
gb = df.groupby(['user_id', 'cat_id']).apply(lambda g: g.drop(['user_id', 'cat_id'], axis=1).to_dict(orient='records')).to_dict()
d = defaultdict(lambda: {'cat_id':{}} )
for (user_id, cat_id), records in gb.items():
for record in records:
# drop 'pref_prod' key of each record
# I'm assuming its unique for each (user_id, cat_id) group
pref_prod = record.pop('pref_prod')
d[user_id]['cat_id'][cat_id] = records
d[user_id]['pref_prod'] = pref_prod
>>> print(json.dumps(d, indent=4))
{
"29762": {
"cat_id": {
"9": [
{
"prod_id": 3115,
"score": 1.0
}
],
"58": [
{
"prod_id": 1335,
"score": 1.0
}
]
},
"pref_prod": 335.0
},
"234894": {
"cat_id": {
"58": [
{
"prod_id": 1335,
"score": 1.0
}
]
},
"pref_prod": 335.0
},
"413276": {
"cat_id": {
"43": [
{
"prod_id": 1388,
"score": 1.0
}
],
"58": [
{
"prod_id": 1335,
"score": 1.0
}
],
"73": [
{
"prod_id": 26,
"score": 1.0
}
]
},
"pref_prod": 335.0
},
"4554542": {
"cat_id": {
"66": [
{
"prod_id": 1612,
"score": 0.166667
},
{
"prod_id": 1406,
"score": 0.166767
},
{
"prod_id": 2021,
"score": 1.0
}
]
},
"pref_prod": 197.0
},
"9280593": {
"cat_id": {
"9": [
{
"prod_id": 137,
"score": 1.0
}
],
"58": [
{
"prod_id": 1335,
"score": 1.0
}
],
"74": [
{
"prod_id": 160,
"score": 1.0
}
]
},
"pref_prod": 335.0
}
}
I used a namedtuple from a dataframe conversion to create the json tree. if the tree has more than one level than I would use recursion to build it. the dataframe did not contain lists of list so recursion was not required.
from io import StringIO
import io
from collections import namedtuple
data="""user_id,cat_id,prod_id,score,pref_prod
29762,9,3115,1.000000,335.0
29762,58,1335,1.000000,335.0
234894,58,1335,1.000000,335.0
413276,43,1388,1.000000,335.0
413276,58,335,1.000000,335.0
413276,73,26,1.000000,335.0
9280593,9,137,1.000000,335.0
9280593,58,1335,1.000000,335.0
9280593,74,160,1.000000,335.0
4554542,66,1612,0.166667,197.0
4554542,66,1406,0.166767,197.0
4554542,66,2021,1.000000,197.0"""
df = pd.read_csv(io.StringIO(data), sep=',')
Record=namedtuple('Generic',['user_id','cat_id','prod_id','score','pref_prod'])
def map_to_record(row):
return Record(row.user_id, row.cat_id, row.prod_id,row.score,row.pref_prod)
my_list = list(map(map_to_record, df.itertuples()))
def named_tuple_to_json(named_tuple):
"""
convert a named tuple to a json tree structure
"""
json_string="records:["
for record in named_tuple:
json_string+="{"
json_string+="'user_id': {},'cat_id': {},'prod_id': {},'score': {},'pref_prod': {},".format(
record.user_id,record.cat_id,record.prod_id,record.score,record.pref_prod)
json_string+="},"
json_string+="]"
return json_string
# convert the list of named tuples to a json tree structure
json_tree = named_tuple_to_json(my_list)
print(json_tree)
output
records:[{'user_id': 29762,'cat_id': 9,'prod_id': 3115,'score': 1.0,'pref_prod': 335.0,},{'user_id': 29762,'cat_id': 58,'prod_id': 1335,'score': 1.0,'pref_prod': 335.0,},{'user_id': 234894,'cat_id': 58,'prod_id': 1335,'score': 1.0,'pref_prod': 335.0,},{'user_id': 413276,'cat_id': 43,'prod_id': 1388,'score': 1.0,'pref_prod': 335.0,},{'user_id': 413276,'cat_id': 58,'prod_id': 335,'score': 1.0,'pref_prod': 335.0,},{'user_id': 413276,'cat_id': 73,'prod_id': 26,'score': 1.0,'pref_prod': 335.0,},{'user_id': 9280593,'cat_id': 9,'prod_id': 137,'score': 1.0,'pref_prod': 335.0,},{'user_id': 9280593,'cat_id': 58,'prod_id': 1335,'score': 1.0,'pref_prod': 335.0,},{'user_id': 9280593,'cat_id': 74,'prod_id': 160,'score': 1.0,'pref_prod': 335.0,},{'user_id': 4554542,'cat_id': 66,'prod_id': 1612,'score': 0.166667,'pref_prod': 197.0,},{'user_id': 4554542,'cat_id': 66,'prod_id': 1406,'score': 0.166767,'pref_prod': 197.0,},{'user_id': 4554542,'cat_id': 66,'prod_id': 2021,'score': 1.0,'pref_prod': 197.0,},]
​

Python. Separating timestamp by minute and Sorting into a new list

hey guys lets say that I have a dictionary of Food types
food_types = {
"pasta" : [],
"Seafood" : [],
"Chinese" : []
}
currently I have a list of timestamps that I would append to the dictionary value along with its cost
def utc_to_local(utc_dt):
return utc_dt.replace(tzinfo=tz.gettz('UTC')).astimezone(tz.gettz('America/New_York'))
def perdelta(start, end, delta):
curr = start
while curr < end:
yield curr
curr += delta
start_time = utc_to_local(datetime.datetime.utcnow() - datetime.timedelta(minutes=120)).replace(second=0, microsecond=0)
end_time = utc_to_local(datetime.datetime.utcnow() - datetime.timedelta(minutes=5)).replace(second=0, microsecond=0)
for key, value in food_types.items():
for result in perdelta(start_time, end_time, datetime.timedelta(minutes=1)):
food_types[key].append({'timestamp': str(result), 'cost': 0})
this would result me getting a list of dictionaries
{
"pasta": [
{
"timestamp": "2021-02-10 13:06:00-05:00",
"cost": 0
},
{
"timestamp": "2021-02-10 13:07:00-05:00",
"cost": 0
},
{
"timestamp": "2021-02-10 13:08:00-05:00",
"cost": 0
},
{
"timestamp": "2021-02-10 13:09:00-05:00",
"cost": 0
},
{
"timestamp": "2021-02-10 13:10:00-05:00",
"cost": 0
}
],
"Seafood": [
{
"timestamp": "2021-02-10 13:06:00-05:00",
"cost": 0
},
{
"timestamp": "2021-02-10 13:07:00-05:00",
"cost": 0
},
{
"timestamp": "2021-02-10 13:08:00-05:00",
"cost": 0
},
{
"timestamp": "2021-02-10 13:09:00-05:00",
"cost": 0
},
{
"timestamp": "2021-02-10 13:10:00-05:00",
"cost": 0
}
],
"Chinese": [
{
"timestamp": "2021-02-10 13:06:00-05:00",
"cost": 0
},
{
"timestamp": "2021-02-10 13:07:00-05:00",
"cost": 0
},
{
"timestamp": "2021-02-10 13:08:00-05:00",
"cost": 0
},
{
"timestamp": "2021-02-10 13:09:00-05:00",
"cost": 0
},
{
"timestamp": "2021-02-10 13:10:00-05:00",
"cost": 0
}
]
}
What I want to achieve is to separate timestamps per minute into its own list the timestamps
for example :
[{'pasta' : {"timestamp": "2021-02-10 13:06:00-05:00", "cost": 0} }, 'seafood' :{{"timestamp": "2021-02-10 13:06:00-05:00", "cost": 0} , 'chinese': {"timestamp": "2021-02-10 13:06:00-05:00", "cost": 0}}], [{'pasta' : {"timestamp": "2021-02-10 13:07:00-05:00", "cost": 0} }, 'seafood' :{{"timestamp": "2021-02-10 13:07:00-05:00", "cost": 0} , 'chinese': {"timestamp": "2021-02-10 13:07:00-05:00", "cost": 0}}],[{'pasta' : {"timestamp": "2021-02-10 13:08:00-05:00", "cost": 0} }, 'seafood' :{{"timestamp": "2021-02-10 13:08:00-05:00", "cost": 0} , 'chinese': {"timestamp": "2021-02-10 13:08:00-05:00", "cost": 0}}], ..., ..., ..., ..., until the end of the timestamp.
Is there a way I can achieve this ?
This should help you
x = sorted(x, key = lambda k: k["date"])

Illegal_argument_exception when importing Twitter into Elasticsearch

I am new to Elasticsearch and am attempting to do some data analysis of Twitter data by importing it into Elasticsearch and running Kibana on it. I'm getting stuck when importing Twitter data into Elasticsearch. Any help is appreciated!
Here's a sample working program that produces the error.
import json
from elasticsearch import Elasticsearch
es = Elasticsearch()
data = json.loads(open("data.json").read())
es.index(index='tweets5', doc_type='tweets', id=data['id'], body=data)
Here's the error:
Traceback (most recent call last):
File "elasticsearch_import_test.py", line 5, in <module>
es.index(index='tweets5', doc_type='tweets', id=data['id'], body=data)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/client/__init__.py", line 279, in index
_make_path(index, doc_type, id), params=params, body=body)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/transport.py", line 329, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 109, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 108, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(400, u'illegal_argument_exception', u'[Raza][127.0.0.1:9300][indices:data/write/index[p]]')
Here's an example Twitter JSON file (data.json)
{
"_id": {
"$oid": "570597358c68d71c16b3b722"
},
"contributors": null,
"coordinates": null,
"created_at": "Wed Apr 06 23:09:41 +0000 2016",
"entities": {
"hashtags": [
{
"indices": [
68,
72
],
"text": "dnd"
},
{
"indices": [
73,
79
],
"text": "Nat20"
},
{
"indices": [
80,
93
],
"text": "CriticalRole"
},
{
"indices": [
94,
103
],
"text": "d20babes"
}
],
"media": [
{
"display_url": "pic.twitter.com/YQoxEuEAXV",
"expanded_url": "http://twitter.com/Zenttsilverwing/status/715953298076012545/photo/1",
"id": 715953292849754112,
"id_str": "715953292849754112",
"indices": [
104,
127
],
"media_url": "http://pbs.twimg.com/media/Ce-TugAUsAASZht.jpg",
"media_url_https": "https://pbs.twimg.com/media/Ce-TugAUsAASZht.jpg",
"sizes": {
"large": {
"h": 768,
"resize": "fit",
"w": 1024
},
"medium": {
"h": 450,
"resize": "fit",
"w": 600
},
"small": {
"h": 255,
"resize": "fit",
"w": 340
},
"thumb": {
"h": 150,
"resize": "crop",
"w": 150
}
},
"source_status_id": 715953298076012545,
"source_status_id_str": "715953298076012545",
"source_user_id": 2375847847,
"source_user_id_str": "2375847847",
"type": "photo",
"url": "https://shortened.url/YQoxEuEAXV"
}
],
"symbols": [],
"urls": [
{
"display_url": "darkcastlecollectibles.com",
"expanded_url": "http://www.darkcastlecollectibles.com/",
"indices": [
44,
67
],
"url": "https://shortened.url/SJgFTE0o8h"
}
],
"user_mentions": [
{
"id": 2375847847,
"id_str": "2375847847",
"indices": [
3,
19
],
"name": "Zack Chini",
"screen_name": "Zenttsilverwing"
}
]
},
"extended_entities": {
"media": [
{
"display_url": "pic.twitter.com/YQoxEuEAXV",
"expanded_url": "http://twitter.com/Zenttsilverwing/status/715953298076012545/photo/1",
"id": 715953292849754112,
"id_str": "715953292849754112",
"indices": [
104,
127
],
"media_url": "http://pbs.twimg.com/media/Ce-TugAUsAASZht.jpg",
"media_url_https": "https://pbs.twimg.com/media/Ce-TugAUsAASZht.jpg",
"sizes": {
"large": {
"h": 768,
"resize": "fit",
"w": 1024
},
"medium": {
"h": 450,
"resize": "fit",
"w": 600
},
"small": {
"h": 255,
"resize": "fit",
"w": 340
},
"thumb": {
"h": 150,
"resize": "crop",
"w": 150
}
},
"source_status_id": 715953298076012545,
"source_status_id_str": "715953298076012545",
"source_user_id": 2375847847,
"source_user_id_str": "2375847847",
"type": "photo",
"url": "https://shortened.url/YQoxEuEAXV"
},
{
"display_url": "pic.twitter.com/YQoxEuEAXV",
"expanded_url": "http://twitter.com/Zenttsilverwing/status/715953298076012545/photo/1",
"id": 715953295727009793,
"id_str": "715953295727009793",
"indices": [
104,
127
],
"media_url": "http://pbs.twimg.com/media/Ce-TuquUIAEsVn9.jpg",
"media_url_https": "https://pbs.twimg.com/media/Ce-TuquUIAEsVn9.jpg",
"sizes": {
"large": {
"h": 768,
"resize": "fit",
"w": 1024
},
"medium": {
"h": 450,
"resize": "fit",
"w": 600
},
"small": {
"h": 255,
"resize": "fit",
"w": 340
},
"thumb": {
"h": 150,
"resize": "crop",
"w": 150
}
},
"source_status_id": 715953298076012545,
"source_status_id_str": "715953298076012545",
"source_user_id": 2375847847,
"source_user_id_str": "2375847847",
"type": "photo",
"url": "https://shortened.url/YQoxEuEAXV"
}
]
},
"favorite_count": 0,
"favorited": false,
"filter_level": "low",
"geo": null,
"id": 717851801417031680,
"id_str": "717851801417031680",
"in_reply_to_screen_name": null,
"in_reply_to_status_id": null,
"in_reply_to_status_id_str": null,
"in_reply_to_user_id": null,
"in_reply_to_user_id_str": null,
"is_quote_status": false,
"lang": "en",
"place": null,
"possibly_sensitive": false,
"retweet_count": 0,
"retweeted": false,
"retweeted_status": {
"contributors": null,
"coordinates": null,
"created_at": "Fri Apr 01 17:25:42 +0000 2016",
"entities": {
"hashtags": [
{
"indices": [
47,
51
],
"text": "dnd"
},
{
"indices": [
52,
58
],
"text": "Nat20"
},
{
"indices": [
59,
72
],
"text": "CriticalRole"
},
{
"indices": [
73,
82
],
"text": "d20babes"
}
],
"media": [
{
"display_url": "pic.twitter.com/YQoxEuEAXV",
"expanded_url": "http://twitter.com/Zenttsilverwing/status/715953298076012545/photo/1",
"id": 715953292849754112,
"id_str": "715953292849754112",
"indices": [
83,
106
],
"media_url": "http://pbs.twimg.com/media/Ce-TugAUsAASZht.jpg",
"media_url_https": "https://pbs.twimg.com/media/Ce-TugAUsAASZht.jpg",
"sizes": {
"large": {
"h": 768,
"resize": "fit",
"w": 1024
},
"medium": {
"h": 450,
"resize": "fit",
"w": 600
},
"small": {
"h": 255,
"resize": "fit",
"w": 340
},
"thumb": {
"h": 150,
"resize": "crop",
"w": 150
}
},
"type": "photo",
"url": "https://shortened.url/YQoxEuEAXV"
}
],
"symbols": [],
"urls": [
{
"display_url": "darkcastlecollectibles.com",
"expanded_url": "http://www.darkcastlecollectibles.com/",
"indices": [
23,
46
],
"url": "https://shortened.url/SJgFTE0o8h"
}
],
"user_mentions": []
},
"extended_entities": {
"media": [
{
"display_url": "pic.twitter.com/YQoxEuEAXV",
"expanded_url": "http://twitter.com/Zenttsilverwing/status/715953298076012545/photo/1",
"id": 715953292849754112,
"id_str": "715953292849754112",
"indices": [
83,
106
],
"media_url": "http://pbs.twimg.com/media/Ce-TugAUsAASZht.jpg",
"media_url_https": "https://pbs.twimg.com/media/Ce-TugAUsAASZht.jpg",
"sizes": {
"large": {
"h": 768,
"resize": "fit",
"w": 1024
},
"medium": {
"h": 450,
"resize": "fit",
"w": 600
},
"small": {
"h": 255,
"resize": "fit",
"w": 340
},
"thumb": {
"h": 150,
"resize": "crop",
"w": 150
}
},
"type": "photo",
"url": "https://shortened.url/YQoxEuEAXV"
},
{
"display_url": "pic.twitter.com/YQoxEuEAXV",
"expanded_url": "http://twitter.com/Zenttsilverwing/status/715953298076012545/photo/1",
"id": 715953295727009793,
"id_str": "715953295727009793",
"indices": [
83,
106
],
"media_url": "http://pbs.twimg.com/media/Ce-TuquUIAEsVn9.jpg",
"media_url_https": "https://pbs.twimg.com/media/Ce-TuquUIAEsVn9.jpg",
"sizes": {
"large": {
"h": 768,
"resize": "fit",
"w": 1024
},
"medium": {
"h": 450,
"resize": "fit",
"w": 600
},
"small": {
"h": 255,
"resize": "fit",
"w": 340
},
"thumb": {
"h": 150,
"resize": "crop",
"w": 150
}
},
"type": "photo",
"url": "https://shortened.url/YQoxEuEAXV"
}
]
},
"favorite_count": 5,
"favorited": false,
"filter_level": "low",
"geo": null,
"id": 715953298076012545,
"id_str": "715953298076012545",
"in_reply_to_screen_name": null,
"in_reply_to_status_id": null,
"in_reply_to_status_id_str": null,
"in_reply_to_user_id": null,
"in_reply_to_user_id_str": null,
"is_quote_status": false,
"lang": "en",
"place": null,
"possibly_sensitive": false,
"retweet_count": 1,
"retweeted": false,
"source": "Twitter Web Client",
"text": "coins came in!! Thanks https://shortened.url/SJgFTE0o8h #dnd #Nat20 #CriticalRole #d20babes https://shortened.url/YQoxEuEAXV",
"truncated": false,
"user": {
"contributors_enabled": false,
"created_at": "Thu Mar 06 19:59:14 +0000 2014",
"default_profile": true,
"default_profile_image": false,
"description": "DM Geek Critter Con-man. I am here to like your art ^.^",
"favourites_count": 4990,
"follow_request_sent": null,
"followers_count": 57,
"following": null,
"friends_count": 183,
"geo_enabled": false,
"id": 2375847847,
"id_str": "2375847847",
"is_translator": false,
"lang": "en",
"listed_count": 7,
"location": "Flower Mound, TX",
"name": "Zack Chini",
"notifications": null,
"profile_background_color": "C0DEED",
"profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png",
"profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png",
"profile_background_tile": false,
"profile_banner_url": "https://pbs.twimg.com/profile_banners/2375847847/1430928759",
"profile_image_url": "http://pbs.twimg.com/profile_images/708816622358663168/mNF4Ysr5_normal.jpg",
"profile_image_url_https": "https://pbs.twimg.com/profile_images/708816622358663168/mNF4Ysr5_normal.jpg",
"profile_link_color": "0084B4",
"profile_sidebar_border_color": "C0DEED",
"profile_sidebar_fill_color": "DDEEF6",
"profile_text_color": "333333",
"profile_use_background_image": true,
"protected": false,
"screen_name": "Zenttsilverwing",
"statuses_count": 551,
"time_zone": null,
"url": null,
"utc_offset": null,
"verified": false
}
},
"source": "Twitter Web Client",
"text": "RT #Zenttsilverwing: coins came in!! Thanks https://shortened.url/SJgFTE0o8h #dnd #Nat20 #CriticalRole #d20babes https://shortened.url/YQoxEuEAXV",
"timestamp_ms": "1459984181156",
"truncated": false,
"user": {
"contributors_enabled": false,
"created_at": "Tue Feb 10 04:31:18 +0000 2009",
"default_profile": false,
"default_profile_image": false,
"description": "I use Twitter to primarily retweet Critter artwork of Critical Role and their own creations. I maintain a list of all the Critter artists I've come across.",
"favourites_count": 17586,
"follow_request_sent": null,
"followers_count": 318,
"following": null,
"friends_count": 651,
"geo_enabled": true,
"id": 20491914,
"id_str": "20491914",
"is_translator": false,
"lang": "en",
"listed_count": 33,
"location": "SanDiego, CA",
"name": "UnknownOutrider",
"notifications": null,
"profile_background_color": "EDECE9",
"profile_background_image_url": "http://abs.twimg.com/images/themes/theme3/bg.gif",
"profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme3/bg.gif",
"profile_background_tile": false,
"profile_image_url": "http://pbs.twimg.com/profile_images/224346493/cartoon_dragon_tattoo_designs_normal.jpg",
"profile_image_url_https": "https://pbs.twimg.com/profile_images/224346493/cartoon_dragon_tattoo_designs_normal.jpg",
"profile_link_color": "088253",
"profile_sidebar_border_color": "D3D2CF",
"profile_sidebar_fill_color": "E3E2DE",
"profile_text_color": "634047",
"profile_use_background_image": true,
"protected": false,
"screen_name": "UnknownOutrider",
"statuses_count": 12760,
"time_zone": "Pacific Time (US & Canada)",
"url": null,
"utc_offset": -25200,
"verified": false
}
}
The reason that don't work is that you are trying to index document with a field named _id which is already exist as a default field. So delete that field or change field name:
import json
from elasticsearch import Elasticsearch
es = Elasticsearch()
data = json.loads(open("data.json").read())
# data['id_'] = data['_id'] <= You can change _id as id_
del data['_id']
es.index(index='tweets5', doc_type='tweets', id=data['id'], body=data)

Data structure manipulation with Pandas

I have a list of dicts as follows :
[
{
"status": "BV",
"max_total_duration": null,
"min_total_duration": null,
"75th_percentile": 420,
"median": 240.0,
"25th_percentile": 180,
"avg_total_duration": null
},
{
"status": "CORR",
"max_total_duration": null,
"min_total_duration": null,
"75th_percentile": 1380,
"median": 720.0,
"25th_percentile": 420,
"avg_total_duration": null
},
{
"status": "FILL",
"max_total_duration": null,
"min_total_duration": null,
"75th_percentile": 1500,
"median": 840.0,
"25th_percentile": 480,
"avg_total_duration": null
},
{
"status": "INIT",
"max_total_duration": 11280,
"min_total_duration": 120,
"75th_percentile": 720,
"median": 360.0,
"25th_percentile": 180,
"avg_total_duration": 2061
},
]
As is evident,max_total_duration,min_total_duration and avg_total_duration is null for all status except when status is "INIT".What I would want is to remove all the entries for null values and for INIT where max_total_duration,min_total_duration and avg_total_duration have correct values, add them as a new dictionary in the list as follows:
[
{
"status": "BV",
"75th_percentile": 420,
"median": 240.0,
"25th_percentile": 180,
},
{
"status": "CORR",
"75th_percentile": 1380,
"median": 720.0,
"25th_percentile": 420,
},
{
"status": "FILL",
"75th_percentile": 1500,
"median": 840.0,
"25th_percentile": 480,
},
{
"status": "INIT",
"75th_percentile": 720,
"median": 360.0,
"25th_percentile": 180,
},
{
"max_total_duration": 11280,
"min_total_duration": 120,
"avg_total_duration": 2061,
}
]
I have tried doing this by iterating over the list and it is computationally very expensive.Is there an easier way of doing this with pandas ?
data =[
{
"status": "BV",
"max_total_duration": None,
"min_total_duration": None,
"75th_percentile": 420,
"median": 240.0,
"25th_percentile": 180,
"avg_total_duration": None
},
{
"status": "CORR",
"max_total_duration": None,
"min_total_duration": None,
"75th_percentile": 1380,
"median": 720.0,
"25th_percentile": 420,
"avg_total_duration": None
},
{
"status": "FILL",
"max_total_duration": None,
"min_total_duration": None,
"75th_percentile": 1500,
"median": 840.0,
"25th_percentile": 480,
"avg_total_duration": None
},
{
"status": "INIT",
"max_total_duration": 11280,
"min_total_duration": 120,
"75th_percentile": 720,
"median": 360.0,
"25th_percentile": 180,
"avg_total_duration": 2061
},
]
data = [{key: val for key, val in d.iteritems() if val} for d in data]
final = []
for d in data:
status = d.get('status')
if status == 'INIT':
final.append({'max_total_duration': d.get('max_total_duration'), 'min_total_duration': d.get('min_total_duration'), 'avg_total_duration': d.get('avg_total_duration')})
del d['max_total_duration']
del d['min_total_duration']
del d['avg_total_duration']
final.append(d)
print final
import pandas as pd
# Substituting your 'null' for 'None'
df = pd.DataFrame(data)
>>> df
25th_percentile 75th_percentile avg_total_duration max_total_duration \
0 180 420 NaN NaN
1 420 1380 NaN NaN
2 480 1500 NaN NaN
3 180 720 2061 11280
median min_total_duration status
0 240 NaN BV
1 720 NaN CORR
2 840 NaN FILL
3 360 120 INIT
Grabbing the percentiles part:
df_percentiles = df[['status','25th_percentile','median','75th_percentile']]
>>> df_percentiles
status 25th_percentile median 75th_percentile
0 BV 180 240 420
1 CORR 420 720 1380
2 FILL 480 840 1500
3 INIT 180 360 720
Grabbing the durations part:
df_durations = df[df['status'] == 'INIT'][['max_total_duration','min_total_duration','avg_total_duration']]
>>> df_durations
max_total_duration min_total_duration avg_total_duration
3 11280 120 2061
Loop and combine to list:
summary = df_percentiles.T.to_dict().values()
summary.append(df_durations.T.to_dict().values())
>>> summary
[{'25th_percentile': 180,
'75th_percentile': 420,
'median': 240.0,
'status': 'BV'},
{'25th_percentile': 420,
'75th_percentile': 1380,
'median': 720.0,
'status': 'CORR'},
{'25th_percentile': 480,
'75th_percentile': 1500,
'median': 840.0,
'status': 'FILL'},
{'25th_percentile': 180,
'75th_percentile': 720,
'median': 360.0,
'status': 'INIT'},
{'avg_total_duration': 2061.0,
'max_total_duration': 11280.0,
'min_total_duration': 120.0}]

Categories

Resources