Let's say I have some JSON stored in postgresql like so:
{"the": [0, 4], "time": [1, 5], "is": [2, 6], "here": [3], "now": [7]}
This is an inverted index showing the position of each word, which spells out
the time is here the time is now
I want to put the text from the second example in a separate column. I can convert the inverted text with python like so:
def convert_index(inverted_index):
unraveled = {}
for key, values in inverted_index.items():
for value in values:
unraveled[value] = key
sorted_unraveled = dict(sorted(unraveled.items()))
result = " ".join(sorted_unraveled.values())
result = result.replace("\n", "")
return result
But I would love to do this within postgresql so I am not reading text from one column, running a script somewhere else, then adding text in a separate column. Anybody know of a way to go about that? Can I use some kind of script?
You need to get keys with jsonb_each() and unpack arrays with jsonb_array_elements() then aggregate the keys with proper order:
with my_table(json_col) as (
values
('{"the": [0, 4], "time": [1, 5], "is": [2, 6], "here": [3], "now": [7]}'::jsonb)
)
select string_agg(key, ' ' order by ord::int)
from my_table
cross join jsonb_each(json_col)
cross join jsonb_array_elements(value) as e(ord)
Test it in Db<>fiddle.
I have a dataframe. I am trying to find percentiles of datetimes. I am using the function:
Dataframe:
student, attempts, time
student 1,14, 9/3/2019 12:32:32 AM
student 2,2, 9/3/2019 9:37:14 PM
student 3, 5
student 4, 16, 9/5/2019 8:58:14 PM
studentInfo2 = [14, 4, Timestamp('2019-09-04 00:26:36')]
data['time'] = pd.to_datetime(data['time_0001'], errors='coerce')
perc1_first = stats.percentileofscore(data['time'].notnull(), student2Info[2], 'rank')
where student2Info[2] holds the datetime for a particular student. When I try and do this I get the error:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Any ideas on how I can get the percentile to calculate correctly even when there are missing times in the columns?
You need to transform the Timestamps into units that percentileofscore can understand. Also, pd.DataFrame.notnull() returns a boolean list that you may use to filter your DataFrame, it does not return the filtered list, so I've updated that for you. Here is a working example:
import pandas as pd
import scipy.stats as stats
data = pd.DataFrame.from_dict({
"student": [1, 2, 3, 4],
"attempts": [14, 2, 5, 16],
"time_0001": [
"9/3/2019 12:32:32 AM",
"9/3/2019 9:37:14 PM",
"",
"9/5/2019 8:58:14 PM"
]
})
student2Info = [14, 4, pd.Timestamp('2019-09-04 00:26:36')]
data['time'] = pd.to_datetime(data['time_0001'], errors='coerce')
perc1_first = stats.percentileofscore(data[data['time'].notnull()].time.transform(pd.Timestamp.toordinal), student2Info[2].toordinal(), 'rank')
print(perc1_first) #-> 66.66666666666667
I have a dataframe (df) whose column names are ["Home", "Season", "Date", "Consumption", "Temp"]. Now what I'm trying to do is perform calculations on these dataframe by "Home", "Season", "Temp" and "Consumption".
In[56]: df['Home'].unique().tolist()
Out[56]: [1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
In[57]: df['Season'].unique().tolist()
Out[57]: ['Spring', 'Summer', 'Autumn', 'Winter']
Here is what is done so far:
series = {}
for i in df['Home'].unique().tolist():
for j in df["Season"].unique().tolist():
series[i, j] = df[(df["Home"] == i) & (df["Consumption"] >= 0) & (df["Season"] == j)]
for key, value in series.items():
value["Corr"] = value["Temp"].corr(value["Consumption"])
Here is the dictionary of dataframes named "Series" as an output of loop.
What I expected from last loop is to give me a dictionary of dataframes with a new column i.e. "Corr" added that would have correlated values for "Temp" and "Consumption", but instead it gives a single dataframe for last home in the iteration i.e. 23.
To simply add sixth column named "Corr" in all dataframes in a dictionary that would be a correlation between "Temp" and "Consumption". Can you help me with the above? I'm somehow missing the use of keys in the last loop. Thanks in advance!
All of those loops are entirely unnecessary! Simply call:
df.groupby(['Home', 'Season'])['Consumption', 'Temp'].corr()
(thanks #jezrael for the correction)
One of the answer on How to find the correlation between a group of values in a pandas dataframe column
helped. Avoiding all unnecessary loops. Thanks #jezrael and #JoshFriedlander for suggesting groupby method. Upvote (y).
Posting solution here:
df = df[df["Consumption"] >= 0]
corrs = (df[["Home", "Season", "Temp"]]).groupby(
["Home", "Season"]).corrwith(
df["Consumption"]).rename(
columns = {"Temp" : "Corr"}).reset_index()
df = pd.merge(df, corrs, how = "left", on = ["Home", "Season"])
I am trying to drop multiple rows from my data.
I can drop rows using:
dt=dt.drop([40,41,42,43,44,45])
But I was wondering if there is a simpler way. I tried:
dt=dt.drop([40:45])
But sadly it did not work.
I will recommend np.r_
df.drop(np.r_[40:50+1])
In case you want to drop two range at the same time
np.r_[40:50+1,1:4+1]
Out[719]: array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 1, 2, 3, 4])
Assuming you want to drop a range of positions:
df.drop(df.index[40: 46])
This doesn't assume the indices are integers.
You can use:
dt = dt.drop(range(40,46))
or
dt.drop(range(40,46), inplace=True)
You could generate the list based on a range:
dt=dt.drop([x for x in range(40, 46)])
Or just:
dt=dt.drop(range(40, 46))
Please could I solicit some general advice regarding Python lists. I know I shouldn't answer 'open' questions on here but I am worried about setting off on completely the wrong path.
My problem is that I have .csv files that are approximately 600,000 lines long each. Each row of the .csv has 6 fields, of which the first field is a date-time stamp in the format DD/MM/YYYY HH:MM:SS. The next two fields are blank and the last three fields contain float and integer values, so for example:
23/05/2017 16:42:17, , , 1.25545, 1.74733, 12
23/05/2017 16:42:20, , , 1.93741, 1.52387, 14
23/05/2017 16:42:23, , , 1.54875, 1.46258, 11
etc
No two values in column 1 (date-time stamp) will ever be the same.
I need to write a program that will do a few basic operations with the data, such as:
read all of the data into a dictionary, list, set (?) etc as appropriate.
search through the date time stamp column for a particular value.
read through the list and do basic calculations on the floats in columns 4 and 5.
write a new list based on the searches/calculations.
My question is - how should I 'handle' the data and am I likely to run into problems due to the length of the dataset?
For example, should I import all of the data into a list, and each element of the list is a sublist of each rows data? E.g:
[[23/05/2017 16:42:17,'','', 1.25545, 1.74733, 12],[23/05/2017 16:42:20,'','', 1.93741, 1.52387, 14], ...]
Or would it be better to make each date-time stamp the 'key' in a dictionary and make the dictionary 'value' a list with all the other values, e.g:
{'23/05/2017 16:42:17': [ , , 1.25545, 1.74733, 12], ...}
etc
If I use the list approach, is there a way to get Python to 'search' in only the first column for a particular time stamp rather than making it search through 600,000 rows times 6 columns when we know that only the first column contains timestamps?
I apologize if my query is a little vague, but would appreciate any guidance that anyone can offer.
600000 lines aren't that many, your script should run fine with either a list or a dict.
As a test, let's use:
data = [["2017-05-02 17:28:24", 0.85260, 1.16218, 7],
["2017-05-04 05:40:07", 0.72118, 0.47710, 15],
["2017-05-07 19:27:53", 1.79476, 0.47496, 14],
["2017-05-09 01:57:10", 0.44123, 0.13711, 16],
["2017-05-11 07:22:57", 0.17481, 0.69468, 0],
["2017-05-12 10:11:01", 0.27553, 0.47834, 4],
["2017-05-15 05:20:36", 0.01719, 0.51249, 7],
["2017-05-17 14:01:13", 0.35977, 0.50052, 7],
["2017-05-17 22:05:33", 1.68628, 1.90881, 13],
["2017-05-18 14:44:14", 0.32217, 0.96715, 14],
["2017-05-18 20:24:23", 0.90819, 0.36773, 5],
["2017-05-21 12:15:20", 0.49456, 1.12508, 5],
["2017-05-22 07:46:18", 0.59015, 1.04352, 6],
["2017-05-26 01:49:38", 0.44455, 0.26669, 13],
["2017-05-26 18:55:24", 1.33678, 1.24181, 7]]
dict
If you're looking for exact timestamps, a lookup will be much faster with a dict than with a list. You have to know exactly what you're looking for though: "23/05/2017 16:42:17" has a completely different hash than "23/05/2017 16:42:18".
data_as_dict = {l[0]: l[1:] for l in data}
print(data_as_dict)
# {'2017-05-21 12:15:20': [0.49456, 1.12508, 5], '2017-05-18 14:44:14': [0.32217, 0.96715, 14], '2017-05-04 05:40:07': [0.72118, 0.4771, 15], '2017-05-26 01:49:38': [0.44455, 0.26669, 13], '2017-05-17 14:01:13': [0.35977, 0.50052, 7], '2017-05-15 05:20:36': [0.01719, 0.51249, 7], '2017-05-26 18:55:24': [1.33678, 1.24181, 7], '2017-05-07 19:27:53': [1.79476, 0.47496, 14], '2017-05-17 22:05:33': [1.68628, 1.90881, 13], '2017-05-02 17:28:24': [0.8526, 1.16218, 7], '2017-05-22 07:46:18': [0.59015, 1.04352, 6], '2017-05-11 07:22:57': [0.17481, 0.69468, 0], '2017-05-18 20:24:23': [0.90819, 0.36773, 5], '2017-05-12 10:11:01': [0.27553, 0.47834, 4], '2017-05-09 01:57:10': [0.44123, 0.13711, 16]}
print(data_as_dict.get('2017-05-17 14:01:13'))
# [0.35977, 0.50052, 7]
print(data_as_dict.get('2017-05-17 14:01:10'))
# None
Note that your DD/MM/YYYY HH:MM:SS format isn't very convenient : sorting the cells lexicographically won't sort them by datetime. You'd need to use datetime.strptime() first:
from datetime import datetime
data_as_dict = {datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S'): l[1:] for l in data}
print(data_as_dict.get(datetime(2017,5,17,14,1,13)))
# [0.35977, 0.50052, 7]
print(data_as_dict.get(datetime(2017,5,17,14,1,10)))
# None
list with binary search
If you're looking for timestamps ranges, a dict won't help you much. A binary search (e.g. with bisect) on a list of timestamps should be very fast.
import bisect
timestamps = [datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S') for l in data]
i = bisect.bisect(timestamps, datetime(2017,5,17,14,1,10))
print(data[i-1])
# ['2017-05-15 05:20:36', 0.01719, 0.51249, 7]
print(data[i])
# ['2017-05-17 14:01:13', 0.35977, 0.50052, 7]
Database
Before reinventing the wheel, you might want to dump all your CSVs into a small database (sqlite, Postgresql, ...) and use the corresponding queries.
Pandas
If you don't want the added complexity of a database but are ready to invest some time learning a new syntax, you should use pandas.DataFrame. It does exactly what you want, and then some.