my question is the following: I do not know very well all the pandas methods and I think that there is surely a more efficient way to do this: I have to load two tables from .csv files to a postgres database; These tables are related to each other with an id, which serves as a foreign key, and comes from the source data, however I must relate them to a different id controlled by my logic.
I explain graphically in the following image:
Im trying to create a new Series based on the "another_id" that i have and apply a function that loop through a dataframe Series to compare if have the another code and get their id
def check_foreign_key(id, df_ppal):
if id:
for i in df_ppal.index:
if id == df_ppal.iloc[i]['another_id']:
return df_ppal.iloc[i]['id']
dfs['id_fk'] = dfs['another_id'].apply(lambda id : check_foreign_key(id, df_ppal))
In this point i think that it is not efficient because I have to loop in all column to match the another_id and get and get its the correct id that I need is in yellow in the picture.
So I should think about search algorithms to make the task more efficient, but I wonder if pandas does not have a method that allows me to do this faster, in case there are many records.
I need a dataframe like a this table that have a new column "ID Principal" based on matching Another_code, with another dataframe column.
ID
ID Principal
Another_code
1
12
54
2
12
54
3
13
55
4
14
56
5
14
56
6
14
56
Well indeed, I was not understanding very well all the pandas functions, I could solve my problem using merge, I did not know that pandas had a good implementation of the typical Join in SQL.
This documentation helped me a lot:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
Pandas Merging 101
Finally my answer:
new_df = principal.merge(secondary, on='another_id')
I thank you all!
I have a big data (30 milions rows).
Each table has id,date,value.
I need to go over each id and per these id get a list of values sorted by date so the first value is the list will be the older date.
Example:
ID DATE VALUE
1 02/03/2020 300
1 04/03/2020 200
2 04/03/2020 456
2 01/03/2020 300
2 05/03/2020 78
Desire table:
ID VALUE_LIST_ORDERED
1 [300,200]
2 [300,456,78]
I can do it by for loop, by apply but its not effictive and with milion of users it's not feasible.
I thought about using group by and sort the dates but I dont know of to make a list and if so, groupby on pandas df is the best way?
I would love to get some suggestions on how to do it and which kind of df/technology to use.
Thank you!
what you need to do is to order your data using pandas.dataframe.sort_values and then apply the groupby method
I don't have huge data set to test this code on, but I believe this would do the trick:
sorted = data.sort_values('DATE')
result = data.groupby('ID').VALUE.apply(np.array)
and since it's Python you can always put everything in one statement
print(data.sort_values('DATE').data.groupby('ID').VALUE.apply(np.array))
Problem and what I want
I have a data file that comprises time series read asynchronously from multiple sensors. Basically for every data element in my file, I have a sensor ID and time at which it was read, but I do not always have all sensors for every time, and read times may not be evenly spaced. Something like:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
2,1,5 # skip some sensors for some time steps
0,2,6
2,2,7
2,3,8
1,5,9 # skip some time steps
2,5,10
Important note the actual time column is of datetime type.
What I want is to be able to zero-order hold (forward fill) values for every sensor for any time steps where that sensor does not exist, and either set to zero or back fill any sensors that are not read at the earliest time steps. What I want is a dataframe that looks like it was read from:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
1,1,2 # ID 1 hold value from time step 0
2,1,5
0,2,6
1,2,2 # ID 1 still holding
2,2,7
0,3,6 # ID 0 holding
1,3,2 # ID 1 still holding
2,3,8
0,5,6 # ID 0 still holding, can skip totally missing time steps
1,5,9 # ID 1 finally updates
2,5,10
Pandas attempts so far
I initialize my dataframe and set my indices:
df = pd.read_csv(filename, dtype=np.int)
df.set_index(['ID', 'time'], inplace=True)
I try to mess with things like:
filled = df.reindex(method='ffill')
or the like with various values passed to the index keyword argument like df.index, ['time'], etc. This always either throws an error because I passed an invalid keyword argument, or does nothing visible to the dataframe. I think it is not recognizing that the data I am looking for is "missing".
I also tried:
df.update(df.groupby(level=0).ffill())
or level=1 based on Multi-Indexed fillna in Pandas, but I get no visible change to the dataframe again, I think because I don't have anything currently where I want my values to go.
Numpy attempt so far
I have had some luck with numpy and non-integer indexing using something like:
data = [np.array(df.loc[level].data) for level in df.index.levels[0]]
shapes = [arr.shape for arr in data]
print(shapes)
# [(3,), (2,), (5,)]
data = [np.array([arr[i] for i in np.linspace(0, arr.shape[0]-1, num=max(shapes)[0])]) for arr in data]
print([arr.shape for arr in data])
# [(5,), (5,), (5,)]
But this has two problems:
It takes me out of the pandas world, and I now have to manually maintain my sensor IDs, time index, etc. along with my feature vector (the actual data column is not just one column but a ton of values from a sensor suite).
Given the number of columns and the size of the actual dataset, this is going to be clunky and inelegant to implement on my real example. I would prefer a way of doing it in pandas.
The application
Ultimately this is just the data-cleaning step for training recurrent neural network, where for each time step I will need to feed a feature vector that always has the same structure (one set of measurements for each sensor ID for each time step).
Thank you for your help!
Here is one way , by using reindex and category
df.time=df.time.astype('category',categories =[0,1,2,3,4,5])
new_df=df.groupby('time',as_index=False).apply(lambda x : x.set_index('ID').reindex([0,1,2])).reset_index()
new_df['data']=new_df.groupby('ID')['data'].ffill()
new_df.drop('time',1).rename(columns={'level_0':'time'})
Out[311]:
time ID data
0 0 0 1.0
1 0 1 2.0
2 0 2 3.0
3 1 0 4.0
4 1 1 2.0
5 1 2 5.0
6 2 0 6.0
7 2 1 2.0
8 2 2 7.0
9 3 0 6.0
10 3 1 2.0
11 3 2 8.0
12 4 0 6.0
13 4 1 2.0
14 4 2 8.0
15 5 0 6.0
16 5 1 9.0
17 5 2 10.0
You can have a dictionary of last readings for each sensors. You'll have to pick some initial value; the most logical choice is probably to back-fill the earliest reading to earlier times. Once you've populated your last_reading dictionary, you can just sort all the readings by time, update the dictionary for each reading, and then fill in rows according to the dictionay. So after you have your last_reading dictionary initialized:
last_time = readings[1][time]
for reading in readings:
if reading[time] > last_time:
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
last_reading[reading[ID]] = reading[data]
#the above for loop doesn't update for the last time
#so you'll have to handle that separately
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
This assumes that you have only one reading for each time/sensor pair, and that 'readings' a list of dictionaries sorted by time. It also assumes that df has the different sensors as columns and different times as index. Adjust the code as necessary if otherwise. You can also probably optimize it a bit more by updating a whole row at once instead of using a for loop, but I didn't want to deal with making sure I had the Pandas syntax right.
Looking at the application, though, you might want to have each cell in the dataframe be not a number but a tuple of last value and time it was read, so replace last_reading[reading[ID]] = reading[data] with
last_reading[reading[ID]] = [reading[data],reading[time]]. Your neural net can then decide how to weight data based on how old it is.
I got this to work with the following, which I think is pretty general for any case like this where the time index for which you want to fill values is the second in a multi-index with two indices:
# Remove duplicate time indices (happens some in the dataset, pandas freaks out).
df = df[~df.index.duplicated(keep='first')]
# Unstack the dataframe and fill values per serial number forward, backward.
df = df.unstack(level=0)
df.update(df.ffill()) # first ZOH forward
df.update(df.bfill()) # now back fill values that are not seen at the beginning
# Restack the dataframe and re-order the indices.
df = df.stack(level=1)
df = df.swaplevel()
This gets me what I want, although I would love to be able to keep the duplicate time entries if anybody knows of a good way to do this.
You could also use df.update(df.fillna(0)) instead of backfilling if starting unseen values at zero is preferable for a particular application.
I put the above code block in a function called clean_df that takes the dataframe as argument and returns the cleaned dataframe.
I'm using the Quandl database service API and its python support to download stock financial data.
Right now, I'm using the free SFO database which downloads year operational financial data.
For example, this query code passes the last 6-8 years of data for stock "CRM" to the dataframe.
df=quandl.get('SF0/CRM_REVENUE_MRY')
df
Out[29]:
Value
Date
2010-01-31 1.305583e+09
2011-01-31 1.657139e+09
2012-01-31 2.266539e+09
2013-01-31 3.050195e+09
2014-01-31 4.071003e+09
2015-01-31 5.373586e+09
2016-01-31 6.667216e+09
What I want to do with this is to recursively pass it a list of about 50 stocks and also grab 6-8 other columns from this database using different query codes appended on to the SFO/CRM_ part of the query.
qcolumns = ['REVUSD_MRY',
'GP_MRY',
'INVCAP_MRY',
'DEBT_MRY',
'NETINC_MRY',
'RETEARN_MRY',
'SHARESWADIL_MRY',
'SHARESWA_MRY',
'COR_MRY',
'FCF_MRY',
'DEBTUSD_MRY',
'EBITDAUSD_MRY',
'SGNA_MRY',
'NCFO_MRY',
'RND_MRY']
So, I think I need to:
a) run the query for each column and in each case append to the dataframe.
b) Add column names to the dataframe.
c) Create a dataframe for each stock (should this be a panel or a list of dataframes? (apologies as I'm new to Pandas and dataframes and am on my learning curve.
d) write to CSV
could you suggest or point me?
This code works to do two queries (two columns of data, both date indexed), renames the columns, and then concatenates them.
df=quandl.get('SF0/CRM_REVENUE_MRY')
df = df.rename(columns={'Value': 'REVENUE_MRY'})
dfnext=quandl.get('SF0/CRM_NETINC_MRY')
dfnext = dfnext.rename(columns={'Value': 'CRM_NETINC_MRY'})
frames = [df, dfnext]
dfcombine = pd.concat([df, dfnext], axis=1) # now question is how to add stock tag "CRM" to frame
dfcombine
Out[39]:
REVENUE_MRY CRM_NETINC_MRY
Date
2010-01-31 1.305583e+09 80719000.0
2011-01-31 1.657139e+09 64474000.0
2012-01-31 2.266539e+09 -11572000.0
2013-01-31 3.050195e+09 -270445000.0
2014-01-31 4.071003e+09 -232175000.0
2015-01-31 5.373586e+09 -262688000.0
2016-01-31 6.667216e+09 -47426000.0
I can add recursion to this to get all the columns (there are around 15) but how do I tag each frame for each stock? Use a key? Use a 3D panel? Thanks for helping a struggling python programmer!
I'm working on very basic pandas since a few days but struggle with my current task:
I have a (non normalized) timeseries with items that contains a userid per timestamp. So something like: (date, userid, payload) So think about an server logffile where I would like to find how much IPs return within a certain timeperiod.
Now I like to find how much of the users have multiple items within an intervall for example in 4 weeks etc. So it's more a sliding window than constant intervals on the t-axis.
So my approaches were:
df_users reindex on userids
or multiindex?
Sadly I didn't found a way to generate the results successfully.
So all in all I'm not sure how I realize that kind of search with Pandas, or maybe this is easier to implement in pure Python? Or do I just lack some keywords for that problem?
Some dummy data that I think fits your problem.
df = pd.DataFrame({'id': ['A','A','A','B','B','B','C','C','C'],
'time': ['2013-1-1', '2013-1-2', '2013-1-3',
'2013-1-1', '2013-1-5', '2013-1-7',
'2013-1-1', '2013-1-7', '2013-1-12']})
df['time'] = pd.to_datetime(df['time'])
This approach requires some kind non-missing numeric column to count with, so just add a dummy one.
df['dummy_numeric'] = 1
My approach to the problem is this. First, groupby the id and iterate so we are working with one user id worth of data at time. Next, resample the irregular data up to daily values so it is normalized.
Then, using the rolling_count function, count the number of observations in each X day window (using 3 here). This works because the upsampled data will be filled with NaN and not counted. Notice that only the numeric column is being passed to rolling_count, and also note the use of double-brackets (which results in a DataFrame being selected rather than a series).
window_days = 3
ids = []
for _, df_gb in df.groupby('id'):
df_gb = df_gb.set_index('time').resample('D')
df_gb = pd.rolling_count(df_gb[['dummy_numeric']], window_days).reset_index()
ids.append(df_gb)
Combine all the data back together, mark the spans with more than observations
df_stack = pd.concat(ids, ignore_index=True)
df_stack['multiple_requests'] = (df_stack['dummy_numeric'] > 1).astype(int)
Then groupby and sum, and you should have the right answer.
df_stack.groupby('time')['multiple_requests'].sum()
Out[356]:
time
2013-01-01 0
2013-01-02 1
2013-01-03 1
2013-01-04 0
2013-01-05 0
2013-01-06 0
2013-01-07 1
2013-01-08 0
2013-01-09 0
2013-01-10 0
2013-01-11 0
2013-01-12 0
Name: multiple, dtype: int32