I am streaming data from a websocket into my python application successfully using these lines of code:
wsClient = GDAX.WebsocketClient(url="wss://ws-feed.gdax.com", products="LTC-USD")
wsClient.start()
I am having trouble saving the results of wsClient.start()into a pandas dataframe. Not sure why records are not appending with this lines of code, can anyone please help me understand why not :
df1 = pd.DataFrame()
for i in wsClient.start():
df1.append(wsClient.start())
Thank you in advance.
If you look at the documentation for append you can see that it returns the resulting DataFrame, and doesn't alter the DataFrame on which append is called nor the DataFrame which is the argument.
In the body of your loop, you probably meant something like
df1 = df1.append(wsClient.start())
As DJK correctly notes below, for a more efficient alternative, you can first create a list of all DataFrames, then append:
dfs = []
for i in wsClient.start():
dfs.append(wsClient.start())
df1 = pd.concat(dfs)
Related
I have a DataFrame with a "details" column that I believe is a dictionary. The initial data is a JSON string parsed with json.loads, then converted from a dictionary to DataFrame. I would like to populate a new "affectedDealId" column with the value in data['details']['actions']['affectedDealId'].
I'm hoping I can do this the "DataFrame" way without using a loop with something like:
data['affectedDealId'] = data['details'].get('actions').get('affectedDealId')
To simplify I've tried:
data['actions'] = data['details'].get('actions')
But that ends up as "None".
Also data['details'] seems to be a series when I think it's a dictionary before converting it to a DataFrame.
Alternatively, I do later loop through the DataFrame. How would I access that 'affectedDealId' element?
Below is a screenshot of the DataFrame from the PyCharm debugger.
I'm making some assumptions about details json, but does this help? You'll will have to adjust the json.loads(x) key/index to extract the right location.
df['affectedDealId'] = df['details'].apply(lambda x: json.loads(x)['affectedDealId'])
I think with will be great if you could do something like this.
so this create a data frame off your json column by calling the pd.Series
data_frame_new = df['details'].apply(pd.Series)
and then reassign your data frame by concat your data_frame_new with your existing data frame.
df = pd.concat([df,data_frame_new],axis = 1)
print(df)
This approach worked for me on a recent project.
your affectedId will be come a column of it own with the data populated.
it may be of help to you.
Thanks
I'm a Pandas newbie, so please bear with me.
Overview: I started with a free-form text file created by a data harvesting script that remotely accessed dozens of different kinds of devices, and multiple instances of each. I used OpenRefine (a truly wonderful tool) to munge that into a CSV that was then input to dataframe df using Pandas in a JupyterLab notebook.
My first inspection of the data showed the 'Timestamp' column was not monotonic. I accessed individual data sources as follows, in this case for the 'T-meter' data source. (The technique was taken from a search result - I don't really understand it, but it worked.)
cond = df['Source']=='T-meter'
rows = df.loc[cond, :]
df_tmeter = pd.DataFrame(columns=df.columns)
df_tmeter = df_tmeter.append(rows, ignore_index=True)
then checked each as follows:
df_tmeter['Timestamp'].is_monotonic
Fortunately, the problem was easy to identify and fix: Some sensors were resetting, then sending bad (but still monotonic) timestamps until their clocks were updated. I wrote the function healing() to cleanly patch such errors, and it worked a treat:
df_tmeter['healed'] = df_tmeter['Timestamp'].apply(healing)
Now for my questions:
How do I get the 'healed' values back into the original df['Timestamp'] column for only the 'T-meter' items in df['Source']?
Given the function healing(), is there a clean way to do this directly on df?
Thanks!
Edit: I first thought I should be using 'views' into df, but other operations on the data would either generate errors, or silently turn the views into copies.
I wrote a wrapper function heal_row() for healing():
def heal_row( row ):
if row['Source'] == 'T-meter': # Redundant check, but safe!
row['Timestamp'] = healing(row['Timestamp'])
return row
then did the following:
df = df.apply(lambda row: row if row['Source'] != 'T-meter' else heal_row(row), axis=1)
This ordering is important, since healing() is stateful based on the prior row(s), and thus can't be the default operation.
The first row in pandas data table has turned into a column. I've tried various renaming methods and restructuring and it hasn't been working. Something really trivial, but unfortunately I need some help.
The line "0" is supposed to come down as the first data row "Bachelor". Could someone please point me to the proper way of getting this done?
I think there is problem your csv have no header, so is possible create default range columns names:
df_degree = pd.read_csv(file, header=None)
Or is possible define custom columns names:
df_degree = pd.read_csv(file, names=['col1','col2'])
I have a tab separated file which I extracted in pandas dataframe as below:
import pandas as pd
data1 = pd.DataFrame.from_csv(r"C:\Users\Ashish\Documents\indeed_ml_dataset\train.tsv", sep="\t")
data1
Here is how the data1 looks like:
Now, I want to view the column name tags. I don't know whether I should call it a column or not, but I have tried accessing it using the norm:
data2=data1[['tags']]
but it errors out. I have tried several other things as well using index and loc, but all of them fails. Any suggestions?
To fix this you'll need to remove description from the index by resetting. Try the below:
data2 = data1.reset_index()
data2['tags']
You'll then be able to select by "tags".
Try reading your data using pd.read_csv instead of pd.DataFrame.from_csv as it takes first column as index by default.
For more info refer to this documentation on pandas website: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_csv.html
I am trying to combine two tables row wise (stack on top of each other, like using rbind in R). I've followed steps mentioned in:
Pandas version of rbind
how to combine two data frames in python pandas
But none of the "append" or "concat" are working for me.
About my data
I have two panda dataframe objects (type class 'pandas.core.frame.DataFrame'), both have 19 columns. when i print each dataframe they look fine.
The problem
So I created another panda dataframe using:
query_results = pd.DataFrame(columns=header_cols)
and then in a loop (because sometimes i may be combining more than just 2 tables) I am trying to combine all the tables:
for CCC in CCCList:
query_results.append(cost_center_query(cccode=CCC))
where cost_center_query is a customized function and returns pandas dataframe objects with same column names as the query_results.
however, with this, whenever i print "query_results" i get empty dataframe.
any idea why this is happening? no error message as well, so i am just confused.
Thank you so much for any advice!
Consider the concat method on a list of dataframes which avoids object expansion inside a loop with multiple append calls. Even consider a list comprehension:
query_results = pd.concat([cost_center_query(cccode=CCC) for CCC in CCCList], ignore_index=True)