Adding to pandas dataframe line by line - python

I'm making a dataframe and I need to add to it line by line. I created the df with
df = pd.DataFrame(columns=('date', 'daily_high', 'daily_low'))
then I'm reading data from an API, so I run
for api in api_list:
with urllib.request.urlopen(api) as url:
data = json.loads(url.read().decode())
and I need to put different attributes from data in to the dataframe.
I tried to put
df = df.append({'date':datetime.fromtimestamp(data["currently"]["time"]).strftime("20%y%m%d"), 'daily_high' : data["daily"]["data"][0]["temperatureHigh"], 'daily_low': data["daily"]["data"][0]["temperatureLow"]},ignore_index=True)
in the for loop, but it was taking a long time and I'm not sure if this is good practice. Is there a better way to do this? Maybe I could create three separate series and join them together?

pandas.DataFrame.append is inefficient for iterative approaches.
From documentation:
Iteratively appending rows to a DataFrame can be more computationally
intensive than a single concatenate. A better solution is to append
those rows to a list and then concatenate the list with the original
DataFrame all at once.
As mentioned, concatenating results will be more efficient, but in your case using pandas.DataFrame.from_dict would be even more convenient.
Also, I would use requests library for requesting urls.
import requests
d = {}
d['date'] = []
d['daily_high'] = []
d['daily_low'] = []
for api_url in api_list:
data = requests.get(api_url).json()
d['date'].append(datetime.fromtimestamp(data["currently"]["time"]).strftime("20%y%m%d"))
d['daily_high'].append(data["daily"]["data"][0]["temperatureHigh"])
d['daily_low'].append(data["daily"]["data"][0]["temperatureLow"])
df = pd.DataFrame.from_dict(d)

Related

Unpack a list with many dataframes

I have a problem with a list containing many dataframes. I create them in that way:
listWithDf = []
listWithDf.append(file)
And I got:
And now I wanna work with data inside this list but I want to have one dataframe with all the data. I know this is a very ugly way and this must be changed every time the quantity of the dataframe is changed.
df = pd.concat([listWithDf[0], listWithDf[1], ...)
So, I was wondering is any better way to unpack a list like that. Or maybe is a different way to make some dataframe in a loop, which contains the data that I need.
Here's a way you can do it as suggested in comments by #sjw:
df = pd.concat(listWithDf)
Here's a method with a loop(but it's unnecessary!):
df = pd.concat([i for i in listWithDf])

Normalize column with JSON data in Pandas dataframe

I have a Pandas dataframe in which one column contains JSON data (the JSON structure is simple: only one level, there is no nested data):
ID,Date,attributes
9001,2020-07-01T00:00:06Z,"{"State":"FL","Source":"Android","Request":"0.001"}"
9002,2020-07-01T00:00:33Z,"{"State":"NY","Source":"Android","Request":"0.001"}"
9003,2020-07-01T00:07:19Z,"{"State":"FL","Source":"ios","Request":"0.001"}"
9004,2020-07-01T00:11:30Z,"{"State":"NY","Source":"windows","Request":"0.001"}"
9005,2020-07-01T00:15:23Z,"{"State":"FL","Source":"ios","Request":"0.001"}"
I would like to normalize the JSON content in the attributes column so the JSON attributes become each a column in the dataframe.
ID,Date,attributes.State, attributes.Source, attributes.Request
9001,2020-07-01T00:00:06Z,FL,Android,0.001
9002,2020-07-01T00:00:33Z,NY,Android,0.001
9003,2020-07-01T00:07:19Z,FL,ios,0.001
9004,2020-07-01T00:11:30Z,NY,windows,0.001
9005,2020-07-01T00:15:23Z,FL,ios,0.001
I have been trying using Pandas json_normalize which requires a dictionary. So, I figure I would convert the attributes column to a dictionary but it does not quite work out as expected for the dictionary has the form:
df.attributes.to_dict()
{0: '{"State":"FL","Source":"Android","Request":"0.001"}',
1: '{"State":"NY","Source":"Android","Request":"0.001"}',
2: '{"State":"FL","Source":"ios","Request":"0.001"}',
3: '{"State":"NY","Source":"windows","Request":"0.001"}',
4: '{"State":"FL","Source":"ios","Request":"0.001"}'}
And the normalization takes the key (0, 1, 2, ...) as the column name instead of the JSON keys.
I have the feeling that I am close but I can't quite work out how to do this exactly. Any idea is welcome.
Thank you!
Normalize expects to work on an object, not a string.
import json
import pandas as pd
df_final = pd.json_normalize(df.attributes.apply(json.loads))
You shouldn’t need to convert to a dictionary first.
Try:
import pandas as pd
pd.json_normalize(df[‘attributes’])
I found an solution but I am not overly happy with it. I reckon it is very inefficient.
import pandas as pd
import json
# Import full dataframe
df = pd.read_csv(r'D:/tmp/sample_simple.csv', parse_dates=['Date'])
# Create empty dataframe to hold the results of data conversion
df_attributes = pd.DataFrame()
# Loop through the data to fill the dataframe
for index in df.index:
row_json = json.loads(df.attributes[index])
normalized_row = pd.json_normalize(row_json)
# df_attributes = df_attributes.append(normalized_row) (deprecated method) use concat instead
df_attributes = pd.concat([df_attributes, normalized_row], ignore_index=True)
# Reset the index of the attributes dataframe
df_attributes = df_attributes.reset_index(drop=True)
# Drop the original attributes column
df = df.drop(columns=['attributes'])
# Join the results
df_final = df.join(df_attributes)
# Show results
print(df_final)
print(df_final.info())
Which gives me the expected result. However, as I said, there are several inefficiencies in it. For starters, the dataframe append in the for loop. According to the documentation the best practice is to make a list and then append but I could not figure out how to do that while keeping the shape I wanted. I welcome all critics and ideas.

How to write data to existing excel file using pandas?

I want to request some data from a python module tushare.
By using this code, I can each time get a line of data.
However I want to send the server a request for like every 5 seconds
and put all the data within 4 hrs into one excel file.
I notice that pandas is already built in tushare.
How to put the data together and generate only one excel file?
import tushare as ts
df=ts.get_realtime_quotes('000875')
df.to_excel(r'C:\Users\stockfile\000875.xlsx')
You can do it with for example
df = df.append(ts.get_realtime_quotes('000875'))
Given the number of calls, it nay be better to create a data frame and fill it with data rows as they arrive. Something like this:
# Here, just getting column names:
columns = ts.get_realtime_quotes('000875').columns
# Choose the right number of calls, N,
df = pd.DataFrame(index = range(N), columns = columns)
for i in range(N):
df.iloc[0] = ts.get_realtime_quotes('000875').iloc[0]
sleep(5)
Another way to do it (possibly simpler and without preallocating the empty data frame) would be storing answers from tushare in a list and then applying pd.concat.
list_of_dfs = []
for _ in range(N):
list_of_dfs.append(ts.get_realtime_quotes('000875'))
sleep(5)
full_df = pd.concat(list_of_dfs)
This way you don't need to know the number of requests in advance (for example, if you decide to write the for loop without explicit number of repetitions).

How to add values from one dataframe into another ignoring the row indices

I have a pandas dataframe called trg_data to collect data that I am producing in batches. Each batch is produced by a sub-routine as a smaller dataframe df with the same number of columns but less rows and I want to insert the values from df into trg_data at a new row position each time.
However, when I use the following statement df is always inserted at the top. (i.e. rows 0 to len(df)).
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df
I'm guessing but I think the reason may be that even though the slice indicates the desired rows, it is using the index in df to decide where to put the data.
As a test I found that I can insert an ndarray at the right position no problem:
trg_data.iloc[trg_pt:(trg_pt + len(df))] = np.ones(df.shape)
How do I get it to ignore the index in df and insert the data where I want it? Or is there an entirely different way of achieving this? At the end of the day I just want to create the dataframe trg_data and then save to file at the end. I went down this route because there didn't seem to be a way of easily appending to an existing dataframe.
I've been working at this for over an hour and I can't figure out what to google to find the right answer!
I think I may have the answer (I thought I had already tried this but apparently not):
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df.values
Still, I'm open to other suggestions. There's probably a better way to add data to a dataframe.
The way I would do this is save all the intermediate dataframes in an array, and then concatenate them together
import pandas as pd
dfs = []
# get all the intermediate dataframes somehow
# combine into one dataframe
trg_data = pd.concatenate(dfs)
Both
trg_data = pd.concat([df1, df2, ... dfn], ignore_index=True)
and
trg_data = pd.DataFrame()
for ...: #loop that generates df
trg_data = trg_data.append(df, ignore_index=True) #you can reuse the name df
shoud work for you.

Parsing a JSON string which was loaded from a CSV using Pandas

I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:
name,dob,stats
john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}"
dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}"
bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"
After using df = pandas.read_csv('file.csv'), what's the most efficient way to parse and split the stats column into additional columns?
After about an hour, the only thing I could come up with was:
import json
stdf = df['stats'].apply(json.loads)
stlst = list(stdf)
stjson = json.dumps(stlst)
df.join(pandas.read_json(stjson))
This seems like I'm doing it wrong, and it's quite a bit of work considering I'll need to do this on three columns regularly.
Desired output is the dataframe object below. Added following lines of code to get there in my (crappy) way:
df = df.join(pandas.read_json(stjson))
del(df['stats'])
In [14]: df
Out[14]:
name dob eye_color height weight
0 john smith 1/1/1980 brown 160 76
1 dave jones 2/2/1981 blue 170 85
2 bob roberts 3/3/1982 green 180 94
I think applying the json.load is a good idea, but from there you can simply directly convert it to dataframe columns instead of writing/loading it again:
stdf = df['stats'].apply(json.loads)
pd.DataFrame(stdf.tolist()) # or stdf.apply(pd.Series)
or alternatively in one step:
df.join(df['stats'].apply(json.loads).apply(pd.Series))
There is a slightly easier way, but ultimately you'll have to call json.loads There is a notion of a converter in pandas.read_csv
converters : dict. optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels
So first define your custom parser. In this case the below should work:
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
In your case you'll have something like:
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict
From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the 'stats' column using apply to make a data frame out of each key/value pair.
Option 1
If you dumped the column with json.dumps before you wrote it to csv, you can read it back in with:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': json.loads})
Option 2
If you didn't then you might need to use this:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': eval})
Option 3
For more complicated situations you can write a custom converter like this:
import json
import pandas as pd
def parse_column(data):
try:
return json.loads(data)
except Exception as e:
print(e)
return None
df = pd.read_csv('data/file.csv', converters={'json_column_name': parse_column})
Paul's original answer was very nice but not correct in general, because there is no assurance that the ordering of columns is the same on the left-hand side and the right-hand side of the last line. (In fact, it does not seem to work on the test data in the question, instead erroneously switching the height and weight columns.)
We can fix this by ensuring that the list of dict keys on the LHS is sorted. This works because the apply on the RHS automatically sorts by the index, which in this case is the list of column names.
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
json_normalize function in pandas.io.json package helps to do this without using custom function.
(assuming you are loading the data from a file)
from pandas.io.json import json_normalize
df = pd.read_csv(file_path, header=None)
stats_df = json_normalize(data['stats'].apply(ujson.loads).tolist())
stats_df.set_index(df.index, inplace=True)
df.join(stats_df)
del df.drop(df.columns[2], inplace=True)
If you have DateTime values in your .csv file, df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series) will mess up the date time values
This link has some tip how to read the csv file
with json strings into the dataframe.
You could do the following to read csv file with json string column and convert your json string into columns.
Read your csv into the dataframe (read_df)
read_df = pd.read_csv('yourFile.csv', converters={'state':json.loads}, header=0, quotechar="'")
Convert the json string column to a new dataframe
state_df = read_df['state'].apply(pd.Series)
Merge the 2 dataframe with index number.
df = pd.merge(read_df, state_df, left_index=True, right_index=True)

Categories

Resources