Pandas row to json - python

I have a dataframe in pandas and my goal is to write each row of the dataframe as a new json file.
I'm a bit stuck right now. My intuition was to iterate over the rows of the dataframe (using df.iterrows) and use json.dumps to dump the file but to no avail.
Any thoughts?

Looping over indices is very inefficient.
A faster technique:
df['json'] = df.apply(lambda x: x.to_json(), axis=1)

Pandas DataFrames have a to_json method that will do it for you:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html
If you want each row in its own file you can iterate over the index (and use the index to help name them):
for i in df.index:
df.loc[i].to_json("row{}.json".format(i))

Extending the answer of #MrE, if you're looking to convert multiple columns from a single row into another column with the content in json format (and not separate json files as output) I've had speed issues while using:
df['json'] = df.apply(lambda x: x.to_json(), axis=1)
I've achieved significant speed improvements on a dataset of 175K records and 5 columns using this line of code:
df['json'] = df.to_json(orient='records', lines=True).splitlines()
Speed went from >1 min to 350 ms.

Using apply, this can be done as
def writejson(row):
with open(row["filename"]+'.json', "w") as outfile:
json.dump(row["json"], outfile, indent=2)
in_df.apply(writejson, axis=1)
Assuming the dataframe has a column named "filename" with filename for each json row.

Here's a simple solution:
transform a dataframe to json per record, one json per line. then simply split the lines
list_of_jsons = df.to_json(orient='records', lines=True).splitlines()

Related

Converting all pandas column: row to key:value pair json

I am trying to add a new column at the end of my pandas dataframe that will contain the values of previous cells in key:value pair. I have tried the following:
import json
df["json_formatted"] = df.apply
(
lambda row: json.dumps(row.to_dict(), ensure_ascii=False), axis=1
)
It creates the the column json_formatted successfully with all required data, but the problem is it also adds the json_formatted as another extra key. I don't want that. I want the json data to contain only the information from the original df columns. How can I do that?
Note: I made ensure_ascii=False because the column names are in Japanese characters.
Create a new variable holding the created column and add it afterwards:
json_formatted = df.apply(lambda row: json.dumps(row.to_dict(), ensure_ascii=False), axis=1)
df['json_formatted'] = json_formatted
This behaviour shouldn't happen, but might be caused by your having run this function more than once. (You added the column, and then ran df.apply on the same dataframe).
You can avoid this by making your columns explicit: df[['col1', 'col2']].apply()
Apply is an expensive operation is Pandas, and if performance matters it is better to avoid it. An alternative way to do this is
df["json_formatted"] = [json.dumps(s, ensure_ascii=False) for s in df.T.to_dict().values()]

Looping over rows of a CSV imported dataframe using pandas

I am trying to print rows of a dataframe one by one.
I only manage to loop over the columns instead of the rows:
first I am importing from a csv:
table_csv = pd.read_csv(r'C:\Users\xxx\Desktop\table.csv',sep=';', error_bad_lines=False)
Next, I convert into a datframe using pandas:
table_dataframe = DataFrame(table_csv)
I then start the for loop as follows:
for row in table_dataframe:
print(row)
It however loops over the columns instead of the instead over the row. However i need to perform alterations on rows. Does anybody know where this goes wrong or has an alternative solution?
Check out answers to this question
In short this is how you'd do it:
for index, row in table_dataframe.iterrows():
print(row)

Access latest entry of name X in CSV file

I have a csv file, the columns are:
Date,Time,Spread,Result,Direction,Entry,TP,SL,Bal
And typical entries would look like:
16/07/21,01:25:05,N/A,No Id,Null,N/A,N/A,N/A,N/A
16/07/21,01:30:06,N/A,No Id,Null,N/A,N/A,N/A,N/A
16/07/21,01:35:05,8.06,Did not qualify,Long,N/A,N/A,N/A,N/A
16/07/21,01:38:20,6.61,Trade,Long,1906.03,1912.6440000000002,1900.0,1000.0
16/07/21,01:41:06,N/A,No Id,Null,N/A,N/A,N/A,N/A
How would I access the latest entry where the Result column entry is equal to Trade preferably without looping through the whole file?
If it must be a loop, it would have to loop backwards from latest to earliest because it is a large csv file.
If you want to use pandas, try using read_csv with loc:
df = pd.read_csv('yourcsv.csv')
print(df.loc[df['Result'] == 'Trade'].iloc[[-1]])
Load your .csv into a pd.DataFrame and you can get all the rows where df.Results equals Trade like this:
df[df.Result == 'Trade']
if you only want the last one then list use .iloc
df[df.Result == 'Trade'].iloc[-1]
I hope this is what you are looking for.
I suggest you use pandas, but in case you really cannot, here's an approach.
Assuming the data is in data.csv:
from csv import reader
with open("data.csv") as data:
rows = [row for row in reader(data)]
col = rows[0].index('Result')
res = [row for i, row in enumerate(rows) if i > 0 and row[col] == 'Trade']
I advise against using this, way too brittle.

How to read only selected lines in a csv via pandas

So the skiprows argument of pd.read_csv() allows to skip the specific rows. But how can I read just the selected rows for a pandas dataframe. Like I have a list of row indices which I need to read out from the file how can I achieved that? Passing skiprows = ~line_nos does not work as unary operator does not work for lists.
Currently using this method to read out the lines:
def picklines(thefile, whatlines):
return [x for i, x in enumerate(thefile) if i in whatlines]
And then converting the result into dataframe. But wondering if there's a better way to do so.
You could use a lambda function to achieve this.
# rows_to_keep are the line_nos you would like to keep
pd.read_csv(path_to_csv, skiprows = lambda x: x not in rows_to_keep)

concatenate excel datas with python or Excel

Here's my problem, I have an Excel sheet with 2 columns (see below)
I'd like to print (on python console or in a excel cell) all the data under this form :
"1" : ["1123","1165", "1143", "1091", "n"], *** n ∈ [A2; A205]***
We don't really care about the Column B. But I need to add every postal code under this specific form.
is there a way to do it with Excel or in Python with Panda ? (If you have any other ideas I would love to hear them)
Cheers
I think you can use parse_cols for parse first column and then filter out all columns from 205 to 1000 by skiprows in read_excel:
df = pd.read_excel('test.xls',
sheet_name='Sheet1',
parse_cols=0,
skiprows=list(range(205,1000)))
print (df)
Last use tolist for convert first column to list:
print({"1": df.iloc[:,0].tolist()})
The simpliest solution is parse only first column and then use iloc:
df = pd.read_excel('test.xls',
parse_cols=0)
print({"1": df.iloc[:206,0].astype(str).tolist()})
I am not familiar with excel, but pandas could easily handle this problem.
First, read the excel to a DataFrame
import pandas as pd
df = pd.read_excel(filename)
Then, print as you like
print({"1": list(df.iloc[0:N]['A'])})
where N is the amount you would like to print. That is it. If the list is not a string list, you need to cast the int to string.
Also, there are a lot parameters that can control the load part of excel read_excel, you can go through the document to set suitable parameters.
Hope this would be helpful to you.

Categories

Resources