So the skiprows argument of pd.read_csv() allows to skip the specific rows. But how can I read just the selected rows for a pandas dataframe. Like I have a list of row indices which I need to read out from the file how can I achieved that? Passing skiprows = ~line_nos does not work as unary operator does not work for lists.
Currently using this method to read out the lines:
def picklines(thefile, whatlines):
return [x for i, x in enumerate(thefile) if i in whatlines]
And then converting the result into dataframe. But wondering if there's a better way to do so.
You could use a lambda function to achieve this.
# rows_to_keep are the line_nos you would like to keep
pd.read_csv(path_to_csv, skiprows = lambda x: x not in rows_to_keep)
Related
I am trying to add a new column at the end of my pandas dataframe that will contain the values of previous cells in key:value pair. I have tried the following:
import json
df["json_formatted"] = df.apply
(
lambda row: json.dumps(row.to_dict(), ensure_ascii=False), axis=1
)
It creates the the column json_formatted successfully with all required data, but the problem is it also adds the json_formatted as another extra key. I don't want that. I want the json data to contain only the information from the original df columns. How can I do that?
Note: I made ensure_ascii=False because the column names are in Japanese characters.
Create a new variable holding the created column and add it afterwards:
json_formatted = df.apply(lambda row: json.dumps(row.to_dict(), ensure_ascii=False), axis=1)
df['json_formatted'] = json_formatted
This behaviour shouldn't happen, but might be caused by your having run this function more than once. (You added the column, and then ran df.apply on the same dataframe).
You can avoid this by making your columns explicit: df[['col1', 'col2']].apply()
Apply is an expensive operation is Pandas, and if performance matters it is better to avoid it. An alternative way to do this is
df["json_formatted"] = [json.dumps(s, ensure_ascii=False) for s in df.T.to_dict().values()]
I have the following code:
def _load_data_set(self, dataset_file):
data_frame = pandas.read_csv(dataset_file)
table = data_frame.values.tolist()
rows = len(table)
patients = []
for i in range(rows):
first = table[i][0]
rest = table[i]
rest.pop(0)
p = PatientDataSet(first, rest)
patients.append(p)
return patients
This code is basically, iterates over a CSV file (with a header) and for each row it splits the first place and the rest and creates a list of PatientDataSet object.
The input: CSV file with header.
The output: List of PatientDataSet objects.
Although it works, I really don't like how I implemented it because I pop the first column and the code looks really ugly. Is it possible to suggest how would be better to do it?
I'd do it like this:
def _load_data_set(self, dataset_file):
df = pandas.read_csv(dataset_file, index_col=0)
return [PatientDataSet(first, rest.tolist()) for first, rest in df.iterrows()]
Passing index_col=0 to read_csv() naturally separates the first column from the rest.
iterrows() gives you the index and the values for each row, which is exactly what you need.
You may be able to remove .tolist() if your PatientDataSet does not actually need a list but can accept rest directly as a Pandas Series. That would be better, to avoid an unnecessary conversion.
I modified a line from this post to conditionally read rows from a csv file:
filename=r'C:\Users\Nutzer\Desktop\Projects\UK_Traffic_Data\test.csv'
df = (pd.read_csv(filename, error_bad_lines=False) [lambda x: x['Accident_Index'].str.startswith('2005')])
This line works perfectly fine for a small test dataset. However, I do have a big csv file to read and it takes a very long time to read the file. Actually, eventually the NotebookApp.iopub_data_rate_limit is reached. My questions are:
Is there a way to improve this code and its performance?
The records in the "Accident_Index" column are sorted. Therefore, it may be a solution to break out of the read statement if a value is reached where "Accident_Index" does not equal str.startswith('2005'). Do you have a suggestion on how to do that?
Here is some example data:
The desired output should be a pandas dataframe containing the top six records.
We could initially read just the specific column we want to filter on with the above conditions (assuming this reduces the reading overhead significantly) .
#reading the mask column
df_indx = (pd.read_csv(filename, error_bad_lines=False,usecols=['Accident_Index'])
[lambda x: x['Accident_Index'].str.startswith('2005')])
We could then use the values from this column to read the remaining columns from the file using the skiprows and nrows properties since they are sorted values in the input file
df_data= (pd.read_csv(filename,
error_bad_lines=False,header=0,skiprows=df_indx.index[0],nrows=df_indx.shape[0]))
df_data.columns=['Accident_index','data']
This would give a subset of the data we want. We may not need to get the column names separately.
I am trying to export a list of pandas dataframes to indivudal csv files.
I have currently got this
import pandas as pd
import numpy as np
data = {"a":[1,2,3,4,5,6,7,8,9], "b":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]}
df = pd.DataFrame (data, columns = ["a","b"])
df = np.array_split(df, 3)
I have tried:
for i in df:
i.to_csv((r'df.csv'))
However this doesn't ouput all the sub df, only the last one.
How do I get this to output all the df, with the outputted csv having the names df1.csv, df2.csv, and df3.csv?
It did output all three. It's just that the second one overwrote the first, and likewise with the last one. You have to write them out to three separate filenames.
To achieve this, we need to modify the string based on where we are in the loop. The easiest way to do this is with a counter on the loop. Since the variable 'i' is normally reserved for such counters, I'm going to rename your dummy variable to _df. Don't get confused by it. To get a counter in the loop, we use enumerate.
for i, _df in enumerate(df):
print(i)
filename = 'df' + str(i) + '.csv'
_df.to_csv(filename) # I think the extra parenthesis are unecessary?
edit: Just to note the advantage of this over the suggestion of specifying all the filenames in a list is that you don't need to know the length of the list in advance. Also helpful if you do know the length, but it is big. If it's 3 and you know it's 3 and that won't change, then you can specify the filenames as suggested elsewhere.
you can use floor division on the index then use a groupby to create your individual frames.
for data, group in df.groupby(df.index // 3):
group.to_csv(f"df{data+1}.csv")
You're writing each data frame to the same file, 'df.csv'. In your for loop, you can specify both the dataframes to save and the files to save them to with zip().
>>> for i, outfile in zip(df, ["df1.csv", "df2.csv", "df3.csv"]):
... i.to_csv(outfile)
You can do this particular task a number of ways. Here's a loop with enumerate() so you don't have to write the whole list of filenames.
>>> for j, frame in enumerate(df):
... frame.to_csv(f"df{j+1}.csv")
Data is getting replace in same file with each iterations.
Try :
for i, value in enumerate(df):
value.to_csv('/path/to/folder/df'+str(i)+'.csv')
I have a dataframe in pandas and my goal is to write each row of the dataframe as a new json file.
I'm a bit stuck right now. My intuition was to iterate over the rows of the dataframe (using df.iterrows) and use json.dumps to dump the file but to no avail.
Any thoughts?
Looping over indices is very inefficient.
A faster technique:
df['json'] = df.apply(lambda x: x.to_json(), axis=1)
Pandas DataFrames have a to_json method that will do it for you:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html
If you want each row in its own file you can iterate over the index (and use the index to help name them):
for i in df.index:
df.loc[i].to_json("row{}.json".format(i))
Extending the answer of #MrE, if you're looking to convert multiple columns from a single row into another column with the content in json format (and not separate json files as output) I've had speed issues while using:
df['json'] = df.apply(lambda x: x.to_json(), axis=1)
I've achieved significant speed improvements on a dataset of 175K records and 5 columns using this line of code:
df['json'] = df.to_json(orient='records', lines=True).splitlines()
Speed went from >1 min to 350 ms.
Using apply, this can be done as
def writejson(row):
with open(row["filename"]+'.json', "w") as outfile:
json.dump(row["json"], outfile, indent=2)
in_df.apply(writejson, axis=1)
Assuming the dataframe has a column named "filename" with filename for each json row.
Here's a simple solution:
transform a dataframe to json per record, one json per line. then simply split the lines
list_of_jsons = df.to_json(orient='records', lines=True).splitlines()