Selecting 1.6M rows of a pandas dataframe [duplicate]

Selecting 1.6M rows of a pandas dataframe [duplicate] - python

This question already has answers here:
"Large data" workflows using pandas [closed]
(16 answers)
Closed 2 years ago.
I have a csv file with ~2.3M rows. I'd like save the subset (~1.6M) of the rows which have non-nan values in two columns inside the dataframe. I'd like to keep using pandas to do this. Right now, my code looks like:
import pandas as pd
catalog = pd.read_csv('catalog.txt')
slim_list = []
for i in range(len(catalog)):
if (pd.isna(catalog['z'][i]) == False and pd.isna(catalog['B'][i]) == False):
slim_list.append(i)
which holds the rows of catalog which have non-nan values. I then make a new catalog with those rows as entries
slim_catalog = pd.DataFrame(columns = catalog.columns)
for j in range(len(slim_list)):
data = (catalog.iloc[j]).to_dict()
slim_catalog = slim_catalog.append(data, ignore_index = True)
pd.to_csv('slim_catalog.csv')
This should, in principle, work. It's sped up a little by reading each row into a dict. However, it takes far, far too long to execute for all 2.3M rows. What is a better way to solve this problem?

This is the completely wrong way of doing this in pandas.
Firstly, never iterate over some range, i.e. for i in range(len(catalog)): and then individually index into the row: catalog['z'][i], that is incredibly inefficient.
Second, do not create a pandas.DataFrame using pd.DataFrame.append in a loop, that is a linear operation, so the entire thing will be quadratic time.
But you shouldn't be looping here to begin with. All you need is something like
catalog[catalog.loc[:, ['z', 'B']].notna().all(axis=1)].to_csv('slim_catalog.csv')
Or broken up to perhaps be more readable:
not_nan_zB = catalog.loc[:, ['z', 'B']].notna().all(axis=1)
catalog[not_nan_zB].to_csv('slim_catalog.csv')

Related

What is the most efficient(fastest) way to create a dataframe?

I am working on a project that reads a couple of thousand text documents, creates a dataframe from them, and then trains a model on the dataframe. The most time-consuming aspect of the code is the creation of the dataframe.
Here is how I create the dataframe:
I first create 4-5 lists, create a dictionary with 'Column-name' as the key and the previous lists as the values. Then use pd.DataFrame to give the dictionary. I have added print updates after each step and the dataframe creation step takes the most time.
Method I am using:
line_of_interest = []
line_no = []
file_name = []
for file in file_names:
with open(file) as txt:
for i, line in enumerate(txt):
if 'word of interest' in line:
line_of_interest.append(line)
line_no.append(i)
file_name.append()
rows = {'Line_no':line_no,'Line':line_of_interest,'File':file_name}
df = pd.DataFrame(data = rows)
I was wondering if there is a more efficient and less time-consuming way to create the dataframe. I tried looking for similar questions and the only thing I could find was "Most Efficient Way to Create Pandas DataFrame from Web Scraped Data".
Let me know if there is a similar question with a good answer. The only other method of creating a dataframe I know is appending row by row all the values as I discover them, and I don't know a way to check if that is quicker. Thanks!

Fastest way to generate new rows and append them to DataFrame

I want to change the target value in the data set within a certain interval. When doing it with 500 data, it takes about 1.5 seconds, but I have around 100000 data. Most of the execution time is spent in this process. I want to speed this up.
What is the fastest and most efficient way to append rows to a DataFrame?
I tried the solution in this link, tried to create a dictionary, but I couldn't do it.
Here is the code which takes around 1.5 seconds for 500 data.
def add_new(df,base,interval):
df_appended = pd.DataFrame()
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df
df_new["DeltaG"] = s[i]
df_appended = df_appended.append(df_new)
return df_appended

DataFrames in the pandas are continuous peaces of memory, so appending or concatenating etc. dataframes is very inefficient - this operations create new DataFrames and overwrite all data from old DataFrames.
But basic python structures as list and dicts are not, when append new element to it python just create pointer to new element of structure.
So my advice - make all you data processing on lists or dicts and convert them to DataFrames in the end.
Another advice can be creating preallocated DataFrame of the final size and just change values in it using .iloc. But it works only if you know final size of your resulting DataFrame.
Good examples with code: Add one row to pandas DataFrame
If you need more code examples - let me know.

def add_new(df1,base,interval,has_interval):
dictionary = {}
if has_interval == 0:
for i in range(0,5):
dictionary[i] = (df1.copy())
elif has_interval == 1:
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df1
df_new[4] = s[i]
dictionary[i] = (df_new.copy())
return dictionary
It works. It takes around 10 seconds for whole data. Thanks for your answers.

How to efficiently match values from 2 series and add them to a dataframe

I have a csv file "qwi_ak_se_fa_gc_ns_op_u.csv" which contains a lot of observations of 80 variables. One of them is geography which is the county. Every county belongs to something called a Commuting Zone (CZ). Using a matching table given in "czmatcher.csv" I can assign a CZ to every county given in geography.
The code below shows my approach. It is simply going through every row and finding its CZ by going through the whole "czmatcher.csv" row and finding the right one. Then i proceed to just copy the values using .loc. The problem is, this took over 10 hours to run on a 0.5 GB file (2.5 million rows) which isn't that much and my intuition says this should be faster?
This picture illustrates the way the csv files look like. The idea would be to construct the "Wanted result (CZ)" column, name it CZ and add it to the dataframe.
File example
import pandas as pd
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']
Is there a faster way of doing this?

The best way to do this is a left merge on your dataframes,
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
I assume that in both dataframes the column country is spelled the same,
data_final = data.merge(czm, how='left', on = 'country')
If it isn't spelled the same way you can rename your columns,
data.rename(columns:{col1:country}, inplace=True)
read the doc for further information https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

In order to make it faster, but not reworking your whole solution I would recommend to use Dask DataFrames, to say it simple, Dask divides your reads your csv in chunks and processes each of them in parallel. After reading csv. you can use .compute method to get pandas df instead of Dask df.
This will look like this:
import pandas as pd
import dask.dataframe as dd #IMPROT DASK DATAFRAMES
# YOU NEED TO USE dd.read_csv instead of pd.read_csv
data = dd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
data = data.compute()
czm = dd.read_csv("czmatcher.csv")
czm = czm.compute()
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']

How to populate arrays with values read in from csv via pandas?

I have create a DataFrame using pandas by reading a csv file. What I want to do is iterate down the rows (for the values in column 1) into a certain array, and do the same for the values in column 2 for a different array. This seems like it would normally be a fairly easy thing to do, so I think I am missing something, however I can't find much online that doesn't get too complicated and doesn't seem to do what I want. Stack questions like this one appear to be asking the same thing, but the answers are long and complicated. Is there no way to do this in a few lines of code? Here is what I have set up:
import pandas as pd
#available possible players
playerNames = []
df = pd.read_csv('Fantasy Week 1.csv')
What I anticipate I should be able to do would be something like:
for row in df.columns[1]:
playerNames.append(row)
This however does not return the desired result.
Essentially, if df =
[1,2,3
4,5,6
7,8,9], I would want my array to be [1,4,7]

Do:
for row in df[df.columns[1]]:
playerNames.append(row)
Or even better:
print(df[df.columns[1]].tolist())
In this case you want the 1st column's values so do:
for row in df[df.columns[0]]:
playerNames.append(row)
Or even better:
print(df[df.columns[0]].tolist())

What is a proper idiom in pandas for creating a dataframes from the output of a apply function on a df?

Edit --- I've made some progress, and discovered the drop_duplicates method in pandas, which saves some custom duplicate removal functions I created.
This changes the question in a couple of ways, b/c it changes my initial requirements.
One of the operations I need to conduct is grabbing the latest feed entries --- the feed urls exist in a column in a data frame. Once I've done the apply I get feed objects back:
import pandas as pd
import feedparser
import datetime
df_check_feeds = pd.DataFrame({'account_name':['NYTimes', 'WashPo'],'feed_url':['http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml', 'http://feeds.washingtonpost.com/rss/homepage'], 'last_update':['2015-12-28 23:50:40', '2015-12-28 23:50:40']})
df_check_feeds["feeds_results"] = pd.DataFrame(df_check_feeds.feed_url.apply(lambda feed_url: feedparser.parse(feed_url)))
df_check_feeds["entries"] = df_check_feeds.feeds_results.apply(lambda x: x.entries)
So, now I'm stuck with the feed entries in the "entries" column, I'd like to create a two new data frames in one apply method, and concatenate the two frames immediately.
I've expressed the equivalent in a for loop:
frames_list = []
for index in df_check_feeds.index:
df_temp = pd.DataFrame(df_check_feeds.entries[index])
df_temp['account_name'] = df_check_feeds.ix[index,'account_name']
# some error checking on the info here
frames_list.append(df_temp)
df_total_results = pd.concat(frames_list)
df_total_results
I realize I could do this in a for loop (and indeed have written that), but I feel there is some better, more succinct pandas idiomatic way of writing this statement.

A more compact way could be:
df_total_results = df_check_feeds.groupby('account_name').apply(lambda x: pd.DataFrame(x['entries'].iloc[0]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting 1.6M rows of a pandas dataframe [duplicate] - python

Related

What is the most efficient(fastest) way to create a dataframe?

Fastest way to generate new rows and append them to DataFrame

How to efficiently match values from 2 series and add them to a dataframe

How to populate arrays with values read in from csv via pandas?

What is a proper idiom in pandas for creating a dataframes from the output of a apply function on a df?

Categories

Resources