I have a large dataframe (~4.7M rows) where one of the columns contains document text. I am trying unsuccessfully to run Gensim summarize on a specific column for the entire dataframe.
df['summary'] = df['variable_content'].apply(lambda x: summarize(x, word_count=200))
Extracting each row of variable_content into a variable and running summarize works well, but is slow and ugly. I also get the error:
ValueError: input must have more than one sentence
but can't find a row with only one sentence (most are hundreds/thousands). Can anyone help?
You have 4.7 million rows, each of which has hundreds or thousands of sentences, and you expect this to work in finite time? That's what I call "optimism". I would suggest looping through your dataframe and running your thing in chunks of, say 1000 rows, saving the work as you go along, and printing out the number of the chunk as you go along. Once it fails, you will know roughly where the failure was, plus you will actually get some results.
Related
So, I work on a place and here I use A LOT of Python (Pandas) and the data keeps getting bigger and bigger, last month I was working with a few hundred thousand rows, weeks after that I was working with a few million rows and now I am working with 42 million rows. Most of my work is just take a dataframe and for each row, I need to consult in another dataframe its "equivalent" and process the data, sometimes just merge but more often i need to do a function with the equivalent data. Back in the days with a few hundred thousand rows, it was ok to just use apply and a simple filter but now it is EXTREMELY SLOW. Recently I've switched to vaex which is way faster than pandas on every aspect but apply, and after some time searching I found that apply is the last resource and should be used only if u haven't another option. So, is there another option? I really don't know
Some code to explain how I was doing this entire time:
def get_secondary(row: pd.DataFrame):
cnae = row["cnae_fiscal"]
cnpj = row["cnpj"]
# cnaes is another dataframe
secondary = cnaes[cnaes.cnpj == cnpj]
return [cnae] + list(secondary["cnae"].values)
empresas["cnae_secundarios"] = empresas.apply(get_secondary, axis=1)
This isn't the only use case, as I said.
While compiling a pandas table to plot certain activity on a tool I have encountered a rare error in the data that creates an extra 2 columns for certain entries. This means that one of my computed column data goes into the table 2 cells further on that the other and kills the plot.
I was hoping to find a way to pull the contents of a single cell in a row and swap it into the other cell beside it, which contains irrelevant information in the error case, but which is used for the plot of all the other pd data.
I've tried a couple of different ways to swap the data around but keep hitting errors.
My attempts to fix it include:
for rows in df['server']:
if '%USERID' in line:
df['server'] = df[7] # both versions of this and below
df['server'].replace(df['server'],df[7])
else:
pass
if '%USERID' in df['server']: # Attempt to fix missing server name
df['server'] = df[7];
else:
pass
if '%USERID' in df['server']:
return row['7'], row['server']
else:
pass
I'd like the data from column '7' to be replicated in 'server', only in the case of the error - where the data in the cell contains a string starting with '%USERID'
Turns out I was over-thinking this one. I took a step back, worked the code a bit and solved it.
Rather than trying to smash a one-size fits all bit of code for the all data I built separate lists for the general data and 2 exception I found, by writing a nested loop and created 3 data frames. These were easy enough to then manipulate individually, and finally concatenate together. All working fine now.
I have a .csv file with 3 columns and 500.000+ lines.
I'm trying to get insight into this dataset by counting occurences of certain tags.
When i started i used Notepad++ count function for the tags i found and noted the results by hand. Now that i want to automate that process, i use pandas to do the same thing but the results differ quite a bit.
Results for all tags summed up are:
Notepad++ : 91.500
Excel : 91.677
Python.pandas : 91.034
Quite a difference, and i have no clue how to explain this and how to validate which result i can trust and use.
My python code looks like this and is fully functional.
#CSV.READ | Delimiter: ; | Datatype: string| Using only first 3 columns
df = pd.read_csv("xxx.csv", sep=';', dtype="str")
#fills nan with "Empty" to allow indexing
df = df.fillna("Empty")
#counts and sorts occurences of object(3) category
occurences = df['object'].value_counts()
#filter Columns with "Category:"
tags_occurences = df[df['object'].str.contains("Category:")]
#displays and
tags_occurences2 = tags_occurences['object'].value_counts()
Edit: Already iterated through the other columns, which result in finding another 120 tags, but there is still a discrepancy.
In Excel and Notepad I just open Ctrl+F and search for "Category:" using their count functions.
Has anyone made a similiar experience or can explain what might cause this ?
Are excel & wordpad having errors while counting ? I can't imagine pandas (being used in ML and DataScience a lot) would have such flaws.
I have a very large data file (foo.sas7bdat) that I would like to filter rows from without loading the whole data file into memory. For example, I can print the first 20 rows of the dataset without loading the entire file into memory by doing the following:
import pandas
import itertools
with pandas.read_sas('foo.sas7bdat') as f:
for row in itertools.islice(f,20):
print(row)
However, I am unclear on how to only print (or preferably place in a new file) only rows that have any column that contain the number 123.1. How can I do this?
Pandas has the ability to pull dataframes one chunk at a time. Following the trail of read_sas() documentation to "chunksize" I came across this:
http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk
for chunk in pd.read_sas('foo.sas7bdat', interator=True, chunksize=100000):
print(chunk)
This would get chunks of 100,000 lines.
As for the other problem you would need a query. However I don't know the constraints of the problem. If you make a Dataframe with all the columns then you still might overflow your memory space so an efficient way would be to collect the indexes and put those in a set, then sort those and use .iloc to get those entries if you wanted to put those into a Dataframe.
You may need to use tools that take this into account. Dask is a good alternative for use on clusters.
I have a huge data frame with about 1041507 rows.
I wanted to calculate a rolling_median for certain values, under certain categories in my data frame. For this I used a groupBy follwed by apply:
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
However, this given me a MemoryError: skiplist_insert failed. I will attach the full Traceback if needed, but I came across another similar post which specifies that this is an issue in pandas, as can be seen here https://github.com/pydata/pandas/issues/11697. For a very large size >~ 35000
After this i tried to do a bit of manipulation to simply get the rolling median by iterating over each group separately
for index,group in df.groupby(['Category','Subcategory']):
print pd.rolling_median(group['value'],7,min_period=7)
Each group is about 20-25 rows long only. Yet this function fails and shows the same MemoryError after a few iterations.
I ran the code several times, and every time it showed the memory error for different items.
I created some dummy values for anyone to test, here:
index=[]
[index.append(x) for y in range(25) for x in np.arange(34000)]
sample=pd.DataFrame(np.arange(34000*25),index=index)
for index,group in sample.groupby(level=0):
try:
pd.rolling_median(group[0],7,7)
except MemoryError:
print a
print pd.rolling_median(group[0],7,7)
If i run the rolling_median again after encountering the memoryError (as you can see in the above code), it runs fine without another exception-
I am not sure how can I calculate my rolling_median if it keeps throwing the memory Error.
Can anyone tell me a better way to calculate the rolling_median, or help me understand the issue here?
The groupby doesn't look right and should change
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
to
df['rolling_median']=df.groupby(['Category','Subcategory'])['value'].apply(pd.rolling_median,7,min_periods=7)
Otherwise the the groupby won't work as it is a series with column named ["value"] so can't groupby ['Category','Subcategory'] as not present.
Also the groupby is going to be smaller than the length of the dataframe and creating the df['rolling_median'] will cause a length mismatch.
Hope that helps.
The bug has been fixed in Pandas 0.18.0, and now the methods rolling_mean() and rolling_median() have depreciated.
This was the bug: https://github.com/pydata/pandas/issues/11697
Can be viewed here: http://pandas.pydata.org/pandas-docs/stable/computation.html