I am trying to remove non-English tweets from a large dataset in the most efficient way possible. I have tried to create a list of rows that are not English and them removing them, but removing each tweet takes a long time (the langid.classify() function is not the problem).
def removeLanguage(df):
rowsToDelete = []
text = df['tweet'][i]
try:
if (langid.classify(text)[0] != 'en' ):
rowsToDelete.append(i)
continue
except ValueError:
rowsToDelete.append(i)
continue
for i in rowsToDelete:
df.drop(i, inplace=True)
newDf = beforeClassification(inputDf).reset_index(drop=True)
Is there a more efficient way to remove a set of rows from a DataFrame than df.drop()?
df.drop is pretty efficient
but I'd also use anything like this
df = df[langid.classify(df.tweet)[0] != 'en' ]
Related
I'm trying to build a tensorflow application in python, but after importing my data in I needed to normalize it. No problem there, except all my columns are now titled palm.velocity.x for example. I found a way to rename all of these columns as there are 230 of them in total so the old df.rename and similar methods aren't much help, unless they can be used like df.apply but from what I've looked at there doesn't seem to be a way.
def FixColumnHeading(column):
columns = re.split(r'\.', column)
name = []
for word in range(len(columns)):
if(word > 0):
columns[word] = columns[word].capitalize()
name.append(columns[word])
newColumn = ''
for part in name:
newColumn += part
return newColumn
normalisedData.columns = normalisedData.columns.to_series().apply(lambda x: FixColumnHeading(x))
If anyone can think of a way to improve, please put what you would change below :)
I have a Pandas dataframe named pd. I am attempting to use a nested-for-loop to iterate through each tuple of the dataframe and, at each iteration, compare the tuple with all other tuples in the frame. During the comparison step, I am using Python's difflib.SequenceMatcher().ratio() and dropping tuples that have a high similarity (ratio > 0.8).
Problem:
Unfortunately, I am getting a KeyError after the first, outer-loop, iteration.
I suspect that, by dropping the tuples, I am invalidating the outer-loop's indexer. Or, I am invalidating the inner-loop's indexer by attempting to access an element that doesn't exist (dropped).
Here is the code:
import json
import pandas as pd
import pyreadline
import pprint
from difflib import SequenceMatcher
# Note, this file, 'tweetsR.json', was originally csv, but has been translated to json.
with open("twitter data/tweetsR.json", "r") as read_file:
data = json.load(read_file) # Load the source data set, esport tweets.
df = pd.DataFrame(data) # Load data into a pandas(pd) data frame for pandas utilities.
df = df.drop_duplicates(['text'], keep='first') # Drop tweets with identical text content. Note,
these tweets are likely reposts/retweets, etc.
df = df.reset_index(drop=True) # Adjust the index to reflect dropping of duplicates.
def duplicates(df):
for ind in df.index:
a = df['text'][ind]
for indd in df.index:
if indd != 26747: # Trying to prevent an overstep keyError here
b = df['text'][indd+1]
if similar(a,b) >= 0.80:
df.drop((indd+1), inplace=True)
print(str(ind) + "Completed") # Debugging statement, tells us which iterations have completed
duplicates(df)
Error Output:
Can anyone help me understand this and/or fix it?
One solution, which was mentioned by #KazuyaHatta, is the itertools.combination(). Although, the way I've used it (there may be another way), it's O(n^2). So, in this case, with 27,000 tuples, it's nearly 357,714,378 combinations to iterate (too long).
Here is the code:
# Create a set of the dropped tuples and run this code on bizon overnight.
def duplicates(df):
# Find out how to improve the speed of this
excludes = set()
combos = itertools.combinations(df.index, 2)
for combo in combos:
if str(combo) not in excludes:
if similar(df['text'][combo[0]], df['text'][combo[1]]) > 0.8:
excludes.add(f'{combo[0]}, {combo[1]}')
excludes.add(f'{combo[1]}, {combo[0]}')
print("Dropped: " + str(combo))
print(len(excludes))
duplicates(df)
My next step, which #KazuyaHatta described, is to attempt the dropping-by-mask method.
Note: I unfortunately won't be able to post a sample of the dataset.
I need to modify some values of a Pandas dataframe based on a test, and leave the others values intact. I also need to leave the order of the rows intact.
I have a working code, based on iterating on the dataframe's rows. But it's horrendously slow. Is there a quicker way to get it done?
Here are two examples of this very slow code
for index, row in df.iterrows():
if df.number[index].is_integer():
df.number[index] = int(df.number[index])
for index, row in df.iterrows():
if df.string[index] == "XXX":
df.string[index] = df.other_colum[index].split("\")[0] + df.other_colum[index].split("\")[1]
else:
df.string[index] = df.other_colum[index].split("\")[1] + df.other_colum[index].split("\")[0]
Thanks
Generally you want to avoid iterating through rows in a pandas dataframe as it is slower than other methods pandas has created for accomplishing the same thing. One way of getting around this is using apply. You would redefine the number column:
df["number"] = df["number"].apply(lambda x: int(x) if x.is_integer() else x)
And (re)define the string column:
df["string"] = df["other column"].apply(lambda x: x.split("\\")[0] + x.split("\\")[1] if x == r"XX\X" else x.split("\\")[1] + x.split("\\")[0])
Made some assumptions based off of the data you removed from the problem set up -- .split("\") is incorrect syntax, and "other column" above necessarily has to have a backslash in it in order for your code (and mine) to work, otherwise .split("\\")[1] will return an error.
I'll post a a little bit of my code here. Basically I've been manually removing 1 row at a time that I don't want, but I want it to look nicer than that, so I'm wondering if there's a cleaner way that at allows me to delete everything in 1 line.
data = data[data.city_or_county != 'Alma']
data = data[data.city_or_county != 'Alpine']
data = data[data.city_or_county != 'Altadena']
data = data[data.city_or_county != 'Alsip']
You could use .isin
data[~data.city_or_county.isin(["Alma", "Alpine",
"Alsip","Altadena"])]
I have a Dataframe with 3 columns:
id,name,team
101,kevin, marketing
102,scott,admin\n
103,peter,finance\n
I am trying to apply a regex function such that I remove the unnecessary spaces. I have got the code that removes these spaces how ever I am unable loop it through the entire Dataframe.
This is what I have tried thus far:
df['team'] = re.sub(r'[\n\r]*','',df['team'])
But this throws an error AttributeError: 'Series' object has no attribute 're'
Could anyone advice how could I loop this regex through the entire Dataframe df['team'] column
You are almost there, there are two simple ways of doing this:
# option 1 - faster way
df['team'] = [re.sub(r'[\n\r]*','', str(x)) for x in df['team']]
# option 2
df['team'] = df['team'].apply(lambda x: re.sub(r'[\n\r]*','', str(x)))
As long it's a dataframe check replace https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html
df['team'].replace( { r"[\n\r]+" : '' }, inplace= True, regex = True)
Regarding the regex, '*' means 0 or more, you should need '+' which is 1 or more
Here's a powerful technique to replace multiple words in a pandas column in one step without loops. In my code I wanted to eliminate things like 'CORPORATION', 'LLC' etc. (all of them is in the RemoveDB.csv file) from my column without using a loop. In this scenario I'm removing 40 words from the entire column in one step.
RemoveDB = pd.read_csv('RemoveDBcsv')
RemoveDB = RemoveDB['REMOVE'].tolist()
RemoveDB = '|'.join(RemoveDB)
pattern = re.compile(RemoveDB)
df['NAME']= df['NAME'].str.replace(pattern,'', regex = True)
Another example (but without regex) but maybe still usefull for someone.
id = pd.Series(['101','102','103'])
name = pd.Series(['kevin','scott','peter'])
team = pd.Series([' marketing','admin\n', 'finance\n'])
testsO = pd.DataFrame({'id': id, 'name': name, 'team': team})
print(testsO)
testsO['team'] = testsO['team'].str.strip()
print(testsO)