Drop rows that contain a special value for a tuple - python

I have a dataset that contains a column that contains a tuple in the form ('String', int).
I would like to drop all the rows that contain ('String1', 1), ('String2', 1), and ('String3', 1). I have tried many things but can't get it to drop.

I'm not sure what your data looks like, but it sounds like you can just filter out values equal to ('String', 1)
df = df[df['your column'] != ('String', 1)]
For multiple values:
df = df[~(df['your column'].str[0].isin(['String1', 'String2', 'String3']) & df['your column'].str[1] == 1)]

Related

Pandas multi index loc

I have a pandas dataframe for graph edges with a multi index as such
df = pd.DataFrame(index=[(1, 2), (2, 3), (3, 4), ...], data=['v1', 'v2', 'v3', ...])
However doing a simple .loc fails:
df.loc[(1, 2)] # error
df.loc[df.index[0]] # also error
with the message KeyError: 1. Why does it fail? The index clearly shows that the tuple (1, 2) is in it and in the docs I see .loc[] being used similarly.
Edit: Apparently df.loc[[(1, 2)]] works. Go figure. It was probably interpreting the first iterable as separate keys?
Turns out I needed to wrap the key in another iterable like a list for it to use the whole tuple instead of its elements like so: df.loc[[(1, 2)]].

Create a new column that details if rows in one PySpark dataframe matches a a row in another column of a dataframe

I want to create a function that creates a new column from a left join in PySpark that details if a value in one column matches or does not match the column of another dataframe row by row.
For example, we have one PySpark dataframe (d1) that has columns ID and Name and another PySpark dataframe (d2) that has the same columns - ID and Name.
I'm trying to make a function that joins these two tables and creates a new column that shows 'True' or 'False' if the same ID exists in both dataframes.
So far, I have this
def doValuesMatch(df1, df2):
left_join = df1.join(df2, on='ID', how='left')
df1.withColumn('MATCHES?', .....(not sure what to do here))
I'm new to PySpark, can someone please help me? Thanks in advance.
It maybe something like this.
data1 = [
(1, 'come'),
(2, 'on'),
(3, 'baby'),
(4, 'hurry')
]
data2 = [
(2, 'on'),
(3, 'baby'),
(5, 'no')
]
df1 = spark.createDataFrame(data1, ['id', 'name'])
df2 = spark.createDataFrame(data2, ['id', 'name'])
df2 = df2.withColumnRenamed('name', 'name2')
df = df1.join(df2, on='id', how='left').withColumn('MATCHES', F.expr('if(name2 is null,"Flase","True")'))
df.show(truncate=False)

Match list of strings with column and return corresponding column value

This is my dataframe df3:
My Template files are named like:
AdDape CBS Index Template 6.3.xlsx
AdDape Midlife Index Template 5.3.xlsx
CausalIQ Index Template 5.xlsx
I'm iterating over my excel files and then in a nested loop I'm iterating over its excel sheets. I have saved the file in df2 it has several columns and 3 sheets like below:
The file's sheet look like this (df2):
I want to get Grouping value. If my actual filename matches with the df3.filename and then match it's sheet with the df3.sheetname and then get the corresponding Grouping.
The code I'm using is:
for fname in TemplateFileList:
excel = pd.ExcelFile(fname)
for sheet in excel.sheet_names:
print("\nSelected sheet: ", sheet)
df2 = pd.read_excel(excel, sheet_name=sheet)
print('\n',sheet,df2.head(5))
m=df3['Filename'].apply(lambda x: process.extract(x, fname, limit=10)).all()
if m:
print(m)
In simple language, what I want is:
if fname is similar to df3[filename]
then
if fname's sheetname or df2's sheet is similar to df3[sheetname]
then
return the corresponding df3[costgroup]
If you observe the actual filename is not exactly same to filenames in df3, this is the actual problem, also actual sheetnames are different with the sheetnames in df3.
I know about fuzzy matching, but I'm not sure how to use. I used it like this:
m=df3['Filename'].apply(lambda x: process.extract(x, fname, limit=10)).all()
if m:
print(m)
this is giving me output like this:
[('A', 60), ('d', 60), ('D', 60), ('a', 60), ('t', 60), ('S', 60), ('S', 60), ('a', 60), ('y', 60), ('d', 60)]
I think what you generally want here is to use merge(), which will merge together two dataframes to give all the columns of both.
Your df2 will need columns for filename and sheetname. It looks like you might have to do a little work to get them into the same format as what you have in df3. Once you have that, though, merge will get you the grouping column as well.
df_merged = df1.merge(df2
,how='left' #this will give all rows from df1, even if they don't have matches in df2
,on=['Filename','Sheetname'] # this assumes that the columns titles are the same in both dfs
)

Pandas concatenate Multiindex columns with same row index

I want to concatenate the multidimensional output of a NumPy computation matching in dimensions the shape of the input (with regards to rows and respective selected columns).
But it fails with: NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
I do not want to flatten the indices first - so is there another way to get it to work?
import pandas as pd
from pandas import Timestamp
df = pd.DataFrame({('metrik_0', Timestamp('2020-01-01 00:00:00')): {(1, 1): 2.5393693602911447, (1, 5): 4.316896324314225, (1, 6): 4.271001191238499, (1, 9): 2.8712588011247377, (1, 11): 4.0458495954752545}, ('metrik_0', Timestamp('2020-01-01 01:00:00')): {(1, 1): 4.02779063729038, (1, 5): 3.3849606155101224, (1, 6): 4.284114856052976, (1, 9): 3.980919941298365, (1, 11): 5.042488191587525}, ('metrik_0', Timestamp('2020-01-01 02:00:00')): {(1, 1): 2.374592085569529, (1, 5): 3.3405503781564487, (1, 6): 3.4049690284720366, (1, 9): 3.892686173978996, (1, 11): 2.1876998087043127}})
def compute_return_columns_to_df(df, colums_to_process,axis=0):
method = 'compute_result'
renamed_base_levels = map(lambda x: f'{x}_{method}', colums_to_process.get_level_values(0).unique())
renamed_columns = colums_to_process.set_levels(renamed_base_levels, level=0)
#####
# perform calculation in numpy here
# for the sake of simplicity (and as the actual computation is irrelevant - it is omitted in this minimal example)
result = df[colums_to_process].values
#####
result = pd.DataFrame(result, columns=renamed_columns)
display(result)
return pd.concat([df, result], axis=1) # fails with: NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
# I do not want to flatten the indices first - so is there another way to get it to work?
compute_return_columns_to_df(df[df.columns[0:3]].head(), df.columns[0:2])
The reason why your code failed is in:
result = df[colums_to_process].values
result = pd.DataFrame(result, columns=renamed_columns)
Note that result has:
column names with the top index level renamed to
metrik_0_compute_result (so far OK),
but the row index is the default single level index,
composed of consecutive numbers.
Then, when you concatenate df and result, Pandas attempts to
align both source DataFrames on the row index, but they are incompatible
(df has a MultiIndex, whereas result has an "ordinary" index).
Change this part of your code to:
result = df[colums_to_process]
result.columns = renamed_columns
This way result keeps the original index and concat raises no
exception.
Another remark: Your function contains axis parameter, which is
never used. Consider removing it.
Another possible approach
Since result has a default (single level) index, you can leave the
previous part of code as is, but reset the index in df before joining:
return pd.concat([df.reset_index(drop=True), result], axis=1)
This way both DataFrames have the same indices and you can concatenate
them as well.

Want to find the first instance of each unique string in a Data Frame. Then create a list which marks as first unique instance or not

To reword it, I am generating some dummy data. Assuming a list of customers (some with multiple transactions), I want to mark each Unique Customer. Then I will generate related Personal Info such as Gender, Customer ID, etc.
My steps were:
1) Create list of all Unique names
2) iterate over "Names" column in my Dataframe
3) when the value in Unique names list and the DataFrame "Names" matches, then append 1 to a list (then delete the name from Unique names list, therefore, creating a 0 for each subsequent instance of the name). Or leave a 0 if it doesn't match.
I've tried a few methods but none seem to work, this one seemed the closest but I could not find the answer.
First the DataFrame
customers = [ ('jack', 34),
('tom', 30),
('jack', 31),
('jack', 32),
('jon', 16),
('tim', 17) ]
Create a DataFrame object
df = pd.DataFrame(customers, columns = ['Name' , 'Age'])
1) create list of Unique Names
uniques = df.Name.unique().tolist()
uniques
2)
worklist = []
for i in df:
if df["Name"] == uniques[i]:
worklist.append(i)
uniques.remove(i)
else:
worklist.append(0)
print(worklist)
print(uniques)
At the end, I should have a list of dummy variables (1,0s)
  [1,1,0,0,1,1]
Similarly, the Unique names list should be empty.
However, I continually get this error.
TypeError: list indices must be integers or slices, not str
The error you have is because you don't loop over what you think you are looping, when doing your loop you actually loop over the column names ("Name" and "Age") meaning that you ask for uniques["Name"] and uniques["Age"] which raise the error, list indices are not str.
You can do some kind of switch button using a dict instead of a list for your uniques variable:
customers = [ ('jack', 34),
('tom', 30),
('jack', 31),
('jack', 32),
('jon', 16),
('tim', 17) ]
df = pd.DataFrame(customers, columns = ['Name' , 'Age'])
uniques = {name: True for name in df['Name']}
worklist = []
for name in df["Name"]:
if uniques[name]:
worklist.append(1)
uniques[name] = False
else:
worklist.append(0)
print(worklist)
The uniques variable is not empty at the end though but filled with keys that are all "False", not sure if its important, if it is then tell me and I'll edit.

Categories

Resources