I am working in python's pandas with 4 dataframes, each of which have a column of unique identifiers followed by multiple columns of shared attributes. I'd like to generate the cartesian product of the first column, but sum the remaining columns of shared attributes for each unique combination.
I have a working version of that that separates the first column out into a separate frame, performs the cross merge, and then looks up the shared attributes for each of its unique elements but it is horribly slow. There has got to be a smarter way to do this. Anyone have any tips?
Related
I'm writing a function that (hopefully) simplifies a complex operation for other users. As part of this, the user passes in some dataframes and an arbitrary boolean Column expression computing something from those dataframes, e.g.
(F.col("first")*F.col("second").getItem(2) < F.col("third")) & (F.col("fourth").startswith("a")).
The dataframes may have dozens of columns each, but I only need the result of this expression, so it should be more efficient to select only the relevant columns before the tables are joined. Is there a way, given an arbitrary Column, to extract the names of the source columns that Column is being computed from, i.e.
["first", "second", "third", "fourth"]?
I'm using PySpark, so an ideal solution would be contained only in Python, but some sort of hack that requires Scala would also be interesting.
Alternatives I've considered would be to require the users to pass the names of the source columns separately, or to simply join the entire tables instead of selecting the relevant columns first. (I don't have a good understanding of Spark internals, so maybe the efficiency loss isn't as much I think.) I might also be able to do something by cross-referencing the string representation of the column with the list of column names in each dataframe, but I suspect that approach would be unreliable.
I'm sure this code exists and I've read through a LOT of pandas / python documentation, in fact perhaps my answer is contained within Pandas, append column based on unique subset of column values but we can't seem to get it to work as below.
Using the example below, if both the Company and Place match, we want to combine the rest of the columns. IF there is unique data, then we would to retain the data in that column and append it as an additional column.
Here is the visual representation of what we need:
I have two dataframes which have similar data in the columns but different column names. I need to identify if they are similar columns or not.
colName1=['movieName','movieRating','movieDirector','movieReleaseDate']
colName2=['name','release_date','director']
My approach tokenize colName1 and compare them using
- levenshtein/Jaccard Distance
- Find similarity using TFIDF score.
But this works for col names having similar names for eg. movieName and Name. Suppose you have 'IMDB_Score' and 'average_rating' this approach is not going to work.
Is there any way word2vec can be utilized in the above mentioned problem.
I have huge pandas DataFrames I work with. 20mm rows, 30 columns. The rows have a lot of data, and each row has a "type" that uses certain columns. Because of this, I've currently designed the DataFrame to have some columns that are mixed dtypes for whichever 'type' the row is.
My question is, performance wise, should I split out mixed dtype columns into two separate columns or keep them as one? I'm running into problems getting some of these DataFrames to even save(to_pickle) and trying to be as efficient as possible.
The columns could be mixes of float/str, float/int, float/int/str as currently constructed.
Seems to me that it may depend on what your subsequent use case is. But IMHO I would make each column unique type otherwise functions such as group by with totals and other common Pandas functions simply won't work.
I need to reconcile two separate dataframes. Each row within the two dataframes has a unique id that I am using to match the two dataframes. Without using a loop, how can I reconcile one dataframe against another and vice-versa?
I tried merging the two dataframes on an index (unique id) but the problem I run into when I do this is when there are duplicate rows of data. Is there a way to identify duplicate rows of data and put that data into an array or export it to a CSV?
Your help is much appreciated. Thanks.
Try DataFrame.duplicated and DataFrame.drop_duplicates