Python: (findall, wide to long, pandas): Data manipulation - python

So, I am trying to pivot this data (link here) to where all the metrics/numbers are in a column with another column being the ID column. Obviously having a ton of data/metrics in a bunch of columns is much harder to compare and do calculations off of than having it all in one column.
So, I know what tools I need for this; Pandas, Findall, Wide_to_long (or melt) and maybe stack. However, I am having a bit of difficulty putting them all in the right place.
I easily can import the data into the df and view it, but when it comes to using findall with wide_to_long to pivot the data I get pretty confused. I am basing my idea off of this example (about half way down when they use the findall / regex to define new column names). I am looking to create a new column for each category of the metrics (ie. population estimate is one column and then % change is another, they should not be all one column)
Can someone help me set up the syntax correctly for this part? I am not good at the expressions dealing with pattern recognition.
Thank you.

Related

Complicated string manipulation in pandas without regex

I just looped over a pandas column to it from out[156] to out[157] in the image attached below.
Now I have another column that is very complicated, is it possible to achieve the same thing without using regex (I am completely unfamiliar with regex)? I basically want the figures in float format (eg. 9ft 2in to the equivalent in inches) but there are so many different ways it was entered.
I would appreciate any advice.

Is there a way to get the source columns from a column object in PySpark?

I'm writing a function that (hopefully) simplifies a complex operation for other users. As part of this, the user passes in some dataframes and an arbitrary boolean Column expression computing something from those dataframes, e.g.
(F.col("first")*F.col("second").getItem(2) < F.col("third")) & (F.col("fourth").startswith("a")).
The dataframes may have dozens of columns each, but I only need the result of this expression, so it should be more efficient to select only the relevant columns before the tables are joined. Is there a way, given an arbitrary Column, to extract the names of the source columns that Column is being computed from, i.e.
["first", "second", "third", "fourth"]?
I'm using PySpark, so an ideal solution would be contained only in Python, but some sort of hack that requires Scala would also be interesting.
Alternatives I've considered would be to require the users to pass the names of the source columns separately, or to simply join the entire tables instead of selecting the relevant columns first. (I don't have a good understanding of Spark internals, so maybe the efficiency loss isn't as much I think.) I might also be able to do something by cross-referencing the string representation of the column with the list of column names in each dataframe, but I suspect that approach would be unreliable.

Is there a way to map 2 dataframes onto each other to produce a rearranged dataframe (with one data frame values acting as the new column names)?

I have one dataframe of readings that come in a particular arrangement due to the nature of the experiment. I also have another dataframe, that contains information about each point on the dataframe and what each point corresponds to, in terms of what chemical was at that point. Note, there are only a few different chemicals over the dataframe, but they are arranged all over the dataframe.
What I want to do is to create a new, reorganised dataframe, where the columns are the type of chemical. My inital thought would be to compare the data and information dataframes to produce a dictionary, which I could then transform into a new dataframe. I could not figure out how to do this, and might not actually be the best approach either!
I have previously achieved it by manually rearranging the points on the dataframe to match the pattern I want, but I'm not happy with this approach and must be a better way.
Thanks in advance for any help!

General Approach to Working with Data in DataFrames

Question for experienced Pandas users on approach to working with Dataframe data.
Invariably we want to use Pandas to explore relationships among data elements. Sometimes we use groupby type functions to get summary level data on subsets of the data. Sometimes we use plots and charts to compare one column of data against another. I'm sure there are other application I haven't thought of.
When I speak with other fairly novice users like myself, they generally try to extract portions of a "large" dataframe into smaller dfs that are sorted or formatted properly to run applications or plot. This approach certainly has disadvantages in that if you strip out a subset of data into a smaller df and then want to run an analysis against a column of data you left in the bigger df, you have to go back and recut stuff.
My question is - is best practices for more experienced users to leave the large dataframe and try to syntactically pull out the data in such a way that the effect is the same or similar to cutting out a smaller df? Or is it best to actually cut out smaller dfs to work with?
Thanks in advance.

How to use user-defined input for column name in pandas series

I am looking to understand how to use a user-defined variable within a column name. I am using pandas. I have a dataframe with several columns that are in the same format, but the code will be run against the different column names. I don't want to have to put in the different column names each time when only the first part of the name actually changes.
For example,
df['input_same_same']
Where the code will call out different columns where only the first part of the column is different and the rest remains the same.
Is it possible to do something along the lines of:
vari='cats' (and the next time I run I can input dogs, pigs, etc)
for
df['vari_count_litter']
I have tried using %s within the column name but that doesn't work.
I'd appreciate any insight or understanding how this is possible. Thanks!
If I understand right, you could do df[vari+'_count_litter']. However, you may be better off using a MultiIndex that would let you do df[vari, 'count_litter']. It's difficult to say how to set it up without know what your data structure is and how you want to access it.

Categories

Resources