I have column-based data in a CSV file and I would like to manipulate it in several ways. People have pointed me to R because it gives you easy access to both rows and columns, but I am already familiar with python and rather use it.
For example, I want to be able to delete all the rows that have a certain value in one of the columns. Or I want to change all the values of one column (i.e., trim the string). I also want to be able to aggregate rows based on common values (like a SQL GROUP BY).
Is there a way to do this in python without having to write a loop to iterate over all of the rows each time?
Look at the pandas library. It provides a DataFrame type similar to R's dataframe that lets you do the kind of thing you're talking about.
Related
I'm writing a function that (hopefully) simplifies a complex operation for other users. As part of this, the user passes in some dataframes and an arbitrary boolean Column expression computing something from those dataframes, e.g.
(F.col("first")*F.col("second").getItem(2) < F.col("third")) & (F.col("fourth").startswith("a")).
The dataframes may have dozens of columns each, but I only need the result of this expression, so it should be more efficient to select only the relevant columns before the tables are joined. Is there a way, given an arbitrary Column, to extract the names of the source columns that Column is being computed from, i.e.
["first", "second", "third", "fourth"]?
I'm using PySpark, so an ideal solution would be contained only in Python, but some sort of hack that requires Scala would also be interesting.
Alternatives I've considered would be to require the users to pass the names of the source columns separately, or to simply join the entire tables instead of selecting the relevant columns first. (I don't have a good understanding of Spark internals, so maybe the efficiency loss isn't as much I think.) I might also be able to do something by cross-referencing the string representation of the column with the list of column names in each dataframe, but I suspect that approach would be unreliable.
This is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is a real stumper for me.
Grab a large data file based on a parquet - no problem
Select 5 columns from the parquet and create a dataframe - no problem
import pandas
df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,
columns=["ts", "session_id", "event", "duration", "sample_data"])
But here is where it gets a bit tricky for me. One column(a key column) is called "session_id" . Many values are unique. Many duplicate values(of session_id) exist and have multiple associated entry rows of data. I wish to iterate through the master dataframe, create a unique dataframe per session_id. Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the "duration" column per session_id. Again that SUM would be unique per unique session_id, so each sub dataframe would have it's own SUM with a row added with that total listed along with the session_id I'm thinking there is a nested loop formula that will work for me but every effort has been a mess to date.
Ultimately, I'd like to have a final dataframe that is a collection of these unique sub dataframes. I guess I'd need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data. I should be able to do that simply
Finally, write this final df to a new parquet file. Should be simple enough so I won't need help with that.
But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows
I think the groupby function will work:
df.groupby('session_id')['duration'].sum()
More info here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
I have a table I put into a pandas object using
pd.read_parquet(filename)
I have 3 columns of interest in the data set, 2 are data, one is an ID. I have to search through the whole set for values, but discard duplicate Id's.
What is the fastest way to put these id's in a data structure, or maybe clean the data for duplicates first? I was thinking of a dictionary, but there might be a way to do this already using pandas faster, or use some sort of cashe.
Thanks!
Try
pd.read_parquet(filename).drop_duplicates(['ID'])
I have a API wrapper that pulls data from a specific product. I am facing this problem of how can I map the json data to the database (postgresql). I have read up on Pandas dataframe but I am unsure if it is the right way to go. I have a few questions that I need help with.
1) Is it possible to choose which rows get into the dataframe?
2) Every row inside the dataframe needs to be inserted into two different database tables. I would need to insert ten columns into TableA get the id of the newly inserted row and insert five columns including the returned id into TableB. How would I go about this?
3) Is it possible to specify the data types for each column in the dataframe?
4) Is it possible to rename the column names to the database field names?
5) Is it possible to iterate through specific columns and replace certain data?
Is there a specific term for what I am trying to accomplish which I can search for?
Many thanks!
1) Yes, you can. You can follow this tutorial
2) You can achieve this following the same tutorial as before.
3) Theres 3 main options to convert data types in pandas:
3.1) to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)
3.2) astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).
3.3) infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.
4) You can simply call the .rename function as explained here
5) Theres at least 5 ways to iterate over data in pandas. Some are faster than others, but the ideal way depends for each case. There's a very good post on GeeksForGeeks about it.
I Hope I could help you somehow =)
I am looking to understand how to use a user-defined variable within a column name. I am using pandas. I have a dataframe with several columns that are in the same format, but the code will be run against the different column names. I don't want to have to put in the different column names each time when only the first part of the name actually changes.
For example,
df['input_same_same']
Where the code will call out different columns where only the first part of the column is different and the rest remains the same.
Is it possible to do something along the lines of:
vari='cats' (and the next time I run I can input dogs, pigs, etc)
for
df['vari_count_litter']
I have tried using %s within the column name but that doesn't work.
I'd appreciate any insight or understanding how this is possible. Thanks!
If I understand right, you could do df[vari+'_count_litter']. However, you may be better off using a MultiIndex that would let you do df[vari, 'count_litter']. It's difficult to say how to set it up without know what your data structure is and how you want to access it.