Pandas df.loc with regex - python

I'm working with a data set consisting of several csv files of nearly the same form. Each csv describes a particular date, and labels the data by state/province. However, the format of one of the column headers in the data set was altered from Province/State to Province_State, so that all csv's created before a certain date use the first format and all csv's created after that date use the second format.
I'm trying to sum up all the entries corresponding to a particular state. At present, the code I'm working with is as follows:
daily_data.loc[daily_data[areaLabel] == location].sum()
where daily_data is the dataframe containing the csv data, location is the name of the state I'm looking for, and arealabel is a variable storing either 'Province/State' or 'Province_State' depending on the result of a date check. I would like to eliminate the date check by e.g. conditioning on a regular expression like Province(/|_)State, but I'm having a lot of trouble finding a way to index into a pandas dataframe by regular expression. Is this doable (and in a way that would make the code more elegant rather than less elegant)? If so, I'd appreciate it if someone could point me in the right direction.

Use filter to get the columns that match your regex
>>> df.filter(regex="Province(/|_)State").columns[0]
'Province/State'
Then use this to select only rows that match your location:
df[df[df.filter(regex="Province(/|_)State").columns[0]]==location].sum()
This however assumes that there are no other columns that would match the regex.

Related

Tableau Prep + Panda's DF split and compare

I'm trying to use tableau prep my prep data has three columns with multiple rows
id, valueFromSystem1, valueFromSystem2
123, 56.2|34.0|82.1, 82.1|34.1|56.2
I need to use the python pandas script to compare the values from each system by id. In the example above I'd expect a new column 'compareResult' - False (since they don't match - note the order of the values doesn't matter just that they all match.
ideally I could also have another column that specifies which ones didn't match: 'nonMatch' - 34.1
Thoughts on how to construct the python file?
I'd like to have one function to handle all of the above (due to the way tableau expects things).
def compare()
I'll need something to split valueFromSystem by id and then I think there's a df.compare ?
UPDATE:
two other things I've found are
groupby
and
equals for sets ie
result = valueFromSystem1.equals(other=valueFromSystem2)
but still trying to put it all together...

Extract text from a column by delimiters Python

So far I've only used powerquery to clean and automate files, and i want to step up my game and move to python, but I'm having some issues and have no one to ask so I'm coming to you for help, I'm completely new to python and learning based off youtube videos and the python for data analysis book so please bear with me for a moment.
To learn, I've been working on a project using a sample csv file, the file cover several dates and has multiple columns with different data, what I want to do is split the file into different csv based on the date on the column "DateFull" which has the dates with a yyyy-mm-dd 00:00:00 format and name the new csv files with the date.
Looking at youtube videos I came up with this piece of code
import pandas as pd
df = pd.read_csv("sample_file.csv")
split_dates = df['DateFull'].unique()
for date in split_dates:
df1 = df[df['DateFull'] == date]
split_file_name ="Samplefile_" + str(date) + ".csv"
df1.to_csv(split_file_name,index=False)
But when i run it it errors out because it tries to bring the whole name and is not acceptable. I've been looking into the split method to separate the DateFull column at the whitespace, but I don't know how to incorporate that into the code.
It's obvious that I don't have any idea of how the structure or logic of the code should be but my plan was to use the df['DateFull'].str.split() command to create two new columns, one with just the date, and one with the 00:00:00 part, then remove the last one and the original DateFull to have the trimmed date column replace it and use that one to split the csv.
I know I'm probably overcomplicating it and there's an easier way to do it, maybe just removing the time part from the original column. If that's possible it would be amazing to know how to do it. but I'd also like how to do it with my approach, since I would be practicing more methods even though the resulting code will be redundant
Any help would be greatly appreciated.
Thank you so much
You can find the documentation for the split() function here.
To do this with split():
str(date).split(" ")[0]
This splits on the whitespace and return the first (0 indexed) value in the resulting list. With this change your for loop would look like this:
for date in split_dates:
df1 = df[df['DateFull'] == date]
split_file_name ="Samplefile_" + str(date).split(" ")[0] + ".csv"
df1.to_csv(split_file_name,index=False)

Extract pandas dataframe column names from query string

I have a dataset with a lot of fields, so I don't want to load all of it into a pd.DataFrame, but just the basic ones.
Sometimes, I would like to do some filtering upon loading and I would like to apply the filter via the query or eval methods, which means that I need a query string in the form of, i.e. "PROBABILITY > 10 and DISTANCE <= 50", but these columns need to be loaded in the dataframe.
Is is possible to extract the column names from the query string in order to load them from the dataset?
I know some magic using regex is possible, but I'm sure that it would break sooner or later, as the conditions get complicated.
So, I'm asking if there is a native pandas way to extract the column names from the query string.
I think you can use when you load your dataframe the term use cols I use it when I load a csv I dont know that is possible when you use a SQL or other format.
Columns_to use=['Column1','Column3']
pd.read_csv(use_cols=Columns_to_use,...)
Thank you

Binary search through CSV column which may contain duplicates

Using Pandas I have read a CSV file accessed through FTP. The first column Code values are sorted like so:
PA0000357,
PA0000358,
PA0000359,
PA0000359,
PA0000360,
PA0000380 ...
The codes may have duplicate numbers. I need to return all rows that match a given code. Since the numbers are sorted I was thinking to use bisect but I'm not sure if or how it works with duplicate codes.
data = pd.read_csv(r, sep=',', index_col=None, parse_dates=['Date'],
usecols=['Code', 'PT Code', 'Value'])
data is a dataframe with theCode column I need to search through. Is it worth it to use bisect or shall I just go for in? The amount of data is around 500 rows.
It's classic problem with binary search.
you should change the range search (not just the middle index only).
Ex:
PA0000357, PA0000358, PA0000359, PA0000359, PA0000360, PA0000380 ...
if meet PA0000359, you should run left and right to find the correct range.
an because you are using Py, please just use in/find. (500 is small number)

Create index only (no column values) for new dataframe/series in Pandas

I am fairly new to Python and Pandas and I have not found an answer to this question while searching.
I have multiple csv data files that all contain a date-time column and corresponding data. I wanted to create a series/dataframe that contains a specific span of dates (all data is 1 min interval, so if I wanted to look at July for example I would set the index to start at July and go until the end).
Can I create a series or dataframe that contains only the date-time intervals as an index and does not contain column info? Or would I create an index (the row numbers) and then fill my column with the dates.
I also am unsure of using 'pd.merge' vs 'newdataframe = pd.merge'. When using just pd.merge, nothing comes up in my variable explorer (I use Anaconda's Spyder IDE), only when I use newdataframe = pd.merge does it appear.
Thanks in advance,

Categories

Resources