I am new to pandas dataframe. I would like to apply a function on an old dataframe (df1) by extracting values from another dataframe (df2).
DF2 looks like this (the actual one has ~500 rows)
Judge old_court_name new_court_name
John eighth circuit first circuit
Ruth us court claims. fifth circuit
Ben district connecticut district ohio
Then I've written a function
def addJudgeCourt(df1, Judge, old_court_name, new_court_name):
How do I tell pandas to extract the last three items by iterating from the dataframe2? Thanks!
Related
I have a table with some company information that we're trying to clean up. In the first column is a clean company name, but not necessarily the correct one. In the second column, there is the correct company name, but often not very clean / missing. Here is an example.
Name
Info
Nike
Nike, a footwear manufacturer is headquartered in Oregon.
ASG Shoes
Reebok
Adidas
None
We're working with this dataset in Pandas. We'd like to follow the rules below.
If the Name column is equal to the left side of the Info column, keep the name column. We would like this to be dynamic with the length of column 1. For "Nike", it should check the first 4 letters of the Info column, for "ASG Shoes", it should check the first 9 characters.
If rule 1 is false, use the Info column.
If Info is None, use the Name column.
The output we seek is a 3rd column that is the output of these rules. I am hoping someone can help me with writing this code in an efficient manner. There's a lot going on here and I want to ensure I'm doing this properly. How can I achieve this output with the most efficient Python code possible?
Name
Info
Clean
Nike
Nike, a footwear manufacturer is headquartered in Oregon.
Nike
ASG Shoes
Reebok
Reebok
Adidas
None
Adidas
You can start by creating another column that contains the length of your Name column. This is really straight-forward. Let us call the new column Slicers. What you can then do is to create a function that slices a string by a certain number and map this function to your columns Info and Slicers, where Info is the string column that should be sliced and Slicers defines the slicing number. (There may be even a pandas implementation for this, but I do not know one). After that, you can compare your sliced info with your Name variable and assign all matches to your Clean column. Then, just apply a pandas coalesce over your desired columns.
The code implementation is given below:
import pandas as pd
def slicer(strings, slicers):
return strings[:slicers] if isinstance(strings, str) else strings
df = pd.DataFrame({
"Name": ["Nike", "ASG Shoes", "Adidas"],
"Info": ["Nike, a footwear manufacturer is headquartered in Oregon.", "Reebok", None]
})
# Define length column
df["Slicers"] = df["Name"].str.len()
# Slice Info column by length column and overwrite
df["Slicers"] = list(map(slicer, df["Info"], df["Slicers"]))
# Check whether sliced str column and name column are equal
mask = df["Name"].eq(df["Slicers"])
# Overwrite if they are equal
df.loc[mask, "Clean"] = df.loc[mask, "Name"]
# Apply coalesce
coalesce_rules = ["Clean", "Info", "Name"]
df.drop(columns=["Slicers"]).assign(Clean=df[coalesce_rules].fillna(method="bfill", axis=1).iloc[:,0])
Output:
Name Info Clean
0 Nike Nike, a footwear manufacturer is headquartered... Nike
1 ASG Shoes Reebok Reebok
2 Adidas None Adidas
It only needs around five seconds for 3. Mio rows. Obviously, I do not know whether this is the most efficient way to solve your problem. But I think it's an efficient one.
So let's say I have data like this with some delimiter like commas that I want to split to new cells either across to columns or down into rows.
The Data
Location
One Museum, Two Museum
City A
3rd Park, 4th Park, 5th Park
City B
How would you do it in either direction? There are lots of methods why is methods provided preferred?
Looking for methods in:
Python
Excel
Power Query
R
The Excel manual method: click on Data>Text to Column. Now just copy and past if you want the data in one column. This is only good when the data set is small and your are doing it once.
The Power Query method: This method you do it once for the data source then click refresh button when the data changes in the future. The data source can be almost anything like csv, website or etc. Steps below:
1 - Pick your data source
2 - When within excel choose From Table/ Range
3 - Now choose the split method, there is delimiter and there is 6 other choices.
4 - For this data I when with custom and use ", "
5 & 6 - To split down you have to select Advanced options. Make the selection.
7 Close & Load
This is a good method because you don't have to code in Power Query unless you want to.
The Python method
Make sure you have pip installed pandas or use conda to install pandas.
The code is like so:
import pandas as pd
df = pd.read_excel('path/to/myexcelfile.xlsx')
df[['key.0','key.1','key.2']] = df['The Data'].str.split(',', expand=True)
df.drop(columns=['The Data'], inplace = True)
# stop here if you want the data to be split into new columns
The data looks like this
Location key.0 key.1 key.2
0 City A One Museum Two Museum None
1 City B 3rd park 4th park 5th park
To get the split into rows proceed with the next code part:
stacked = df.set_index('Location').stack()
# set the name of the new series created
df = stacked.reset_index(name='The Data')
# drop the 'source' level (key.*)
df.drop('level_1', axis=1, inplace=True)
Now this is done and it looks like this
Location The Data
0 City A One Museum
1 City A Two Museum
2 City B 3rd park
3 City B 4th park
4 City B 5th park
The benefit of python is that is faster for larger data sets you can split using regex in probable a 100 ways. The data source can be all types that you would use for power query and more.
R
library(data.table)
dt <- fread("yourfile.csv") # or use readxl package for xls files
dt
# Data Location
# 1: One Museum, Two Museum City A
# 2: 3rd Park, 4th Park, 5th Park City B
dt[, .(Data = unlist(strsplit(Data, ", "))), by = Location]
# Location Data
# 1: City A One Museum
# 2: City A Two Museum
# 3: City B 3rd Park
# 4: City B 4th Park
# 5: City B 5th Park
I work through a pandas tutorial that deals with analyzing sales data (https://www.youtube.com/watch?v=eMOA1pPVUc4&list=PLFCB5Dp81iNVmuoGIqcT5oF4K-7kTI5vp&index=6). The data is already in a dataframe format, within the dataframe is one column called "Purchase Address" that contains street, city and state/zip code. The format looks like this:
Purchase Address
917 1st St, Dallas, TX 75001
682 Chestnut St, Boston, MA 02215
...
My idea was to convert the data to a string and to then drop the irrelevant list values. I used the command:
all_data['Splitted Address'] = all_data['Purchase Address'].str.split(',')
That worked for converting the data to a comma separated list of the form
[917 1st St, Dallas, TX 75001]
Now, the whole column 'Splitted Address' looks like this and I am stuck at this point. I simply wanted to drop the list indices 0 and 2 and to keep 1, i.e. the city in another column.
In the tutorial the solution was layed out using the .apply()-method:
all_data['Column'] = all_data['Purchase Address'].apply(lambda x: x.split(',')[1])
This solutions definitely looks more elegant than mine so far, but I wondered whether I can reach a solution with my approach with a comparable amount of effort.
Thanks in advance.
Use Series.str.split with selecting by indexing:
all_data['Column'] = all_data['Purchase Address'].str.split(',').str[1]
I have a dataframe df1 having more than 500k records:
state lat-long
Florida (12.34,34.12)
texas (13.45,56.0)
Ohio (-15,49)
Florida (12.04,34.22)
texas (13.35,56.40)
Ohio (-15.79,49.34)
Florida (12.8764,34.2312)
the lat-long value can differ for a particular state.
Need to get a dictonary like below. the lat-long value can differ for a particular state but need to capture the first occurrence like this.
dict_state_lat_long = {"Florida":"(12.34,34.12)","texas":"(13.45,56.0)","Ohio":"(-15,49)"}
How can I get this in most efficient way?
You can use DataFrame.groupby to group the dataframe with respect to the states and then you can apply the aggregate function first to select the first occurring values of lat-long in the grouped dataframe.
Then you can use DataFrame.to_dict() function to convert the dataframe to the python dict.
Use this:
grouped = df.groupby("state")["lat-long"].agg("first")
dict_state_lat_long = grouped.to_dict()
print(dict_state_lat_long)
Output:
{'Florida': '(12.34,34.12)', 'Ohio': '(-15,49)', 'texas': '(13.45,56.0)'}
New to Python, using 3.x
I have a large CSV containing a list of customer names and addresses.
[Name, City, State]
I am wanting to create a 4th column that is a count of the total number of customers living in the current customer's state.
So for example:
Joe, Dallas, TX
Steve, Austin, TX
Alex, Denver, CO
would become:
Joe, Dallas, TX, 2
Steve, Austin, TX, 2
Alex, Denver, CO, 1
I am able to read the file in, and then use groupby to create a Series that contains the values for the 4th column, but I can't figure out how to take that series and match it against the million+ rows in my actual file.
import pandas as pd
mydata=pd.read_csv(r'C:\Users\customerlist.csv', index_col=False)
mydata=mydata.drop_duplicates(subset='name', keep='first')
mydata['state']=mydata['state'].str.strip()
stateinstalls=(mydata.groupby(mydata.state, as_index=False).size())
stateinstalls gives me a series [2,1] but I lose the corresponding state([TX, CO]). It needs to be a tuple, so that I can then go back and iterate through all rows of my spreadsheet and say something like:
if mydata['state'].isin(stateinstalls(0))
mydata[row]=stateinstalls(1)
I feel very lost. I know there has to be a far simpler way to do this. Like even in place within the array (like a countif type function).
Any pointers is much appreciated.