Group datetime series - python

I have a data frame where the the columns are “city” and “datetime”. The data indicates the arrival of VIP’s into the city.
City datetime
New York 2022-12-06 10:37:25
New York 2022-12-06 10:42:34
New York 2022-12-06 10:47:12
New York 2022-12-06 10:52:10
New York 2022-12-06 02:37:25
As you can see the last column stands out from the rest as datetime. The first 3 entries are at a time interval less than 10minutes with respect to the column above and the last column datetime Interval is more than 10minutes.
Now I want to group city into 2 different groups , the first 3 as 1 group and last column alone as 1 group.
Desired out out
City datetime- count
New York 4 [‘2022-12-06 10:37:25’, 2022-12-06 10:42:34’, ‘2022-12-06 10:47:12’, ‘2022-12-06 10:52:10’]
New York 1 [‘2022-12-06 02:37:25’]
This is my first time using this forum . Any help is greatly appreciated
I have tried groupby on the ”city” column but it just group every column with the same city name . But I want to group the city based date time.

You can simply use groupby with Grouper:
# create df
df = pd.DataFrame({
'City': ['New York', 'New York', 'New York', 'New York', 'New York'],
'datetime': ['2022-12-06 10:37:25', '2022-12-06 10:42:34', '2022-12-06 10:47:12', '2022-12-06 10:52:10',
'2022-12-06 02:37:25']
})
# set datetime col as index
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
df['Date'] = df.index
# groupby
grouped = df.groupby(['City', pd.Grouper(freq='15min', origin='start')])
new_df = grouped.count()
new_df['Dates'] = grouped['Date'].apply(list)
new_df.reset_index().drop('datetime', axis=1)
output:
Learn more: pandas.Grouper

Related

How to pivot a table based on the values of one column

let's say I have the below dataframe:
dataframe = pd.DataFrame({'col1': ['Name', 'Location', 'Phone','Name', 'Location'],
'Values': ['Mark', 'New York', '656','John', 'Boston']})
which looks like this:
col1 Values
Name Mark
Location New York
Phone 656
Name John
Location Boston
As you can see I have my wanted columns as rows in col1 and not all values have a Phone number, is there a way for me to transform this dataframe to look like this:
Name Location Phone
Mark New York 656
John Boston NaN
I have tried to transpose in Excel, do a Pivot and a Pivot_Table:
pivoted = pd.pivot_table(data = dataframe, values='Values', columns='col1')
But this comes out incorrectly. any help would be appreciated on this.
NOTES: All new section start with the Name value and end before the Name value of the next person.
Create a new index using cumsum to identify unique sections then do pivot as usual...
df['index'] = df['col1'].eq('Name').cumsum()
df.pivot('index', 'col1', 'Values')
col1 Location Name Phone
index
1 New York Mark 656
2 Boston John NaN

Pandas: How to find whether address in one dataframe is from city and state in another dataframe?

I have a dataframe of addresses as below:
main_df =
address
0 3, my_street, Mumbai, Maharashtra
1 Bangalore Karnataka 45th Avenue
2 TelanganaHyderabad some_street, some apartment
And I have a dataframe with city and state as below (note few states have cities with same names too:
city_state_df =
city state
0 Mumbai Maharashtra
1 Ahmednagar Maharashtra
2 Ahmednagar Bihar
3 Bangalore Karnataka
4 Hyderabad Telangana
I want to have a mapping of city and state next to each address. I am able to do so with iterrows() with nested for loops. However, both take more than an hour each for mere 15k records. What is the optimum way of achieving this considering addresses are randomly written and multiple states have same city name?
My code below:
main_df = pd.DataFrame({'address': ['3, my_street, Mumbai, Maharashtra', 'Bangalore Karnataka 45th Avenue', 'TelanganaHyderabad some_street, some apartment']})
city_state_df = pd.DataFrame({'city': ['Mumbai', 'Ahmednagar', 'Ahmednagar', 'Bangalore', 'Hyderabad'],
'state': ['Maharashtra', 'Maharashtra', 'Bihar', 'Karnataka', 'Telangana']})
df['city'] = np.nan
df['state'] = np.nan
for i, df_row in df.iterrows():
for j, city_row in city_state_df.iterrows():
if city_row['city'] in df_row['address']:
city_filtered = city[city['city'] == city_row['city']]
for k, fil_row in city_filtered.iterrows():
if fil_row['state'] in df_row['address']:
df_row['city'] = fil_row['city']
df_row['state'] = fil_row['state']
break
break
Hello maybe something like this:
main_df = main_df.reindex(columns=[*main_df.columns.tolist(), 'state', 'city'],fill_value=None)
for i, row in city_state_df.iterrows():
main_df.loc[(main_df.address.str.contains(row.city)) & \
(main_df.address.str.contains(row.state)), \
['city', 'state']] = [row.city, row.state]

How to replace column values based on a list?

I have a list like this:
x = ['Las Vegas', 'San Francisco, 'Dallas']
And a dataframe that looks a bit like this:
import pandas as pd
data = [['Las Vegas (Clark County), 25], ['New York', 23],
['Dallas', 27]]
df = pd.DataFrame(data, columns = ['City', 'Value'])
I want to replace my city values in the DF "Las Vegas (Clark County)" with "Las Vegas". In my dataframe are multiple cities with different names which needs to be changed. I know I could do a regex expression to just strip off the part after the parentheses, but I was wondering if there was a more clever, generic way.
Use Series.str.extract with joined values of list by | for regex OR and then replace non matched values to original by Series.fillna:
df['City'] = df['City'].str.extract(f'({"|".join(x)})', expand=False).fillna(df['City'])
print (df)
City Value
0 Las Vegas 25
1 New York 23
2 Dallas 27
Another idea is use Series.str.contains with loop, but it should be slow if large Dataframe and many values in list:
for val in x:
df.loc[df['City'].str.contains(val), 'City'] = val

How to check each time-series entry if name/id is in previous years entries?

I'm stuck.
I have a dataframe where rows are created at the time a customer quotes cost of a product.
My (truncated) data:
import pandas as pd
d = {'Quote Date': pd.to_datetime(['3/10/2016', '3/10/2016', '3/10/2016',
'3/10/2016', '3/11/2017']),
'Customer Name': ['Alice', 'Alice', 'Bob', 'Frank', 'Frank']
}
df = pd.DataFrame(data=d)
I want to check, for each row, if this is the first interaction I have had with this customer in over a year. My thought is to check each row's customer name against the customer name in the preceding years worth of rows. If a row's customer name is not in the previous year subset, then I will append a True value to the new column:
df['Is New']
In practice, the dataframe's shape will be close to (150000000, 5) and I fear adding a calculated column will not scale well.
I also thought to create a multi-index with the date and then customer name, but I was not sure how to execute the necessary search with this indexing.
Please apply any method you believe would be more efficient at checking for the first instance of a customer in the preceding year.
Here is the first approach that came to mind. I don't expect it to scale that well to 150M rows, but give it a try. Also, your truncated data does not produce a very interesting output, so I created some test data in which some users are new, and some are not:
# Create example data
d = {'Quote Date': pd.to_datetime(['3/10/2016',
'3/10/2016',
'6/25/2016',
'1/1/2017',
'6/25/2017',
'9/29/2017']),
'Customer Name': ['Alice', 'Bob', 'Alice', 'Frank', 'Bob', 'Frank']
}
df = pd.DataFrame(d)
df.set_index('Quote Date', inplace=True)
# Solution
day = pd.DateOffset(days=1)
is_new = [s['Customer Name'] not in df.loc[i - 365*day:i-day]['Customer Name'].values
for i, s in df.iterrows()]
df['Is New'] = is_new
df.reset_index(inplace=True)
# Result
df
Quote Date Customer Name Is New
0 2016-03-10 Alice True
1 2016-03-10 Bob True
2 2016-06-25 Alice False
3 2017-01-01 Frank True
4 2017-06-25 Bob True
5 2017-09-29 Frank False

How do I use a mapping variable to re-index a dataframe?

I have the following data frame:
population GDP
country
United Kingdom 4.5m 10m
Spain 3m 8m
France 2m 6m
I also have the following information in a 2 column dataframe(happy for this to be made into another datastruct if that will be more beneficial as the plan is that it will be sorted in a VARS file.
county code
Spain es
France fr
United Kingdom uk
The 'mapping' datastruct will be sorted in a random order as countries will be added/removed at random times.
What is the best way to re-index the data frame to its country code from its country name?
Is there a smart solution that would also work on other columns so for example if a data frame was indexed on date but one column was df['county'] then you could change df['country'] to its country code? Finally is there a third option that would add an additional column that was either country/code which selected the right code based on a country name in another column?
I think you can use Series.map, but it works only with Series, so need Index.to_series. Last rename_axis (new in pandas 0.18.0):
df1.index = df1.index.to_series().map(df2.set_index('county').code)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
It is same as mapping by dict:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.to_series().map(d)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
EDIT:
Another solution with Index.map, so to_series is omitted:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.map(d.get)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
Here are some brief ways to approach your 3 questions. More details below:
1) How to change index based on mapping in separate df
Use df_with_mapping.todict("split") to create a dictionary, then use a list comprehension to change it into {"old1":"new1",...,"oldn":"newn"} form then use df.index = df.base_column.map(dictionary) to get the changed index.
2) How to change index if the new column is in the same df:
df.index = df["column_you_want"]
3) Creating a new column by mapping on a old column:
df["new_column"] = df["old_column"].map({"old1":"new1",...,"oldn":"newn"})
1) Mapping for the current index exists in separate dataframe but you don't have the mapped column in the dataframe yet
This is essentially the same as question 2 with the additional step of creating a dictionary for the mapping you want.
#creating the mapping dictionary in the form of current index : future index
df2 = pd.DataFrame([["es"],["fr"]],index = ["spain","france"])
interm_dict = df2.to_dict("split") #Creates a dictionary split into column labels, data labels and data
mapping_dict = {country:data[0] for country,data in zip(interm_dict["index"],interm_dict['data'])}
#We only want the first column of the data and the index so we need to make a new dict with a list comprehension and zip
df["country"] = df.index #Create a new column if u want to save the index
df.index = pd.Series(df.index).map(mapping_dict) #change the index
df.index.name = "" #Blanks out index name
df = df.drop("county code",1) #Drops the county code column to avoid duplicate columns
Before:
county code language
spain es spanish
france fr french
After:
language country
es spanish spain
fr french france
2) Changing the current index to one of the columns already in the dataframe
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "french"])
df["country"] = df.index #if you want to save the original index
df.index = df["county code"] #The only step you actually need
df.index.name = "" #if you want a blank index name
df = df.drop("county code",1) #if you dont want the duplicate column
Before:
county code language
spain es spanish
french fr french
After:
language country
es spanish spain
fr french french
3) Creating an additional column based on another column
This is again essentially the same as step 2 except we create an additional column instead of assigning .index to the created series.
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "france"])
df["city"] = df["county code"].map({"es":"barcelona","fr":"paris"})
Before:
county code language
spain es spanish
france fr french
After:
county code language city
spain es spanish barcelona
france fr french paris

Categories

Resources