DataFrame filter column - python

I have the following dataframe 'X_df'
which city has the 5th highest total number of Walmart stores (super stores and regular stores combined)?
data_url = 'https://raw.githubusercontent.com/plotly/datasets/master/1962_2006_walmart_store_openings.csv'
x_df = pd.read_csv(data_url, header=0)
x_df['STRSTATE'].where(x_df['type_store'] == 7)

You can use Dataframe.max() to get the max city count the get the city name
X_df=df[X_df['city_count']==X_df['city_count'].max()]
x_df["city_name"]

Edit:
I think something like this is what you want? :
data_url = 'https://raw.githubusercontent.com/plotly/datasets/master/1962_2006_walmart_store_openings.csv'
x_df = pd.read_csv(data_url, header=0)
city_store_count = x_df.groupby(['STRCITY']).size().sort_values(ascending = False).to_frame()
city_store_count.columns = ['Stores_in_City']
city_store_count.iloc[4]
The fifth biggest is actually a shared 3rd place with ten stores, so you could print the top 10 for instance:
city_store_count.head(10)

Related

convert scraped list to pandas dataframe using columns and index

process and data scraping of url( within all given links in a loop )looks like :
for url in urls :
page=requests.get(url)
#fetch and proceed page here and acquire cars info one per page
print(car.name)
print(car_table)
and the output :
BMW
['color','red','weight','50kg','height','120cm','width','200cm','serial','','owner','']
FORD
['color','','weight','','height','','width','','serial','','owner','']
HONDA
['color','blue','weight','60kg','height','','width','160cm','serial','OMEGA','owner','']
at the end how can i have a dataframe same as below by considering that i dunno number of car fields(columns) and number of cars(index) but defined df with them as columns and index
print(car_df)
-----|color|weight|height|width|serial|owner
BMW |red 50 120 200
FORD |
HONDA|blue 60 160 OMEGA
any help appreciated :)
This approach is to create a list of dicts as we iterate through the urls, and then after the loop we convert this to a dictionary. I'm assuming that the car_table is always the column followed by the value over and over again
import pandas as pd
import numpy as np
#Creating lists from your output instead of requesting from the url since you didn't share that
car_names = ['BMW','FORD','HONDA']
car_tables = [
['color','red','weight','50kg','height','120cm','width','200cm','serial','','owner',''],
['color','','weight','','height','','width','','serial','','owner',''],
['color','blue','weight','60kg','height','','width','160cm','serial','OMEGA','owner',''],
]
urls = range(len(car_names))
all_car_data = []
for url in urls:
car_name = car_names[url] #using car_name instead of car.name for this example
car_table = car_tables[url] #again, you get this value some other way
car_data = {'name':car_name}
columns = car_table[::2] #starting from 0, skip every other entry to just get the columns
values = car_table[1::2] #starting from 1, skip every other entry to just get the values
#Zip the columns together with the values, then iterate and update the dict
for col,val in zip(columns,values):
car_data[col] = val
#Add the dict to a list to keep track of all the cars
all_car_data.append(car_data)
#Convert to a dataframe
df = pd.DataFrame(all_car_data)
#df = df.replace({'':np.NaN}) #you can use this if you want to replace the '' with NaNs
df
Output:
name color weight height width serial owner
0 BMW red 50kg 120cm 200cm
1 FORD
2 HONDA blue 60kg 160cm OMEGA

getting data in the same row knowing the element value in the other column(both column name known)

import pandas as pd
import random
data = pd.read_csv("file.csv")
print(data)
Country City
0 German Berlin
1 France Paris
random_country = random.choice(data['Country'])
How do I get the corresponding city name in a quick way please?
Instead of using the country name retrieved to search the dataframe again it would be more efficient to extract the city at the same time. This can be achieved with the pandas.DataFrame.sample method
random_entry = df.sample(1)
random_country = random_entry['Country']
random_city = random_entry['City']
Try
idx = random.choice(data.index)
random_country = df.loc[idx,'Country']
random_city = df.loc[idx,'City']

pandas get the min/max value of a row in a dataframe of only those rows that contain a certain string in another column

I feel really stupid now, this should be easy.
I got good help here how-to-keep-the-index-of-my-pandas-dataframe-after-normalazation-json
I need to get the min/max value in the column 'price' only where the value in the column 'type' is buy/sell. Ultimately I want to get back the 'id' also for that specific order.
So first of I need the price value and second I need to get back the value of 'id' corresponding.
You can find the dataframe that I'm working with in the link.
What I can do is find the min/max value of the whole column 'price' like so :
x = df['price'].max() # = max price
and I can sort out all the "buy" type like so:
d = df[['type', 'price']].value_counts(ascending=True).loc['buy']
but I still can't do both at the same time.
you have to use the .loc method in the dataframe in order to filter the type.
import pandas as pd
data = {"type":["buy","other","sell","buy"], "price":[15,222,11,25]}
df = pd.DataFrame(data)
buy_and_sell = df.loc[df['type'].isin(["sell","buy"])]
min_value = buy_and_sell['price'].min()
max_value = buy_and_sell['price'].max()
min_rows = buy_and_sell.loc[buy_and_sell['price']==min_value]
max_rows = buy_and_sell.loc[buy_and_sell['price']==max_value]
min_rows and max_rows can contain multiple rows because is posible that the same min price is repeated.
To extract the index just use .index.
hbid = df.loc[df.type == 'buy'].min()[['price', 'txid']]
gives me the lowest value of price and the lowest value of txid and not the id that belongs to the order with lowest price . . any help or tips would be greatly appreciated !
0 OMG4EA-Z2WUP-AQJ2XU None ... buy 0.00200000 XBTEUR # limit 14600.0
1 OBTJMX-WTQSU-DNEOES None ... buy 0.00100000 XBTEUR # limit 14700.0
2 OAULXQ-3B5WJ-LMLSUC None ... buy 0.00100000 XBTEUR # limit 14800.0
[3 rows x 23 columns]
highest buy order =
14800.0
here the id and price . . txid =
price 14600.0
txid OAULXQ-3B5WJ-LMLSUC
I' m still not sure how your line isin works. buy_and_sell not specified ;)
How I did it -->
I now first found the highest buy, then found the 'txid' for that price, then I had to remove the index from the returned series. And finally I had to remove a whitespace before my string. no idea how it came there
def get_highest_sell_txid():
hs = df.loc[df.type == 'sell', :].max()['price']
hsid = df.loc[df.price == hs, :]
xd = hsid['txid']
return xd.to_string(index=False)
xd = get_highest_sell_txid()
sd = xd.strip()
cancel_order = 'python -m krakenapi CancelOrder txid=' + sd #
subprocess.run(cancel_order)

Mapping a column from one dataframe to another in pandas based on condition

I have two dataframes df_inv
df_inv
and df_sales.
df_sales
I need to add a column to df_inv with the sales person name based on the doctor he is tagged in df_sales. This would be a simple merge I guess if the sales person to doctor relationship in df_sales was unique. But There is change in ownership of doctors among sales person and a row is added with each transfer with an updated day.
So if the invoice date is less than updated date then previous tagging should be used, If there are no tagging previously then it should show nan. In other word for each invoice_date in df_inv the previous maximum updated_date in df_sales should be used for tagging.
The resulting table should be like this
Final Table
I am relatively new to programming but I can usually find my way through problems. But I can not figure this out. Any help is appreciated
import pandas as pd
import numpy as np
df_inv = pd.read_excel(r'C:\Users\joy\Desktop\sales indexing\consolidated report.xlsx')
df_sales1 = pd.read_excel(r'C:\Users\joy\Desktop\sales indexing\Sales Person
tagging.xlsx')
df_sales2 = df_sales1.sort_values('Updated Date',ascending=False)
df_sales = df_sales2.reset_index(drop=True)
sales_tag = []
sales_dup = []
counter = 0
for inv_dt, doc in zip(df_inv['Invoice_date'],df_inv['Doctor_Name']):
for sal, ref, update in zip(df_sales['Sales
Person'],df_sales['RefDoctor'],df_sales['Updated Date']):
if ref==doc:
if update<=inv_dt and sal not in sales_dup :
sales_tag.append(sal)
sales_dup.append(ref)
break
else:
pass
else:
pass
sales_dup = []
counter = counter+1
if len(sales_tag)<counter:
sales_tag.append('none')
else:
pass
df_inv['sales_person'] = sales_tag
This appears to work.

Looping Thru a Nested Dictionary in Python

So I need help looping thru a nested dictionaries that i have created in order to answer some problems. My code that splits up the 2 different dictionaries and adds items into them is as follows:
Link to csv :
https://docs.google.com/document/d/1v68_QQX7Tn96l-b0LMO9YZ4ZAn_KWDMUJboa6LEyPr8/edit?usp=sharing
import csv
region_data = {}
country_data = {}
answers = []
data = []
cuntry = False
f = open('dph_SYB60_T03_Population Growth, Fertility and Mortality Indicators.csv')
reader = csv.DictReader(f)
for line in reader:
#This gets all the values into a standard dict
data.append(dict(line))
#This will loop thru the dict and create variables to hold specific items
for i in data:
# collects all of the Region/Country/Area
location = i['Region/Country/Area']
# Gets All the Years
years = i['Year']
i_d = i['ID']
info = i['Footnotes']
series = i['Series']
value = float(i['Value'])
# print(series)
stats = {i['Series']:i['Value']}
# print(stats)
# print(value)
if (i['ID']== '4'):
cuntry = True
if cuntry == True:
if location not in country_data:
country_data[location] = {}
if years not in country_data[location]:
country_data[location][years] = {}
if series not in country_data[location][years]:
country_data[location][years][series] = value
else:
if location not in region_data:
region_data[location] = {}
if years not in region_data[location]:
region_data[location][years] = {}
if series not in region_data[location][years]:
region_data[location][years][series] = value
When I print the dictionary region_data output is:
For Clarification What is shown is a "Region" as a key in a dict. The years being Values and keys in that 'Region's Dict and so on so forth....
I want to understand how i can loop thru the data and answer a question like :
Which region had the largest numeric decrease in Maternal mortality ratio from 2005 to 2015?
Were "Maternal mortality ratio (deaths per 100,000 population)" is a key within the dictionary.
Build a dataframe
Use pandas for that and read your file accordint to this answer.
import pandas as pd
filename = 'dph_SYB60_T03_Population Growth, Fertility and Mortality Indicators.csv'
df = pd.read_csv(filename)
Build a pivot table
Then you can make a pivot for "'Region/Country/Area'" and "Series" and use as a aggregate function "max".
pivot = df.pivot_table(index='Region/Country/Area', columns='Series', values='Value', aggfunc='max')
Sort by your series of interest
Then sort your "pivot table" by a series name and use the argument "ascending"
df_sort = pivot.sort_values(by='Maternal mortality ratio (deaths per 100,000 population)', ascending=False)
Extract the greatest value in the first row.
Finally you will have the answer to your question.
df_sort['Maternal mortality ratio (deaths per 100,000 population)'].head(1)
Region/Country/Area
Sierra Leone 1986.0
Name: Maternal mortality ratio (deaths per 100,000 population), dtype: float64
Warning: Some of your regions have records before 2005, so you should filter your data only for values between 2005 and 2015.
If you prefer to loop throught dictionaries in Python 3.x you can use the method .items() from each dictionary and nest them with three loops.
With a main dictionary called hear dict_total, this code will work it.
out_region = None
out_value = None
sel_serie = 'Maternal mortality ratio (deaths per 100,000 population)'
min_year = 2005
max_year = 2015
for reg, dict_reg in dict_total.items():
print(reg)
for year, dict_year in dict_reg.items():
if min_year <= year <= max_year:
print(year)
for serie, value in dict_year.items():
if serie == sel_serie and value is not None:
print('{} {}'.format(serie, value))
if out_value is None or out_value < value:
out_value = value
out_region = reg
print('Region: {}\nSerie: {} Value: {}'.format(out_region, sel_serie, out_value))

Categories

Resources