Selecting random row from dataframe in python based on certain conditions - python

Hi this is probably a very basic fix but I am just completely stuck and don't know enough about Python to figure out how to go about this myself. I made a dictionary of restaurants in my city and created a data frame of them. The whole program is just supposed to pick a random restaurant out of the dataframe. However, I want it to be able to select random restaurants based on certain things. For instance, "Cuisine" is a category and I want it to be able to select a random restaurant(row) based on cuisine being Mexican. I hope that makes sense because I am very lost.
my code is also below but there is not much to it
import pandas as pd
# Define a dictionary containing employee data
data = {'Restaurant':['August Henrys Burger Bar', 'Bridges & Bourbon', 'The Capital Grille', 'Chinatown Inn', 'Chipotle','Condado Tacos','Crafted North','Cristos Mediterranean Grille','Five Guys','Forbes Tavern','Freshii','Genoa Pizza & Bar','Giovannis Pizza & Pasta','Hello Bistro','Joe and Pie Cafe & Pizzeria','Las Velas','Mandarin Gourmet','McCormick and Schmick','Moes Southwest Grill','Nickys Thai Kitchen','Noodles & Company','The Original Oyster House','Pizza Parma','Primanti Bros','Siam Thai Restaurant','The Simple Greek','SlyFox Taphouse','SoFresh','Villa Reale Pizzeria & Restaurant','The Warren','The Yard'],
'Cuisine':['American', 'American', 'American','Asian', 'Mexican','Mexican','American','Mediterranean','American','American','American','Italian','Italian','American','Italian','Mexican','Asian','American','Mexican','Asian','American','American','Italian','American','Asian','Mediterranean','American','American','Italian','American','American'],
'Address':['946 Penn Avenue 412-765-3270', '930 Penn Avenue 412-586-4287', '301 Fifth Avenue 412-338-9100', '522 Third Avenue 412-261-1291', '211 Forbes Avenue 412-224-5586',',971 Liberty Avenue 412-281-9111','Marriott City Center 412-471-4000','130 6th Street 412-261-6442','Three PPH PLace 412-227-0206','310 Forbes Avenue 412-281-1999','501 Grant Street 412-430-0318','111 Market Street 412-281-6100','123 6th Street 412-281-7060','292 Forbes Avenue 412-434-0100','955 Liberty Avenue 412-738-0603','21 Market Square 412-251-0031','305 Wood Street 412-261-6151','301 Fifth Avenue 412-201-6992','210 Forbes Avenue 412-224-4422','903 Penn Avenue 412-471-8424','476 McMasters Way 412-562-2191','20 Market Square 412-566-7925','963 Liberty Avenue 412-577-7300','2 Market Square 412-261-1599','410 First Avenue 412-281-1122','4313 Market Street 412-261-4976','300 Liberty Avenue 412-586-7474','Five PPG Place Suite 100 412-586-7240','628 Smithfield Street 412-391-3963','245 7th Street 412-201-5888','100 Fifth Avenue 412-291-8182'],
'Operation':['Local', 'Local', 'Franchise', 'Local', 'Franchise','Franchise','Franchise','Local','Franchise','Local','Franchise','Local','Local','Franchise','Franchise','Local','Local','Franchise','Franchise','Local','Franchise','Franchise','Franchise','Franchise','Local','Local','Franchise','Local','Local','Local','Franchise']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)

You can first filter by Cusisine and then use sample to pick a random row:
df.loc[df.Cuisine=='Mexican'].sample(1)
Restaurant Cuisine Address Operation
18 Moes Southwest Grill Mexican 210 Forbes Avenue 412-224-4422 Franchise

Related

Pyspark AnalysisException Py4JJavaError on transformation withColumn()

Working with Pyspark using the withColumn() command in order to do some basic transformation on the dataframe, namely, to update the value of a column. Looking for some debug assistance while I also strudy the problem.
Pyspark is issuing an AnalysisException & Py4JJavaError on the usage of the pyspark.withColumn command.
_c49='EVENT_NARRATIVE' is the withColumn('EVENT_NARRATIVE')... reference data elements inside the spark df (dataframe).
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = df.withColumn('EVENT_NARRATIVE', lower(col('EVENT_NARRATIVE')))
Py4JJavaError: An error occurred while calling o100.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`EVENT_NARRATIVE`' given input columns: [_c3, _c17, _c40, _c21, _c48, _c12, _c39, _c18, _c31, _c10, _c45, _c26, _c5, _c43, _c24, _c33, _c9, _c14, _c1, _c16, _c47, _c20, _c46, _c32, _c22, _c7, _c2, _c42, _c37, _c36, _c30, _c8, _c38, _c23, _c25, _c13, _c29, _c41, _c19, _c44, _c11, _c28, _c6, _c50, _c49, _c0, _c15, _c4, _c34, _c27, _c35];;
'Project [_c0#604, _c1#605, _c2#606, _c3#607, _c4#608, _c5#609, _c6#610, _c7#611, _c8#612, _c9#613, _c10#614, _c11#615, _c12#616, _c13#617, _c14#618, _c15#619, _c16#620, _c17#621, _c18#622, _c19#623, _c20#624, _c21#625, _c22#626, _c23#627, ... 28 more fields]
+- Relation[_c0#604,_c1#605,_c2#606,_c3#607,_c4#608,_c5#609,_c6#610,_c7#611,_c8#612,_c9#613,_c10#614,_c11#615,_c12#616,_c13#617,_c14#618,_c15#619,_c16#620,_c17#621,_c18#622,_c19#623,_c20#624,_c21#625,_c22#626,_c23#627,... 27 more fields] csv
1 row of sample data from df.head():
[Row(_c0='BEGIN_YEARMONTH', _c1='BEGIN_DAY', _c2='BEGIN_TIME', _c3='END_YEARMONTH', _c4='END_DAY', _c5='END_TIME', _c6='EPISODE_ID', _c7='EVENT_ID', _c8='STATE', _c9='STATE_FIPS', _c10='YEAR', _c11='MONTH_NAME', _c12='EVENT_TYPE', _c13='CZ_TYPE', _c14='CZ_FIPS', _c15='CZ_NAME', _c16='WFO', _c17='BEGIN_DATE_TIME', _c18='CZ_TIMEZONE', _c19='END_DATE_TIME', _c20='INJURIES_DIRECT', _c21='INJURIES_INDIRECT', _c22='DEATHS_DIRECT', _c23='DEATHS_INDIRECT', _c24='DAMAGE_PROPERTY', _c25='DAMAGE_CROPS', _c26='SOURCE', _c27='MAGNITUDE', _c28='MAGNITUDE_TYPE', _c29='FLOOD_CAUSE', _c30='CATEGORY', _c31='TOR_F_SCALE', _c32='TOR_LENGTH', _c33='TOR_WIDTH', _c34='TOR_OTHER_WFO', _c35='TOR_OTHER_CZ_STATE', _c36='TOR_OTHER_CZ_FIPS', _c37='TOR_OTHER_CZ_NAME', _c38='BEGIN_RANGE', _c39='BEGIN_AZIMUTH', _c40='BEGIN_LOCATION', _c41='END_RANGE', _c42='END_AZIMUTH', _c43='END_LOCATION', _c44='BEGIN_LAT', _c45='BEGIN_LON', _c46='END_LAT', _c47='END_LON', _c48='EPISODE_NARRATIVE', _c49='EVENT_NARRATIVE', _c50='DATA_SOURCE'),
Row(_c0='201210', _c1='29', _c2='1600', _c3='201210', _c4='29', _c5='1922', _c6='68680', _c7='416744', _c8='NEW HAMPSHIRE', _c9='33', _c10='2012', _c11='October', _c12='High Wind', _c13='Z', _c14='12', _c15='EASTERN HILLSBOROUGH', _c16='BOX', _c17='29-OCT-12 16:00:00', _c18='EST-5', _c19='29-OCT-12 19:22:00', _c20='0', _c21='0', _c22='0', _c23='0', _c24='109.60K', _c25='0.00K', _c26='ASOS', _c27='55.00', _c28='MG', _c29=None, _c30=None, _c31=None, _c32=None, _c33=None, _c34=None, _c35=None, _c36=None, _c37=None, _c38=None, _c39=None, _c40=None, _c41=None, _c42=None, _c43=None, _c44=None, _c45=None, _c46=None, _c47=None, _c48='Sandy, a hybrid storm with both tropical and extra-tropical characteristics, brought high winds and coastal flooding to southern New England. Easterly winds gusted to 50 to 60 mph for interior southern New England; 55 to 65 mph along the eastern Massachusetts coast and along the I-95 corridor in southeast Massachusetts and Rhode Island; and 70 to 80 mph along the southeast Massachusetts and Rhode Island coasts. A few higher higher gusts occurred along the Rhode Island coast. A severe thunderstorm embedded in an outer band associated with Sandy produced wind gusts to 90 mph and concentrated damage in Wareham early Tuesday evening, |a day after the center of Sandy had moved into New Jersey. In general, moderate coastal flooding occurred along the Massachusetts coastline, and major coastal flooding impacted the Rhode Island coastline. The storm surge was generally 2.5 to 4.5 feet along the east coast of Massachusetts, but peaked late Monday afternoon in between high tide cycles. Seas built to between 20 and 25 feet Monday afternoon and evening just off the Massachusetts east coast. Along the south coast, the storm surge was 4 to 6 feet and seas from 30 to a little over 35 feet were observed in the outer coastal waters. The very large waves on top of the storm surge caused destructive coastal flooding along stretches of the Rhode Island exposed south coast. ||Sandy grew into a hurricane over the southwest Caribbean and then headed north across Jamaica, Cuba, and the Bahamas. As Sandy headed north of the Bahamas, the storm interacted with a vigorous weather system moving west to east across the United States and began to take on a hybrid structure. Strong high pressure over southeast Canada helped with the expansion of the strong winds well north of the center of Sandy. In essence, Sandy retained the structure of a hurricane near its center (until shortly before landfall) while taking on more of an extra-tropical cyclone configuration well away from the center. Sandy���s track was unusual. The storm headed northeast and then north across the western Atlantic and then sharply turned to the west to make landfall near Atlantic City, NJ during Monday evening. Sandy subsequently weakened and moved west across southern Pennsylvania on Tuesday before turning north and heading across western New York state into Quebec during Tuesday night and Wednesday.', _c49='The Automated Surface Observing System at Manchester-Boston Regional Airport (KMHT) recorded sustained wind speeds of 38 mph and gusts to 63 mph. In Manchester, a tree was downed on Harrison Street. In Hudson, a tree was downed on Lawrence Road, bringing down wires that sparked a fire that damaged a house. In Merrimack, a tree was downed, taking down wires and closing Amherst Road from Meetinghouse Road to Riverside Drive. In Nashua, a tree was downed onto a house on Broad Street, near the Hollils line. No structural damage was found. Numerous trees were downed, blocking roads.', _c50='CSV')
The column names are in the form of _c followed by a number, because presumbaly you did not specify header=True while reading the input file. You can do
df = spark.read.csv('filepath', header=True)
So that the column names will be BEGIN_YEARMONTH, BEGIN_DAY, ... etc, instead of _c0, _c1, ..., and then your withColumn code should work.
You can also consider adding inferSchema=True to ensure that the data types are suitable.
Of course, you can also stick with your current code, and do
df2 = df.withColumn('_c49', lower(col('_c49')))
But that's not a good long-term solution. Column names should be sensible, and you also don't want the header to be one of the rows in your dataframe.

Pandas: Remove all words from specific list within dataframe strings in large dataset

So I have three pandas dataframes(train, test). Overall it is about 700k lines. And I would like to remove all cities from a cities list - common_cities. But tqdm in notebook cell suggests that it would take about 24 hrs to replace all from a list of 33000 cities.
dataframe example (train_original):
id
name_1
name_2
0
sun blinds decoration paris inc.
indl de cuautitlan sa cv
1
eih ltd. dongguan wei shi
plastic new york product co., ltd.
2
jsh ltd. (hk) mexico city
arab shipbuilding seoul and repair yard madrid c
common_cities list example
common_cities = ['moscow', 'madrid', 'san francisco', 'mexico city']
what is supposed to be output:
id
name_1
name_2
0
sun blinds decoration inc.
indl de sa cv
1
eih ltd. wei shi
plastic product co., ltd.
2
jsh ltd. (hk)
arab shipbuilding and repair yard c
My solution in such case worked well on small filter words list, but when it is large, the performance is low.
%%time
for city in tqdm(common_cities):
train_original.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
train_augmented.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
test.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
P.S: I presume it's not great to use list comprehension while splitting string and substituting city name, because city name could be > 2 words.
Any suggestions, ideas on approach to make a quick replacement on Pandas Dataframes in such situations?
Instead of iterating over the huge dfs for reach pass, remember that pandas replace accepts dictionaries with all the replacements to be done in a single go.
Therefore we can start by creating the dictionary and then using it with replace:
replacements = {x:'' for x in common_cities}
train_original = train_original.replace(replacements)
train_augmented = train_augmented.replace(replacements)
test = test.replace(replacements)
Edit: Reading the documentation it might be even easier, because it also accept lists of values to be replaced:
train_original = train_original.replace(common_cities,'')
train_augmented = train_augmented.replace(common_cities,'')
test = test.replace(common_cities,'')

Pandas : Create a new dataframe from 2 different dataframes using fuzzy matching [duplicate]

I have two data frames with each having a different number of rows. Below is a couple rows from each data set
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
and
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1). My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data. It results would look like
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
I tried doing
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
But got an error because the lengths of the columns are different.
I am stumped. How I can accomplish this?
I couldn't tell what you were doing. This is how I would do it.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Create a series of tuples to compare:
compare = pd.MultiIndex.from_product([df1['Company'],
df2['FDA Company']]).to_series()
Create a special function to calculate fuzzy metrics and return a series.
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
Apply metrics to the compare series
compare.apply(metrics)
There are bunch of ways to do this next part:
Get closest matches to each row of df1
compare.apply(metrics).unstack().idxmax().unstack(0)
Get closest matches to each row of df2
compare.apply(metrics).unstack(0).idxmax().unstack(0)

How to convert text article with keywords to pandas data frame

I have similar text files to below, about 5,000 times and I want to extract the text article to one df column and the keywords in a list to another df column. I need this to have more training data.
In below sample, the article I want to extract is everything from 'Addis Abeba' to 'private bank' and the keywords are all keywords after 'SUBJECT' without percentages in brackets.
Sample of the dataset:
Addis Fortune
February 2011
Declaration? AU Action Needed in Favour of Democracy [opinion]
LENGTH: 692 words
Addis Abeba has been hosting delegates and heads of state for the AU Summit. It
is encouraging to see leaders of Africa discussing issues of continental
importance that accelerate the process of integration and thereby put Africa in
a better bargaining position in its relations with the outside world.
Indeed, "United We Stand, Divided We Fall."
It is time that the AU took a bold step to ensure that the leaders of the
continent win the hearts and minds of their citizens. It should ensure the
existence of democratic governments, which, at a minimum, guarantee popular
participation based on an acceptance of political equality among all citizens,
respect for civil liberties, and meaningful checks and balances on the power of
the executive.
This is also indispensable to the realisation of the age-old dream of the
formation of the United States of Africa. Donor countries and organisations also
have moral obligations to extend much needed support in this aspect.
Dawit Haile is a loan officer at a private bank.
SUBJECT: HEADS OF STATE & GOVERNMENT (90%); ELECTIONS (90%); INTERNATIONAL
ASSISTANCE (89%); INTERNATIONAL RELATIONS (73%); GROSS DOMESTIC PRODUCT (70%);
ECONOMIC NEWS (70%); EMBEZZLEMENT (68%); ELECTION FRAUD (68%) Ethiopia;
International Organizations and Africa
GEOGRAPHIC: AFRICA (96%); EGYPT (93%); UNITED STATES (93%); CHINA (92%);
ETHIOPIA (79%); TUNISIA (79%); ISRAEL (79%) Africa
LOAD-DATE: February 8, 2011
LANGUAGE: ENGLISH
PUBLICATION-TYPE: Newspaper
Copyright 2011 AllAfrica Global Media.
All Rights Reserved
2 of 1352 DOCUMENTS
Addis Fortune
February 2011
Gebrekidan Beyene's Prosecutors Repeat Request for 25 Years
BYLINE: Eden Sahle
LENGTH: 815 words
During the appeals hearing last week of Gebrekidan Beyene, a.k.a. Morocco,
general manager and a shareholder of a private limited company by the same name,
prosecutors of the Ethiopian Revenues and Customs Authority (ERCA) requested
almost the same sentence they originally had, in August 2010: a maximum jail
term and confiscation of properties.
However, the lower court's decision to mitigate the sentence was correct and the
Appeals Bench should release Gebrekidan, either as a free man or on parole, the
defence argued. His good behaviour in prison and the investment he had made in
his country should be counted as mitigating circumstances, the lawyer claimed,
also counting the defendant's poor health in mitigation. The case was adjourned
for a verdict until May 2, 2011.
An alleged similar offence involving money laundering and loan sharking against
Ayalew Tesema, board chairman and major shareholder of Ayat Real Estate, is
underway at the Federal High Court.
SUBJECT: LITIGATION (91%); JUSTICE DEPARTMENTS (90%); BANKING & FINANCE (90%);
EXCISE & CUSTOMS (90%); LIMITED LIABILITY COMPANIES (90%); SENTENCING (90%);
APPEALS (89%); LAW COURTS & TRIBUNALS (89%); JAIL SENTENCING (89%); LAWYERS
(89%); VERDICTS (89%); SUPREME COURTS (89%); FINES & PENALTIES (89%);
SETTLEMENTS & DECISIONS (78%); CRIMINAL CONVICTIONS (78%); DECISIONS & RULINGS
(78%); PRISONS (77%); SUITS & CLAIMS (77%); VALUE ADDED TAX (77%); JUDGES (73%);
INCOME TAX (72%); MONEY LAUNDERING (69%); COUNTERFEITING (68%); INTEREST RATES
(55%); ECONOMIC NEWS (55%) Ethiopia; Legal and Judicial Affairs
GEOGRAPHIC: MOROCCO (90%)
LOAD-DATE: March 1, 2011
LANGUAGE: ENGLISH
PUBLICATION-TYPE: Newspaper
My expected result would be:
df
content keywords
1 'string article 1' [HEADS OF STATE & GOVERNMENT, ELECTIONS, ...]
2 'string article 2' [LITIGATION, JUSTICE DEPARTMENTS, ...]

Pandas:reshape table and plotting different times series in one plot

Im rocky to pandas or python for that case and I'm working with a 311 dataset. The output Im trying to get is a plot with 5 time series, one for each NYC borough. Where each point in the plot represents the total number of complaints for each "created date" in that period of time. My data is as follow:
Agency Name Complaint Type \ Borough
Created Date
2013-08-30 23:58:55 New York City Police Department Noise - Vehicle BROOKLYN
2013-08-30 23:58:28 New York City Police Department Noise - Vehicle QUEENS
2013-08-30 23:57:46 New York City Police Department Noise - Street/Sidewalk MANHATTAN
2013-08-30 23:55:07 New York City Police Department Noise - Street/Sidewalk QUEENS
2013-08-30 23:55:06 New York City Police Department Noise - Commercial MANHATTAN
X= created date, Y= Total No.of complaints.
My code so far (overlooking at some stackoverflow queries and libraries):
df=pd.read_csv(sys.argv[1], parse_dates=True)
df.set_index("Created Date", inplace=True)
df2=df[["Borough","Complaint Type"]]
df3=df2.groupby("Complaint Type").count()
df3.plot()
plt.show()
!http://imgur.com/D9jrYLf
I did some changes but still it doesn't work:
df=pd.read_csv(sys.argv[1], parse_dates=True)
df.set_index("Created Date", inplace=True)
df2=df[["Borough","Complaint Type"]]
df3=df[df2.groupby("Complaint Type")].count()
df3.plot()
I really appreciated any help. :)

Categories

Resources