Pandas DataFrame match word in URL

Pandas DataFrame match word in URL - python

I have a data frame created by pandas. One of the columns in the data frame has URL's which, I would like to match and count the particular number of occurrences.
My logic is that if it does not return 'None' then at this stage print('Match'), however, that does not appear to work. Here is a sample of my current code, and would appreciate any tips on how to match a value using pandas as I really have just come back from using a lot of R and don't have a lot of experience with Pandas and data frames in python.
Title,URL,Date,Unique Pageviews
Preparing and Starting DS
career,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:242750,20-Jan-15,163
The Rogue Data Scientist,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:273425,4-May-15,1108
Is it safe to code after one bottle of
wine?,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:349416,9-Nov-15,1736
Short-Term Forecasting of Electricity
Demand,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:350421,12-Nov-15,1117
Visual directory of 339 tools.
Wow!,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:373786,14-Jan-16,4228
8 Types of Data,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:377008,23-Jan-16,2829
Very funny video for people who write
code,http://www.datasciencecentral.com/forum/topic/show?
id=6448529:Topic:379578,30-Jan-16,2444
Code Block (Pep8 Requires two line spaces between functions)
def count_set_words(as_pandas):
reg_exp = re.match('\b/forum', as_pandas['URL']).any()
if as_pandas['URL'].str.match(reg_exp, case=False, flags=0, na=np.NAN).any():
print("Match")
def set_new_columns(as_pandas):
titles_list = ['Year > 2014', 'Forum', 'Blog', 'Python', 'R',
'Machine_Learning', 'Data_Science', 'Data', 'Analytics']
for number, word in enumerate(titles_list):
as_pandas.insert(len(as_pandas.columns), titles_list[number], 0)
def open_as_dataframe(file_name_in):
reader = pd.read_csv(file_name_in, encoding='windows-1251')
return reader
def main():
multi_sets = open_as_dataframe('HDT_data5.txt')
set_new_columns
count_set_words(multi_sets)
main()

reg_exp in the first line of count_words is not a regexp but check if the elements in the URL column match '\b/forum', I think someting like:
df = pd.read_csv(file_name_in, encoding='windows-1251')
for ix, row in df.iterrows():
re.match('\b/forum', row['url']) is not None:
print('this is a match')
Would solve your problem
or even simpler
df['is_a_match'] = df.url.apply(lambda row: re.match('\b/forum', row['url']) is not None)

Related

I am having issues with my code working properly and im stuck

I am having a problem with my code and getting it to work. Im not sure if im sorting this correctly. I am trying to sort with out lambda pandas or itemgetter.
Here is my code that I am having issues with.
with open('ManufacturerList.csv', 'r') as man_list:
ml = csv.reader(man_list, delimiter=',')
for row in ml:
manufacturerList.append(row)
print(row)
with open('PriceList.csv', 'r') as price_list:
pl = csv.reader(price_list, delimiter=',')
for row in pl:
priceList.append(row)
print(row)
with open('ManufacturerList.csv', 'r') as service_list:
sl = csv.reader(service_list, delimiter=',')
for row in sl:
serviceList.append(row)
print(row)
new_mfl = (sorted(manufacturerList, key='None'))
new_prl = (sorted(priceList, key='None'))
new_sdl = (sorted(serviceList, key='None'))
for x in range(0, len(new_mfl)):
new_mfl[x].append(priceList[x][1])
for x in range(0, len(new_mfl)):
new_mfl[x].append(serviceList[x][1])
new_list = new_mfl
inventoryList = (sorted(list, key=1))
i have tried to use the def function to try to get it to work but i dont know if im doing it right. This is what i tried.
def new_mfl(x):
return x[0]
x.sort(key=new_mfl)

You can do it like this:
def manufacturer_key(x):
return x[0]
sorted_mfl = sorted(manufacturerList, key=manufacturer_key)
The key argument is the function that extracts the field of the CSV that you want to sort by.

sorted_mfl = sorted(manufacturerList, key=lambda x: x[0])
There are different Dialects and Formatting Parameters that allow to handle input and output of data from comma separated value files; Maybe it could be used in a way with fewer statements using the correct delimiter which depends on the type of data you handle, this would be added to using built in methods like split for string data or another method to sort and manipulate lists, for example for single column data, delimiter=',' separate data by comma and it would iterate trough each value and not as a list of lists when you call csv.reader
['9.310788653967691', '4.065746465800029', '6.6363356879192965', '7.279020237137884', '4.010297786910394']
['9.896092029283933', '7.553018448286675', '0.3268282119829197', '2.348011394854333', '3.964531054345021']
['5.078622663277619', '4.542467725728741', '3.743648062104161', '12.761916277286993', '9.164698479088221']
# out:
column1 column2 column3 column4 column5
0 4.737897984379577 6.078414943611958 2.7021438955897095 5.8736388919905895 7.878958949784588
1 4.436982168483749 3.9453563399358544 12.66647791861843 5.323017508568736 4.156777982870004
2 4.798241413768279 12.690268531982028 9.638858110105895 7.881360524434767 4.2948334000783195
This is achieved because I am using lists that contain singular values, since for columns or lists that are of the form sorted_mfl = {'First Name' : ['name', 'name', 'name'], 'Second Name ' : [...], 'ID':[...]}, new_prl = ['example', 'example', 'example'] new_sdl = [...] the data would be added by something like sorted_mfl + new_prl + new_sdl and since different modules are also used to set and manage comma separated files, you should add more information to your question like the data type you use or create a minimal reproducible example with pandas.

Daily leaderboard or price tracking data

I'll just start from scratch since I feel like I'm lost with all the different possibilities. What I will be talking about is leaderboard but could apply to price tracking as well.
My goal is to scrape data from a website (the all time leaderboard / hidden), put it in a .csv file and update it daily at noon.
What I have succeeded so far : scraping the data.
Tried scraping with BS4 but since the data is hidden, I couldn't be specific enough to only get the all-time points. I find it's a success because I'm able to get a table with all the data I need and the date as a header. My problem with this solution is 1) unuseful data populating the csv 2) table is vertical and not horizontal
Scraped data with CSS selector but I have abandoned this idea because soemtimes the page won't load and the data wasn't scraped. Found out that there's a json file containing the data right away
Json scraping seems to be the best option, but having trouble creating a csv file that's OK to make a graph with.
This is what brings me to what I'm struggling with : storing the data in a table that looks like this where the grey area is the points and the DATE1 is the moment the data has been scraped :
I'd like not to manipulate the data in the csv file too much. If the table would look like what I picture above, then it's gonna be easier to make a graph afterwards but I'm having trouble. The best I did is creating a table that looks like this AND that's vertical and not horizontal.
name,points,date
Dennis,52570,10-23-2020
Dinh,40930,10-23-2020
name,points,date
Dennis,52570,10-23-2020
Dinh,40930,10-23-2020
name,points,date
Dennis,52570,10-23-2020
Dinh,40930,10-23-2020
Thank you for your help.
Here's the code
import pandas as pd
import time
timestr = time.strftime("%Y-%m-%d %H:%M")
url_all_time = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=allTime&maxResults=20&excludeRoles='
data = pd.read_json(url_all_time)
table = pd.DataFrame.from_records(data, index=['name'], columns=['points','name'])
table['date'] = pd.Timestamp.today().strftime('%m-%d-%Y')
table.to_csv('products.csv', index=True, encoding='utf-8')
If what I want is not possible, I might just scrape individually for each member, make one CSV file per member and make a graph that refers to those different files.

So, I've played around with your question a bit and here's what I came up with.
Basically, your best bet for data storage is a light weight database, as suggested in the comments. However, with a bit of planning, a few hoops to jump, and some hacky code you could get away with a simple (sort of) JSON that eventually ends up as a .csv file that looks like this:
Note: the values are the same as I don't want to wait a day or two for the leader-board to actually update.
What I did was rearranging the data that came back from the request to the API and built a structure that looks like this:
"BobTheElectrician": {
"id": 7160010,
"rank": 14,
"score_data": {
"2020-10-24 18:45": 4187,
"2020-10-24 18:57": 4187,
"2020-10-24 19:06": 4187,
"2020-10-24 19:13": 4187
}
Every player is your main key that has, among others, a scores_data value. This in turn is a dict that holds points value for each day you run the script.
Now, the trick is to get this JSON to look like the .csv you want. The question is - how?
Well, since you intend to update all players' data (I just assumed that) they all should have the same number of entries for score_data.
The keys for score_data are your timestamps. Grab any player's score_data keys and you have the date headers, right?
Having said that, you can build your .csv rows the same way: grab player's name and all their point values from score_data. This should get you a list of lists, right? Right.
Then, when you have all this, you just dump that to a .csv file and there you have it!
Putting it all together:
import csv
import json
import os
import random
import time
from urllib.parse import urlencode
import requests
API_URL = "https://community.koodomobile.com/widget/pointsLeaderboard?"
LEADERBOARD_FILE = "leaderboard_data.json"
def get_leaderboard(period: str = "allTime", max_results: int = 20) -> list:
payload = {"period": period, "maxResults": max_results}
return requests.get(f"{API_URL}{urlencode(payload)}").json()
def dump_leaderboard_data(leaderboard_data: dict) -> None:
with open("leaderboard_data.json", "w") as jf:
json.dump(leaderboard_data, jf, indent=4, sort_keys=True)
def read_leaderboard_data(data_file: str) -> dict:
with open(data_file) as f:
return json.load(f)
def parse_leaderboard(leaderboard: list) -> dict:
return {
item["name"]: {
"id": item["id"],
"score_data": {
time.strftime("%Y-%m-%d %H:%M"): item["points"],
},
"rank": item["leaderboardPosition"],
} for item in leaderboard
}
def update_leaderboard_data(target: dict, new_data: dict) -> dict:
for player, stats in new_data.items():
target[player]["rank"] = stats["rank"]
target[player]["score_data"].update(stats["score_data"])
return target
def leaderboard_to_csv(leaderboard: dict) -> None:
data_rows = [
[player] + list(stats["score_data"].values())
for player, stats in leaderboard.items()
]
random_player = random.choice(list(leaderboard.keys()))
dates = list(leaderboard[random_player]["score_data"])
with open("the_data.csv", "w") as output:
w = csv.writer(output)
w.writerow([""] + dates)
w.writerows(data_rows)
def script_runner():
if os.path.isfile(LEADERBOARD_FILE):
fresh_data = update_leaderboard_data(
target=read_leaderboard_data(LEADERBOARD_FILE),
new_data=parse_leaderboard(get_leaderboard()),
)
leaderboard_to_csv(fresh_data)
dump_leaderboard_data(fresh_data)
else:
dump_leaderboard_data(parse_leaderboard(get_leaderboard()))
if __name__ == "__main__":
script_runner()
The script also checks if you have a JSON file that pretends to be a proper database. If not, it'll write the leader-board data. Next time you run the script, it'll update the JSON and spit out a fresh .csv file.
Hope this answer will nudge you in the right direction.

Hey since you are loading it in a panda frame it makes the operations fairly simple. I ran your code first
import pandas as pd
import time
timestr = time.strftime("%Y-%m-%d %H:%M")
url_all_time = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=allTime&maxResults=20&excludeRoles='
data = pd.read_json(url_all_time)
table = pd.DataFrame.from_records(data, index=['name'], columns=['points','name'])
table['date'] = pd.Timestamp.today().strftime('%m-%d-%Y')
Then I added a few more lines of code to modify the panda frame table to your need.
idxs = table['date'].index
for i,val in enumerate(idxs):
table.at[ val , table['date'][i] ] = table['points'][i]
table = table.drop([ 'date', 'points' ], axis = 1)
In the above snippet I am using pandas frames ability to assign values using indexes. So first I get index values for the date column then I go through each of them to add column for the required date(values from date column) and get the corresponding points according to the indexes we pulled earlier
This gives me the following output:
name 10-24-2020
Dennis 52570.0
Dinh 40930.0
Sophia 26053.0
Mayumi 25300.0
Goran 24689.0
Robert T 19843.0
Allan M 19768.0
Bernard Koodo 14404.0
nim4165 13629.0
Timo Tuokkola 11216.0
rikkster 7338.0
David AKU 5774.0
Ranjan Koodo 4506.0
BobTheElectrician 4170.0
Helen Koodo 3370.0
Mihaela Koodo 2764.0
Fred C 2542.0
Philosoraptor 2122.0
Paul Deschamps 1973.0
Emilia Koodo 1755.0
I can then save this to csv using last line from your code. Similar you can pull data for more dates and format it to add it to the same panda frame
table.to_csv('products.csv', index=True, encoding='utf-8')

How to replace a string in a list of strings in a DataFrame (Python)?

I have a Dataframe which consists of lists of lists in two seperate columns.
import pandas as pd
data = pd.DataFrame()
data["Website"] = [["google.com", "amazon.com"], ["google.com"], ["aol.com", "no website"]]
data["App"] = [["Ok Google", "Alexa"], ["Ok Google"], ["AOL App", "Generic Device"]]
Thats how the Dataframe looks like
I need to replace certain strings in the first column (here: "no website") with the according string in the second column (here: "Generic Device"). The replacing string has the same index in the list as the string that needs to be replaced.
What did not work so far:
I tried several forms of str.replace(x,y) for lists and DataFrames and nothing worked. A simple replace(x,y) does not work as I need to replace several different strings. I think I can't get my head around the indexing thing.
I already googled and stackoverflowed for two hours and haven't found a solution yet.
Many thanks in advance! Sorry for bad engrish or noob mistakes, I am still learning.
-Max

Define replacement function and use apply to vectorize
def replacements(websites, apps):
" Substitute items in list replace_items that's found in websites "
replace_items = ["no website", ] # can add to this list of keys
# that trigger replacement
for i, k in enumerate(websites):
# Check each item in website for replacement
if k in replace_items:
# This is an item to be replaced
websites[i] = apps[i] # replace with corresponding item in apps
return websites
# Create Dataframe
websites = [["google.com", "amazon.com"], ["google.com"], ["aol.com", "no website"]]
app = [["Ok Google", "Alexa"], ["Ok Google"], ["AOL App", "Generic Device"]]
data = list(zip(websites, app))
df = pd.DataFrame(data, columns = ['Websites', 'App'])
# Perform replacement
df['Websites'] = df.apply(lambda row: replacements(row['Websites'], row['App']), axis=1)
print(df)
Output
Websites App
0 [google.com, amazon.com] [Ok Google, Alexa]
1 [google.com] [Ok Google]
2 [aol.com, Generic Device] [AOL App, Generic Device]

Try this,You can define replaceable values in a array and execute.
def f(x,items):
for rep in items:
if rep in list(x.Website):
x.Website[list(x.Website).index(rep)]=list(x.App)[list(x.Website).index(rep)]
return x
items = ["no website"]
data = data.apply(lambda x: f(x,items),axis=1)
Output:
Website App
0 [google.com, amazon.com] [Ok Google, Alexa]
1 [google.com] [Ok Google]
2 [aol.com, Generic Device] [AOL App, Generic Device]

First of all, Happy Holidays!
I wasn't really sure what your expected output was and I'm not really sure what you have tried previously, but I think that this may work:
data["Website"] = data["Website"].replace("no website", "Generic Device")
I really hope this helps!

You can create a function like this:
def f(replaced_value, col1, col2):
def r(s):
while replaced_value in s[col1]:
s[col1][s[col1].index(replaced_value)] = s[col2][s[col1].index(replaced_value)]
return s
return r
and use apply:
df=df.apply(f("no website","Website","App"), axis=1)
print(df)

mask function doesn't get rid of unwanted data

I'm working on a data frame taken from Adafruit IO and sadly some of my data is from a time when my project malfunctioned so some of the values are just equal NaN.
I tried to remove it by typing this code lines:
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
This is data retreived from Adafruit IO Feed, getting analyzed by pandas, I tried using 'where' function too but it didn't work
my entire code is
import pandas as pd
temp_data = pd.read_json('https://io.adafruit.com/api/(...)')
light_data = pd.read_json('https://io.adafruit.com/api/(...)')
temp_data['created_at'] = pd.to_datetime(temp_data['created_at'], infer_datetime_format=True)
temp_data = temp_data.set_index('created_at')
light_data['created_at'] = pd.to_datetime(light_data['created_at'], infer_datetime_format=True)
light_data = light_data.set_index('created_at')
tempVals = pd.Series(temp_data['value'])
lightVals = pd.Series(light_data['value'])
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
The output is all of my data for some reason, but it should be only the valid values.

Hey I think the issue here that you're looking for values equal to the string 'NaN', while actual NaN values aren't a string, or more specifically aren't anything.
Try using:
onlyValidData = temp_data.mask(temp_data['value'].isnull())
Edit: to remove rows rather than marking all values in that row as NaN:
onlyValidData = temp_data.dropna()

Creating a list of lists of lists to sort data from a text file

I'm trying to read and analyse data back from a molecular dynamics simulation, which looks like this, but has approximately 50000 lines :
40 443.217134221125 -1167.16960983145 -930.540717277902 -945.149746592058 14.6090293141563 -76510.1177229871 4955.17798368798 17.0485096390963 17.0485096390963 17.0485096390963
80 659.39103652059 -923.638916369481 -963.088128935875 -984.822539088925 21.7344101530497 14390.2520385682 4392.18167603894 16.3767140226773 16.3767140226773 16.3767140226773
120 410.282687399253 -979.413482414461 -978.270613122515 -991.794079036891 13.5234659143754 -416.30808174241 4398.37322990079 16.3844056974088 16.3844056974088 16.3844056974088
The second column represents temperature. I want to have the entire contents of the file inside a list, containing lists dividing every line depending on their temperature. So for example, the first list in the main list would have every line where the temperature is 50+/-25K, the second list in the main list would have every line where the temperature is 100+/-25K, the third for 150+/-25K, etc.
Here's the code I have so far :
for nbligne in tqdm(range(0,len(LogFullText),1), unit=" lignes", disable=False):
string = LogFullText[nbligne]
line = string.replace('\n','')
Values = line.split(' ')
divider = float(Values[1])
number = int(round(divider/ecart,0))
if number>0 and number < (nbpts+1):
numericValues = []
for nbresultat in range(0,len(Values)-1,1):
numericValues = numericValues + [float(Values[nbresultat+1])]
TotalResultats[number-1].append(numericValues)
The entire document with data is stored in the list LogFullText, in which I remove the \n at the end and split the data, using line.split(' '), I then know in which "section" of the main list, TotalResultats, the line of data has to be stored with the variable number, ecart has in my example a value of 50.
From my testing in idle, this should work, but in reality what happens in that the list numericValues is appended to every section of TotalResultats, which makes the entire "sorting" process pointless, as I simply end up with nbpts times the same list.
EDIT : A desired output would be for example to have TotalResultats[0] contain only these lines :
440 49.9911561170447 -1002.727121613 -1002.72088094757 -1004.36865629012 1.64777534254374 -2.30045369926927 4346.38067015602 16.319590369315 16.319590369315 16.319590369315
480 42.0678318129411 -1002.69068695093 -1003.09270361295 -1004.47931559314 1.38661198019398 148.219667654185 4345.58826561836 16.3185985476593 16.3185985476593 16.3185985476593
520 43.0855216044083 -1003.4761833678 -1003.33820025832 -1004.75835665467 1.42015639634654 -50.877194096845 4345.23364199522 16.3181546401367 16.3181546401367 16.3181546401367
Whereas TotalResults[1] would contain these :
29480 109.504432929553 -980.560226069922 -998.958927113452 -1002.5683396275 3.6094125140473 6797.60091557441 4336.52501942717 16.3072458525354 16.3072458525354 16.3072458525354
29520 106.663291994583 -987.853629557979 -998.63436605413 -1002.15013076443 3.51576471029626 3975.43407740646 4344.84444478408 16.3176674266037 16.3176674266037 16.3176674266037
29560 112.712019757891 -1020.65735849343 -998.342638324154 -1002.05777718853 3.71513886437272 -8172.25412368794 4374.81748831773 16.3551041162317 16.3551041162317 16.3551041162317
And TotalResults[2] would be :
52480 142.86322849701 -983.254970494784 -995.977110177167 -1000.68607319299 4.70896301582636 4687.60299340191 4348.30194824999 16.321994657312 16.321994657312 16.321994657312
52520 159.953459288754 -984.221801201968 -995.711657311665 -1000.9839371836 5.27227987193358 4233.04866428826 4348.82254074761 16.3226460049712 16.3226460049712 16.3226460049712
52560 161.624843851124 -1011.76969126636 -995.320907086768 -1000.64827802848 5.32737094170867 -6023.57133443538 4375.12133631739 16.3554827492176 16.3554827492176 16.3554827492176
In the first case,
TotalResultats[0][0] = [49.9911561170447, -1002.727121613, -1002.72088094757, -1004.36865629012, 1.64777534254374, -2.30045369926927, 4346.38067015602, 16.319590369315, 16.319590369315, 16.319590369315]
If it can help, I'm coding this in Visual Studio, using python 3.6.8
Thanks a whole lot!

I recommend to use pandas. It's a very powerfull tool to treat tabular data in python. It's like excel or sql inside python. Suppose 1.csv contains the data you have provided in the question. Then you can easily load data, filter it, and save results:
import pandas as pd
# load data from file into pandas dataframe
df = pd.read_csv('1.csv', header=None, delimiter=' ')
# filter by temperature, column named 0 since there is no header in the file
df2 = df[df[0].between(450, 550)]
# save filtered rows in the same format
df2.to_csv('2.csv', header=None, index=False, sep=' ')
Pandas may be harder to learn than plain python syntax but it is well worth it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas DataFrame match word in URL - python

Related

I am having issues with my code working properly and im stuck

Daily leaderboard or price tracking data

How to replace a string in a list of strings in a DataFrame (Python)?

mask function doesn't get rid of unwanted data

Creating a list of lists of lists to sort data from a text file

Categories

Resources