Scraping free proxy list with pandas - python

So I'm using pandas and requests to scrape IP's from https://free-proxy-list.net/ but how do I cover this code
import pandas as pd
resp = requests.get('https://free-proxy-list.net/')
df = pd.read_html(resp.text)[0]
df = (df[(df['Anonymity'] == 'elite proxy')])
print(df.to_string(index=False))
so that the output is list of IP's without anything else. I managed to remove index and only added elite proxy but I can't make a variable that is a list with only IP's and without index.

You can use loc to slice directly the column for the matching rows, and to_list to convert to list:
df.loc[df['Anonymity'].eq('elite proxy'), 'IP Address'].to_list()
output: ['134.119.xxx.xxx', '173.249.xxx.xxx'...]

To get the contents of the 'IP Address' column, subset to the 'IP address' column and use .to_list().
Here's how:
print(df['IP Address'].to_list())

It looks like you are trying to accomplish something like below:
print(df['IP Address'].to_string(index=False))
Also It would be a good idea, after filtering your dataframe to reset its index like below:
df = df.reset_index(drop=True)
So the code snippet would be something like this:
import pandas as pd
resp = requests.get('https://free-proxy-list.net/')
df = pd.read_html(resp.text)[0]
df = (df[(df['Anonymity'] == 'elite proxy')])
df = df.reset_index(drop=True)
print(df['IP Address'].to_string(index=False))

Related

Normalize json column and join with rest of dataframe

This is my first question here on stackoverflow so please don't roast me.
I was trying to find similar problems on the internet and actually there are several, but for me the solutions didn't work.
I have created this dataframe:
import pandas as pd
from ast import literal_eval
d = {'order_id': [1], 'email': ["hi#test.com"], 'line_items': ["[{'sku':'testproduct1', 'quantity':'2'},{'sku':'testproduct2','quantity':'2'}]"]}
orders = pd.DataFrame(data=d)
It looks like this:
order_id email line_items
1 hi#test.com [{'sku':'testproduct1', 'quantity':'2'},{'sku':'testproduct2','quantity':'2'}]
I want the dataframe to look like this:
order_id email line_items.sku line_items.quantity
1 hi#test.com testproduct1 2
1 hi#test.com testproduct2 2
I used the following code to change the type of line_items from string to dict:
orders.line_items = orders.line_items.apply(literal_eval)
Normally I would use json_normalize now to flatten the line_items column. But I also want to keep the id and don't know how to do that. I also want to avoid any loops.
Is there anyone who can help me with this issue?
Kind regards
joant95
If your dictionary really is that strange, then you could try:
d['line_items'] = eval(d['line_items'][0])
df = pd.json_normalize(d, record_path=['line_items'], meta=['order_id', 'email'])
To create d out of orders you could try:
d = orders.to_dict(orient='list')
Or you could try:
orders.line_items = orders.line_items.map(eval)
d = orders.to_dict(orient='records')
df = pd.json_normalize(d, record_path=['line_items'], meta=['order_id', 'email'])
But: I still don't have a clear picture of the situation :)

Pandas Python probelm

import pandas as pd
nba = pd.read_csv("nba.csv")
names = pd.Series(nba['Name'])
data = nba['Salary']
nba_series = (data, index=[names])
print(nba_series)
Hello I am trying to convert the columns 'Name' and 'Salary' into a series from a dataframe. I need to set the names as the index and the salaries as the values but i cannot figure it out. this is my best attempt so far anyone guidance is appreciated
I think you are over-thinking this. Simply construct it with pd.Series(). Note the data needs to be with .values, otherwis eyou'll get Nans
import pandas as pd
nba = pd.read_csv("nba.csv")
nba_series = pd.Series(data=nba['Salary'].values, index=nba['Name'])
Maybe try set_index?
nba.set_index('name', inlace = True )
nba_series = nba['Salary']
This might help you
import pandas as pd
nba = pd.read_csv("nba.csv")
names = nba['Name']
#It's automatically a series
data = nba['Salary']
#Set names as index of series
data.index = nba_series
data.index = names might be correct but depends on the data

How to read this JSON file in Python?

I'm trying to read such a JSON file in Python, to save only two of the values of each response part:
{
"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
For example, I want to put the name and the age in a table. I already tried it this way (based on this topic), but it's not working for me.
import json
import pandas as pd
file = open("myfile.json")
data = json.loads(file)
columns = [dct['name', 'age'] for dct in data['response']]
df = pd.DataFrame(data['response'], columns=columns)
print(df)
I also have seen more solutions of reading a JSON file, but that all were solutions of a JSON file with no other header values at the top, like responseHeader in this case. I don't know how to handle that. Anyone who can help me out?
import json
with open("myfile.json") as f:
columns = [(dic["name"],dic["age"]) for dic in json.load(f)["response"]["docs"]]
print(columns)
result:
[(['Peter'], ['23']), (['Harry'], ['30'])]
You can pass the list data["response"]["docs"] to pandas directly as it's a recordset.
df = pd.DataFrame(data["response"]["docs"])`
print(df)
>>> name country age
0 [Peter] [England] [23]
1 [Harry] [Wales] [30]
The data in you DatFrame will be bracketed though as you can see. If you want to remove the brackets you can consider the following:
for column in df.columns:
df.loc[:, column] = df.loc[:, column].str.get(0)
if column == 'age':
df.loc[:, column] = df.loc[:, column].astype(int)
sample = {"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
data = [(x['name'][0], x['age'][0]) for x in
sample['response']['docs']]
df = pd.DataFrame(names, columns=['name',
'age'])

How to iterate over a CSV file with Pywikibot

I wanted to try uploading a series of items to test.wikidata, creating the item and then adding a statement of inception P571. The csv file sometimes has a date value, sometimes not. When no date value is given, I want to write out a placeholder 'some value'.
Imagine a dataframe like this:
df = {'Object': [1, 2,3], 'Date': [250,,300]}
However, I am not sure using Pywikibot how to iterate over a csv file with pywikibot to create an item for each row and add a statement. Here is the code I wrote:
import pywikibot
import pandas as pd
site = pywikibot.Site("test", "wikidata")
repo = site.data_repository()
df = pd.read_csv('experiment.csv')
item = pywikibot.ItemPage(repo)
for item in df:
date = df['date']
prop_date = pywikibot.Claim(repo, u'P571')
if date=='':
prop_date.setSnakType('somevalue')
else:
target = pywikibot.WbTime(year=date)
prop_date.setTarget(target)
item.addClaim(prop_date)
When I run this through PAWS, I get the message: KeyError: 'date'
But I think the real issue here is that I am not sure how to get Pywikibot to iterate over each row of the dataframe and create a new claim for each new date value. I would value any feedback or suggestions for good examples and documentation. Many thanks!
Looking back on this, the solution was to use .iterrows() or .itertuples() or .loc[] to access the values in the row.
So
for item in df.itertuples():
prop_date = pywikibot.Claim(repo, u'P571')
if item.date=='':
prop_date.setSnakType('somevalue')
else:
target = pywikibot.WbTime(year=date)
prop_date.setTarget(target)
item.addClaim(prop_date)

How to filter census data (From API) with a created list of zipcodes?

I pull the data from the census api using the census wrapper, i would like to filter that data out with a list of zips i compiled.
So i am trying to filter the data from a pull request data of the census. I Have a csv file of the zip i want to use and i have it put into a list already. I have tried a few things such as putting the census in a data frame and trying to filter the zipcode column by my list but i dont think my syntax is correct.
this is just the test data i pulled,
census_data = c.acs5.get(('NAME', 'B25034_010E'),
{'for': 'zip code tabulation area:*'})
census_pd = census_pd.rename(columns={"NAME": "Name", "zip code tabulation area": "Zipcode"})
censusfilter = census_pd['Zipcode'==ziplst]
so i tried this way, and also i tried a for loop where i take census_pd['Zipcode'] and a inner for loop to iterate over the list with a if statement like zip1 == zip2 append to a list.
my dependencys
# Dependencies
import pandas as pd
import requests
import json
import pprint
import numpy as np
import matplotlib.pyplot as plt
import requests
from census import Census
import gmaps
from us import states
# Census & gmaps API Keys
from config import (api_key, gkey)
c = Census(api_key, year=2013)
# Configure gmaps
gmaps.configure(api_key=gkey)
as mentioned i want to filter out whatever data i may pull from the census data specific to the zipcodes i use
It's not clear how your data looks like. I am guessing that you have a scalar column and you want to filter that column using a list. If it is the question then you can use isin built in method to filter the dataframe.
import pandas as pd
data = {'col': [2, 3, 4], 'col2': [1, 2, 3], 'col3': ["asd", "ads", "asdf"]}
df = pd.DataFrame.from_dict(data)
random_list = ["asd", "ads"]
df_filtered = df[df["col3"].isin(random_list)]
The sample data isn't very clear, so below is how to filter a dataframe on a column using a list of values to filter by
import pandas as pd
from io import StringIO
# Example data
df = pd.read_csv(StringIO(
'''zip,some_column
"01234",A1
"01234",A2
"01235",A3
"01236",B1
'''), dtype = {"zip": str})
zips_list = ["01234", "01235"]
# using a join
zips_df = pd.DataFrame({"zip": zips_list})
df1 = df.merge(zips_df, how='inner', on='zip')
print(df1)
# using query
df2 = df.query('zip in #zips_list')
print(df2)
# using an index
df.set_index("zip", inplace=True)
df3=df.loc[zips_list]
print(df3)
Output in all cases:
zip some_column
0 01234 A1
1 01234 A2
2 01235 A3

Categories

Resources