vaex apply does not work when using dataframe columns - python

I am trying to tokenize natural language for the first sentence in wikipedia in order to find 'is a' patterns. n-grams of the tokens and left over text would be the next step. "Wellington is a town in the UK." becomes "town is a attr_root in the country." Then find common patterns using n-grams.
For this I need to replace string values in a string column using other string columns in the dataframe. In Pandas I can do this using
df['Test'] = df.apply(lambda x: x['Name'].replace(x['Rep'], x['Sub']), axis=1)
but I cannot find the equivalent vaex method. This issue led me to believe that this should be possible in vaex based on Maarten Breddels' example code, however when trying it I get the below error.
import pandas as pd
import vaex
df = pd.DataFrame(
{
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
"Bonnell, Miss. Elizabeth",
],
"Rep": ["Braund", "Henry", "Miss."],
"Sub": ["<surname>", "<name>", "<title>"],
}
)
dfv = vaex.from_pandas(df)
def func(x, y, z):
return x.replace(y, z)
dfv['Test'] = dfv.apply(func, arguments=[df.Name.astype('str'), df.Rep.astype('str'), df.Sub.astype('str')])
Gives
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\dataframe.py", line 455, in apply
arguments = _ensure_strings_from_expressions(arguments)
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 780, in _ensure_strings_from_expressions
return [_ensure_strings_from_expressions(k) for k in expressions]
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 780, in <listcomp>
return [_ensure_strings_from_expressions(k) for k in expressions]
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 782, in _ensure_strings_from_expressions
return _ensure_string_from_expression(expressions)
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 775, in _ensure_string_from_expression
raise ValueError('%r is not of string or Expression type, but %r' % (expression, type(expression)))
ValueError: 0 Braund, Mr. Owen Harris
1 Allen, Mr. William Henry
2 Bonnell, Miss. Elizabeth
Name: Name, dtype: object is not of string or Expression type, but <class 'pandas.core.series.Series'>
How can I accomplish this in vaex?

Turns out I had a bug. Needed dfv in the call to apply instead of df.
Also got this faster method from the nice people at vaex.
import pyarrow as pa
import pandas as pd
import vaex
df = pd.DataFrame(
{
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
"Bonnell, Miss. Elizabeth",
],
"Rep": ["Braund", "Henry", "Miss."],
"Sub": ["<surname>", "<name>", "<title>"],
}
)
dfv = vaex.from_pandas(df)
#vaex.register_function()
def replacer(x, y, z):
res = []
for i, j, k in zip(x.tolist(), y.tolist(), z.tolist()):
res.append(i.replace(j, k))
return pa.array(res)
dfv['Test'] = dfv.func.replacer(dfv['Name'], dfv['Rep'], dfv['Sub'])

Related

Compare dynamic number of strings inside a dictionary, storing the best match as a new key

I have the following dictionary, representing a certain ID along with an address. I'm trying to use the jaro distance algorithm to have the distance between them all (compare the first with all, the second with them all (except the first), and so on).
address_dict = [
{'SiteID': 123, 'Address': '350- Maxwell Rd'},
{'SiteID': 124, 'Address': '350 Maxwell Rd Ste 500'},
{'SiteID': 125, 'Address': '350 Maxwell Road'},
{'SiteID': 126, 'Address': '350 Maxwell Road 500'}
]
What I plan to have, is a dictionary that looks like this. SiteID 124 has a greater length and verbosity, so I may use it as the official one, instead of the address in each of the IDs we have.
address_dict = [
{'SiteID': 123, 'Address': '350- Maxwell Rd', 'reference_id': 124},
{'SiteID': 124, 'Address': '350 Maxwell Rd Ste 500', 'reference_id': 124},
{'SiteID': 125, 'Address': '350 Maxwell Road', 'reference_id': 124},
{'SiteID': 126, 'Address': '350 Maxwell Road 500', 'reference_id': 124}
]
What is says is: "considering the all the records all similar (depends on the threshold), I'll keep for all those IDs the records with the greater amount of information - or length".
The way I compare those two strings is pretty simple, actually: jellyfish.jaro_distance(str_1, str_2).
So far, I was trying to build something like this, but it is incomplete. I could not figure out how to make this logic work, but I think it's cool to post what I have so far, so no one has to tell the full code.
counter = 0
for item in address_dict:
## Can't figure out how to loop over the record one with two, three and four
similarity = jellyfish.jaro_distance(item['Address'], address_dict[])
## Get the record with the greater length
## Find the similarity and maps to the reference ID
if similarity > 0.8:
address_dict[counter]['reference_id'] = item['SiteID']
counter+=1
I added some comments that I cannot figure out. Any ideas?
Here is one way to do it with the help of SequenceMatcher class from Python standard library difflib module:
def similar(a, b):
"""Get similarity ratio between a and b.
Args:
a: value.
b: other value.
Returns:
Similatity ratio.
"""
return SequenceMatcher(None, a, b).ratio()
df = pd.DataFrame(
[
{"SiteID": 123, "Address": "350- Maxwell Rd"},
{"SiteID": 124, "Address": "350 Maxwell Rd Ste 500"},
{"SiteID": 125, "Address": "350 Maxwell Road"},
{"SiteID": 126, "Address": "350 Maxwell Road 500"},
]
)
# Add ratios as new column
df = df.assign(
Match=df["Address"].map(
lambda x: max(
[similar(x, max(df["Address"], key=len))],
key=lambda x: x if x != 1 else 0,
)
)
)
# Add reference_id if ratio > 0.7
df["reference_id"] = df.apply(
lambda x: df.loc[df["Match"] == 1, "SiteID"] if x["Match"] >= 0.7 else x["SiteID"],
axis=1,
)
# Cleanup
df = df.drop(columns="Match")
new_adsress_dict = df.to_dict(orient="records")
print(new_address_dict)
# Output
[
{"SiteID": 123, "Address": "350- Maxwell Rd", "reference_id": 124},
{"SiteID": 124, "Address": "350 Maxwell Rd Ste 500", "reference_id": 124},
{"SiteID": 125, "Address": "350 Maxwell Road", "reference_id": 124},
{"SiteID": 126, "Address": "350 Maxwell Road 500", "reference_id": 124},
]

creating Pandas DataFrame as a cross-product between family x city x member

sorry if this may seem like a simple question, but I am new to python.
I would like to create a DataFrame containing 10 values for family names, 10 values for city of birth and for each pair of family name-city of birth, 3 members of that family, which have the "name" a random string up to 8 characters.
How can i create such a DataFrame?
I don't really know how to use the same pair of family name-city of birth for more than one value for "member".
There are a few ways to go about this, but here's a simple one that's easy to follow (with 5 values instead of the required 10 but you get the idea) :
import random
import string
import pandas as pd
cities = ["New York", "London", "Paris", "Beijing", "Casablanca"]
names = ["Smith", "Heston", "Dupont", "Torvalds", "Clooney"]
df = pd.DataFrame(
[
{
"city": cities[i],
"family_name": names[i],
"first_name": "".join([random.choice(string.ascii_lowercase) for _ in range(8)]),
}
for i in range(5)
for _ in range(3)
]
)
print(df)

Python - Look for multiple words in a single cell from excel and add corresponding data to JSON using pandas

I'm trying to work on a project that is related to analytics and I'm trying to extract data out of an excel that looks something like this:
For some reason the uploading of image doesn't work so please bear with me as I try to put the excel data:
User ID | Transcript
9001 B: How are you?
U: Show credit balance
B: End
9002 B: How are you?
U: Show bank statement
B: End
I wanted to loop through the entire "Transcript" column and capture certain strings such as "Sample balance", "Bank statement", and "End" while putting in mind that the lines within this column are multiple.
Now if I see the data I need, I have to push or append a certain value in a JSON payload that would look something like this:
{
"SampleBal": 1, "BankSt": 1, "End": 2
}
Here's what I have so far:
import pandas as pd
df = pd.read_excel('export.xlsx')
new_df = df.loc[df['TRANSCRIPT'].str.contains('Bank statement', flags=re.I, regex=True)].reset_index(drop=True)
print(new_df)
I'm fairly new to learning Python and was just wondering what are the next steps to make this possible by using pandas in Python?
Any help/guide is very much appreciated.
I don't have to much time right now, but I will write an explanation tonight. Maybe the snippet already helps, figuring it out.
import pandas as pd
dummyData = [
{"Column 1": "Line 1\nLine 2\nLine 3"},
{"Column 1": "Line 1\nLine 2\nLine 3\nLine 4"}
]
df = pd.DataFrame.from_dict(dummyData)
print(df)
Column 1
0 Line 1\nLine 2\nLine 3
1 Line 1\nLine 2\nLine 3\nLine 4
searchWords = ["Line 1", "Line 2", "Line 4"]
wordCount = {}
for index, row in df.iterrows():
lines = row["Column 1"].split("\n")
for line in lines:
if line in searchWords:
wordCount[line] = wordCount.get(line, 0) + 1
print(wordCount)
{'Line 1': 2, 'Line 2': 2, 'Line 4': 1}

python, x in list and x == list[28] deliver different results

Im trying to find if some string is in a list. when using: 'if string in list' i get a false. but when im trying 'if string == list[28]' i get a true.
how come? the string is definitely in the list.
import pandas as pd
import numpy as np
import scipy.stats as stats
import re
nba_df=pd.read_csv("assets/nba.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]
nba_df = nba_df[(nba_df['year'] == 2018)]
nba_df['team'] = nba_df['team'].apply(lambda x: x.split('*')[0])
nba_df['team'] = nba_df['team'].apply(lambda x: x.split('(')[0])
nba_df['team'] = nba_df['team'].str.strip()
cityList = cities['Metropolitan area'].str.strip()
actualCities = []
for idx, city in enumerate(nba_df['team']):
if city == 'New Orleans Pelicans':
print('string: ', city.split()[0] + ' ' + city.split()[1])
print('cityList[28]: ', cityList[28])
print('is string in list: ', (city.split()[0] + ' ' + city.split()[1]) in cityList)
print('is string == list[28]: ', (city.split()[0] + ' ' + city.split()[1]) == cityList[28])
output:
string: New Orleans
cityList[28]: New Orleans
is string in list: False
is string == list[28]: True
It looks like your issue is related to membership testing with the in operator, particularly as it relates to pandas "containers" such as DataFrames and Series. Keep in mind when you say:
how come? the string is definitely in the list.
This is not quite accurate. Your cityList is a Series object, not a list. This creates some quirks we have to work around, since we cannot treat a Series the same as a list. In general Series behave a bit more like a dictionary rather than a list.
I've created a truncated test example for your code, using the setup here:
import pandas as pd
data = {
"Teams": [ "Boston Celtics", "Brooklyn Nets", "New York Knicks", "Philadelphia 76ers", "Toronto Raptors", "Chicago Bulls", "Cleveland Cavaliers", "Detroit Pistons", "Indiana Pacers", "Milwaukee Bucks", "Atlanta Hawks", "Charlotte Hornets", "Miami Heat", "Orlando Magic", "Washington Wizards", "Denver Nuggets", "Minnesota Timberwolves", "Oklahoma City Thunder", "Portland Trail Blazers", "Utah Jazz", "Golden State Warriors", "Los Angeles Clippers", "Los Angeles Lakers", "Phoenix Suns", "Sacramento Kings", "Houston Rockets", "Memphis Grizzlies", "San Antonio Spurs", "New Orleans Pelicans" ],
"Cities": [ "Boston", "Brooklyn", "New York", "Philadelphia", "Toronto", "Chicago", "Cleveland", "Detroit", "Indiana", "Milwaukee", "Atlanta", "Charlotte", "Miami", "Orlando", "Washington", "Denver", "Minnesota", "Oklahoma City", "Portland", "Utah", "Golden", "Los Angeles", "Los Angeles", "Phoenix", "Sacramento", "Houston", "Memphis", "San Antonio", "New Orleans" ]
}
nba_df = pd.DataFrame(data, columns = ['Teams', 'Cities'])
# doing this to mimic your code of storing the Series to cityList
cityList = nba_df['Cities'].str.strip()
print(cityList)
print(type(cityList))
Output:
0 Boston
1 Brooklyn
2 New York
...
28 New Orleans
<class 'pandas.core.series.Series'>
The key is to use cityList.values, rather than just cityList. However, I encourage you to read the Series.values documentation, as Pandas does not recommend using this property anymore (it looks like Series.array was added in 0.24, and they recommend using that instead). Both PandasArray and numpy.ndarray appear to behave a bit more like a list, at least in this example when it comes to membership test. Again, reading the Series.array documentation is highly encouraged.
Example from the terminal:
>>> cityList[28]
'New Orleans'
>>> 'New Orleans' in cityList
False
>>> 'New Orleans' in cityList.values
True
>>> 'New Orleans' in cityList.array
True
You could also just create a list from your cityList (which again, is a Series)
>>> list(cityList)
['Boston', 'Brooklyn', ..., 'New Orleans']
>>> 'New Orleans' in list(cityList)
True
Side Note
I would probably rename your cityList to citySeries or something similar, to make a note in your code that you are not dealing with a list, but a "special" container from the pandas library.
Alternatively, you could just create your cityList like so (note: I'm using your code now, not my example):
cityList = list(cities['Metropolitan area'].str.strip())
I did have to do a bit of research for this answer as I am by no means a pandas expert, so here are the three questions that helped me figure this out:
Indexing a pandas dataframe by integer
membership test in pandas data frame column
How does the "in" and "not in" statement work in python

Convert CSV into JSON. How do I keep values with the same Index?

I am using this database: https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&localOrForeign%5B%5D=Foreign&start_year=1992&end_year=2019&group_by=year
I have preprocessed it into this csv (showing only 2 lines of 159):
year,combinedStatus,fullName,sortName,primaryNationality,secondaryNationality,tertiaryNationality,gender,photoUrl,photoCredit,type,lastStatus,typeOfDeath,status,employedAs,organizations,jobs,coverage,mediums,country,location,region,state,locality,province,localOrForeign,sourcesOfFire,motiveConfirmed,accountabilityCrossfire,accountabilityAssignment,impunityMurder,tortured,captive,threatened,charges,motive,lengthOfSentence,healthProblems,impCountry,entry,sentenceDate,sentence,locationImprisoned
1994,Confirmed,Abdelkader Hireche,,,,,Male,,,Journalist,,Murder,Killed,Staff,Algerian Television (ENTV),Broadcast Reporter,Politics,Television,Algeria,Algiers,,,Algiers,,Foreign,,Confirmed,,,Partial Impunity,No,No,No,,,,,,,,,
2014,Confirmed,Ahmed Hasan Ahmed,,,,,Male,,,Journalist,,Dangerous Assignment,Killed,Staff,Xinhua News Agency,"Camera Operator,Photographer","Human Rights,Politics,War",Internet,Syria,Damascus,,,Damascus,,Foreign,,Confirmed,,,,,,,,,,,,,,,
And I want to make this type of JSON out of it:
"Afghanistan": {"year": 2001, "fullName": "Volker Handloik", "gender": "Male", "typeOfDeath": "Crossfire", "employedAs": "Freelance", "organizations": "freelance reporter", "jobs": "Print Reporter", "coverage": "War", "mediums": "Print", "photoUrl": NaN}, "Somalia": {"year": 1994, "fullName": "Pierre Anceaux", "gender": "Male", "typeOfDeath": "Murder", "employedAs": "Freelance", "organizations": "freelance", "jobs": "Broadcast Reporter", "coverage": "Human Rights", "mediums": "Television", "photoUrl": NaN}
The problem is that Afghanistan (as you can see in the link) has had many journalist deaths. I want to list all these killings under the Index 'Afghanistan'. However, as I currently do it, only the last case (Volker Handloik) in the csv file shows up. How can I get it so every case shows up?
this is my code atm
import pandas as pd
import pprint as pp
import json
# list with stand-ins for empty cells
missing_values = ["n/a", "na", "unknown", "-", ""]
# set missing values to NaN
df = pd.read_csv("data_journalists.csv", na_values = missing_values, skipinitialspace = True, error_bad_lines=False)
# columns
columns_keep = ['year', 'fullName', 'gender', 'typeOfDeath', 'employedAs', 'organizations', 'jobs', 'coverage', 'mediums', 'country', 'photoUrl']
small_df = df[columns_keep]
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
print(small_df)
# create dict with country-column as index
df_dict = small_df.set_index('country').T.to_dict('dict')
print(df_dict)
# make json file from the dict
with open('result.json', 'w') as fp:
json.dump(df_dict, fp)
# use pretty print to see if dict matches the json example in the exercise
pp.pprint(df_dict)
I want to include all of these names (and more) in the JSON under the index Afghanistan
I think I will need a list of objects that is attached to the index of a country so that every country can show all the cases of journalists death instead of only 1 (each time being replaced by the next in the csv) I hope this is clear enough
I'll keep your code until the definition of small_df.
After that, we perform a groupby on the 'country' column and use pd.to_json on it:
country_series = small_df.groupby('country').apply(lambda r : r.drop(['country'], axis=1).to_json())
country_series is a pd.Series with the countries as index.
After that, we create a nested dictionary, so that we have a valid json object:
fullDict = {}
for ind, a in country_series.iteritems():
b = json.loads(a)
c = b['fullName']
smallDict = {}
for index, journalist in c.items():
smallDict[journalist] = {}
for i in b.keys():
smallDict[journalist][i] = b[i][index]
fullDict[ind] = (smallDict)
The nomenclature in my part of code is pretty bad, but I tried to write all the steps explicitly so that things should be clear.
Finally, we write the results to a file:
with open('result.json', 'w') as f:
json.dump(fullDict, f)

Categories

Resources