Manipulate data in dictionary-column from TSV - python

I have a TSV file where one of the columns are a dictionary-format type.
Example of headers and one row (notice the string-quotes in Preferences-column)
Name, Age, Preferences
Nick, 18, "[{"Hobby":"Football", "Food":"Pizza", "FavoriteNumber":"72"}]"
To read the file into python:
df = pd.read_csv('search_data_assessment.tsv',delimiter='\t')
To remove the strings of the "Preferences" at beginning and end, I used ast.literal_eval:
df["Preferences"] = ast.literal_eval(df["Preferences"])
This raises "ValueError: malformed node or string: 0", but it seems to do the trick.
The question: How can I check all rows and look for "FavoriteNumber" in Preferences, and if it == 72, change it to 100 (arbitrary example)?

You can use pd.Series.apply with a custom function. Just note this is bordering on abuse of Pandas. Pandas isn't designed to hold lists of dictionaries in series. Here, you are running a loop in a particularly inefficient way.
from ast import literal_eval
df = pd.DataFrame([['Nick', 18, '[{"Hobby":"Football", "Food":"Pizza", "FavoriteNumber":"72"}]']],
columns=['Name', 'Age', 'Preferences'])
def updater(x):
if x[0]['FavoriteNumber'] == '72':
x[0]['FavoriteNumber'] = '100'
return x
df['Preferences'] = df['Preferences'].apply(literal_eval)
df['Preferences'] = df['Preferences'].apply(updater)
print(df['Preferences'].iloc[0])
[{'Hobby': 'Football', 'Food': 'Pizza', 'FavoriteNumber': '100'}]

Related

Implement a "where" method to select column_name and value from string in Python

Ive got a little issue while coding a script that takes a CSV string and is supposed to select a column name and value based on the input. The CSV string contains Names of NBA players, their Universities etc. Now when the input is "name" && "Andre Brown", it should search for those values in the given CSV string. I have a rough code laid out - but I am unsure on how to implement the where method. Any ideas?
import csv
import pandas as pd
import io
class MySelectQuery:
def __init__(self, table, columns, where):
self.table = table
self.columns = columns
self.where = where
def __str__(self):
return f"SELECT {self.columns} FROM {self.table} WHERE {self.where}"
csvString = "name,year_start,year_end,position,height,weight,birth_date,college\nAlaa Abdelnaby,1991,1995,F-C,6-10,240,'June 24, 1968',Duke University\nZaid Abdul-Aziz,1969,1978,C-F,6-9,235,'April 7, 1946',Iowa State University\nKareem Abdul-Jabbar,1970,1989,C,7-2,225,'April 16, 1947','University of California, Los Angeles\nMahmoud Abdul-Rauf,1991,2001,G,6-1,162,'March 9, 1969',Louisiana State University\n"
df = pd.read_csv(io.StringIO(csvString), error_bad_lines=False)
where = "name = 'Alaa Abdelnaby' AND year_start = 1991"
df = df.query(where)
print(df)
The CSV string is being transformed into a pandas Dataframe, which should then find the values based on the input - however I get the error "name 'where' not defined". I believe everything until the df = etc. part is correct, now I need help implementing the where method. (Ive seen one other solution on SO but wasnt able to understand or figure that out)
# importing pandas
import pandas as pd
record = {
'Name': ['Ankit', 'Amit', 'Aishwarya', 'Priyanka', 'Priya', 'Shaurya' ],
'Age': [21, 19, 20, 18, 17, 21],
'Stream': ['Math', 'Commerce', 'Science', 'Math', 'Math', 'Science'],
'Percentage': [88, 92, 95, 70, 65, 78]}
# create a dataframe
dataframe = pd.DataFrame(record, columns = ['Name', 'Age', 'Stream', 'Percentage'])
print("Given Dataframe :\n", dataframe)
options = ['Math', 'Science']
# selecting rows based on condition
rslt_df = dataframe[(dataframe['Age'] == 21) &
dataframe['Stream'].isin(options)]
print('\nResult dataframe :\n', rslt_df)
Output:
Source: https://www.geeksforgeeks.org/selecting-rows-in-pandas-dataframe-based-on-conditions/
Sometimes Googling does the trick ;)
You need the double = there. So should be:
where = "name == 'Alaa Abdelnaby' AND year_start == 1991"

Create a nested dict containing list from a file

For example, for the txt file of
Math, Calculus, 5
Math, Vector, 3
Language, English, 4
Language, Spanish, 4
into the dictionary of:
data={'Math':{'name':[Calculus, Vector], 'score':[5,3]}, 'Language':{'name':[English, Spanish], 'score':[4,4]}}
I am having trouble with appending value to create list inside the smaller dict. I'm very new to this and I would not understand importing command. Thank you so much for all your help!
For each line, find the 3 values, then add them to a dict structure
from pathlib import Path
result = {}
for row in Path("test.txt").read_text().splitlines():
subject_type, subject, score = row.split(", ")
if subject_type not in result:
result[subject_type] = {'name': [], 'score': []}
result[subject_type]['name'].append(subject)
result[subject_type]['score'].append(int(score))
You can simplify it with the use of a defaultdict that creates the mapping if the key isn't already present
result = defaultdict(lambda: {'name': [], 'score': []}) # from collections import defaultdict
for row in Path("test.txt").read_text().splitlines():
subject_type, subject, score = row.split(", ")
result[subject_type]['name'].append(subject)
result[subject_type]['score'].append(int(score))
With pandas.DataFrame you can directly the formatted data and output the format you want
import pandas as pd
df = pd.read_csv("test.txt", sep=", ", engine="python", names=['key', 'name', 'score'])
df = df.groupby('key').agg(list)
result = df.to_dict(orient='index')
From your data:
data={'Math':{'name':['Calculus', 'Vector'], 'score':[5,3]},
'Language':{'name':['English', 'Spanish'], 'score':[4,4]}}
If you want to append to the list inside your dictionary, you can do:
data['Math']['name'].append('Algebra')
data['Math']['score'].append(4)
If you want to add a new dictionary, you can do:
data['Science'] = {'name':['Chemisty', 'Biology'], 'score':[2,3]}
I am not sure if that is what you wanted but I hope it helps!

Replace values from pandas dataset with dictionary

I am extracting a column from excel document with pandas. After that, I want to replace for each row of the selected column, all keys contained in multiple dictionaries grouped in a list.
import pandas as pd
file_loc = "excelFile.xlsx"
df = pd.read_excel(file_loc, usecols = "C")
In this case, my dataframe is called by df['Q10'], this data frame has more than 10k rows.
Traditionally, if I want to replace a value in df I use;
df['Q10'].str.replace('val1', 'val1')
Now, I have a dictionary of words like:
mydic = [
{
'key': 'wasn't',
'value': 'was not'
}
{
'key': 'I'm',
'value': 'I am'
}
... + tons of line of key value pairs
]
Currently, I have created a function that iterates over "mydic" and replacer one by one all occurrences.
def replaceContractions(df, mydic):
for cont in contractions:
df.str.replace(cont['key'], cont['value'])
Next I call this function passing mydic and my dataframe:
replaceContractions(df['Q10'], contractions)
First problem: this is very expensive because mydic has a lot of item and data set is iterate for each item on it.
Second: It seems that doesn't works :(
Any Ideas?
Convert your "dictionary" to a more friendly format:
m = {d['key'] : d['value'] for d in mydic}
m
{"I'm": 'I am', "wasn't": 'was not'}
Next, call replace with the regex switch and pass m to it.
df['Q10'] = df['Q10'].replace(m, regex=True)
replace accepts a dictionary of key-replacement pairs, and it should be much faster than iterating over each key-replacement at a time.

Change order of list of lists according to another list

I have a bunch of CSV-files where first line is the column name, and now I want to change the order according to another list.
Example:
[
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233'],
...
]
The above order differs slightly between the files, but the same column-names are always available.
So the I want the columns to be re-arranged as:
['index','date','name','position']
I can solve it by comparing the first row, making an index for each column, then re-map each row into a new list of lists using a for-loop.
And while it works, it feels so ugly even my blind old aunt would yell at me if she saw it.
Someone on IRC told me to look at on map() and operator but I'm just not experienced enough to puzzle those together. :/
Thanks.
Plain Python
You could use zip to transpose your data:
data = [
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233']
]
columns = list(zip(*data))
print(columns)
# [('date', '2003-02-04', '2003-02-04'), ('index', '23445', '23446'), ('name', 'Steiner, James', 'Holm, Derek'), ('position', '98886', '2233')]
It becomes much easier to modify the columns order now.
To calculate the needed permutation, you can use:
old = data[0]
new = ['index','date','name','position']
mapping = {i:new.index(v) for i,v in enumerate(old)}
# {0: 1, 1: 0, 2: 2, 3: 3}
You can apply the permutation to the columns:
columns = [columns[mapping[i]] for i in range(len(columns))]
# [('index', '23445', '23446'), ('date', '2003-02-04', '2003-02-04'), ('name', 'Steiner, James', 'Holm, Derek'), ('position', '98886', '2233')]
and transpose them back:
list(zip(*columns))
# [('index', 'date', 'name', 'position'), ('23445', '2003-02-04', 'Steiner, James', '98886'), ('23446', '2003-02-04', 'Holm, Derek', '2233')]
With Pandas
For this kind of tasks, you should use pandas.
It can parse CSVs, reorder columns, sort them and keep an index.
If you have already imported data, you could use these methods to import the columns, use the first row as header and set index column as index.
import pandas as pd
df = pd.DataFrame(data[1:], columns=data[0]).set_index('index')
df then becomes:
date name position
index
23445 2003-02-04 Steiner, James 98886
23446 2003-02-04 Holm, Derek 2233
You can avoid those steps by importing the CSV correctly with pandas.read_csv. You'd need usecols=['index','date','name','position'] to get the correct order directly.
Simple and stupid:
LIST = [
['date', 'index', 'name', 'position'],
['2003-02-04', '23445', 'Steiner, James', '98886'],
['2003-02-04', '23446', 'Holm, Derek', '2233'],
]
NEW_HEADER = ['index', 'date', 'name', 'position']
def swap(lists, new_header):
mapping = {}
for lst in lists:
if not mapping:
mapping = {
old_pos: new_pos
for new_pos, new_field in enumerate(new_header)
for old_pos, old_field in enumerate(lst)
if new_field == old_field}
yield [item for _, item in sorted(
[(mapping[index], item) for index, item in enumerate(lst)])]
if __name__ == '__main__':
print(LIST)
print(list(swap(LIST, NEW_HEADER)))
To rearrange your data, you can use a dictionary:
import csv
s = [
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233'],
]
new_data = [{a:b for a, b in zip(s[0], i)} for i in s[1:]]
final_data = [[b[c] for c in ['index','date','name','position']] for b in new_data]
write = csv.writer(open('filename.csv'))
write.writerows(final_data)

Filter Pandas DataFrames Using Dynamic URL Query String

Currently i am having an question in python pandas. I want to filter a dataframe using url query string dynamically.
For eg:
CSV:
url: http://example.com/filter?Name=Sam&Age=21&Gender=male
Hardcoded:
filtered_data = data[
(data['Name'] == 'Sam') &
(data['Age'] == 21) &
(data['Gender'] == 'male')
];
I don't want to hard code the filter keys like before because the csv file changes anytime with different column headers.
Any suggestions
The easiest way to create this filter dynamically is probably to use np.all.
For example:
import numpy as np
query = {'Name': 'Sam', 'Age': 21, 'Gender': 'male'}
filters = [data[k] == v for k, v in query.items()]
filter_data = data[np.all(filters, axis=0)]
use df.query. For example
df = pd.read_csv(url)
conditions = "Name == 'Sam' and Age == 21 and Gender == 'Male'"
filtered_data = df.query(conditions)
You can build the conditions string dynamically using string formatting like
conditions = " and ".join("{} == {}".format(col, val)
for col, val in zip(df.columns, values)
Typically, your web framework will return the arguments in a dict-like structure. Let's say your args are like this:
args = {
'Name': ['Sam'],
'Age': ['21'], # Note that Age is a string
'Gender': ['male']
}
You can filter your dataset successively like this:
for key, values in args.items():
data = data[data[key].isin(values)]
However, this is likely not to match any data for Age, which may have been loaded as an integer. In that case, you could load the CSV file as a string via pd.read_csv(filename, dtype=object), or convert to string before comparison:
for key, values in args.items():
data = data[data[key].astype(str).isin(values)]
Incidentally, this will also match multiple values. For example, take the URL http://example.com/filter?Name=Sam&Name=Ben&Age=21&Gender=male -- which leads to the structure:
args = {
'Name': ['Sam', 'Ben'], # There are 2 names
'Age': ['21'],
'Gender': ['male']
}
In this case, both Ben and Sam will be matched, since we're using .isin to match.

Categories

Resources