I have a dataframe that consists of lines that look like:
"{'displayName':'MartinscroftTramStop','locationIdentifier':'STATION^15306','normalisedSearchTerm':'MARTINSCROFTTRAMSTOP'}"
How do I split this into columns. I've tried str.slice[stop and start].
I suspect it's all the quotes but finding and replacing them don't seem to work either
You can handle the first problem, the string object, using the eval('..') function. It will return the execution of the string, so will return the dict itself.
The second one, the dict structure, you have multiple choices. There is one solution
import pandas as pd
# Transform the string in dict
dict_data=eval("{'displayName':'MartinscroftTramStop','locationIdentifier':'STATION^15306','normalisedSearchTerm':'MARTINSCROFTTRAMSTOP'}")
# Organize the data
columns_name = dict_data.keys()
data_list = [list(dict_data.values())] # a row must be a list inside a list
pd.DataFrame(data_list, columns=columns_name)
Related
i have a json data https://steamcommunity.com/id/RednelssGames/inventory/json/730/2
need get names of all the items
r = requests.get('https://steamcommunity.com/id/RednelssGames/inventory/json/730/2')
if r.json()['success'] == True:
for rows in r.json()['rgDescriptions']:
print(rows['market_hash_name'])
getting error string indices must be integers
Change the for-loop as follows:
for rows in r.json()['rgDescriptions'].values():
print(rows['market_hash_name'])
By iterating over a dictionary like you did, you get the keys and not the values (rows). If you want to iterate over the values, you have to iterate over the return value of dict.values().
From the link you provided:
"rgDescriptions":{"4291220570_302028390":
rgDescriptions doesn't return an array, but an object (a dictionary, in this case) (notice the opening curly brace ({) rather than a regular square brace ([)).
By using for rows in r.json()['rgDescriptions']: you end up iterating over the dictionary's keys. The first key of the dictionary seems to be "4291220570_302028390", which is a string.
so when you do print(rows['market_hash_name']), you're attempting to access the 'market_hash_name' "index" of your rows object, but rows is actually a string, so it doesn't work.
I am trying to make a new column depending on different criteria. I want to add characters to the string dependent on the starting characters of the column.
An example of the data:
RH~111~header~120~~~~~~~ball
RL~111~detailed~12~~~~~hat
RA~111~account~13~~~~~~~~~car
I want to change those starting with RH and RL, but not the ones starting with RA. So I want to look like:
RH~111~header~120~~1~~~~~ball
RL~111~detailed~12~~cancel~~~ball
RA~111~account~12~~~~~~~~~ball
I have attempted to use str split, but it doesn't seem to actually be splitting the string up
(np.where(~df['1'].str.startswith('RH'),
df['1'].str.split('~').str[5],
df['1']))
This is referencing the correct columns but not splitting it where I thought it would, and cant seem to get further than this. I feel like I am not really going about this the right way.
Define a function to replace element No pos in arr list:
def repl(arr, pos):
arr[pos] = '1' if arr[0] == 'RH' else 'cancel'
return '~'.join(arr)
Then perform the substitution:
df[0] = df[0].mask(df[0].str.match('^R[HL]'),
df[0].str.split('~').apply(repl, pos=5))
Details:
str.match provides that only proper elements are substituted.
df[0].str.split('~') splits the column of strings into a column
of lists (resulting from splitting of each string).
apply(repl, pos=5) computes the value to sobstitute.
I assumed that you have a DataFrame with a single column, so its column
name is 0 (an integer), instead of '1' (a string).
If this is not the case, change the column name in the code above.
I am new to Python and am trying to achieve something new. I have a list defined with some string values, like
col_names = 'ABC,DEF,XYZ'.
If I want to extract and use values individually, how can I do that in Python?
Ex: I want to use ABC in one scenario but DEF in another and so on.
Can I create the list as a dictionary, like below? Would that help anything
col_names = {'ABC','DEF','XYZ'}
col_names is a string, not a list. You could use col_names.split(',') to separate each value.
FYI, the your second definition for col_names is a set, not a dictionary.
To use values from a list, you'd reference each value's index
For example, in a list ls = ['ABC','DEF','XYZ'], ls[2] would be equal to 'XYZ'
Edit: Finally figured it out myself. I kept using select() on column within the function, that's why it didn't work. I added my solution as comments withint the original question just in case it might be of use for somebody else.
I'm working on an online course where I'm supposed to write the following function:
# TODO: Replace <FILL IN> with appropriate code
# Note that you shouldn't use any RDD operations or need to create custom user defined functions (udfs) to accomplish this task
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
# EDIT: MY SOLUTION
# column = lower(column)
# column = regexp_replace(column, r'([^a-z\d\s])+', r'')
# return trim(column).alias('sentence')
return <FILL IN>
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
I've written the code that gives me the required output for operations on the DataFrame itself:
# Lower case
lower = sentenceDF.select(lower(col('sentence')).alias('lower'))
lower.show()
# Remove Punctuation
cleaned = lower.select(regexp_replace(col('lower'), r'([^a-z\d\s])+', r'').alias('cleaned'))
cleaned.show()
# Trim
sentenceDF = cleaned.select(trim(col('cleaned')).alias('sentence'))
sentenceDF.show(truncate=False)
I just don't know, how to implement this code in my function, since it doesn't operate on the DataFrame, but only on the given column. I've tried different approaches, one was to create a new DataFrame out of the column input using
[...]
df = sqlContext.createDataFrame(column, ['sentence'])
[...]
within the function, but it doesn't work: TypeError: Column is not iterable. Other approaches were trying to directly operate on column within the function, always leading to TypeError: 'Column' object is not callable.
I've started with (Py)Spark a few days ago and still have conceptual problems regarding how to deal with rows and columns only. I would really appreciate any kind of help on the current issue.
You can do this in a single line.
return re.sub(r'[^a-z0-9\s]','',text.lower().strip()).strip()
I want to be able to append to a .txt file each time I run a function.
The output I am trying to write to the function is something like this:
somelist = ['a','b','b','c']
somefloat = -0.64524
sometuple = (235,633,4245,524)
output = tuple(somelist,somefloat,sometuple) (the output does not need to be in tuple format.)
Right now, I am outputting like this:
outfile = open('log.txt','a')
out = str(output)+'\n
outfile.write(out)
This kind of works, but I have to import it like this:
with open('log.txt', "r") as myfile:
mydata = myfile.readlines()
for line in mydata:
line = eval(line)
Ideally, I would like to be able to import it back directly into a Pandas DataFrame something like this:
dflog = pd.read_csv('log.txt')
and have it generate a three column dataset with the first column containing a list (string format is fine), the second column containing a float, and the third column containing a tuple (same deal as the list).
My questions are:
Is there a way to append the output in a format that can be more easily imported into pandas
Is there a simpler way of doing this, this seems like a pretty common task, I wouldn't be surprised if somebody has made this into a line or two of code.
One way to do this is to separate your columns with a custom separator such as '|'
Say:
somelist = ['a','b','b','c']
somefloat = -0.64524
sometuple = (235,633,4245,524)
output = str(somelist) + "|" + str(somefloat) + "|" + str(sometuple)
(if you wanna have many more columns, then use string.join() or something like that)
Then, just as before:
outfile = open('log.txt','a')
out = output + '\n'
outfile.write(out)
As just read the whole file with
pd.read_csv("log.txt", sep='|')
Do note that using lists or tuples for an entry in pandasis discouraged (I couldn't find a official reference for that though). For speedups with operations, you might consider dividing your tuples or lists into separate columns so that you're left with floats, integers or simple strings. Pandas can easily handle automatic naming if you so need.