I have bunch of list and string like texts in the cell value of the pandas data frame. I am trying to convert list to string, I am able to convert list to string, but its splitting the string as well. How do I only apply this logic if the cell contains list [] in the particular column?
raw_data = {'Name': [['\'John Smith\''], ['\'Jane Doe\'']],
'id': [['\'A1005\'','\'A1006\''], 'A200,A400,A500']}
dfRaw = pd.DataFrame(raw_data, columns = ['Name','id'])
dfRaw['Name'] = dfRaw['Name'].astype(str)
Data
Name id
0 ["'John Smith'"] ['A1005', 'A1006']
1 ["'Jane Doe'"] A200,A400,A500
Need output like this:
Name id
0 ["'John Smith'"] 'A1005','A1006'
1 ["'Jane Doe'"] A200,A400,A500
But the code below is splitting string cell values too.
dfRaw['id'] = dfRaw['id'].apply(lambda x: ','.join([str(i) for i in x]))
Name id
0 ["'John Smith'"] 'A1005','A1006'
1 ["'Jane Doe'"] A,2,0,0,,,A,4,0,0,,,A,5,0,0
You could use a list comprehension to generate a new list with the rows in id joining those entries that are lists using string.join.
You can check if an entry is a list using isinstance:
df['id'] = [','.join(i) if isinstance(i, list) else i for i in df['id']]
Output
Name id
0 ['John Smith'] A1005,A1006
1 ['Jane Doe'] A200,A400,A500
Related
I have a data frame like this
data = [['Ma', 1,'too'], ['Ma', 1,'taa'], ['Ma', 1,'tuu',],['Ga', 2,'too'], ['Ga', 2,'taa'], ['Ga', 2,'tuu',]]
df = pd.DataFrame(data, columns = ['NAME', 'AID','SUBTYPE'])
NAME ID SUBTYPE
Ma 1 too
Ma 1 taa
Ma 1 tuu
Ga 2 too
Ga 2 taa
Ga 2 tuu
There are repeated NAME and ID and different SUBTYPE
And I want a list like this
Ma-1-[too,taa,too],Ga-2-[too,taa,tuu]
EDIT: NAME and ID should be always the same.
Generally, to achieve this in Python we would use dictionaries as the keys cannot be duplicated.
# We combine the NAME and ID keys, so we can use them together as a key.
df["NAMEID"] = df["NAME"] + "-" + df["ID"].astype(str)
# Convert the desired fields to lists.
name_id_list = df["NAMEID"].tolist()
subtype_list = df["SUBTYPE"].tolist()
# Loop through the lists by zipping them together.
results_dict = {}
for name_id, subttype in zip(name_id_list, subtype_list):
if results_dict.get(name_id):
# If the key already exists then instead we append them to the end of the list.
results_dict[name_id].append(subttype)
else:
# If key not exists add them as key-value pairs to a dictionary.
results_dict[name_id] = [subtype]
Results dict will end up looking like:
{'Ma-1': ['too', 'taa', 'tuu'], 'Ga-2': ['too', 'taa', 'tuu']}
I have a list:
l = [{'County': 'SentenceCase'}, {'Postcode': 'UpperCase'}]
type(l) equals <class 'list'>
If I load it into a dataframe.
df = pd.DataFrame(l)
The first row then ends up as the column names.
County Postcode
0 SentenceCase NaN
1 NaN UpperCase
I've tried Header=None, etc. but nothing seems to work.
I would want the dataframe to be
Header1 Header2
0 County SentenceCase
1 Postcode UpperCase
I think you are just using the wrong structure. instead of a list of dictionaries, you should have a list of lists to load into the DataFrame:
[['County', 'SentenceCase'], ['Postcode', 'UpperCase']]
If for some reason you require your original structure, you could use something like this:
new_l = []
for item in l:
for key, value in item.items() :
new_l.append([key, value])
new_df = pd.DataFrame(new_l)
new_df.columns = ['Header1', 'Header2']
new_df
which will give:
Header1 Header2
0 County SentenceCase
1 Postcode UpperCase
Every dictionary in the list represent a row in the table, the key is the Header.
It should look like that:
l = [{'Header1': 'County','Header2':'SentenceCase'},
{'Header1': 'Postcode ','Header2':'UpperCase'}]
so i have a column called "URL's" in my DataFrame Pd1
URL
row 1 : url1,url1,url2
row 2 : url2,url2,url3
output :
URL
row 1 : url1,url2
row 2 : url2,url3
I assume that your column contains only URL list.
One of possible solutions is to:
apply a function to URL column, containing the following steps:
split the source string on each comma (tre result is a list of
fragments),
create a set from this list (thus eleminating repetitions),
join keys from this set, using a comma,
save the result back into the source column.
Something like:
df.URL = df.URL.apply(lambda x: ','.join(set(re.split(',', x))))
As this code uses re module, you have to import re before.
split and apply set
d = {"url": ["url1,url1,url2",
"url2,url2,url3"]}
df = pd.DataFrame(d)
df.url.str.split(",").apply(set)
df['URL'] = df.URL.str.split(':').apply(lambda x: [x[0],','.join(sorted(set(x[1].split(','))))]).apply(' : '.join)
URL
0 row 1 : url1,url2
1 row 2 : url2,url3
if data
URL
0 url1,url1,url2
1 url2,url2,url3
then
df['URL'] = df.URL.str.split(',').apply(lambda x: ','.join(sorted(set(x))))
##print(df)
URL
0 url1,url2
1 url2,url3
I have a data frame df with 3 columns and a loop creating strings from a text file depending on the column-names of the loop:
exampletext = "Nr1 thisword1 and Nr2 thisword2 and Nr3 thisword3"
Columnnames = ("Nr1", "Nr2", "Nr3")
df1= pd.DataFrame(columns = Columnnames)
for i in range(0,len(Columnnames)):
solution = exampletext.find(Columnnames[i])
lsolution= len(Columnnames[i])
Solutionwords = exampletext[solution+lsolution:solution+lsolution+10]
Now I want to append the solutionwords at the end of the dataframe df1 in the correct field, e.g. when looking for Nr1 I want to append the solutionwords to column named Nr1.
I tried working with append and creating a list, but this will just append at the end of the list. I need the data frame to seperate the words depending on the word I was looking for. Thank you for any help!
edit for desired output and readability:
Desired Output should be a data frame and look like the following:
Nr1 | Nr2 | Nr3
thisword1 | thisword2 | thisword3
I've assumed that your word for the cell value always follows your column name and is separated by a space. In which case, I'd probably try and achieve this by adding your values to a dictionary and then creating a dataframe from it after it contains the data you want, like this:
example_text = "Nr1 thisword1 and Nr2 thisword2 and Nr3 thisword3"
column_names = ("Nr1", "Nr2", "Nr3")
d = dict()
split_text = example_text.split(' ')
for i, text in enumerate(split_text):
if text in column_names:
d[text] = split_text[i+1]
df = pd.DataFrame(d, index=[0])
which will give you:
>>> df
Nr1 Nr2 Nr3
0 thisword1 thisword2 thisword3
a picture on how the data look like
So, I have a json file with data, the file is really nested, I want to take only the words and create a new dataframe for each post id. Can anyone help with this?
You can use apply with list comprehension:
df = pd.DataFrame({'member_info.vocabulary':[[], [{'post_iD':'3913', 'word':'Twisters'},
{'post_iD':'3911', 'word':'articulate'}]]})
df['words'] = df['member_info.vocabulary'].apply(lambda x: [y.get('word') for y in x])
print (df)
member_info.vocabulary words
0 [] []
1 [{'post_iD': '3913', 'word': 'Twisters'}, {'po... [Twisters, articulate]
And if get one element lists only add str[0] for select first value of lists:
df = pd.DataFrame({'member_info.vocabulary':[[], [{'post_iD':'3913', 'word':'Twisters'}]]})
df['words'] = df['member_info.vocabulary'].apply(lambda x: [y.get('word') for y in x]).str[0]
print (df)
member_info.vocabulary words
0 [] NaN
1 [{'post_iD': '3913', 'word': 'Twisters'}] Twisters