Splitting strings in an array - python

Splitting strings in an array - python - python

I have a pandas dataframe with an array variable that's currently made up of a two part string as in the example below. The first part is a datetime and the second part is a price. Records in the dataframe have different length price_trend arrays.
Id Name Color price_trend
1 apple red '1420848000:1.25', '1440201600:1.35', '1443830400:1.52' 60
2 lemon yellow '1403740800:0.32','1422057600:0.25'
I'd like to split each of the strings in the array into two parts around the colon (:), however when I run the code below, all the values in price_trend are replaced with nan
df['price_trend'] = df['price_trend'].str.split(':')
I would like to keep the array in this dataframe, and not a create a new one.

df['price_trend'].apply(lambda x:[i.split(':') for i in x])
0 [['1420848000, 1.25'], [ '1440201600, 1.35'], [ '1443830400, 1.52']]
1 [['1403740800, 0.32'], ['1422057600, 0.25']]

I assume below code should work for you
>>> df={}
>>> df['p']=['1420848000:1.25', '1440201600:1.35', '1443830400:1.52']
>>> df['p']=[ x.split(':') for x in df['p']]
>>> df
{'p': [['1420848000', '1.25'], ['1440201600', '1.35'], ['1443830400', '1.52']]}

Related

Replace strings in a pandas column with a randomly generated code, and store the matchings in a dictionary

I have a dataframe with a column containing strings that I am trying to replace with a randomly generated string, and keep a dictionary with the originals and the replacements.
Concretely, I have something like this:
col1
0 Marie
1 Marie
2 Lucas
3 Dog
4 Table
5 Dog
And I want to replace those strings with a code. The format of the code is indifferent, but, for example, with à 6 characters only letters code, the output would look like this:
col1
0 aadfre
1 aadfre
2 qwerty
3 lfkdjs
4 hgyeoy
5 lfkdjs
And I am trying to keep a dictionary of the matching, like this: {'Marie': 'aadfre', 'Lucas': 'qwerty', 'Dog ': 'lfkdjs', 'Table': 'hgyeoy'}
Is there any way to do this?
Thanks!!

Try using Python's string and randint module.
import pandas as pd
import string
from random import randint
Create Pandas Dataframe and our dictionary with real and encoded names
df = pd.DataFrame(['Marie','Marie','Lucas','Dog','Table','Dog'])
secret_names_dict = {name:''.join([string.ascii_lowercase[randint(0,25)] for char in range(6)]) for name in df[0].unique()}
I will break down steps in this dictionary comprehension.
Code below creates a list of 6 random lowercase characters
[string.ascii_lowercase[randint(0,25)] for char in range(6)]
and by using ''.join() we will join them into string.
.unique() is Pandas method to extract unique values from column, we will use it to ensure same values are encoded in same way.
df[0].unique()
The rest is just dictionary comprehension, storing original values and encoded values.
This newly created dictionary can be easily used to rename values in column using Pandas .rename() function.
df.replace(secret_names_dict)
The result will be:
0
0 loixez
1 loixez
2 pavedm
3 kigahn
4 gybour
5 kigahn
Hope that helps, I tried to keep it as simple as possible.

How do I create a DataFrame from a list so that the list will be shown as a column and not as one single row?

Using jupyter notebook. I have scraped some data from the web which I named "graphValues". The type of graphValues is a list and the values within the list are string. I would like to put the data of graphValues as a column in a new dataframe named "data". When I do this the dataframe contains only one single element which is the list of graphValues showing as a row, not a column:
data=pd.DataFrame([graphValues])
print(data)
output:
0 [10,0,0,2,0,3,2,4,4,14,11,20,12,18,43,50,20,80...
Something else I tried is putting graphValues in a dictionary as follows:
code:
graphValues2={'Graph Val': graphValues}
data=pd.DataFrame(graphValues2)
print(data)
This gave an error saying:
ValueError: If using all scalar values, you must pass an index<br/>
but if I add an index of lenght x, the df will just contain the same list x times (x being the lenght of graphValues of course).
How can I get the following output? Is there a way without a for loop? What is the most efficient way?
Wanted output:
0 10
1 0
2 0
3 2
4 0
: :
: :

Do not use print. Call the variable without any function like this:
data=pd.DataFrame(graphValues.split(','))
data
Or if the code above doesn't work, use this:
data = pd.DataFrame(graphValues)
data

>>> graphValues="10,0,0,2,0,3".split(",")
>>> data=pd.DataFrame(graphValues)
>>> data
0
0 10
1 0
2 0
3 2
4 0
5 3
pd.DataFrame(['r0c0','r0c1','r0c2']) sets a single column. Add an outer list and pandas thinks you are adding rows (pd.DataFrame([['r0c0','r0c1','r0c2'], ['r1c0','r1c1','r1c2']])). Since graphValues was already a list, you were doing that second one.

Iterating over dataframe and using replace method based on condtions

I am attempting to iterate over a specific column in my dataframe.
The column is:
df['column'] = ['1.4million', '1,235,000','100million',NaN, '14million', '2.5mill']
I am trying to clean this column and eventually get it all to integers to do more work with. I am stuck on the step to clean out "million." I would like to replace the "million" with five zeros when there is a decimal (ie 1.4million becomes 1.400000) and the "million" with six zeros when there is no decimal (ie 100million becomes 100000000).
To simplify, the first step I'm trying is to just focus on filtering out the values with a decimal and replace those with 5 zeros. I have attempted to use np.where for this, however I cannot use the replace method with numpy.
I also attempted to use pd.DataFrame.where, but am getting an error:
for i,row in df.iterrows():
df.at[i,'column'] = pd.DataFrame.where('.' in df.at[i,'column'],df.at[i,'column'].replace('million',''),df.at[i,'column'])
``AttributeError: 'numpy.ndarray' object has no attribute 'replace'
Im sure there is something I'm missing here. (I'm also sure that I'll be told that I don't need to use iterrows here, so I am open to suggestions on that as well).

Given your sample data - it looks like you can strip out commas and then take all digits (and . characters) until the string mill or end of string and split those out, eg:
x = df['column'].str.replace(',', '').str.extract('(.*?)(mill.*)?$')
This'll give you:
0 1
0 1.4 million
1 1235000 NaN
2 100 million
3 NaN NaN
4 14 million
5 2.5 mill
Then take the number part and multiply it by a million where there's something in column 1 else multiple it by 1, eg:
res = pd.to_numeric(x[0]) * np.where(x[1].notna(), 1_000_000, 1)
That'll give you:
0 1400000.0
1 1235000.0
2 100000000.0
3 NaN
4 14000000.0
5 2500000.0

Try this:
df['column'].apply(lambda x : x.replace('million','00000'))
Make sure your dtype is string before applying this

For the given data:
df['column'].apply(lambda x: float(str(x).split('m')[0])*10**6
if 'million' in str(x) or 'mill' in str(x) else x)
If there may be many forms of million in the column, then regex search.

DataFrame is empty, expected data in it

I want to find duplicate items within 2 rows in Excel. So for example my Excel consists of:
list_A list_B
0 ideal ideal
1 brown colour
2 blue blew
3 red red
I checked the pandas documentation and tried duplicate method but I simply don't know why it keeps saying "DataFrame is empty". It finds both columns and I guess it's iterated over it but why doesn't it find the values and compare them?
I also tried using iterrows but honestly don't know how to implement it.
When running the code I get this output:
Empty DataFrame
Columns: [list A, list B]
Index: []
import pandas as pd
pt = pd.read_excel(r"C:\Users\S531\Desktop\pt.xlsx")
dfObj = pd.DataFrame(pt)
doubles = dfObj[dfObj.duplicated()]
print(doubles)
The output I'm looking for is:
list_A list_B
0 ideal ideal
3 red red
Final solved code looks like this:
import pandas as pd
pt = pd.read_excel(r"C:\Users\S531\Desktop\pt.xlsx")
doubles = pt[pt['list_A'] == pt['list_B']]
print(doubles)

The term "duplicate" is usually used to mean rows that are exact duplicates of previous rows (see the documentation of pd.DataFrame.duplicate).
What you are looking for is just the rows where these two columns are equal. For that, you want:
doubles = pt[pt['list_A'] == pt['list_B']]

How to extract values from a Pandas DataFrame, rather than a Series (without referencing the index)?

I am trying to return a specific item from a Pandas DataFrame via conditional selection (and do not want to have to reference the index to do so).
Here is an example:
I have the following dataframe:
Code Colour Fruit
0 1 red apple
1 2 orange orange
2 3 yellow banana
3 4 green pear
4 5 blue blueberry
I enter the following code to search for the code for blueberries:
df[df['Fruit'] == 'blueberry']['Code']
This returns:
4 5
Name: Code, dtype: int64
which is of type:
pandas.core.series.Series
but what I actually want to return is the number 5 of type:
numpy.int64
which I can do if I enter the following code:
df[df['Fruit'] == 'blueberry']['Code'][4]
i.e. referencing the index to give the number 5, but I do not want to have to reference the index!
Is there another syntax that I can deploy here to achieve the same thing?
Thank you!...
Update:
One further idea is this code:
df[df['Fruit'] == 'blueberry']['Code'][df[df['Fruit']=='blueberry'].index[0]]
However, this does not seem particularly elegant (and it references the index). Is there a more concise and precise method that does not need to reference the index or is this strictly necessary?
Thanks!...

Let's try this:
df.loc[df['Fruit'] == 'blueberry','Code'].values[0]
Output:
5
First, use .loc to access the values in your dataframe using the boolean indexing for row selection and index label for column selection. The convert that returned series to an array of values and since there is only one value in that array you can use index '[0]' get the scalar value from that single element array.

Referencing index is a requirement (unless you use next()^), since a pd.Series is not guaranteed to have one value.
You can use pd.Series.values to extract the values as an array. This also works if you have multiple matches:
res = df.loc[df['Fruit'] == 'blueberry', 'Code'].values
# array([5], dtype=int64)
df2 = pd.concat([df]*5)
res = df2.loc[df2['Fruit'] == 'blueberry', 'Code'].values
# array([5, 5, 5, 5, 5], dtype=int64)
To get a list from the numpy array, you can use .tolist():
res = df.loc[df['Fruit'] == 'blueberry', 'Code'].values.tolist()
Both the array and the list versions can be indexed intuitively, e.g. res[0] for the first item.
^ If you are really opposed to using index, you can use next() to iterate:
next(iter(res))

you can also set your 'Fruit' column as ann index
df_fruit_index = df.set_index('Fruit')
and extract the value from the 'Code' column based on the fruit you choose
df_fruit_index.loc['blueberry','Code']

Easiest solution: convert pandas.core.series.Series to integer!
my_code = int(df[df['Fruit'] == 'blueberry']['Code'])
print(my_code)
Outputs:
5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting strings in an array - python - python

df['price_trend'].apply(lambda x:[i.split(':') for i in x]) 0 [['1420848000, 1.25'], [ '1440201600, 1.35'], [ '1443830400, 1.52']] 1 [['1403740800, 0.32'], ['1422057600, 0.25']]

I assume below code should work for you >>> df={} >>> df['p']=['1420848000:1.25', '1440201600:1.35', '1443830400:1.52'] >>> df['p']=[ x.split(':') for x in df['p']] >>> df {'p': [['1420848000', '1.25'], ['1440201600', '1.35'], ['1443830400', '1.52']]}

Related

Replace strings in a pandas column with a randomly generated code, and store the matchings in a dictionary

How do I create a DataFrame from a list so that the list will be shown as a column and not as one single row?

Iterating over dataframe and using replace method based on condtions

DataFrame is empty, expected data in it

How to extract values from a Pandas DataFrame, rather than a Series (without referencing the index)?

Categories

Resources