Concatenate two columns in pandas - python

Good evening, I need help on getting two columns together, my brain is stuck right now, here's my code:
import pandas as pd
import numpy as np
tabela = pd.read_csv('/content/idkfa_linkedin_user_company_202208121105.csv', sep=';')
tabela.head(2)
coluna1 = 'startPeriodoMonth'
coluna2 = 'startPeriodoYear'
pd.concat([coluna1, coluna2])
ERROR: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
I'm currently getting this error, but I really don't know what to do, by the way I'm a beginner, I don't know much about coding, so any help is very appreciated.

I am new to Pandas too. But I think I can help you. You seem to have created (2) string variables by encapsulating the literal strings "startPeriodMonth" and "startPeriodYear" in single quotes ('xyz'). I think that what you're trying to do is pass columns from your Pandas data frame... and the way to do that is to explicitly reference the df and then wrap your column name in square brackets, like this:
coluna1 = tabela[startPeriodoMonth]
This is why it is saying that you "can't concatenate an object of type string". It only accepts a series or dataframe object.

From what I understand coluna1, coluna2 are columnw from tabela. You have two options:
The first is selecting the columns from the dataframe and storing it in a new dataframe.
import pandas as pd
import numpy as np
tabela = pd.read_csv('/content/idkfa_linkedin_user_company_202208121105.csv', sep=';')
tabela.head(2)
coluna1 = 'startPeriodoMonth'
coluna2 = 'startPeriodoYear'
new_df=df[[coluna1, coluna2]]
The second option is creating a Dataframe, which contains just the desired column (for both columns), followed by the concatenation of these Dataframes.
coluna1 = 'startPeriodoMonth'
coluna2 = 'startPeriodoYear'
df_column1=tabela[[coluna1]]
df_column2=tabela[[coluna2]]
pd_concat=[df_column1, df_column2]
result = pd.concat(pd_concat)

You can create a new column in your existing data frame to get the desired output.
tabela['month_year'] = tabela['coluna1'].apply(str) + '/' + tabela['coluna2'].apply(str)

Related

I'm trying to find elements of a column in a dataframe in the columns of another dataframe, but index() is not working for me

I'm trying to find the elements of the column 'Start (m)' of the corrosion dataframe in the column 'Inicio (m)' of the riesgoRel dataframe and get the indices stored in a list. I implemented the following code:
import pandas as pd
corrosion = pd.read_excel('Corrosion.xlsx', index=False)
TPdamage = pd.read_excel('Daños por terceros.xlsx', index=False)
for row in corrosion['Start (m)']:
indexcorr[row]=riesgoRel['Progresiva Inicio (m)'].index(corrosion['Start (m)'][row])
print(indexcorr)
But when I try to run this, I get the following error: 'RangeIndex' object is not callable. I'm guessing there is a fairly simple mistake somewhere but I cannot figure it out.
Thank you very much.
Let us try this:
final_list = corrosion[corrosion ['Start (m)']\
.isin(riesgoRel['Progresiva Inicio (m)'].values)]\
.index.tolist()
final_list is the list of all desired indices.
change the files formats to CSV and work with pandas data frame.

Normalize column with JSON data in Pandas dataframe

I have a Pandas dataframe in which one column contains JSON data (the JSON structure is simple: only one level, there is no nested data):
ID,Date,attributes
9001,2020-07-01T00:00:06Z,"{"State":"FL","Source":"Android","Request":"0.001"}"
9002,2020-07-01T00:00:33Z,"{"State":"NY","Source":"Android","Request":"0.001"}"
9003,2020-07-01T00:07:19Z,"{"State":"FL","Source":"ios","Request":"0.001"}"
9004,2020-07-01T00:11:30Z,"{"State":"NY","Source":"windows","Request":"0.001"}"
9005,2020-07-01T00:15:23Z,"{"State":"FL","Source":"ios","Request":"0.001"}"
I would like to normalize the JSON content in the attributes column so the JSON attributes become each a column in the dataframe.
ID,Date,attributes.State, attributes.Source, attributes.Request
9001,2020-07-01T00:00:06Z,FL,Android,0.001
9002,2020-07-01T00:00:33Z,NY,Android,0.001
9003,2020-07-01T00:07:19Z,FL,ios,0.001
9004,2020-07-01T00:11:30Z,NY,windows,0.001
9005,2020-07-01T00:15:23Z,FL,ios,0.001
I have been trying using Pandas json_normalize which requires a dictionary. So, I figure I would convert the attributes column to a dictionary but it does not quite work out as expected for the dictionary has the form:
df.attributes.to_dict()
{0: '{"State":"FL","Source":"Android","Request":"0.001"}',
1: '{"State":"NY","Source":"Android","Request":"0.001"}',
2: '{"State":"FL","Source":"ios","Request":"0.001"}',
3: '{"State":"NY","Source":"windows","Request":"0.001"}',
4: '{"State":"FL","Source":"ios","Request":"0.001"}'}
And the normalization takes the key (0, 1, 2, ...) as the column name instead of the JSON keys.
I have the feeling that I am close but I can't quite work out how to do this exactly. Any idea is welcome.
Thank you!
Normalize expects to work on an object, not a string.
import json
import pandas as pd
df_final = pd.json_normalize(df.attributes.apply(json.loads))
You shouldn’t need to convert to a dictionary first.
Try:
import pandas as pd
pd.json_normalize(df[‘attributes’])
I found an solution but I am not overly happy with it. I reckon it is very inefficient.
import pandas as pd
import json
# Import full dataframe
df = pd.read_csv(r'D:/tmp/sample_simple.csv', parse_dates=['Date'])
# Create empty dataframe to hold the results of data conversion
df_attributes = pd.DataFrame()
# Loop through the data to fill the dataframe
for index in df.index:
row_json = json.loads(df.attributes[index])
normalized_row = pd.json_normalize(row_json)
# df_attributes = df_attributes.append(normalized_row) (deprecated method) use concat instead
df_attributes = pd.concat([df_attributes, normalized_row], ignore_index=True)
# Reset the index of the attributes dataframe
df_attributes = df_attributes.reset_index(drop=True)
# Drop the original attributes column
df = df.drop(columns=['attributes'])
# Join the results
df_final = df.join(df_attributes)
# Show results
print(df_final)
print(df_final.info())
Which gives me the expected result. However, as I said, there are several inefficiencies in it. For starters, the dataframe append in the for loop. According to the documentation the best practice is to make a list and then append but I could not figure out how to do that while keeping the shape I wanted. I welcome all critics and ideas.

Copy DataFrame columns with particular labels to new pandas DataFrame under different labels

Currently trying to do following:
import pandas as pd
import numpy as np
Creating empty data frame with column indexes
currentDataToAdd = pd.read_csv('product-export-empty.csv', header=0)
Importing Data that needs to be reformated
newData = pd.read_csv('ChermsideStock.csv', header=0)
Reformating Data into currentDataToAdd (all labels exist in two data frames)
currentDataToAdd.loc[:,'sku'] = newData3.loc[:,'Barcode']
currentDataToAdd.loc[:,'name'] = newData3.loc[:,'Description']
currentDataToAdd.loc[:,'tax_name'] = newData3.loc[:,'Sales_Tax']
currentDataToAdd.loc[:,'supply_price'] = newData3.loc[:,'Cost']
currentDataToAdd.loc[:,'retail_price'] = newData3.loc[:,'Sell']
The problem is what I'm getting in currentDataToAdd is the data but in wrong columns. Please help. What am I doing wrong?
newData3
enter image description here
currentDataToAdd
enter image description here
Result
Assuming that 'product-export-empty.csv' just has the columns and nothing else, and 'ChermsideStock.csv' has your data and you want to have the two combined. You can that by:
df_empty = pd.read_csv('product-export-empty.csv')
df_data = pd.read_csv('ChermsideStock.csv', header=None)
df_data.columns = df_empty.columns
Two things to keep in mind:
better to use lower case variable names with underscores between words
better use very clear names that allow no ambiquity
Pandas is very elegant, and things are usually very simple to do syntax wise

python dask dataframe splitting column of tuples into two columns

I am using python 2.7 with dask
I have a dataframe with one column of tuples that I created like this:
table[col] = table.apply(lambda x: (x[col1],x[col2]), axis = 1, meta = pd.Dataframe)
I want to re convert this tuple column into two seperate columns
In pandas I would do it like this:
table[[col1,col2]] = table[col].apply(pd.Series)
The point of doing so, is that dask dataframe does not support multi index and i want to use groupby according to multiple columns, and wish to create a column of tuples that will give me a single index containing all the values I need (please ignore efficiency vs multi index, for there is not yet a full support for this is dask dataframe)
When i try to unpack the tuple columns with dask using this code:
rxTable[["a","b"]] = rxTable["tup"].apply(lambda x: s(x), meta = pd.DataFrame, axis = 1)
I get this error
AttributeError: 'Series' object has no attribute 'columns'
when I try
rxTable[["a","b"]] = rxTable["tup"].apply(dd.Series, axis = 1, meta = pd.DataFrame)
I get the same
How can i take a column of tuples and convert it to two columns like I do in Pandas with no problem?
Thanks
Best i found so for in converting into pandas dataframe and then convert the column, then go back to dask
df1 = df.compute()
df1[["a","b"]] = df1["c"].apply(pd.Series)
df = dd.from_pandas(df1,npartitions=1)
This will work well, if the df is too big for memory, you can either:
1.compute only the wanted column, convert it into two columns and then use merge to get the split results into the original df
2.split the df into chunks, then converting each chunk and adding it into an hd5 file, then using dask to read the entire hd5 file into the dask dataframe
I found this methodology works well and avoids converting the Dask DataFrame to Pandas:
df['a'] = df['tup'].str.partition(sep)[0]
df['b'] = df['tup'].str.partition(sep)[2]
where sep is whatever delimiter you were using in the column to separate the two elements.

How to import all fields from xls as strings into a Pandas dataframe?

I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting.
So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99).
import pandas as pd
filename = 'C:\DemoFile.xlsx'
fields = {col: str for col in range(99)}
df = pd.read_excel(filename, sheetname=0, converters=fields)
These import files do have a varying number of columns all the time, and I am looking to handle this differently than changing the range manually all the time.
Does somebody have any further suggestions or alternatives for reading Excel files into a dataframe and treating all fields as strings by default?
Many thanks!
Try this:
xl = pd.ExcelFile(r'C:\DemoFile.xlsx')
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})
UPDATE:
In [261]: type(xl)
Out[261]: pandas.io.excel.ExcelFile
In [262]: type(xl.book)
Out[262]: xlrd.book.Book
Use dtype=str when calling .read_excel()
import pandas as pd
filename = 'C:\DemoFile.xlsx'
df = pd.read_excel(filename, dtype=str)
the usual solution is:
read in one row of data just to get the column names and number of columns
create the dictionary automatically where each columns has a string type
re-read the full data using the dictionary created at step 2.

Categories

Resources