Excel misaligns columns when appending dataframe to csv - python

I have a program that takes a URL as input, and checks it against a df that I'm reading from csv:
Name ID Date URL
0 Faye 111 12/31/16 https://www.url1.com
1 Faye 111 3/31/17 https://www.url2.com
2 Mike 222 3/31/17 https://www.url3.com
3 Mike 222 6/30/18 https://www.url4.com
4 Mike 222 9/30/18 https://www.url5.com
5 Jim 333 9/30/18 https://www.url6.com
If the URL doesn't exist in the df, the program executes some code, and then adds a new row with the URL to the df; else it moves on to another URL.
The program works fine if I just run it, stop it, and restart it. But if I delete an existing row (e.g., [1]) directly from the csv file in Excel to reprocess the data for that one url, it gets added at column [4] and row [5] to df:
Name ID Date URL
0 Faye 111 12/31/16 https://www.url1.com
2 Mike 222 3/31/17 https://www.url3.com
3 Mike 222 6/30/18 https://www.url4.com
4 Mike 222 9/30/18 https://www.url5.com
5 Jim 333 9/30/18 https://www.url6.com Faye 111 3/31/17 https://www.url2.com
rather than adding at row [6] as a new row, which happens when I remove the row in a plain text editor (rather than Excel):
Name ID Date URL
... ... ... ...
5 Jim 333 9/30/18 https://www.url6.com
6 Faye 111 3/31/17 https://www.url2.com
I'm adding the data to the existing csv via df.to_csv('~/file.csv', mode='a', header=False, index=False), so can anyone identify what I'm doing wrong?

Instead of trying appending another (1-line) dataframe as another chunk of CSV-file:
df.to_csv('~/file.csv', mode='a', header=False, index=False)
append it first to your dataframe
df = df.append(your_1_row_df, ignore_index=True)
and only then write it into your ~/file.csv, this time with the (default) mode 'w' (effectively rewriting it):
df.to_csv('~/file.csv', header=False, index=False)

Related

Add intermediate rows in a dataframe based on the previous record

Be the following dataframe:
ID
direction
country
time
0
IN
USA
12:10
0
OUT
FRA
14:20
0
OUT
ESP
16:11
1
IN
GER
11:13
1
OUT
USA
10:29
2
OUT
USA
09:21
2
OUT
ESP
21:33
I would like to add the following functionality to the above dataframe:
If there are two rows sequentially with the value of the attribute "direction" equal to OUT for the same ID. An intermediate row is created with the same data of the first OUT row by changing the direction to IN.
Here is an example applied to the above dataframe:
ID
direction
country
time
0
IN
USA
12:10
0
OUT
FRA
14:20
0
IN
FRA
14:20
0
OUT
ESP
16:11
1
IN
GER
11:13
1
OUT
USA
10:29
2
OUT
USA
09:21
2
IN
USA
09:21
2
OUT
ESP
21:33
Thank you for your help.
Maintain a new dataframe
dfNew = pd.DataFrame()
and loop through each row of the existing dataframe.
for column_name, item in dfOld.iteritems():
Look at the value under direction with every loop, and if it is IN, take that entire row and append it to the new dataframe.
dfNew.append(item, ignore_index=True)
If it is out, add the entire row as above, but also create a new row
dfNew.loc[len(dfNew.index)] = [value1, value2, value3, ...]
or edit the existing row (contained in item) and add it to the new dataframe as well.

I want to count occurence of a subset in pandas dataframe

If in a data frame I have a data like below:
Name Id
Alex 123
John 222
Alex 123
Kendal 333
So I want to add a column which will result:
Name Id Subset Count
Alex 123 2
John 222 1
Alex 123 2
Kendal 333 1
I used the below code but didnt got the output:
df['Subset Count'] = df.value_counts(subset=['Name','Id'])
try via groupby():
df['Subset Count']=df.groupby(['Name','Id'])['Name'].transform('count')
OR
via droplevel() and map()
df['Subset Count']=df['Name'].map(df.value_counts(subset=['Name','Id']).droplevel(1))

pandas read dataframe with partial header

I want to read a file that has partial header i.e. some columns have names some have not. I want to read the file as it is. So I want to keep the names of the columns that already have names and the rest as it is. Is there any clean way to do that in pandas?
The short answer to your question is no, since pandas dataframes cannot have more than one empty column name, so if you try to import a .csv file with multiple empty column names, you won't get the behavior you expect: pandas will fill in empty column names with Unnamed: 0, Unnamed: 1... and so on (or possibly something else if you have a space in place of the column name in the .csv file).
For example, this .csv file with columns of index 0, 3, 4, 5 removed...
,Doe,120 jefferson st.,,,
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123
...will get imported in the following way:
Unnamed: 0 Doe 120 jefferson st. Unnamed: 3 Unnamed: 4
0 Jack McGinnis 220 hobo Av. Phila PA 9119
1 John "Da Man" Repici 120 Jefferson St. Riverside NJ 8075
2 Stephen Tyler 7452 Terrace "At the Plaza" road SomeTown SD 91234
3 NaN Blankman NaN SomeTown SD 298
4 Joan "the bone", Anne Jet 9th, at Terrace plc Desert City CO 123
If for example you have missing columns names for column 1,2. you will have this structure after reading the file normally by pandas
df.head()
Unnamed: 0 Unnamed: 1 col3 col4 col5
0 .. ..
1 .. ..
After reading the df, you can rename the unnamed columns as below
df.rename(columns = {'Unnamed: 1':'Col1','Unnamed: 2':'Col2'})

Pythom:Compare 2 columns and write data to excel sheets

I need to compare two columns together: "EMAIL" and "LOCATION".
I'm using Email because it's more accurate than name for this issue.
My objective is to find total number of locations each person worked
at, sum up the total of locations to select which sheet the data
will been written to and copy the original data over to the new
sheet(tab).
I need the original data copied over with all the duplicate
locations, which is where this problem stumps me.
Full Excel Sheet
Had to use images because it flagged post as spam
The Excel sheet (SAMPLE) I'm reading in as a data frame:
Excel Sample Spreadsheet
Example:
TOMAPPLES#EXAMPLE.COM worked at WENDYS,FRANKS HUT, and WALMART - That
sums up to 3 different locations, which I would add to a new sheet
called SHEET: 3 Different Locations
SJONES22#GMAIL.COM worked at LONDONS TENT and YOUTUBE - That's 2 different locations, which I would add to a new sheet called SHEET:
2 Different Locations
MONTYJ#EXAMPLE.COM worked only at WALMART - This user would be added
to SHEET: 1 Location
Outcome:
data copied to new sheets
Sheet 2
Sheet 2: different locations
Sheet 3
Sheet 3: different locations
Sheet 4
Sheet 4: different locations
Thanks for taking your time looking at my problem =)
Hi Check below lines if work for you..
import pandas as pd
df = pd.read_excel('sample.xlsx')
df1 = df.groupby(['Name','Location','Job']).count().reset_index()
# this is long line
df2 = df.groupby(['Name','Location','Job','Email']).agg({'Location':'count','Email':'count'}).rename(columns={'Location':'Location Count','Email':'Email Count'}).reset_index()
print(df1)
print('\n\n')
print(df2)
below is the output change columns to check more variations
df1
Name Location Job Email
0 Monty Jakarta Manager 1
1 Monty Mumbai Manager 1
2 Sahara Jonesh Paris Cook 2
3 Tom App Jakarta Buser 1
4 Tom App Paris Buser 2
df2 all columns
Name Location ... Location Count Email Count
0 Monty Jakarta ... 1 1
1 Monty Mumbai ... 1 1
2 Sahara Jonesh Paris ... 2 2
3 Tom App Jakarta ... 1 1
4 Tom App Paris ... 2 2

compare columns and generate duplicate rows in mysql (or) python pandas

i am new to Mysql and just getting started with some basic concepts. i have been trying to solve this for a while now. any help is appreciated.
I have a list of users with two phone numbers. i would want to compare two columns(phone numbers) and generate a new row if the data is different in both columns, else retain the row and make no changes.
The processed data would look like the second table.
Is there any way to acheive this in MySql.
i also don't minds doing the transformation in a dataframe and then loading into a table.
id username primary_phone landline
1 John 222 222
2 Michael 123 121
3 lucy 456 456
4 Anderson 900 901
Thanks!!!
Use DataFrame.melt with remove variable column and DataFrame.drop_duplicates:
df = (df.melt(['id','username'], value_name='phone')
.drop('variable', axis=1)
.drop_duplicates()
.sort_values('id'))
print (df)
id username phone
0 1 John 222
1 2 Michael 123
5 2 Michael 121
2 3 lucy 456
3 4 Anderson 900
7 4 Anderson 901

Categories

Resources